Replacing character values in Data Frame Column with numeric value

Question

I am working on the SAT Scores database: https://nycopendata.socrata.com/Education/SAT-Results/f9bf-2cp4?

This is what it looks like:

> head(SAT)
 DBN                                   SCHOOL.NAME Num.of.SAT.Test.Takers
1 01M292 HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES                     29
2 01M448           UNIVERSITY NEIGHBORHOOD HIGH SCHOOL                     91
3 01M450                    EAST SIDE COMMUNITY SCHOOL                     70
4 01M458                     FORSYTH SATELLITE ACADEMY                      7
5 01M509                       MARTA VALLE HIGH SCHOOL                     44
6 01M515       LOWER EAST SIDE PREPARATORY HIGH SCHOOL                    112
  SAT.Critical.Reading.Avg..Score SAT.Math.Avg..Score SAT.Writing.Avg..Score
1                             355                 404                    363
2                             383                 423                    366
3                             377                 402                    370
4                             414                 401                    359
5                             390                 433                    384
6                             332                 557                    316

In the Column Num.of.SAT.Test.Takers, many values are simply the character 's'. The corresponding values for the 's' columns also have 's' and no numeric scores.

> SATnocandidates<-SAT[SAT$Num.of.SAT=='s', ]
> head(SATnocandidates)
      DBN                                 SCHOOL.NAME Num.of.SAT.Test.Takers
23 02M392                  MANHATTAN BUSINESS ACADEMY                      s
24 02M393                   BUSINESS OF SPORTS SCHOOL                      s
26 02M399  THE HIGH SCHOOL FOR LANGUAGE AND DIPLOMACY                      s
39 02M427       MANHATTAN ACADEMY FOR ARTS & LANGUAGE                      s
41 02M437 HUDSON HIGH SCHOOL OF LEARNING TECHNOLOGIES                      s
42 02M438   INTERNATIONAL HIGH SCHOOL AT UNION SQUARE                      s
   SAT.Critical.Reading.Avg..Score SAT.Math.Avg..Score SAT.Writing.Avg..Score
23                               s                   s                      s
24                               s                   s                      s
26                               s                   s                      s
39                               s                   s                      s
41                               s                   s                      s
42                               s                   s                      s

Questions

In the original SAT dataframe, I want to replace all 's' values in $Num.of.SAT column with numeric vector 0.
Subsequently, I want to selectively replace all 's' values in corresponding columns to 0.
How can I write an overarching command to find and replace all 's' values in the data frame to 0?

Is "s" a missing value? If so, set "s" as a na.strings value when read in the data.... — A5C1D2H2I1M1N2O1R2T1
– A5C1D2H2I1M1N2O1R2T1, Commented Feb 19, 2014 at 17:09
Indeed, NA is probably better than 0. (0 would mess up your histograms, your correlations, your averages...) — David Robinson
– David Robinson, Commented Feb 19, 2014 at 17:12
Ananda, I'm a beginner stumbling through with no programming background. It could be a missing value but I'd rather set it as numeric 0. because eventually I need to add rows, columns and do pie charts / box plots etc. — vagabond
– vagabond, Commented Feb 19, 2014 at 17:12
@vagabond: All the more reason you want it to be NA (meaning missing value) rather than zero. If you show a boxplot, NA values will be automatically removed. If you set them to 0, your boxplot will stretch to zero and make it look like many people failed the test. Similarly, if you want to find the median or mean of a test, you can just set na.rm=TRUE and they'll be removed: but your zeroes would skew the mean/median very low. — David Robinson
– David Robinson, Commented Feb 19, 2014 at 17:13
@vagabond, NA would still be better than 0 even for what you state you have to do. NA and 0 mean pretty different things.... — A5C1D2H2I1M1N2O1R2T1
– A5C1D2H2I1M1N2O1R2T1, Commented Feb 19, 2014 at 17:14

A5C1D2H2I1M1N2O1R2T1 · Accepted Answer · 2014-02-19 17:39:29Z

2

My comment as an answer...

Use the na.strings argument to read your data in. Assuming you had downloaded the CSV version of the dataset to your "Downloads" directory, you would us a command like:

SAT <- read.csv("~/Downloads/SAT_Results.csv", na.strings = "s")

Note that the na.strings argument is plural--you can have multiple values that get read in as NA.

Another option, if the data is already in your R workspace, is to get rid of your "s" values just by coercion. The columns are likely to be factors or characters at the moment. If you convert them to numeric, the "s" values would automatically become NA (you'll get warnings, but the warnings are only telling us what we already know).

So for instance, imagine we started here:

SAT <- read.csv("~/Downloads/SAT_Results.csv", na.strings = "s")

If we wanted to apply our operation across all numeric columns (all but the first two columns), we could do:

SAT[-c(1, 2)] <- lapply(SAT[-c(1, 2)], function(x) as.numeric(as.character(x)))

Alternatively, if you wanted to change just the third column, you can use something like the following:

SAT[[3]] <- as.numeric(as.character(SAT[[3]]))

edited Feb 19, 2014 at 17:39

answered Feb 19, 2014 at 17:22

A5C1D2H2I1M1N2O1R2T1

194k31 gold badges417 silver badges497 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

vagabond Over a year ago

right. so if I have multiple values that I want to replace say, s, r and t, I could write: SAT <- read.csv("~/Downloads/SAT_Results.csv", na.strings = c("s","r", "t")) is that correct?

vagabond Over a year ago

Also, Ananda, this answers my third question: - replaces all 's' as NA. How about if I want to selectively replace a column or a row or one particular value?

Collectives™ on Stack Overflow

Replacing character values in Data Frame Column with numeric value

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related