Replace multiple characters from multiple columns in R

Question

Given a dataframe as follows:

structure(list(date = structure(1:24, .Label = c("2010Y1-01m", 
"2010Y1-02m", "2010Y1-03m", "2010Y1-04m", "2010Y1-05m", "2010Y1-06m", 
"2010Y1-07m", "2010Y1-08m", "2010Y1-09m", "2010Y1-10m", "2010Y1-11m", 
"2010Y1-12m", "2011Y1-01m", "2011Y1-02m", "2011Y1-03m", "2011Y1-04m", 
"2011Y1-05m", "2011Y1-06m", "2011Y1-07m", "2011Y1-08m", "2011Y1-09m", 
"2011Y1-10m", "2011Y1-11m", "2011Y1-12m"), class = "factor"), 
    a = structure(c(1L, 18L, 19L, 20L, 22L, 23L, 2L, 4L, 5L, 
    7L, 8L, 10L, 1L, 21L, 3L, 6L, 9L, 11L, 12L, 13L, 14L, 15L, 
    16L, 17L), .Label = c("--", "10159.28", "10295.69", "10580.82", 
    "10995.65", "11245.84", "11327.23", "11621.99", "12046.63", 
    "12139.78", "12848.27", "13398.26", "13962.6", "14559.72", 
    "14982.58", "15518.64", "15949.87", "7363.45", "8237.71", 
    "8830.99", "9309.47", "9316.56", "9795.77"), class = "factor"), 
    b = structure(c(2L, 16L, 23L, 24L, 4L, 6L, 7L, 9L, 10L, 12L, 
    14L, 17L, 1L, 22L, 3L, 5L, 8L, 11L, 13L, 15L, 18L, 19L, 20L, 
    21L), .Label = c("-", "--", "1058.18", "1455.6", "1539.01", 
    "1867.07", "2036.92", "2102.23", "2372.84", "2693.96", "2769.65", 
    "2973.04", "3146.88", "3227.23", "3604.71", "365.07", "3678.01", 
    "4043.18", "4438.55", "4860.76", "5360.94", "555.51", "653.19", 
    "980.72"), class = "factor"), c = structure(c(2L, 6L, 10L, 
    11L, 13L, 15L, 16L, 18L, 20L, 22L, 24L, 7L, 1L, 9L, 12L, 
    14L, 17L, 19L, 21L, 23L, 3L, 4L, 5L, 8L), .Label = c("-", 
    "--", "1092.73", "1222.48", "1409.07", "158.18", "1748.44", 
    "2179.42", "227.68", "268.53", "331.81", "366.95", "434.19", 
    "486.41", "538.49", "606.62", "614.75", "651.46", "729.44", 
    "736.55", "836.46", "890.81", "929.72", "981.65"), class = "factor")), class = "data.frame", row.names = c(NA, 
-24L))

How could I replace -- and - in only columns a and b with NA? Thanks.

How are you reading the data in? Maybe you should look at avoiding the problem when it's being read in (using na.strings, for example). That way, the column types would also be correct. — A5C1D2H2I1M1N2O1R2T1
– A5C1D2H2I1M1N2O1R2T1, Commented Jun 26, 2020 at 2:59
Have you tried using the na.strings argument in read.xlsx? — A5C1D2H2I1M1N2O1R2T1
– A5C1D2H2I1M1N2O1R2T1, Commented Jun 26, 2020 at 3:05
Yes. But note that doing it while reading the data in would apply the rule to the entire dataset. If it's for specific columns, like you've indicated here, you can try the type.convert example I shared in my answer below. — A5C1D2H2I1M1N2O1R2T1
– A5C1D2H2I1M1N2O1R2T1, Commented Jun 26, 2020 at 4:05

Ronak Shah · Accepted Answer · 2020-06-26 02:57:31Z

2

You can use :

cols <- c('a', 'b')
df[cols][df[cols] == '--' | df[cols] == '-'] <- NA

Or using dplyr :

library(dplyr)
df %>% mutate(across(c(a, b), ~replace(., . %in% c('--', '-'), NA)))

edited Jun 26, 2020 at 2:57

answered Jun 26, 2020 at 2:51

Ronak Shah

391k20 gold badges173 silver badges237 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

A5C1D2H2I1M1N2O1R2T1 · Accepted Answer · 2020-06-26 03:32:33Z

1

I think it's better to try to avoid the data being read in like this in the first place, but if you need to correct it after, you can try using the na.strings argument in type.convert. Notice that it's na.strings with an "s" -- it's plural, so more than one value can be used to represent NA values.

df[c("a", "b")] <- lapply(df[c("a", "b")], type.convert, na.strings = c("--", "-"))
str(df)
# 'data.frame':   24 obs. of  4 variables:
#  $ date: Factor w/ 24 levels "2010Y1-01m","2010Y1-02m",..: 1 2 3 4 5 6 7 8 9 10 ...
#  $ a   : num  NA 7363 8238 8831 9317 ...
#  $ b   : num  NA 365 653 981 1456 ...
#  $ c   : Factor w/ 24 levels "-","--","1092.73",..: 2 6 10 11 13 15 16 18 20 22 ...
head(df)
#         date       a       b      c
# 1 2010Y1-01m      NA      NA     --
# 2 2010Y1-02m 7363.45  365.07 158.18
# 3 2010Y1-03m 8237.71  653.19 268.53
# 4 2010Y1-04m 8830.99  980.72 331.81
# 5 2010Y1-05m 9316.56 1455.60 434.19
# 6 2010Y1-06m 9795.77 1867.07 538.49

Note that in this particular case, you could also use the side effect of as.numeric(as.character(...)) converting anything that can't be coerced to numeric to NA, but keep in mind that you will get a warning for each column that you use this approach on.

lapply(df[c("a", "b")], function(x) as.numeric(as.character(x)))

edited Jun 26, 2020 at 3:32

answered Jun 26, 2020 at 3:12

A5C1D2H2I1M1N2O1R2T1

194k31 gold badges417 silver badges497 bronze badges

1 Comment

ah bon Over a year ago

Thanks a lot for your detailed answer.

Collectives™ on Stack Overflow

Replace multiple characters from multiple columns in R

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related