0

Given a dataframe as follows:

structure(list(date = structure(1:24, .Label = c("2010Y1-01m", 
"2010Y1-02m", "2010Y1-03m", "2010Y1-04m", "2010Y1-05m", "2010Y1-06m", 
"2010Y1-07m", "2010Y1-08m", "2010Y1-09m", "2010Y1-10m", "2010Y1-11m", 
"2010Y1-12m", "2011Y1-01m", "2011Y1-02m", "2011Y1-03m", "2011Y1-04m", 
"2011Y1-05m", "2011Y1-06m", "2011Y1-07m", "2011Y1-08m", "2011Y1-09m", 
"2011Y1-10m", "2011Y1-11m", "2011Y1-12m"), class = "factor"), 
    a = structure(c(1L, 18L, 19L, 20L, 22L, 23L, 2L, 4L, 5L, 
    7L, 8L, 10L, 1L, 21L, 3L, 6L, 9L, 11L, 12L, 13L, 14L, 15L, 
    16L, 17L), .Label = c("--", "10159.28", "10295.69", "10580.82", 
    "10995.65", "11245.84", "11327.23", "11621.99", "12046.63", 
    "12139.78", "12848.27", "13398.26", "13962.6", "14559.72", 
    "14982.58", "15518.64", "15949.87", "7363.45", "8237.71", 
    "8830.99", "9309.47", "9316.56", "9795.77"), class = "factor"), 
    b = structure(c(2L, 16L, 23L, 24L, 4L, 6L, 7L, 9L, 10L, 12L, 
    14L, 17L, 1L, 22L, 3L, 5L, 8L, 11L, 13L, 15L, 18L, 19L, 20L, 
    21L), .Label = c("-", "--", "1058.18", "1455.6", "1539.01", 
    "1867.07", "2036.92", "2102.23", "2372.84", "2693.96", "2769.65", 
    "2973.04", "3146.88", "3227.23", "3604.71", "365.07", "3678.01", 
    "4043.18", "4438.55", "4860.76", "5360.94", "555.51", "653.19", 
    "980.72"), class = "factor"), c = structure(c(2L, 6L, 10L, 
    11L, 13L, 15L, 16L, 18L, 20L, 22L, 24L, 7L, 1L, 9L, 12L, 
    14L, 17L, 19L, 21L, 23L, 3L, 4L, 5L, 8L), .Label = c("-", 
    "--", "1092.73", "1222.48", "1409.07", "158.18", "1748.44", 
    "2179.42", "227.68", "268.53", "331.81", "366.95", "434.19", 
    "486.41", "538.49", "606.62", "614.75", "651.46", "729.44", 
    "736.55", "836.46", "890.81", "929.72", "981.65"), class = "factor")), class = "data.frame", row.names = c(NA, 
-24L))

How could I replace -- and - in only columns a and b with NA? Thanks.

5
  • How are you reading the data in? Maybe you should look at avoiding the problem when it's being read in (using na.strings, for example). That way, the column types would also be correct. Commented Jun 26, 2020 at 2:59
  • I'm using read.xlsx read it. Commented Jun 26, 2020 at 3:04
  • Have you tried using the na.strings argument in read.xlsx? Commented Jun 26, 2020 at 3:05
  • You mean by adding na.strings = c('--', '-')? Commented Jun 26, 2020 at 3:58
  • Yes. But note that doing it while reading the data in would apply the rule to the entire dataset. If it's for specific columns, like you've indicated here, you can try the type.convert example I shared in my answer below. Commented Jun 26, 2020 at 4:05

2 Answers 2

2

You can use :

cols <- c('a', 'b')
df[cols][df[cols] == '--' | df[cols] == '-'] <- NA

Or using dplyr :

library(dplyr)
df %>% mutate(across(c(a, b), ~replace(., . %in% c('--', '-'), NA)))
Sign up to request clarification or add additional context in comments.

Comments

1

I think it's better to try to avoid the data being read in like this in the first place, but if you need to correct it after, you can try using the na.strings argument in type.convert. Notice that it's na.strings with an "s" -- it's plural, so more than one value can be used to represent NA values.

df[c("a", "b")] <- lapply(df[c("a", "b")], type.convert, na.strings = c("--", "-"))
str(df)
# 'data.frame':   24 obs. of  4 variables:
#  $ date: Factor w/ 24 levels "2010Y1-01m","2010Y1-02m",..: 1 2 3 4 5 6 7 8 9 10 ...
#  $ a   : num  NA 7363 8238 8831 9317 ...
#  $ b   : num  NA 365 653 981 1456 ...
#  $ c   : Factor w/ 24 levels "-","--","1092.73",..: 2 6 10 11 13 15 16 18 20 22 ...
head(df)
#         date       a       b      c
# 1 2010Y1-01m      NA      NA     --
# 2 2010Y1-02m 7363.45  365.07 158.18
# 3 2010Y1-03m 8237.71  653.19 268.53
# 4 2010Y1-04m 8830.99  980.72 331.81
# 5 2010Y1-05m 9316.56 1455.60 434.19
# 6 2010Y1-06m 9795.77 1867.07 538.49

Note that in this particular case, you could also use the side effect of as.numeric(as.character(...)) converting anything that can't be coerced to numeric to NA, but keep in mind that you will get a warning for each column that you use this approach on.

lapply(df[c("a", "b")], function(x) as.numeric(as.character(x)))

1 Comment

Thanks a lot for your detailed answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.