52

I have an example data set with a column that reads somewhat like this:

Candy
Sanitizer
Candy
Water
Cake
Candy
Ice Cream
Gum
Candy
Coffee

What I'd like to do is replace it into just two factors - "Candy" and "Non-Candy". I can do this with Python/Pandas, but can't seem to figure out a dplyr based solution. Thank you!

7 Answers 7

96

In dplyr and tidyr

dat %>% 
    mutate(var = replace(var, var != "Candy", "Not Candy"))

Significantly faster than the ifelse approaches. Code to create the initial dataframe can be as below:

library(dplyr)
dat <- as_data_frame(c("Candy","Sanitizer","Candy","Water","Cake","Candy","Ice Cream","Gum","Candy","Coffee"))
colnames(dat) <- "var"
Sign up to request clarification or add additional context in comments.

2 Comments

Isn't there a function that does not require to repeat var?
The only function that I am familiar with that autopopulates the conditional statement is replace_na() explanation: The first var refs the output name you want. The second var the input column and the third var specifies the column to use in the conditional statement you are applying. A function missing one of these would have to assume (hard-code in) one or more of these [input, output, conditional column].
28

Another solution with dplyr using case_when:

dat %>%
    mutate(var = case_when(var == 'Candy' ~ 'Candy',
                           TRUE ~ 'Non-Candy'))

The syntax for case_when is condition ~ value to replace. Documentation here.

Probably less efficient than the solution using replace, but an advantage is that multiple replacements could be performed in a single command while still being nicely readable, i.e. replacing to produce three levels:

dat %>%
    mutate(var = case_when(var == 'Candy' ~ 'Candy',
                           var == 'Water' ~ 'Water',
                           TRUE ~ 'Neither-Water-Nor-Candy'))

Comments

16

Assuming your data frame is dat and your column is var:

dat = dat %>% mutate(candy.flag = factor(ifelse(var == "Candy", "Candy", "Non-Candy")))

1 Comment

@RichardScriven's approach (comments on mine) strictly dominates this
9

No need for dplyr. Assuming var is stored as a factor already:

non_c <- setdiff(levels(dat$var), "Candy")
    
levels(dat$var) <- list(Candy = "Candy", "Non-Candy" = non_c)

See ?levels.

This is much more efficient than the ifelse approach, which is bound to be slow:

library(microbenchmark)
set.seed(01239)
# resample data
smp <- data.frame(sample(dat$var, 1e6, TRUE))
names(smp) <- "var"
    
timings <- replicate(50, {
  # copy data to facilitate reuse
  cop <- smp
  t0 <- get_nanotime()
  levs <- setdiff(levels(cop$var), "Candy")
  levels(cop$var) <- list(Candy = "Candy", "Non-Candy" = levs)
  t1 <- get_nanotime() - t0

  cop <- smp
  t0 <- get_nanotime()
  cop = cop %>%
    mutate(candy.flag = factor(ifelse(var == "Candy", "Candy", "Non-Candy")))
  t2 <- get_nanotime() - t0

  cop <- smp
  t0 <- get_nanotime()
  cop$var <- 
    factor(cop$var == "Candy", labels = c("Non-Candy", "Candy"))
  t3 <- get_nanotime() - t0
  c(levels = t1, dplyr = t2, direct = t3)
})

x <- apply(times, 1, median)
x[2]/x[1]
#    dplyr   direct 
# 8.894303 4.962791 

That is, this is 9 times faster.

1 Comment

Or also factor(dat$var == "Candy", labels = c("Non-Candy", "Candy")) but I think resetting the levels is a nice way to go.
2

I didn't benchmark this, but at least in some cases with more than one condition, a combination of mutate and a list seems to provide an easy solution:

# assuming that all sweet things fall in one category

dat <- data.frame(var = c("Candy", "Sanitizer", "Candy", "Water", "Cake", "Candy", "Ice Cream", "Gum", "Candy", "Coffee"))

conditions <- list("Candy" = TRUE, "Sanitizer" = FALSE, "Water" = FALSE, 
"Cake" = TRUE, "Ice Cream" = TRUE, "Gum" = TRUE, "Coffee" = FALSE)

dat %>% mutate(sweet = conditions[var])

1 Comment

Way too long to write
1

When you only need two values, a simple ifelse() is prettiet, I think.

Furthermore, embedded ifelses can simulate the same situation as the case_when solution proposed by PhJ (I do like his readability, though)!

dat %>%
    mutate(
        var = ifelse(var == "Candy", "Candy", "Non-Candy")
    )

Comments

1

A newer solution is to use case_match from dplyr

library(dplyr)
dat %>% 
    mutate(var = case_match(var, "Candy" ~ var, 
                           .default ~ "Not Candy"))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.