0

I have a data frame with two columns, and want to create a third column which will essentially be a boolean for whether or not column two contain a certain set of specified values.

f <- data.frame(name=c("John", "Sara", "David", "Chad"),
                 car=c("Honda|Ford", "BMW", "Toyota|Chevy|Ford", 
                 "Toyota|Chevy|Ford|Honda"))

The first thing I did was remove the | from each string in the second column, and placed those valued in a third column

library(stringr)
g = str_replace_all(f$car, "[^[:alnum:]]", " ")
f$make = c(g)
f

What I want to do now if create another column, which will be a boolean, 1 if make contains a common car, and 0 if it contains a not common car.

common = c("Honda", "Ford", "Toyota", "Chevy")
not_common = c("BMW", "Lexus", "Acura")

I've tried a few things, including the stringr package and ifelse to produce the following output.

   name                     car                    make       common   
1  John              Honda|Ford              Honda Ford           1
2  Sara                     BMW                     BMW           0
3 David       Toyota|Chevy|Ford       Toyota Chevy Ford           1
4  Chad Toyota|Chevy|Ford|Honda Toyota Chevy Ford Honda           1

Since it's possible to have both a common and uncommon car as an entry, the uncommon make should override the common make and that row should take the value 0 in the common column. So if an entry had both BMW and Ford, that entry should take a 0 in the common column.

Can anyone help with this task.

Oh, and here's what I tried with the stringr package, but it doesn't work.

common = c("Honda", "Ford", "Toyota", "Chevy")
not_common = c("BMW", "Lexus", "Acura")
common_match <- str_c(common)
not_match <- str_c(not_common)

main <- function(df) {
  f$new_make <- str_detect(f$make, common_match)
  df
}

main(f)

Thanks!

2 Answers 2

2

Another way and a comparison

f2 <- f[rep(1:4,50000),]
system.time({
v <- sapply(f2$make, strsplit, " ")
sapply(v, function(x) max(1-not_common %in% x)*max(common %in% x))
})
 user  system elapsed 
 7.94    0.01    8.00 

system.time(sapply(f2$car,function(x) ifelse(length(grep("BMW|Lexus|Acura",x))>0,0,1)))
 user  system elapsed 
28.72    0.04   28.87 
Sign up to request clarification or add additional context in comments.

Comments

2

Not sure if this is the most efficient way, but try this one using grep and ifelse applied to each value of f$car. The | characters just mean or for combining search terms inside grep and have nothing to do with the separator in your data.

f$common <- sapply(f$car,function(x) ifelse(length(grep("BMW|Lexus|Acura",x))>0,0,1))

Result:

> f
   name                     car common
1  John              Honda|Ford      1
2  Sara                     BMW      0
3 David       Toyota|Chevy|Ford      1
4  Chad Toyota|Chevy|Ford|Honda      1

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.