1

I have a dataframe and a set of keywords. I want to create a new column in the dataframe that matches any of the strings in the keywords and a second dataframe with not-matching strings.

keyword <- c('yellow','blue','red','green','purple')

my dataframe

colour id
blue A234
blue,black A5
yellow A6
blue,green,purple A7

What i hope to get is a dataframe like this:

colour id match non-match
blue A234 blue yellow,red,green,purple
blue,green A5 blue,green yellow,red,purple
yellow A6 yellow blue,red,green,purple
blue,green,purple A7 blue,green,purple yellow,red

I tried this to get the match column:

df %>% mutate(match = str_extract(paste(keyword,collapse="|"), tolower(colour)))

but it only worked for the first and third rows, not the 2nd and 4th rows. Appreciate any help with this and also to get a column of unmatched strings.

1
  • Row 2 is blue,black in input data which changes to blue,green in output. Commented Mar 1, 2021 at 3:42

2 Answers 2

2

Get each colour in separate_rows splitting on comma and for each id you can find match using intersect and non_match with setdiff.

library(dplyr)
keyword <- c('yellow','blue','red','green','purple')

df %>%
  tidyr::separate_rows(colour, sep = ',\\s*') %>%
  group_by(id) %>%
  summarise(match = toString(intersect(keyword, colour)), 
            non_match = toString(setdiff(keyword, colour)), 
            colour = toString(colour))

#  id    match               non_match                  colour             
#* <chr> <chr>               <chr>                      <chr>              
#1 A234  blue                yellow, red, green, purple blue               
#2 A5    blue                yellow, red, green, purple blue, black        
#3 A6    yellow              blue, red, green, purple   yellow             
#4 A7    blue, green, purple yellow, red                blue, green, purple

data

df <- structure(list(colour =c("blue","blue,black", "yellow", "blue,green,purple"
), id = c("A234", "A5", "A6", "A7")),class = "data.frame",row.names = c(NA, -4L))
Sign up to request clarification or add additional context in comments.

Comments

1

Here is a base R solution. We can use apply in row mode, and split the CSV string of colors into a vector. Then, use %in% to figure out what the non matching colors should be.

df$match <- df$colour
df$non_match <- apply(df, 1, function(x) {
    paste(keyword[!keyword %in% strsplit(x[1], ",", fixed=TRUE)[[1]]], collapse=",")
})
df

             colour   id             match               non_match
1              blue A234              blue yellow,red,green,purple
2        blue,green   A5        blue,green       yellow,red,purple
3            yellow   A6            yellow   blue,red,green,purple
4 blue,green,purple   A7 blue,green,purple              yellow,red

Data:

keyword <- c('yellow','blue','red','green','purple')
df <- data.frame(colour=c("blue", "blue,green", "yellow", "blue,green,purple"),
                 id=c("A234", "A5", "A6", "A7"), stringsAsFactors=FALSE)

1 Comment

this really worked for my problem. Thanks

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.