multiple patterns for string matching using case_when

Question

I'm trying to use str_detect and case_when to recode strings based on multiple patterns, and paste each occurance of the recoded value(s) into a new column. The Correct column is the output I'm trying to achieve.

This is similar to this question and this question If it can't be done with case_when (limited to one pattern I think) is there a better way this can be achieved still using tidyverse?

Fruit=c("Apples","apples, maybe bananas","Oranges","grapes w apples","pears")
Num=c(1,2,3,4,5)
data=data.frame(Num,Fruit)

df= data %>% mutate(Incorrect=
paste(case_when(
  str_detect(Fruit, regex("apples", ignore_case=TRUE)) ~ "good",
  str_detect(Fruit, regex("bananas", ignore_case=TRUE)) ~ "gross",
  str_detect(Fruit, regex("grapes | oranges", ignore_case=TRUE)) ~ "ok",
  str_detect(Fruit, regex("lemon", ignore_case=TRUE)) ~ "sour",
  TRUE ~ "other"
),sep=","))

  Num                 Fruit Incorrect
  1                Apples      good
  2 apples, maybe bananas      good
  3               Oranges      other
  4       grapes w apples      good
  5                pears       other

 Num                 Fruit    Correct
   1                Apples       good
   2 apples, maybe bananas good,gross
   3               Oranges         ok
   4       grapes w apples    ok,good
   5                pears       other

Related stackoverflow.com/questions/53851627/… & stackoverflow.com/questions/56588108/… — Tung
– Tung, Commented Nov 13, 2020 at 0:18

Ronak Shah · Accepted Answer · 2019-11-30 02:58:16Z

6

In case_when if a condition is satisfied for one row it stops there and doesn't check for any more conditions. So usually in such cases it is better to have every entry in separate row so that it easier to assign value and then summarise all of them together. However, in this case Fruit column does not have a clear separator, some fruits are separated by comma (,), some are with whitespace and also there are additional words between them. To handle all such cases we assign NA to the words which do not match and then remove them during summarising.

library(dplyr)
library(stringr)

data %>%
  tidyr::separate_rows(Fruit, sep = ",|\\s+") %>%
   mutate(Correct = case_when(
      str_detect(Fruit, regex("apples", ignore_case=TRUE)) ~ "good",
      str_detect(Fruit, regex("bananas", ignore_case=TRUE)) ~ "gross",
      str_detect(Fruit, regex("grapes|oranges", ignore_case=TRUE)) ~ "ok",
      str_detect(Fruit, regex("lemon", ignore_case=TRUE)) ~ "sour",
      TRUE ~ NA_character_)) %>% 
   group_by(Num) %>%
   summarise(Correct = toString(na.omit(Correct))) %>%
   left_join(data)

#   Num Correct     Fruit                
#  <dbl> <chr>       <fct>                
#1     1 good        Apples               
#2     2 good, gross apples, maybe bananas
#3     3 ok          Oranges              
#4     4 ok, good    grapes w apples      
#5     5 sour        Lemons

For the updated data, we can remove the extra words which occur and do

data %>%
  mutate(Fruit = gsub("maybe|w", "", Fruit)) %>%
  tidyr::separate_rows(Fruit, sep = ",\\s+|\\s+") %>%
  mutate(Correct = case_when(
     str_detect(Fruit, regex("apples", ignore_case=TRUE)) ~ "good",
     str_detect(Fruit, regex("bananas", ignore_case=TRUE)) ~ "gross",
     str_detect(Fruit, regex("grapes|oranges", ignore_case=TRUE)) ~ "ok",
     str_detect(Fruit, regex("lemon", ignore_case=TRUE)) ~ "sour",
     TRUE ~ "other")) %>% 
  group_by(Num) %>%
  summarise(Correct = toString(na.omit(Correct))) %>%
  left_join(data)

#    Num Correct     Fruit                
#  <dbl> <chr>       <fct>                
#1     1 good        Apples               
#2     2 good, gross apples, maybe bananas
#3     3 ok          Oranges              
#4     4 ok, good    grapes w apples      
#5     5 other       pears

edited Nov 30, 2019 at 2:58

answered Nov 28, 2019 at 5:51

Ronak Shah

391k20 gold badges173 silver badges237 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

W148SMH Over a year ago

The only issue is TRUE ~ NA_character_ . I want meaningful non-matching strings to be coded as TRUE ~ other. I edited the data to better reflect my actual data. @RonakShah

Ronak Shah Over a year ago

@W148SMH As mentioned in my post the problem arises because there is no clear separator between each fruits. Sometimes they are separated by comma , sometimes by space. So I have separated by both but there are some non-matching words already like maybe, w. If we give TRUE ~ 'other' then those words would also be assigned 'other'.

W148SMH Over a year ago

If I remove maybe and w in the beginning with something like str_replace(Fruit,"maybe|w","")) it still wants to add other after those words are removed @RonakShah

Ronak Shah Over a year ago

@W148SMH yes, if those are the only words occurring then you can remove them. See updated answer.

Collectives™ on Stack Overflow

multiple patterns for string matching using case_when

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related