Using dplyr to filter rows which contain partial string of column

Question

Assuming I have a data frame like

term     cnt
apple     10
apples     5
a apple on 3
blue pears 3
pears      1

How could I filter all partial found strings within this column, e.g. getting as a result

term     cnt
apple     10
pears      1

without indicating to which terms I want to filter (apple|pears), but through a self-referencing manner (i.e. it does check each term against the whole column and removes terms that are a partial match). The number of tokens is not limited, nor the consistency of strings (i.e. "mapples" would get matched by "apple"). This would result in an inverted generalized dplyr-based version of

d[grep("^apple$|^pears$", d$term), ]

Additionally, it would be interesting use this departialisation to get a cumulated sum, e.g.

term     cnt
apple     18
pears      4

I couldn't get it to work with contains() or grep().

Thanks

Please check the updated answer!

amrrs
– amrrs

2017-09-15 14:00:40 +00:00
Commented Sep 15, 2017 at 14:00 — amrrs
– amrrs, Commented Sep 15, 2017 at 14:00
@Karsten Sender Did you try my solution?

tushaR
– tushaR

2017-09-18 07:28:39 +00:00
Commented Sep 18, 2017 at 7:28 — tushaR
– tushaR, Commented Sep 18, 2017 at 7:28

amrrs · Accepted Answer · 2017-09-15 14:00:02Z

2

Hopefully the complete answer. Not very idiomatic (as Pythonista's call) but someone can suggest improvement to this:

> ssss <- data.frame(c('apple','red apple','apples','pears','blue pears'),c(15,3,10,4,3))
> 
> names(ssss) <- c('Fruit','Count')
> 
> ssss
       Fruit Count
1      apple    15
2  red apple     3
3     apples    10
4      pears     4
5 blue pears     3
> 
> root_list <- as.vector(ssss$Fruit[unlist(lapply(ssss$Fruit,function(x){length(grep(x,ssss$Fruit))>1}))])
> 
> 
> ssss %>% filter(ssss$Fruit %in% root_list)
  Fruit Count
1 apple    15
2 pears     4
> 
> data <- data.frame(lapply(root_list, function(x){y <- stringr::str_extract(ssss$Fruit,x); ifelse(is.na(y),'',y)}))
> 
> cols <- colnames(data)
> 
> #data$x <- do.call(paste0, c(data[cols]))
> #for (co in cols) data[co] <- NULL
> 
> ssss$Fruit <- do.call(paste0, c(data[cols]))
> 
> ssss %>% group_by(Fruit) %>% summarise(val = sum(Count))
# A tibble: 2 x 2
  Fruit   val
  <chr> <dbl>
1 apple    28
2 pears     7
>

edited Sep 15, 2017 at 14:00

answered Sep 15, 2017 at 12:56

amrrs

6,3752 gold badges22 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Karsten Sender Over a year ago

Hi, thanks for your post and sorry for the delay. I do see your approach and that it works for the sample data; yet when applied to the real data set (read: about 10k terms), it exhibits strange behavior (e.g. duplicating the column names from "apple" to "applesapplered apple" and the exponential increase of memory and runtime requirements just do not make it viable. I will accept your answer but need to find a different way to get this to work. Thanks.

amrrs Over a year ago

Sorry about it, if you could share the case where it doesn't work properly, we can try to generalize the code!

Aramis7d · Accepted Answer · 2017-09-15 12:49:57Z

1

you can try using tidyverse something like

1. define a list of the words as:

     k <- dft %>% 
          select(term) %>% 
          unlist() %>% 
          unique()

2. operate on the data as:

    dft %>%
      separate(term, c('t1', 't2')) %>%
      rowwise() %>%
      mutate( g = sum(t1 %in% k)) %>%
      filter( g > 0) %>%
      select(t1, cnt)

which gives:

      t1   cnt
   <chr> <int>
1  apple    10
2 apples     5
3  pears     1

this still doesn't handle apple and apples though. Will keep trying on that.

answered Sep 15, 2017 at 12:49

Aramis7d

2,49620 silver badges27 bronze badges

1 Comment

Karsten Sender Over a year ago

Hi, thanks for your idea. Yet, terms are not limited to being of two tokens but might be more. I clarified the example above.

tushaR · Accepted Answer · 2017-09-15 14:24:58Z

0

Try this:

df=data.frame(term=c('apple','apples','a apple on','blue pears','pears'),cnt=c(10,5,3,3,1))

matches = sapply(df$term,function(t,terms){grepl(pattern = t,x = terms)},df$term)

sapply(1:ncol(matches),function(t,mat){
  tempmat = mat[,t]&mat[,-t]
  indices=unlist(apply(tempmat,MARGIN = 2,which))
  df$term[indices]<<-df$term[t]
 },matches)

df%>%group_by(term)%>%summarize(cnt=sum(cnt))

 # A tibble: 2 x 2
 #  term   cnt
 #  <chr> <dbl>
 #1 apple    18
 #2 pears     4

answered Sep 15, 2017 at 14:24

tushaR

3,1161 gold badge24 silver badges36 bronze badges

2 Comments

Karsten Sender Over a year ago

Hi, thanks for your idea and sorry for the delay. Please see my comment above which also applies to your solution. Thanks.

tushaR Over a year ago

@KarstenSender Can't help, unless you share a little bigger sample of data to work on and debug.

Collectives™ on Stack Overflow

Using dplyr to filter rows which contain partial string of column

3 Answers 3

2 Comments

1 Comment

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related