5

Assuming I have a data frame like

term     cnt
apple     10
apples     5
a apple on 3
blue pears 3
pears      1

How could I filter all partial found strings within this column, e.g. getting as a result

term     cnt
apple     10
pears      1

without indicating to which terms I want to filter (apple|pears), but through a self-referencing manner (i.e. it does check each term against the whole column and removes terms that are a partial match). The number of tokens is not limited, nor the consistency of strings (i.e. "mapples" would get matched by "apple"). This would result in an inverted generalized dplyr-based version of

d[grep("^apple$|^pears$", d$term), ]

Additionally, it would be interesting use this departialisation to get a cumulated sum, e.g.

term     cnt
apple     18
pears      4

I couldn't get it to work with contains() or grep().

Thanks

2
  • Please check the updated answer! Commented Sep 15, 2017 at 14:00
  • @Karsten Sender Did you try my solution? Commented Sep 18, 2017 at 7:28

3 Answers 3

2

Hopefully the complete answer. Not very idiomatic (as Pythonista's call) but someone can suggest improvement to this:

> ssss <- data.frame(c('apple','red apple','apples','pears','blue pears'),c(15,3,10,4,3))
> 
> names(ssss) <- c('Fruit','Count')
> 
> ssss
       Fruit Count
1      apple    15
2  red apple     3
3     apples    10
4      pears     4
5 blue pears     3
> 
> root_list <- as.vector(ssss$Fruit[unlist(lapply(ssss$Fruit,function(x){length(grep(x,ssss$Fruit))>1}))])
> 
> 
> ssss %>% filter(ssss$Fruit %in% root_list)
  Fruit Count
1 apple    15
2 pears     4
> 
> data <- data.frame(lapply(root_list, function(x){y <- stringr::str_extract(ssss$Fruit,x); ifelse(is.na(y),'',y)}))
> 
> cols <- colnames(data)
> 
> #data$x <- do.call(paste0, c(data[cols]))
> #for (co in cols) data[co] <- NULL
> 
> ssss$Fruit <- do.call(paste0, c(data[cols]))
> 
> ssss %>% group_by(Fruit) %>% summarise(val = sum(Count))
# A tibble: 2 x 2
  Fruit   val
  <chr> <dbl>
1 apple    28
2 pears     7
> 
Sign up to request clarification or add additional context in comments.

2 Comments

Hi, thanks for your post and sorry for the delay. I do see your approach and that it works for the sample data; yet when applied to the real data set (read: about 10k terms), it exhibits strange behavior (e.g. duplicating the column names from "apple" to "applesapplered apple" and the exponential increase of memory and runtime requirements just do not make it viable. I will accept your answer but need to find a different way to get this to work. Thanks.
Sorry about it, if you could share the case where it doesn't work properly, we can try to generalize the code!
1

you can try using tidyverse something like

1. define a list of the words as:

     k <- dft %>% 
          select(term) %>% 
          unlist() %>% 
          unique()

2. operate on the data as:

    dft %>%
      separate(term, c('t1', 't2')) %>%
      rowwise() %>%
      mutate( g = sum(t1 %in% k)) %>%
      filter( g > 0) %>%
      select(t1, cnt)

which gives:

      t1   cnt
   <chr> <int>
1  apple    10
2 apples     5
3  pears     1

this still doesn't handle apple and apples though. Will keep trying on that.

1 Comment

Hi, thanks for your idea. Yet, terms are not limited to being of two tokens but might be more. I clarified the example above.
0

Try this:

df=data.frame(term=c('apple','apples','a apple on','blue pears','pears'),cnt=c(10,5,3,3,1))

matches = sapply(df$term,function(t,terms){grepl(pattern = t,x = terms)},df$term)

sapply(1:ncol(matches),function(t,mat){
  tempmat = mat[,t]&mat[,-t]
  indices=unlist(apply(tempmat,MARGIN = 2,which))
  df$term[indices]<<-df$term[t]
 },matches)

df%>%group_by(term)%>%summarize(cnt=sum(cnt))

 # A tibble: 2 x 2
 #  term   cnt
 #  <chr> <dbl>
 #1 apple    18
 #2 pears     4  

2 Comments

Hi, thanks for your idea and sorry for the delay. Please see my comment above which also applies to your solution. Thanks.
@KarstenSender Can't help, unless you share a little bigger sample of data to work on and debug.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.