Filter strings across multiple columns with data.table

Question

I have a dataset that looks something like this.

df <- tibble::tribble(
  ~name,           ~x,  ~y,              ~z,  
  "N/A",            1,   "testSmith",    -100, 
  "N A",            3,   "NOt available", -99,
  "test Smith",     NA,  "test Smith",    -98,
  "Not Available", -99, "25",             -101,
  "test Smith",    -98, "28",             -1)

I would like to create a new data.table that keeps all the rows the string "test".

The final dataset should look something like this

  name           x y              z
  <chr>      <dbl> <chr>      <dbl>
1 N/A            1 testSmith   -100
2 test Smith    NA test Smith   -98
3 test Smith   -98 28            -1

I could do this column by column like this

setDT(df)[name%like%"test"|y%like%"test"]

The problem with this approach is that I have hundreds of string variables and I would like to find a more compact approach. I tried the followings but they do not work

chvar <- keep(trai,is.character)%>%names()
setDT(df)[chvar%like%"test"]#error
setDT(df)[(chvar)%like%"test"]#error
setDT(df)[.(chvar)%like%"test"]#empty dt

Does someone know how I could do it in a quick and efficient way?

Thanks a lot for your help

Ronak Shah · Accepted Answer · 2020-12-22 14:48:16Z

5

In data.table you can do :

library(data.table)

cols <- c('name', 'y')
setDT(df)

df[df[, Reduce(`|`, lapply(.SD, `%like%`, "test")), .SDcols = cols]]

#         name   x          y    z
#1:        N/A   1  testSmith -100
#2: test Smith  NA test Smith  -98
#3: test Smith -98         28   -1

In base R :

subset(df, Reduce(`|`, lapply(df[cols], function(x) grepl('test', x))))

dplyr :

library(dplyr)
df %>% filter(Reduce(`|`, across(all_of(cols), ~grepl('test', .x))))

lapply/across returns a list of TRUE/FALSE values for all columns. It will return TRUE if 'test' is present and FALSE if it is not present. When we use it in combination with Reduce and | it will give TRUE only of there is atleast one TRUE value in the row. If all the values in the row are FALSE it will returns FALSE. We select only those rows which has at least one TRUE value in it.

edited Dec 22, 2020 at 14:48

answered Dec 22, 2020 at 14:34

Ronak Shah

391k20 gold badges173 silver badges237 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

ThomasIsCoding · Accepted Answer · 2020-12-22 15:03:03Z

1

Another data.table option is using grepl

dt[rowSums(dt[, lapply(.SD, grepl, pattern = "test")]) > 0, .SDcols = .(name, y)]

such that

         name   x          y    z
1:        N/A   1  testSmith -100
2: test Smith  NA test Smith  -98
3: test Smith -98         28   -1

answered Dec 22, 2020 at 15:03

ThomasIsCoding

106k9 gold badges38 silver badges110 bronze badges

Collectives™ on Stack Overflow

Filter strings across multiple columns with data.table

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related