Select data.table columns with grep-like partial matching

Question

At the moment, when trying to select columns not directly by their name, but rather with grepl, one has to add an additional call to the data.table in the grepl command using colnames(). Is it possible to directly implement this functionality into data.table, so that one could use grepl directly and the column names are automatically taken from the current data.table.

dt <- structure(list(Global.Company.Key = c(1380L, 1380L, 1380L, 1380L, 1380L)
                         , Calendar.Data.Year.and.Quarter = structure(c(2000, 2000, 2000, 2000, 2000), class = "yearqtr")
                         , Current.Assets.Total = c(2218, 2218, 2218, 2218, 2218)
                         , DRILL_TYPE = c("U", "D", "V", "H", "U")
                         , DI.Oil.Prod.Quarter = c(18395.6792379842, 1301949.24041659, 235.311086392291, 27261.8049684835, 4719.27956989249)
                         , DI.Gas.Prod.Quarter = c(1600471.27107983, 4882347.22928982, 2611.60215053765, 9634.76418242493, 27648.276603634)), .Names = c("Global.Company.Key", "Calendar.Data.Year.and.Quarter", "Current.Assets.Total", "DRILL_TYPE", "DI.Oil.Prod.Quarter",  "DI.Gas.Prod.Quarter"), row.names = c(NA, -5L), class = c("data.table",  "data.frame"), sorted = c("Global.Company.Key",  "Calendar.Data.Year.and.Quarter"))

One example for a selection based with grepl:

dt[, grepl(glob2rx("Current.Assets*"), colnames(dt)), with = FALSE]

It would be nice, if something like this would be possible instead:

dt[, grepl(glob2rx("Current.Assets*")), with = FALSE]

glob2rx is just my way to get around regex, because it allows to use * as wildcard. — hannes101
– hannes101, Commented Mar 24, 2017 at 12:57
SO is for programming questions, while you're asking for a new feature. The FR is already posted, btw: github.com/Rdatatable/data.table/issues/1878 and github.com/Rdatatable/data.table/issues/1786 — Frank
– Frank, Commented Mar 24, 2017 at 13:12
I didn't find it, sorry. Should I remove the question or keep it as a reference to the feature request? — hannes101
– hannes101, Commented Mar 24, 2017 at 13:15
Up to you. I think it's fine either way. If you want to keep it as a ref, you could drop a comment on one of the issues like "please update / answer this SO post if this is implemented". — Frank
– Frank, Commented Mar 24, 2017 at 13:18

OliverHennhoefer · Accepted Answer · 2022-09-27 09:19:37Z

3

library(data.table)

dt <- data.table(CurrentAssets=rnorm(10),FixedAssets=rnorm(10), CurrentLiabilities=rnorm(10),Capital=rnorm(10))

dt

##    CurrentAssets FixedAssets CurrentLiabilities    Capital
## 1:   -1.27610992  -0.2989316         0.20688252  0.6504636
## 2:    0.01065576   1.3088539         1.22533006  0.7550024
## 3:    0.53308022  -1.3459419        -0.99627142 -0.7589336
## 4:    0.30737237  -0.4291044         2.20328357  0.2157515
## 5:   -1.37391990   0.8581097        -0.08161687  0.7067757
## 6:    0.28664468   0.2308479         0.38675487 -0.3467660
## 7:   -0.22902454   1.3365470         0.10128697  0.3246363
## 8:    0.05159736  -2.0702850         0.78404464 -1.7612696
## 9:    0.51817847  -0.8365225        -0.04778573  0.6170114
##10:    0.50859575   0.5683021        -0.13780167 -0.9243434

Just some random columns. The accounts don't balance. You can define the columns, then do ...

colnames <- c("CurrentAssets","FixedAssets", "CurrentLiabilities","Capital")
dt[,.SD,.SDcols=grep("Assets",colnames,value =TRUE)]

If you don't want to type colnames and value=TRUE all the time you can build your own function like the following.

mygrep <- function(x){
    colnames <- c("CurrentAssets","FixedAssets", "CurrentLiabilities","Capital")
    grep(x,colnames,value=TRUE)
}

Now the drawback is of mygrep is that you need to put the column name manually. An improvement would be to pass the data.table to the function.

mygrep <- function(x,dt){
    colnames <- colnames(dt)
    grep(x,colnames,value=TRUE)
}

dt[,.SD,.SDcols=mygrep("Assets",dt)]

Edit Just found another way to do the same thing using macro in R. You will need the package gtools to use macros.

We define a macro subdt.

library(gtools)
  subdt <- defmacro(dt,pattern,expr={
   dt[,.SD,.SDcols=grep(pattern,colnames(dt),value=TRUE)]
    })

then do

subdt(dt,"Assets")

macros are powerful as they write the code before evaluation.

edited Sep 27, 2022 at 9:19

OliverHennhoefer

1,0292 gold badges13 silver badges29 bronze badges

answered Mar 24, 2017 at 13:29

DJJ

2,5683 gold badges32 silver badges55 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

hannes101 Over a year ago

Made my code much more readable, although it's not the final solution, probably the best available.

gofraidh Over a year ago

I wonder if @DJJ could provide a brief explanation why (example on mtcars) mtcars_dt[,.SD,.SDcols = grep("m", colnames(mtcars_dt), value =TRUE)] needs to be used for data.table while for data.frame the following works mtcars[, grepl("m", colnames(mtcars))] (but not for data.table)? This could helps others understand further the differences in data.table and data.frame. I recently tackled the same issue and your answer was the solution but I still wonder how it works in data.table and why base approach had to be adapted.

DJJ Over a year ago

@gofraidh I would suggest posting a question on SO. There are people with more knowledge than me about data.table. If you have yet read it here is a good starting point. For data.table, a solution would be the following. tmp <- grepl("m", colnames(aa)); aa[,..tmp]

DJJ Over a year ago

Now, data.table has also the patterns key. aa<- as.data.table(mtcars); aa[,.SD,.SDcols=patterns("m")]. see ?data.table for more details.

Collectives™ on Stack Overflow

Select data.table columns with grep-like partial matching

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related