Select columns of data.table based on regex

Question

How can I select columns of a data.table based on a regex? Consider a simple example as follows:

library(data.table)
mydt <- data.table(foo=c(1,2), bar=c(2,3), baz=c(3,4))

Is there a way to use columns of bar and baz from the datatable based on a regex? I know that the following solution works but if the table is much bigger and I would like to choose more variables this could easily get cumbersome.

mydt[, .(bar, baz)]

I would like to have something like matches() in dplyr::select() but only by reference.

You could do mydt[, grep(c("bar|baz"), names(mydt)), with = FALSE] but I think with creates a copy. — David Arenburg
– David Arenburg, Commented May 12, 2015 at 11:52
Might be helpful: stackoverflow.com/q/13383840 and stackoverflow.com/q/12603890 — shadowtalker
– shadowtalker, Commented May 12, 2015 at 11:55
@nicola I remember Arun mentioning it several times but never checked really — David Arenburg
– David Arenburg, Commented May 12, 2015 at 12:02

dieguico · Accepted Answer · 2016-10-19 10:54:55Z

37

You can also try to use %like% from data.table package, which is a "convenience function for calling regexpr". However makes code more readable ;)

In this case, answering your question:

mydt[, .SD, .SDcols = names(mydt) %like% "bar|baz"]

As %like% returns a logical vector, whe can use the following to get every column except those which contain "foo":

mydt[, .SD, .SDcols = ! names(mydt) %like% "foo"]

where !negates the logical vector.

answered Oct 19, 2016 at 10:54

dieguico

1,3461 gold badge10 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Henrik · Accepted Answer · 2021-10-03 13:31:56Z

19

Since data.table v1.12.0 (Jan 2019) you can do:

mydt[, .SD, .SDcols = patterns("bar|baz")]

From the official documentation ?data.table, on the .SDcols argument:

[...] you can filter columns to include in .SD based on their names according to regular expressions via .SDcols=patterns(regex1, regex2, ...). The included columns will be the intersection of the columns identified by each pattern; pattern unions can easily be specified with | in a regex. [...] You can also invert a pattern as usual with .SDcols = !patterns(...).

edited Oct 3, 2021 at 13:31

Henrik

68k15 gold badges153 silver badges166 bronze badges

answered Dec 4, 2019 at 14:07

s_baldur

34.6k4 gold badges43 silver badges80 bronze badges

1 Comment

janosdivenyi Over a year ago

Thanks, this is indeed a long-awaited feature of data.table. I updated my benchmark with it. It seems to work slower but in many cases, it does not matter at all.

janosdivenyi · Accepted Answer · 2020-05-12 15:45:25Z

17

David's answer will work. But if your regex is long and you would rather it be done first, try:

cols <- grep("<regex pattern>", names(mydt), value=T)
mydt[, cols, with=FALSE]

It just depends on your preferences and needs. You can also assign the subsetted table to a chosen variable if you need the original intact.

edited May 12, 2020 at 15:45

janosdivenyi

3,2362 gold badges26 silver badges38 bronze badges

answered May 12, 2015 at 12:01

Pierre L

28.5k6 gold badges50 silver badges70 bronze badges

Comments

janosdivenyi · Accepted Answer · 2020-11-18 09:08:36Z

17

UPDATE: I updated the comparison with @sindri_baldur's answer - using version 1.12.6. According to the results, patterns() is a handy shortcut, but if performance matters, one should stick with the .. or with = FALSE solution (see below).

Apparently, there is a new way of achieving this from version 1.10.2 onwards.

library(data.table)
cols <- grep("bar|baz", names(mydt), value = TRUE)
mydt[, ..cols]

It seems to work the fastest out of the posted solutions.

# Creating a large data.table with 100k rows, 32 columns
n <- 100000
foo_cols <- paste0("foo", 1:30)
big_dt <- data.table(bar = rnorm(n), baz = rnorm(n))
big_dt[, (foo_cols) := rnorm(n)]

# Methods
subsetting <- function(dt) {
    subset(dt, select = grep("bar|baz", names(dt)))
}

usingSD <- function(dt) {
    dt[, .SD, .SDcols = names(dt) %like% "bar|baz"]
}

usingWith <- function(dt) {
    cols <- grep("bar|baz", names(dt), value = TRUE)
    dt[, cols, with = FALSE]
}

usingDotDot <- function(dt) {
    cols <- grep("bar|baz", names(dt), value = TRUE)
    dt[, ..cols]
}

usingPatterns <- function(dt) {
  dt[, .SD, .SDcols = patterns("bar|baz")]
}

# Benchmark
microbenchmark(
    subsetting(big_dt), usingSD(big_dt), usingWith(big_dt), usingDotDot(big_dt), usingPatterns(big_dt),
    times = 5000
)

#Unit: microseconds
#                  expr  min   lq  mean median    uq    max neval
#    subsetting(big_dt)  430  759  1672   1309  1563  82934  5000
#       usingSD(big_dt)  547  951  1872   1461  1797  60357  5000
#     usingWith(big_dt)  278  496  1331   1112  1304  62656  5000
#   usingDotDot(big_dt)  289  483  1392   1117  1344  55878  5000
# usingPatterns(big_dt)  596 1019  1984   1518  1913 120331  5000

edited Nov 18, 2020 at 9:08

answered Sep 21, 2017 at 12:31

janosdivenyi

3,2362 gold badges26 silver badges38 bronze badges

4 Comments

janosdivenyi Over a year ago

From the data.table news: "When j is a symbol prefixed with .. it will be looked up in calling scope and its value taken to be column names or numbers. (...) think one-level-up like the directory .. in all operating systems meaning the parent directory (...) It is experimental."

Frank 2 Over a year ago

Fwiw, I think if speed matters for this, then the code must have some organizational problem. A bad but faster way on my system:

microbenchmark(bad = `[.noquote`(big_dt, names(big_dt) %like% "bar|baz"), good = big_dt[, .SD, .SDcols = names(big_dt) %like% "bar|baz"])

Similar setDT(as.list(big_dt)[names(big_dt) %like% "bar|baz"])

emilBeBri Over a year ago

you forgot the last one, usingPatterns, in the microbenchmark test :)

janosdivenyi Over a year ago

thanks @emilBeBri, it was only included in the results. I updated the code.

Community · Accepted Answer · 2017-05-23 12:34:27Z

6

There is also a subset method for "data.table", so you can always use something like the following:

subset(mydt, select = grep("bar|baz", names(mydt)))
#    bar baz
# 1:   2   3
# 2:   3   4

It turns out that creating a startswith type of function for "data.table" is not very straightforward.

edited May 23, 2017 at 12:34

CommunityBot

11 silver badge

answered May 12, 2015 at 12:17

A5C1D2H2I1M1N2O1R2T1

194k31 gold badges417 silver badges497 bronze badges

6 Comments

David Arenburg Over a year ago

Hmm I wonder why OP prefers this over mydt[, grep(c("bar|baz"), names(mydt)), with = FALSE]

A5C1D2H2I1M1N2O1R2T1 Over a year ago

@DavidArenburg, it's marginally faster?

David Arenburg Over a year ago

Have you tested it? Would be interesting find. I never used subset in my life, but if this is faster, I may give it a chance.

A5C1D2H2I1M1N2O1R2T1 Over a year ago

@DavidArenburg, tracemem doesn't show any copies with with.

David Arenburg Over a year ago

So why would it be slower than subset than?

|

Matthew Son · Accepted Answer · 2020-05-09 19:25:21Z

3

I suggest this one-liner code for readability and performance.

mydt[,names(mydt) %like% "bar|baz", with=F]

Following @Janosdivenji's answer: See usingLikeon the last row

Unit: microseconds
                  expr     min        lq     mean    median        uq       max neval
    subsetting(big_dt) 370.582  977.2760 1194.875 1016.4340 1096.9285  25750.94  5000
       usingSD(big_dt) 554.330 1084.8530 1352.039 1133.4575 1226.9060 189905.39  5000
     usingWith(big_dt) 238.481  832.7505 1017.051  866.6515  927.8460  22717.83  5000
   usingDotDot(big_dt) 256.005  844.8770 1101.543  878.9935  936.6040 181855.43  5000
 usingPatterns(big_dt) 569.787 1128.0970 1411.510 1178.2895 1282.2265 177415.23  5000
     usingLike(big_dt) 262.868  852.5805 1059.466  887.3455  948.6665  23971.70  5000

answered May 9, 2020 at 19:25

Matthew Son

1,47513 silver badges42 bronze badges

1 Comment

janosdivenyi Over a year ago

this is the same as the usingWith solution - regarding the question it is irrelevant if the cols are chosen by using %like%or grep()

Collectives™ on Stack Overflow

Select columns of data.table based on regex

6 Answers 6

Comments

1 Comment

Comments

4 Comments

6 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

Comments

1 Comment

Comments

4 Comments

6 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related