31

How can I select columns of a data.table based on a regex? Consider a simple example as follows:

library(data.table)
mydt <- data.table(foo=c(1,2), bar=c(2,3), baz=c(3,4))

Is there a way to use columns of bar and baz from the datatable based on a regex? I know that the following solution works but if the table is much bigger and I would like to choose more variables this could easily get cumbersome.

mydt[, .(bar, baz)]

I would like to have something like matches() in dplyr::select() but only by reference.

6
  • 3
    You could do mydt[, grep(c("bar|baz"), names(mydt)), with = FALSE] but I think with creates a copy. Commented May 12, 2015 at 11:52
  • Might be helpful: stackoverflow.com/q/13383840 and stackoverflow.com/q/12603890 Commented May 12, 2015 at 11:55
  • @DavidArenburg Don't think it creates a copy. Commented May 12, 2015 at 12:00
  • 1
    @nicola I remember Arun mentioning it several times but never checked really Commented May 12, 2015 at 12:02
  • 1
    @DavidArenburg, that would be pretty strange behavior, no? Commented May 12, 2015 at 12:18

6 Answers 6

37

You can also try to use %like% from data.table package, which is a "convenience function for calling regexpr". However makes code more readable ;)

In this case, answering your question:

mydt[, .SD, .SDcols = names(mydt) %like% "bar|baz"]

As %like% returns a logical vector, whe can use the following to get every column except those which contain "foo":

mydt[, .SD, .SDcols = ! names(mydt) %like% "foo"]

where !negates the logical vector.

Sign up to request clarification or add additional context in comments.

Comments

19

Since data.table v1.12.0 (Jan 2019) you can do:

mydt[, .SD, .SDcols = patterns("bar|baz")]

From the official documentation ?data.table, on the .SDcols argument:

[...] you can filter columns to include in .SD based on their names according to regular expressions via .SDcols=patterns(regex1, regex2, ...). The included columns will be the intersection of the columns identified by each pattern; pattern unions can easily be specified with | in a regex. [...] You can also invert a pattern as usual with .SDcols = !patterns(...).

1 Comment

Thanks, this is indeed a long-awaited feature of data.table. I updated my benchmark with it. It seems to work slower but in many cases, it does not matter at all.
17

David's answer will work. But if your regex is long and you would rather it be done first, try:

cols <- grep("<regex pattern>", names(mydt), value=T)
mydt[, cols, with=FALSE]

It just depends on your preferences and needs. You can also assign the subsetted table to a chosen variable if you need the original intact.

Comments

17

UPDATE: I updated the comparison with @sindri_baldur's answer - using version 1.12.6. According to the results, patterns() is a handy shortcut, but if performance matters, one should stick with the .. or with = FALSE solution (see below).


Apparently, there is a new way of achieving this from version 1.10.2 onwards.

library(data.table)
cols <- grep("bar|baz", names(mydt), value = TRUE)
mydt[, ..cols]

It seems to work the fastest out of the posted solutions.

# Creating a large data.table with 100k rows, 32 columns
n <- 100000
foo_cols <- paste0("foo", 1:30)
big_dt <- data.table(bar = rnorm(n), baz = rnorm(n))
big_dt[, (foo_cols) := rnorm(n)]

# Methods
subsetting <- function(dt) {
    subset(dt, select = grep("bar|baz", names(dt)))
}

usingSD <- function(dt) {
    dt[, .SD, .SDcols = names(dt) %like% "bar|baz"]
}

usingWith <- function(dt) {
    cols <- grep("bar|baz", names(dt), value = TRUE)
    dt[, cols, with = FALSE]
}

usingDotDot <- function(dt) {
    cols <- grep("bar|baz", names(dt), value = TRUE)
    dt[, ..cols]
}

usingPatterns <- function(dt) {
  dt[, .SD, .SDcols = patterns("bar|baz")]
}

# Benchmark
microbenchmark(
    subsetting(big_dt), usingSD(big_dt), usingWith(big_dt), usingDotDot(big_dt), usingPatterns(big_dt),
    times = 5000
)

#Unit: microseconds
#                  expr  min   lq  mean median    uq    max neval
#    subsetting(big_dt)  430  759  1672   1309  1563  82934  5000
#       usingSD(big_dt)  547  951  1872   1461  1797  60357  5000
#     usingWith(big_dt)  278  496  1331   1112  1304  62656  5000
#   usingDotDot(big_dt)  289  483  1392   1117  1344  55878  5000
# usingPatterns(big_dt)  596 1019  1984   1518  1913 120331  5000

4 Comments

From the data.table news: "When j is a symbol prefixed with .. it will be looked up in calling scope and its value taken to be column names or numbers. (...) think one-level-up like the directory .. in all operating systems meaning the parent directory (...) It is experimental."
Fwiw, I think if speed matters for this, then the code must have some organizational problem. A bad but faster way on my system: microbenchmark(bad = `[.noquote`(big_dt, names(big_dt) %like% "bar|baz"), good = big_dt[, .SD, .SDcols = names(big_dt) %like% "bar|baz"]) Similar setDT(as.list(big_dt)[names(big_dt) %like% "bar|baz"])
you forgot the last one, usingPatterns, in the microbenchmark test :)
thanks @emilBeBri, it was only included in the results. I updated the code.
6

There is also a subset method for "data.table", so you can always use something like the following:

subset(mydt, select = grep("bar|baz", names(mydt)))
#    bar baz
# 1:   2   3
# 2:   3   4

It turns out that creating a startswith type of function for "data.table" is not very straightforward.

6 Comments

Hmm I wonder why OP prefers this over mydt[, grep(c("bar|baz"), names(mydt)), with = FALSE]
@DavidArenburg, it's marginally faster?
Have you tested it? Would be interesting find. I never used subset in my life, but if this is faster, I may give it a chance.
@DavidArenburg, tracemem doesn't show any copies with with.
So why would it be slower than subset than?
|
3

I suggest this one-liner code for readability and performance.

mydt[,names(mydt) %like% "bar|baz", with=F] 

Following @Janosdivenji's answer: See usingLikeon the last row

Unit: microseconds
                  expr     min        lq     mean    median        uq       max neval
    subsetting(big_dt) 370.582  977.2760 1194.875 1016.4340 1096.9285  25750.94  5000
       usingSD(big_dt) 554.330 1084.8530 1352.039 1133.4575 1226.9060 189905.39  5000
     usingWith(big_dt) 238.481  832.7505 1017.051  866.6515  927.8460  22717.83  5000
   usingDotDot(big_dt) 256.005  844.8770 1101.543  878.9935  936.6040 181855.43  5000
 usingPatterns(big_dt) 569.787 1128.0970 1411.510 1178.2895 1282.2265 177415.23  5000
     usingLike(big_dt) 262.868  852.5805 1059.466  887.3455  948.6665  23971.70  5000

1 Comment

this is the same as the usingWith solution - regarding the question it is irrelevant if the cols are chosen by using %like%or grep()

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.