4

I was wondering whether the most efficient way to extract text from a column was using the sub function in a data.table.

For example, we have the following data set:

test <- data.table(a = c("Hello world, this is Tom and I am a guy", "Hello world, this is Jack and I am a guy"))

and I would like to extract the names. One way to extract the names are using the substitution function

test[, Name := sub(".*? this is (.*?) and.*", "\\1", a)]

but I was wondering, is this the most efficient way?

1
  • Fyi, better, when asking a performance/efficiency question to provide a scalable example. In this case, the number of distinct values really matters, eg with @akrun's example we get timing near 0 when using by= ... system.time(test1[, Name := sub(".*? this is (.*?) and.*", "\\1", a), by=a]). An example: stackoverflow.com/q/4322219/1191259 Commented Jul 17, 2018 at 15:27

1 Answer 1

6

Combining with str_extract and str_remove reduces the time

library(stringr)
test1 <- test[rep(seq_len(.N), 1e6)]
test2 <- copy(test1)
system.time(test1[, Name := sub(".*? this is (.*?) and.*", "\\1", a)])    
#   user  system elapsed 
#  4.590   0.002   4.597 
system.time(test2[, Name :=  str_remove(str_extract(a, "this is \\w+"), 
                  "this is ")])
#   user  system elapsed 
#   2.259   0.076   2.339 

identical(test1$Name, test2$Name)
#[1] TRUE

library(microbenchmark)
f1 <- function()  sub(".*? this is (.*?) and.*", "\\1", test1$a)
f2 <- function() str_remove(str_extract(test1$a, "this is \\w+"), "this is ")

microbenchmark(f1(), f2(), unit = 'relative', times = 10L)  
#Unit: relative
#expr     min      lq     mean   median       uq      max neval
# f1() 2.12879 2.14592 2.145551 2.173798 2.188693 2.121836    10
# f2() 1.00000 1.00000 1.000000 1.000000 1.000000 1.000000    10
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.