Extracting text strings using data.table in R

Question

I have a data.table similar to the one as follows

Data

library(data.table)
DT <- structure(list(N = 1:6, VN = c("v1", "v3", "v6", "v7a", "v18", 
"v23"), T1 = c("bigby (wolf)", "white", "red (rose)", "piggy (straw)", 
"(curse) beast", "prince"), T2 = c("jack (bean)", "snow (dwarves)", 
"beard (blue)", "bhageera (jungle) mowgli (book)", "beauty", 
"glass (slipper)"), T3 = c("hk (34)", "VL (r45)", "tg (h5)", 
"tt (HG) (45)", "gh", "vlp"), Val = c(36, 25, 0.84, 12, 78, 258
)), .Names = c("N", "VN", "T1", "T2", "T3", "Val"), class = "data.frame", row.names = c(NA, 
-6L))

setDT(DT)

DT
   N  VN            T1                              T2           T3    Val
1: 1  v1  bigby (wolf)                     jack (bean)      hk (34)  36.00
2: 2  v3         white                  snow (dwarves)     VL (r45)  25.00
3: 3  v6    red (rose)                    beard (blue)      tg (h5)   0.84
4: 4 v7a piggy (straw) bhageera (jungle) mowgli (book) tt (HG) (45)  12.00
5: 5 v18 (curse) beast                          beauty           gh  78.00
6: 6 v23        prince                 glass (slipper)          vlp 258.00

I want to extract all the strings within parentheses from columns T1 and T2 to a new column C.

I can do it to single rows as follows.

Rowwise calculations

setDF(DT)
dtf <- c("T1", "T2")
paste(unique(unlist(regmatches(DT[4,dtf], gregexpr("(?=\\().*?(?<=\\))", DT[4,dtf], perl=T)))), collapse=" ")
[1] "(straw) (jungle) (book)"
paste(unique(unlist(regmatches(DT[3,dtf], gregexpr("(?=\\().*?(?<=\\))", DT[3,dtf], perl=T)))), collapse=" ")
[1] "(rose) (blue)"

I am not able to get similar results using data.table.

Try with data.table

setDT(DT)
DT[, C := paste(unique(unlist(regmatches(get(dtf), gregexpr("(?=\\().*?(?<=\\))", get(dtf), perl=T)))), collapse=" ")]

How to use data.table to get the desired result?

Desired result

out <- structure(list(N = 1:6, VN = c("v1", "v3", "v6", "v7a", "v18", 
"v23"), T1 = c("bigby (wolf)", "white", "red (rose)", "piggy (straw)", 
"(curse) beast", "prince"), T2 = c("jack (bean)", "snow (dwarves)", 
"beard (blue)", "bhageera (jungle) mowgli (book)", "beauty", 
"glass (slipper)"), T3 = c("hk (34)", "VL (r45)", "tg (h5)", 
"tt (HG) (45)", "gh", "vlp"), Val = c(36, 25, 0.84, 12, 78, 258
), C = c("(wolf) (bean)", "(dwarves)", "(rose) (blue)", "(straw) (jungle) (book)", 
"(curse)", "(slipper)")), .Names = c("N", "VN", "T1", "T2", "T3", 
"Val", "C"), class = "data.frame", row.names = c(NA, -6L))
out
  N  VN            T1                              T2           T3    Val                       C
1 1  v1  bigby (wolf)                     jack (bean)      hk (34)  36.00           (wolf) (bean)
2 2  v3         white                  snow (dwarves)     VL (r45)  25.00               (dwarves)
3 3  v6    red (rose)                    beard (blue)      tg (h5)   0.84           (rose) (blue)
4 4 v7a piggy (straw) bhageera (jungle) mowgli (book) tt (HG) (45)  12.00 (straw) (jungle) (book)
5 5 v18 (curse) beast                          beauty           gh  78.00                 (curse)
6 6 v23        prince                 glass (slipper)          vlp 258.00               (slipper)

shadow · Accepted Answer · 2015-05-12 08:11:12Z

3

You can use by and .SDcols to do this.

setDT(DT)
dtf <- c("T1", "T2")
DT[, C := paste(unique(unlist(regmatches(.SD, gregexpr("(?=\\().*?(?<=\\))", .SD, perl=T)))), 
                collapse=" "), 
   by = N, 
   .SDcols = dtf]
DT
## N  VN            T1                              T2           T3    Val                       C
## 1: 1  v1  bigby (wolf)                     jack (bean)      hk (34)  36.00           (wolf) (bean)
## 2: 2  v3         white                  snow (dwarves)     VL (r45)  25.00               (dwarves)
## 3: 3  v6    red (rose)                    beard (blue)      tg (h5)   0.84           (rose) (blue)
## 4: 4 v7a piggy (straw) bhageera (jungle) mowgli (book) tt (HG) (45)  12.00 (straw) (jungle) (book)
## 5: 5 v18 (curse) beast                          beauty           gh  78.00                 (curse)
## 6: 6 v23        prince                 glass (slipper)          vlp 258.00               (slipper)

answered May 12, 2015 at 8:11

shadow

22.4k5 gold badges67 silver badges80 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

MichaelChirico Over a year ago

If there's a large number of rows with no parentheses in T1 nor T2, you may want to subset on those rows first, along the lines of: DT[grepl("(",T1)|grepl("(",T2),C:=...]

Collectives™ on Stack Overflow

Extracting text strings using data.table in R

Data

Rowwise calculations

Try with data.table

Desired result

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

Data

Rowwise calculations

Try with data.table

Desired result

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related