How to efficiently match and combine strings in data.table

Question

Consider a sample dataset:

dt <- data.table(data.frame(V1 = c("C1/R3","M2/R4")))
> dt
      V1
1: C1/R3
2: M2/R4

For each row of dt, I want extract the concatenated characters C,M, or R. For example,

dt[,V2 := stri_join_list(str_match_all(V1,"[CMR],sep="",collapse=""),by=seq_len(nrow(dt))]
> dt
         V1 V2
1:    C1/R3 CR
2:    M2/R4 MR

However, I have 42 million rows and the above code is not nearly efficient enough. Is there a way to do this without using row-wise operations? When I skip the by argument I get entry CRMRfor each row.

Aren't these vectorised functions, hence you don't need the by=? - dt[,V2 := stri_join_list(str_match_all(V1,"[CMR]"))] - I'm not sure how you are ending up with NA values, but you might want to include a row that does so in your example. — thelatemail
– thelatemail, Commented Oct 9, 2018 at 3:35
Can you please clarify, in your real data: 1) Are all the letters always uppercase? 2) Is the pattern always simply 1 letter followed by a single digit number followed by / followed by 1 letter followed by a single digit number - if not can you please specify lengths of repeats for letters and numbers, e.g. any lengths? Because efficiency is important to you including those details will help prevent overgeneralized (potentially slower) solutions or else avoid oversimplified solutions that (as a fair assumption) solve the problem correctly as you've presented it. — krads
– krads, Commented Oct 9, 2018 at 11:08
@thelatemail, doing it vectorized actually returns [CMR] from all rows combined. IE, both entries would be "CRMR". — hipHopMetropolisHastings
– hipHopMetropolisHastings, Commented Oct 9, 2018 at 14:26
@hipHopMetropolisHastings - using the vectorised code I suggested works. It gives CR for the first row and MR for the second row as per your original dt before the update. You need to remove the collapse="" which is in your code (and not in mine). — thelatemail
– thelatemail, Commented Oct 9, 2018 at 21:12

Tim Biegeleisen · Accepted Answer · 2018-10-09 02:00:39Z

1

One option uses sub:

dt <- data.table(data.frame(V1 = c("C1/R3","M2/R4")))
dt$V2 <- sub("^([A-Z]+)[0-9]+/([A-Z]+)[0-9]+", "\\1\\2", dt$V1)
dt
     V1 V2
1 C1/R3 CR
2 M2/R4 MR

Demo

answered Oct 9, 2018 at 2:00

Tim Biegeleisen

526k32 gold badges324 silver badges399 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

krads · Accepted Answer · 2018-10-09 11:39:28Z

0

If, as you stated, you only wish to capture the letters C, M and R into a new column in your data.table, then the following should work efficiently by assigning in place:

dt[, V2 := gsub('[^CMR]', '', V1, perl=TRUE, useBytes=TRUE)]

The pattern [^CMR] matches any character that is not C M or R then we substitute for an empty string ''.

Per the help from ?gsub: "If you can make use of useBytes = TRUE, the strings will not be checked before matching, and the actual matching will be faster."

Finally, from what I have read, using perl=TRUE I believe should be faster than omitting it. However, perhaps you could test both ways and reply with results using your real data to confirm for us?

answered Oct 9, 2018 at 11:39

krads

1,37910 silver badges14 bronze badges

1 Comment

hipHopMetropolisHastings Over a year ago

So there are some extra values that include [CMR], but are not in the format described above. For example, there is a designation "ZMR 14-B", for which I'd like to capture "NA". Will update question later on with more complete example. I'd like to capture all "[CMR](?=\d)", but using" [^[CMR](?=\d)]" is only capturing the first instance.

Collectives™ on Stack Overflow

How to efficiently match and combine strings in data.table

2 Answers 2

Demo

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related