0

Consider a sample dataset:

dt <- data.table(data.frame(V1 = c("C1/R3","M2/R4")))
> dt
      V1
1: C1/R3
2: M2/R4

For each row of dt, I want extract the concatenated characters C,M, or R. For example,

dt[,V2 := stri_join_list(str_match_all(V1,"[CMR],sep="",collapse=""),by=seq_len(nrow(dt))]
> dt
         V1 V2
1:    C1/R3 CR
2:    M2/R4 MR

However, I have 42 million rows and the above code is not nearly efficient enough. Is there a way to do this without using row-wise operations? When I skip the by argument I get entry CRMRfor each row.

9
  • 1
    Aren't these vectorised functions, hence you don't need the by=? - dt[,V2 := stri_join_list(str_match_all(V1,"[CMR]"))] - I'm not sure how you are ending up with NA values, but you might want to include a row that does so in your example. Commented Oct 9, 2018 at 3:35
  • Can you please clarify, in your real data: 1) Are all the letters always uppercase? 2) Is the pattern always simply 1 letter followed by a single digit number followed by / followed by 1 letter followed by a single digit number - if not can you please specify lengths of repeats for letters and numbers, e.g. any lengths? Because efficiency is important to you including those details will help prevent overgeneralized (potentially slower) solutions or else avoid oversimplified solutions that (as a fair assumption) solve the problem correctly as you've presented it. Commented Oct 9, 2018 at 11:08
  • @thelatemail, doing it vectorized actually returns [CMR] from all rows combined. IE, both entries would be "CRMR". Commented Oct 9, 2018 at 14:26
  • @krads, updated example. Commented Oct 9, 2018 at 14:31
  • 1
    @hipHopMetropolisHastings - using the vectorised code I suggested works. It gives CR for the first row and MR for the second row as per your original dt before the update. You need to remove the collapse="" which is in your code (and not in mine). Commented Oct 9, 2018 at 21:12

2 Answers 2

1

One option uses sub:

dt <- data.table(data.frame(V1 = c("C1/R3","M2/R4")))
dt$V2 <- sub("^([A-Z]+)[0-9]+/([A-Z]+)[0-9]+", "\\1\\2", dt$V1)
dt
     V1 V2
1 C1/R3 CR
2 M2/R4 MR

Demo

Sign up to request clarification or add additional context in comments.

Comments

0

If, as you stated, you only wish to capture the letters C, M and R into a new column in your data.table, then the following should work efficiently by assigning in place:

dt[, V2 := gsub('[^CMR]', '', V1, perl=TRUE, useBytes=TRUE)]

The pattern [^CMR] matches any character that is not C M or R then we substitute for an empty string ''.

Per the help from ?gsub: "If you can make use of useBytes = TRUE, the strings will not be checked before matching, and the actual matching will be faster."

Finally, from what I have read, using perl=TRUE I believe should be faster than omitting it. However, perhaps you could test both ways and reply with results using your real data to confirm for us?

1 Comment

So there are some extra values that include [CMR], but are not in the format described above. For example, there is a designation "ZMR 14-B", for which I'd like to capture "NA". Will update question later on with more complete example. I'd like to capture all "[CMR](?=\d)", but using" [^[CMR](?=\d)]" is only capturing the first instance.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.