1

I was wondering if there was a clean solution using data.table to the following problem possibly using other packages such as stringr.

Suppose I have the following data table

DT <- data.table(name = c("Carlos", "Henry", "John"), 
    ID = c("US115115, CH123232, AB155, US4445", "CH112, BB53", "US57677777"))

This looks like:

     name                                ID
1: Carlos US115115, CH123232, AB155, US4445
2:  Henry                       CH112, BB53
3:   John                        US57677777

What I want to do is create another column, ID2, say, that takes the column ID and extracts only the "US identities" and creates a new column so that the final data table should look like:

     name                                ID              ID2
1: Carlos US115115, CH123232, AB155, US4445 US115115, US4445
2:  Henry                       CH112, BB53               NA
3:   John                        US57677777       US57677777                     

and where each element is a string. I've been able to code a version where it takes the first "US identity" and discards the rest, but I haven't been able to find a solution that handles multiplicity.

Any help would be greatly appreciated!

3 Answers 3

4

A possible approach:

DT[, ID2 := sapply(strsplit(ID, ","), 
    function(s) paste(s[grepl("\\s*US", s)], collapse=","))]

output:

     name                                ID              ID2
1: Carlos US115115, CH123232, AB155, US4445 US115115, US4445
2:  Henry                       CH112, BB53                 
3:   John                        US57677777       US57677777
Sign up to request clarification or add additional context in comments.

4 Comments

You could avoid the trimws if you change your regex to ",\\s*" , but this seems pretty good to me.
Also I think sapply(strsplit(DT$ID, ","), function(s) toString(grep("^US", trimws(s), value = TRUE))) would be a bit shorter ?
thanks, @latemail. have taken your comments into the solution
very nice, @RonakShah. You should post that as well, i find it useful as i learned that toString does string <- paste(x, collapse = ", ") as well
2

Here are few suggestions inspired from @thelatemail and @chinsoon12

DT$ID1 <- sapply(strsplit(DT$ID, ",\\s*"), function(x) 
                            toString(grep("^US", x, value = TRUE)))
DT
#     name                                ID              ID1
#1: Carlos US115115, CH123232, AB155, US4445 US115115, US4445
#2:  Henry                       CH112, BB53                 
#3:   John                        US57677777       US57677777

Above we are filtering using grep, we can also use startsWith to do the same

sapply(strsplit(DT$ID, ",\\s*"), function(x) toString(x[startsWith(x, "US")]))

You can incorporate both the above options in dplyr chain however another option using dplyr and tidyr would be using separate_rows which might be an overkill for this particular question. We can use str_subset from stringr which is the same as grep("^US", x, value = TRUE).

library(dplyr)
library(tidyr)

DT %>%
  separate_rows(ID) %>%
  group_by(name) %>%
  summarise(ID1 = toString(ID), 
            ID2 = toString(stringr::str_subset(ID, "^US")))

Comments

0

We can either use str_extract to extract the words that start with "US"

library(stringr)
DT[, ID2 := sapply(str_extract_all(ID, "\\bUS\\S*"), toString)]
DT
#     name                                ID               ID2
#1: Carlos US115115, CH123232, AB155, US4445 US115115,, US4445
#2:  Henry                       CH112, BB53                  
#3:   John                        US57677777        US57677777

Or using gsub

DT[, ID2 := gsub("(\\bUS\\S*)(*SKIP)(*F)|.", "", ID, perl = TRUE)]

Or using tidyverse

library(tidyverse)
DT %>%
    mutate(ID2 = str_extract_all(ID, "\\bUS\\S*") %>%
                   map(toString))

Or with base R using gregexpr

DT$ID2 <- sapply(regmatches(DT$ID, gregexpr("\\bUS\\S*", DT$ID)), toString)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.