Subsetting character string and returning string

Question

I was wondering if there was a clean solution using data.table to the following problem possibly using other packages such as stringr.

Suppose I have the following data table

DT <- data.table(name = c("Carlos", "Henry", "John"), 
    ID = c("US115115, CH123232, AB155, US4445", "CH112, BB53", "US57677777"))

This looks like:

     name                                ID
1: Carlos US115115, CH123232, AB155, US4445
2:  Henry                       CH112, BB53
3:   John                        US57677777

What I want to do is create another column, ID2, say, that takes the column ID and extracts only the "US identities" and creates a new column so that the final data table should look like:

     name                                ID              ID2
1: Carlos US115115, CH123232, AB155, US4445 US115115, US4445
2:  Henry                       CH112, BB53               NA
3:   John                        US57677777       US57677777

and where each element is a string. I've been able to code a version where it takes the first "US identity" and discards the rest, but I haven't been able to find a solution that handles multiplicity.

Any help would be greatly appreciated!

chinsoon12 · Accepted Answer · 2019-06-20 02:01:36Z

4

A possible approach:

DT[, ID2 := sapply(strsplit(ID, ","), 
    function(s) paste(s[grepl("\\s*US", s)], collapse=","))]

output:

     name                                ID              ID2
1: Carlos US115115, CH123232, AB155, US4445 US115115, US4445
2:  Henry                       CH112, BB53                 
3:   John                        US57677777       US57677777

edited Jun 20, 2019 at 2:01

answered Jun 20, 2019 at 0:48

chinsoon12

25.2k4 gold badges27 silver badges35 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

thelatemail Over a year ago

You could avoid the trimws if you change your regex to ",\\s*" , but this seems pretty good to me.

Ronak Shah Over a year ago

Also I think sapply(strsplit(DT$ID, ","), function(s) toString(grep("^US", trimws(s), value = TRUE))) would be a bit shorter ?

chinsoon12 Over a year ago

thanks, @latemail. have taken your comments into the solution

chinsoon12 Over a year ago

very nice, @RonakShah. You should post that as well, i find it useful as i learned that toString does string <- paste(x, collapse = ", ") as well

Ronak Shah · Accepted Answer · 2019-06-20 02:28:13Z

Here are few suggestions inspired from @thelatemail and @chinsoon12

DT$ID1 <- sapply(strsplit(DT$ID, ",\\s*"), function(x) 
                            toString(grep("^US", x, value = TRUE)))
DT
#     name                                ID              ID1
#1: Carlos US115115, CH123232, AB155, US4445 US115115, US4445
#2:  Henry                       CH112, BB53                 
#3:   John                        US57677777       US57677777

Above we are filtering using grep, we can also use startsWith to do the same

sapply(strsplit(DT$ID, ",\\s*"), function(x) toString(x[startsWith(x, "US")]))

You can incorporate both the above options in dplyr chain however another option using dplyr and tidyr would be using separate_rows which might be an overkill for this particular question. We can use str_subset from stringr which is the same as grep("^US", x, value = TRUE).

library(dplyr)
library(tidyr)

DT %>%
  separate_rows(ID) %>%
  group_by(name) %>%
  summarise(ID1 = toString(ID), 
            ID2 = toString(stringr::str_subset(ID, "^US")))

akrun · Accepted Answer · 2019-06-20 03:01:17Z

0

We can either use str_extract to extract the words that start with "US"

library(stringr)
DT[, ID2 := sapply(str_extract_all(ID, "\\bUS\\S*"), toString)]
DT
#     name                                ID               ID2
#1: Carlos US115115, CH123232, AB155, US4445 US115115,, US4445
#2:  Henry                       CH112, BB53                  
#3:   John                        US57677777        US57677777

Or using gsub

DT[, ID2 := gsub("(\\bUS\\S*)(*SKIP)(*F)|.", "", ID, perl = TRUE)]

Or using tidyverse

library(tidyverse)
DT %>%
    mutate(ID2 = str_extract_all(ID, "\\bUS\\S*") %>%
                   map(toString))

Or with base R using gregexpr

DT$ID2 <- sapply(regmatches(DT$ID, gregexpr("\\bUS\\S*", DT$ID)), toString)

edited Jun 20, 2019 at 3:01

answered Jun 20, 2019 at 2:42

akrun

891k38 gold badges590 silver badges700 bronze badges

Collectives™ on Stack Overflow

Subsetting character string and returning string

3 Answers 3

4 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related