Split a string into multiple columns of variable length using R

Question

I am looking for a faster way to to the following, I need to split a column of a data.table object containing strings into separate columns. The strings are of the format "name1=value1;name2=value2;". The strings can be split into a variable number of columns in which case those values will need to be filled with NA. For example I have this:

library(data.table)
dt <- data.table("foo"=c("name=john;id=1234;last=smith", "name=greg;id=5678", "last=picard", "last=jones;number=1234567890"))

I would want this:

name id last number john 1234 smith NA greg 5678 NA NA NA NA picard NA NA NA jones 1234567890

This will work but it is slow given the amount of data to parse and I'm wondering if there is a better way:

x <- strsplit(as.character(dt$foo), ";|=")
a <- function(x){
     name <- x[seq(1, length(x), 2)]
     value <- x[seq(2, length(x), 2)]
     tmp <- transpose(as.data.table(value))
     names(tmp) <- name
     return(tmp)
  }
x <- lapply(x, a)
x <- rbindlist(x, fill=TRUE)

mt1022 · Accepted Answer · 2017-05-07 13:22:39Z

3

We can try:

# split into different fields for each row
res <- lapply(strsplit(dt$foo, ';'), function(x){
    # split the the fields into two vectors of field names and field values
    res <- tstrsplit(x, '=')
    # make a list of field values with the field names as names of the list 
    setNames(as.list(res[[2]]), res[[1]])
})

rbindlist(res, fill = T)
#    name   id   last     number
# 1: john 1234  smith         NA
# 2: greg 5678     NA         NA
# 3:   NA   NA picard         NA
# 4:   NA   NA  jones 1234567890

dplyr::bind_rows(res)

# # A tibble: 4 × 4
#    name    id   last     number
#   <chr> <chr>  <chr>      <chr>
# 1  john  1234  smith       <NA>
# 2  greg  5678   <NA>       <NA>
# 3  <NA>  <NA> picard       <NA>
# 4  <NA>  <NA>  jones 1234567890

According to comment by David Arenburg, we can improve the speed by adding fixed = TRUE to both strsplit. I did a short benchmark with this data, adding fixed = TRUE will increase the speed by about one fold.

library(microbenchmark)

dt <- dt[sample.int(nrow(dt), 100, replace = T)]

microbenchmark(
    noFix = {
        res <- lapply(strsplit(dt$foo, ';'), function(x){
            res <- tstrsplit(x, '=')
            setNames(as.list(res[[2]]), res[[1]])
        })
    },
    Fixed = {
        res <- lapply(strsplit(dt$foo, ';', fixed = TRUE), function(x){
            res <- tstrsplit(x, '=', fixed = TRUE)
            setNames(as.list(res[[2]]), res[[1]])
        })
    },
    times = 1000
)
# Unit: milliseconds
#  expr      min       lq     mean   median       uq       max neval
# noFix 1.921947 1.999386 2.212511 2.064997 2.218706 11.290072  1000
# Fixed 1.026753 1.088712 1.226519 1.131899 1.219558  4.490796  1000

edited May 7, 2017 at 13:22

answered May 5, 2017 at 15:56

mt1022

17.4k6 gold badges53 silver badges77 bronze badges

Sign up to request clarification or add additional context in comments.

11 Comments

Frank Over a year ago

Prolly want type.convert=TRUE inside tstrsplit so that `number is num not char.

mt1022 Over a year ago

@Frank, I think this doesn't work because after tstrsplit, the number will be in same vector with name so that it will be coerced to character. We can convert to numeric after rbindlist if we want.

David Arenburg Over a year ago

It is documented really under grep and such. It's just does an exact match instead of running an regex engine which is much cheaper. With base functions such as grep and al., the speed difference could be even 10 fold. As a rule of thumb, whenever you do an exact match rather some expression, always add fixed = TRUE- it is more typing but usually pays off.

mt1022 Over a year ago

@DavidArenburg, the speed improvement is really about 10 fold. I played with this example and found fixed=T works great for single character, and the improvement will reduce to ~4 fold with longer fixed patterns.

David Arenburg Over a year ago

Yeah, it depends on the use case, but it certainly can't perform worse. It is actually also safer if you separator is a metacharacter and you forgot.

|

Collectives™ on Stack Overflow

Split a string into multiple columns of variable length using R

1 Answer 1

11 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

11 Comments

Your Answer

Sign up or log in

Post as a guest

Related