Splitting text column into ragged multiple new columns in a data table in R

Question

I have a data table containing 20000+ rows and one column. The string in each column has different number of words. I want to split the words and put each of them in a new column. I know how I can do it word by word:

Data [ , Word1 := as.character(lapply(strsplit(as.character(Data$complaint), split=" "), "[", 1))]

(Data is my data table and complaint is the name of the column)

Obviously, this is not efficient because each cell in each row has different number of words.

Could you please tell me about a more efficient way to do this?

MichaelChirico · Accepted Answer · 2016-02-23 19:41:04Z

12

Two functions, transpose() and tstrsplit(), are available since version 1.9.6 on CRAN.

With this we can do:

require(data.table)
setDT(tstrsplit(as.character(df$x), " ", fixed=TRUE))[]
#      V1       V2          V3  V4
# 1: This       is interesting  NA
# 2: This actually          is not

tstrsplit is a wrapper for transpose(strsplit(...)).

edited Feb 23, 2016 at 19:41

MichaelChirico

34.9k17 gold badges122 silver badges209 bronze badges

answered Jan 27, 2015 at 19:21

Arun

119k28 gold badges290 silver badges396 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

puslet88 Over a year ago

In some cases, simple strsplit seems quicker than the cSplit proposed above. Tstrsplit may be worth trying.

A5C1D2H2I1M1N2O1R2T1 · Accepted Answer · 2014-11-13 09:08:51Z

10

Check out cSplit from my "splitstackshape" package. It works on either data.frames or data.tables (but always returns a data.table).

Assuming KFB's sample data is at least slightly representative of your actual data, you can try:

library(splitstackshape)
cSplit(df, "x", " ")
#     x_1      x_2         x_3 x_4
# 1: This       is interesting  NA
# 2: This actually          is not

Another (blazing) option is to use stri_split_fixed with simplify = TRUE (from "stringi") (which is obviously deemed to enter the "splitstackshape" code soon):

library(stringi)
stri_split_fixed(df$x, " ", simplify = TRUE)
#      [,1]   [,2]       [,3]          [,4] 
# [1,] "This" "is"       "interesting" NA   
# [2,] "This" "actually" "is"          "not"

edited Nov 13, 2014 at 9:08

answered Nov 13, 2014 at 1:47

A5C1D2H2I1M1N2O1R2T1

194k31 gold badges417 silver badges497 bronze badges

Comments

jlhoward · Accepted Answer · 2014-11-13 16:49:20Z

3

Here is a solution based on rbind.fill.matrix(...) in the plyr package. On a dataset with 20,000 rows it runs in about 3.6 sec.

# create an sample dataset - you have this already
library(data.table)
words <- LETTERS[1:10]     # "words" are just letters in this example
set.seed(1)                # for reproducible example
w  <- sapply(1:2e4,function(i)paste(words[sample(1:10,sample(1:10,1))],collapse=" "))
dt <- data.table(words=w)
head(dt)
#          complaint
# 1:           D F H
# 2:           I J F
# 3:   A B I E C D H
# 4: J D G H B I A E
# 5:         A D G C
# 6:       F E B J I

# you start here...
library(plyr)
result <- rbind.fill.matrix(lapply(strsplit(dt$words, split=" "),matrix,nr=1))
result <- as.data.table(result)
head(result)
#    1 2 3  4  5  6  7  8  9 10
# 1: D F H NA NA NA NA NA NA NA
# 2: I J F NA NA NA NA NA NA NA
# 3: A B I  E  C  D  H NA NA NA
# 4: J D G  H  B  I  A  E NA NA
# 5: A D G  C NA NA NA NA NA NA
# 6: F E B  J  I NA NA NA NA NA

EDIT: Added some benchmarking based on @Ananda's comment below.

f.rfm    <- function() as.data.table(rbind.fill.matrix(lapply(strsplit(dt$complaint, split=" "),matrix,nr=1)))
library(splitstackshape)
f.csplit <- function() cSplit(dt, "complaint", " ",type.convert=FALSE)
library(stringi)
f.sl2m   <- function() as.data.table(stri_list2matrix(strsplit(dt$complaint, split=" "), byrow = TRUE))
f.ssf    <- function() as.data.table(stri_split_fixed(dt$complaint, " ", simplify = TRUE))

all.equal(f.rfm(),f.csplit(),check.names=FALSE)
# [1] TRUE
all.equal(f.rfm(),f.sl2m(),check.names=FALSE)
# [1] TRUE
all.equal(f.rfm(),f.ssf(),check.names=FALSE)
# [1] TRUE
library(microbenchmark)
microbenchmark(f.rfm(),f.csplit(),f.sl2m(),f.ssf(),times=10)
# Unit: milliseconds
#        expr        min         lq     median        uq        max neval
#     f.rfm() 3566.17724 3589.31203 3606.93303 3665.4087 3719.32299    10
#  f.csplit()   98.05709  102.46456  104.51046  107.9588  117.26945    10
#    f.sl2m()   55.45527   55.58852   56.75406   58.9347   67.44523    10
#     f.ssf()   17.77499   17.98879   18.30831   18.4537   21.62161    10

So it looks like stri_split_fixed(...) is the winner.

edited Nov 13, 2014 at 16:49

answered Nov 13, 2014 at 0:49

jlhoward

59.6k7 gold badges105 silver badges144 bronze badges

4 Comments

A5C1D2H2I1M1N2O1R2T1 Over a year ago

I think it is about time to ditch rbind.fill.matrix. Have you seen stri_list2matrix yet from the "stringi" package? Try: stri_list2matrix(strsplit(dt$words, split=" "), byrow = TRUE). Your time will drop from 3+ seconds to < 0.2 seconds....

jlhoward Over a year ago

@AnandaMahto Yes. It seems to be faster than cSplit(...) but slower than stri_split_fixed(...). See benchmark results above.

A5C1D2H2I1M1N2O1R2T1 Over a year ago

@jihoward, Hence my comment about the approach's inclusion in "splitstackshape" in the near future :-) I was just waiting for "stringi" 0.3-1 to be on CRAN, which it is now, so I need to rewrite a few of my existing functions....

A5C1D2H2I1M1N2O1R2T1 Over a year ago

+1 for the benchmarks :-) Also, you should get at least a little boost with strsplit if you add fixed = TRUE. Not sure how much it would affect the benchmarks though.

LeoRJorge · Accepted Answer · 2014-11-13 00:13:40Z

2

An example data would be nice, but if I understand what you want, it is not possible to do properly in a data frame. Given there are different numbers of words in each row you, will need a list. Even though, it is very simple to split the words in the whole object.

If you run strsplit(as.character(Data[,1]), " ") you will get a list with each element corresponding to a row in your dataframe. From that, there are several different alternatives to rearrange this object, but the best approach will depend on your objective

edited Nov 13, 2014 at 0:13

answered Nov 13, 2014 at 0:06

LeoRJorge

4741 gold badge5 silver badges13 bronze badges

Comments

KFB · Accepted Answer · 2014-11-13 13:38:19Z

2

OK for both data.table and data.frame

# toy data
df <- structure(list(x = structure(c(2L, 1L), .Label = c("This actually is not", 
"This is interesting"), class = "factor")), .Names = "x", row.names = c(NA, 
-2L), class = "data.frame")

#                      x
# 1  This is interesting
# 2 This actually is not

# the code
split_result <- strsplit(as.character(df$x), " ")
length_n <- sapply(split_result, length)
length_max <- seq_len(max(length_n))
as.data.frame(t(sapply(split_result, "[", i = length_max))) # Or as.data.table(...)

#     V1       V2          V3   V4
# 1 This       is interesting <NA>
# 2 This actually          is  not

edited Nov 13, 2014 at 13:38

answered Nov 13, 2014 at 0:44

KFB

3,5013 gold badges17 silver badges18 bronze badges

Collectives™ on Stack Overflow

Splitting text column into ragged multiple new columns in a data table in R

5 Answers 5

1 Comment

Comments

4 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

1 Comment

Comments

4 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related