6

I have a data table containing 20000+ rows and one column. The string in each column has different number of words. I want to split the words and put each of them in a new column. I know how I can do it word by word:

Data [ , Word1 := as.character(lapply(strsplit(as.character(Data$complaint), split=" "), "[", 1))]

(Data is my data table and complaint is the name of the column)

Obviously, this is not efficient because each cell in each row has different number of words.

Could you please tell me about a more efficient way to do this?

5 Answers 5

12

Two functions, transpose() and tstrsplit(), are available since version 1.9.6 on CRAN.

With this we can do:

require(data.table)
setDT(tstrsplit(as.character(df$x), " ", fixed=TRUE))[]
#      V1       V2          V3  V4
# 1: This       is interesting  NA
# 2: This actually          is not

tstrsplit is a wrapper for transpose(strsplit(...)).

Sign up to request clarification or add additional context in comments.

1 Comment

In some cases, simple strsplit seems quicker than the cSplit proposed above. Tstrsplit may be worth trying.
10

Check out cSplit from my "splitstackshape" package. It works on either data.frames or data.tables (but always returns a data.table).

Assuming KFB's sample data is at least slightly representative of your actual data, you can try:

library(splitstackshape)
cSplit(df, "x", " ")
#     x_1      x_2         x_3 x_4
# 1: This       is interesting  NA
# 2: This actually          is not

Another (blazing) option is to use stri_split_fixed with simplify = TRUE (from "stringi") (which is obviously deemed to enter the "splitstackshape" code soon):

library(stringi)
stri_split_fixed(df$x, " ", simplify = TRUE)
#      [,1]   [,2]       [,3]          [,4] 
# [1,] "This" "is"       "interesting" NA   
# [2,] "This" "actually" "is"          "not"

Comments

3

Here is a solution based on rbind.fill.matrix(...) in the plyr package. On a dataset with 20,000 rows it runs in about 3.6 sec.

# create an sample dataset - you have this already
library(data.table)
words <- LETTERS[1:10]     # "words" are just letters in this example
set.seed(1)                # for reproducible example
w  <- sapply(1:2e4,function(i)paste(words[sample(1:10,sample(1:10,1))],collapse=" "))
dt <- data.table(words=w)
head(dt)
#          complaint
# 1:           D F H
# 2:           I J F
# 3:   A B I E C D H
# 4: J D G H B I A E
# 5:         A D G C
# 6:       F E B J I

# you start here...
library(plyr)
result <- rbind.fill.matrix(lapply(strsplit(dt$words, split=" "),matrix,nr=1))
result <- as.data.table(result)
head(result)
#    1 2 3  4  5  6  7  8  9 10
# 1: D F H NA NA NA NA NA NA NA
# 2: I J F NA NA NA NA NA NA NA
# 3: A B I  E  C  D  H NA NA NA
# 4: J D G  H  B  I  A  E NA NA
# 5: A D G  C NA NA NA NA NA NA
# 6: F E B  J  I NA NA NA NA NA

EDIT: Added some benchmarking based on @Ananda's comment below.

f.rfm    <- function() as.data.table(rbind.fill.matrix(lapply(strsplit(dt$complaint, split=" "),matrix,nr=1)))
library(splitstackshape)
f.csplit <- function() cSplit(dt, "complaint", " ",type.convert=FALSE)
library(stringi)
f.sl2m   <- function() as.data.table(stri_list2matrix(strsplit(dt$complaint, split=" "), byrow = TRUE))
f.ssf    <- function() as.data.table(stri_split_fixed(dt$complaint, " ", simplify = TRUE))

all.equal(f.rfm(),f.csplit(),check.names=FALSE)
# [1] TRUE
all.equal(f.rfm(),f.sl2m(),check.names=FALSE)
# [1] TRUE
all.equal(f.rfm(),f.ssf(),check.names=FALSE)
# [1] TRUE
library(microbenchmark)
microbenchmark(f.rfm(),f.csplit(),f.sl2m(),f.ssf(),times=10)
# Unit: milliseconds
#        expr        min         lq     median        uq        max neval
#     f.rfm() 3566.17724 3589.31203 3606.93303 3665.4087 3719.32299    10
#  f.csplit()   98.05709  102.46456  104.51046  107.9588  117.26945    10
#    f.sl2m()   55.45527   55.58852   56.75406   58.9347   67.44523    10
#     f.ssf()   17.77499   17.98879   18.30831   18.4537   21.62161    10

So it looks like stri_split_fixed(...) is the winner.

4 Comments

I think it is about time to ditch rbind.fill.matrix. Have you seen stri_list2matrix yet from the "stringi" package? Try: stri_list2matrix(strsplit(dt$words, split=" "), byrow = TRUE). Your time will drop from 3+ seconds to < 0.2 seconds....
@AnandaMahto Yes. It seems to be faster than cSplit(...) but slower than stri_split_fixed(...). See benchmark results above.
@jihoward, Hence my comment about the approach's inclusion in "splitstackshape" in the near future :-) I was just waiting for "stringi" 0.3-1 to be on CRAN, which it is now, so I need to rewrite a few of my existing functions....
+1 for the benchmarks :-) Also, you should get at least a little boost with strsplit if you add fixed = TRUE. Not sure how much it would affect the benchmarks though.
2

An example data would be nice, but if I understand what you want, it is not possible to do properly in a data frame. Given there are different numbers of words in each row you, will need a list. Even though, it is very simple to split the words in the whole object.

If you run strsplit(as.character(Data[,1]), " ") you will get a list with each element corresponding to a row in your dataframe. From that, there are several different alternatives to rearrange this object, but the best approach will depend on your objective

Comments

2

OK for both data.table and data.frame

# toy data
df <- structure(list(x = structure(c(2L, 1L), .Label = c("This actually is not", 
"This is interesting"), class = "factor")), .Names = "x", row.names = c(NA, 
-2L), class = "data.frame")

#                      x
# 1  This is interesting
# 2 This actually is not

# the code
split_result <- strsplit(as.character(df$x), " ")
length_n <- sapply(split_result, length)
length_max <- seq_len(max(length_n))
as.data.frame(t(sapply(split_result, "[", i = length_max))) # Or as.data.table(...)

#     V1       V2          V3   V4
# 1 This       is interesting <NA>
# 2 This actually          is  not

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.