Split a column of strings into variable number of columns using data.table in R

Question

I want to analyze many years of of Quicken home finance records. I exported a file to qif and used bank2csv program to render a csv. Within Quicken one can use a category (eg automobile, tax), subcategories (eg automobile:service, automobile:fuel) and tags (eg self, spouse, son). bank2csv renders the categories:subcategories/tag as a concatenated string. I want to instead put the category in a category column, subcategory in a subcategory column and put whatever tags in the tag column. I saw a similar question but alas that worked bystrsplit then unlist and then indexing each element so that it could be written to the correct place by assignment. That will not work here since sometimes there is no tag and sometimes there is no subcategory. It is quite easy to split the string into a list and save that list in a column but how on earth does one assign the first element of the list to one column and the second element (if it exists) to a second column. Surely there is an elegant easy way.

simplified sample

library(data.table)
library(stringi)
dt <- data.table(category.tag=c("toys/David", "toys/David", "toys/James", "toys", "toys", "toys/James"), transaction=1:6)

How do I create a third and fourth column: category, tag. Some of tag would be NA

I can do the following but it does not get me very far. I need a way to specify the first or the second element of the resultant list (as opposed to the whole list)

dt[, category:= strsplit(x = category.tag, split = "/") ]

Arun · Accepted Answer · 2015-04-01 15:40:50Z

6

Just pushed two functions transpose() and tstrsplit() in data.table v1.9.5.

With this we can do:

require(data.table)
dt[, c("category", "tag") := tstrsplit(category.tag, "/", fixed=TRUE)]
#    category.tag transaction category   tag
# 1:   toys/David           1     toys David
# 2:   toys/David           2     toys David
# 3:   toys/James           3     toys James
# 4:         toys           4     toys    NA
# 5:         toys           5     toys    NA
# 6:   toys/James           6     toys James

tstrsplit is a wrapper for transpose(strsplit(as.character(x), ...)). And you can also pass fill=. to fill missing values with any other value than NA.

transpose() can also be used on lists, data frames and data tables.

edited Apr 1, 2015 at 15:40

answered Jan 27, 2015 at 19:24

Arun

119k28 gold badges290 silver badges396 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Rich Scriven · Accepted Answer · 2014-11-24 23:33:39Z

4

You could use cSplit

library(splitstackshape)
dt[, c("category", "tag") := cSplit(dt[,.(category.tag)], "category.tag", "/")]
dt
#    category.tag transaction category   tag
# 1:   toys/David           1     toys David
# 2:   toys/David           2     toys David
# 3:   toys/James           3     toys James
# 4:         toys           4     toys    NA
# 5:         toys           5     toys    NA
# 6:   toys/James           6     toys James

edited Nov 24, 2014 at 23:33

answered Nov 24, 2014 at 23:10

Rich Scriven

99.8k11 gold badges190 silver badges252 bronze badges

8 Comments

Farrel Over a year ago

Is cSplit an abbreviation of concat.split?

Rich Scriven Over a year ago

I believe it's similar, with some added functionality

Farrel Over a year ago

I love the one line easy elegance.

A5C1D2H2I1M1N2O1R2T1 Over a year ago

@Farrel, it's an alias, essentially. Several people were complaining that concat.split was too awkward to type, so when I was working on improving the function's efficiency, I changed the name to cSplit.

A5C1D2H2I1M1N2O1R2T1 Over a year ago

@Farrel, The package version for "splitstackshape" at Rdocumentation is 1.2, but the present version is 1.4.2. concat.split is essentially Gabor's read.table approach, while cSplit is a much faster implementation using strsplit. By version 1.6, I plan to have moved over to stri_split, which would be even faster. Here's a (functional) toy implementation of where cSplit should be soon.

|

G. Grothendieck · Accepted Answer · 2014-11-25 13:10:54Z

2

1) Try read.table

read <- function(x) read.table(text = x, sep  = "/", fill = TRUE, na.strings = "")
dt[, c("category", "tag") := read(category.tag)]

No extra packages are needed.

2) An alternative is to use separate in the tidyr package:

library(tidyr)
separate(dt, category.tag, c("category", "tag"), extra = "drop")

The above is with tidyr version 0.1.0.9000 from github. To install it ensure that the devtools R package is installed and issue the command: devtools::install_github("hadley/tidyr") .

Update: Incorporated Richard Scrivens' comment and minor improvements. Added tidyr solution.

edited Nov 25, 2014 at 13:10

answered Nov 24, 2014 at 23:14

G. Grothendieck

273k18 gold badges221 silver badges365 bronze badges

4 Comments

Farrel Over a year ago

Absolutely wonderful. Now how should I decide who to award the "accepted" to. You or @RichardScriven. I like being introduced to new packages that may make my life easier with other problems but then again I assume there is an overhead when I use new packages instead of the functions that are already loaded.

Farrel Over a year ago

getting Error in strsplit(value, sep, ...) : unused argument (extra = "drop")

Farrel Over a year ago

I am using version 0.1

Farrel Over a year ago

Let us continue this discussion in chat.

A5C1D2H2I1M1N2O1R2T1 · Accepted Answer · 2014-11-25 01:22:05Z

1

Since you already have "stringi" loaded, you can also look at stri_split_fixed and the simplify argument:

setnames(cbind(dt, stri_split_fixed(dt$category.tag, "/", simplify = TRUE)), 
         c("V1", "V2"), c("category", "tag"))[]
#    category.tag transaction category   tag
# 1:   toys/David           1     toys David
# 2:   toys/David           2     toys David
# 3:   toys/James           3     toys James
# 4:         toys           4     toys    NA
# 5:         toys           5     toys    NA
# 6:   toys/James           6     toys James

Though I must admit I'm partial to cSplit :-)

answered Nov 25, 2014 at 1:22

A5C1D2H2I1M1N2O1R2T1

194k31 gold badges417 silver badges497 bronze badges

Collectives™ on Stack Overflow

Split a column of strings into variable number of columns using data.table in R

4 Answers 4

Comments

8 Comments

4 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

8 Comments

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related