Parsing Delimited Data In a DataFrame Into Separate Columns in R

Question

I have a data frame which looks as such

A  B  C
1  3  X1=7;X2=8;X3=9
2  4  X1=10;X2=11;X3=12
5  6  X1=13;X2=14

I would like to parse the C column into separate columns as such...

A  B  X1  X2  X3
1  3  7   8   9
2  4  10  11  12
5  6  13  14  NA

How would one go about doing this in R?

MrFlick · Accepted Answer · 2014-06-11 00:58:12Z

3

First, here's the sample data in data.frame form

dd<-data.frame(
    A = c(1L, 2L, 5L), 
    B = c(3L, 4L, 6L), 
    C = c("X1=7;X2=8;X3=9", 
    "X1=10;X2=11;X3=12", "X1=13;X2=14"),
    stringsAsFactors=F
)

Now I define a small helper function to take vectors like c("A=1","B=2") and changed them into named vectors like c(A="1", B="2").

namev<-function(x) {
    a<-strsplit(x,"=")
    setNames(sapply(a,'[',2), sapply(a,'[',1))
}

and now I perform the transformations

#turn each row into a named vector
vv<-lapply(strsplit(dd$C,";"), namev)
#find list of all column names
nm<-unique(unlist(sapply(vv, names)))
#extract data from all rows for every column
nv<-do.call(rbind, lapply(vv, '[', nm))
#convert everything to numeric (optional)
class(nv)<-"numeric"
#rejoin with original data
cbind(dd[,-3], nv)

and that gives you

  A B X1 X2 X3
1 1 3  7  8  9
2 2 4 10 11 12
3 5 6 13 14 NA

edited Jun 11, 2014 at 0:58

answered Jun 11, 2014 at 0:38

MrFlick

209k19 gold badges300 silver badges324 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Craig Over a year ago

Thank you for answering, however, when I try to run this code I get. Error in strsplit(dd$C, ";") : non-character argument for the first strsplit.

MrFlick Over a year ago

@Craig Sorry, when i retyped dd i forgot to add back in stringsAsFactors. I've updated my sample data.frame. Just make sure the C column is a character (if it's a factor, use as.character() on it)

A5C1D2H2I1M1N2O1R2T1 · Accepted Answer · 2014-06-27 17:18:21Z

My cSplit function makes solving problems like these fun. Here it is in action:

## Load some packages
library(data.table)
library(devtools) ## Just for source_gist, really
library(reshape2)

## Load `cSplit`
source_gist("https://gist.github.com/mrdwab/11380733")

First, split your values up and create a "long" dataset:

ddL <- cSplit(cSplit(dd, "C", ";", "long"), "C", "=")
ddL
#    A B C_1 C_2
# 1: 1 3  X1   7
# 2: 1 3  X2   8
# 3: 1 3  X3   9
# 4: 2 4  X1  10
# 5: 2 4  X2  11
# 6: 2 4  X3  12
# 7: 5 6  X1  13
# 8: 5 6  X2  14

Next, use dcast.data.table (or just dcast) to go from "long" to "wide":

dcast.data.table(ddL, A + B ~ C_1, value.var="C_2")
#    A B X1 X2 X3
# 1: 1 3  7  8  9
# 2: 2 4 10 11 12
# 3: 5 6 13 14 NA

Tyler Rinker · Accepted Answer · 2014-06-10 22:49:31Z

1

Here's one possible approach:

dat <- read.table(text="A  B  C
1  3  X1=7;X2=8;X3=9
2  4  X1=10;X2=11;X3=12
5  6  X1=13;X2=14", header=TRUE, stringsAsFactors = FALSE)


library(qdapTools)
dat_C <- strsplit(dat$C, ";")

dat_C2 <- sapply(dat_C, function(x) {
    y <- strsplit(x, "=")
    rep(sapply(y, "[", 1), as.numeric(sapply(y, "[", 2)))
})

data.frame(dat[, -3], mtabulate(dat_C2))

##   A B X1 X2 X3
## 1 1 3  7  8  9
## 2 2 4 10 11 12
## 3 5 6 13 14  0

EDIT To obtain the NA values

m <- mtabulate(dat_C2)
m[m==0] <- NA
data.frame(dat[, -3], m)

edited Jun 10, 2014 at 22:49

answered Jun 10, 2014 at 22:27

Tyler Rinker

111k74 gold badges335 silver badges536 bronze badges

6 Comments

Craig Over a year ago

Thank you for your answer, is there anyway of obtaining a NA for the missing X3 value?

Craig Over a year ago

Isn't that going to NA any actual zeros in the original data?

Tyler Rinker Over a year ago

What's the difference between 0 and NA?

Josh O'Brien Over a year ago

@TylerRinker -- NA typically means missing data; if I forgot to weigh a lab rat involved in an experiment, I'd have to put down its weight as NA, which is a totally different thing than saying its weight was 0.

Tyler Rinker Over a year ago

@JoshO'Brien But in the context of count data is there really a difference? I guess I should have phrased my question to the OP, "In your case what is the difference between 0 and NA?"

|

Rich Scriven · Accepted Answer · 2014-06-11 03:27:58Z

1

Here's a nice, somewhat hacky way to get you there.

## read your data
> dat <- read.table(h=T, text = "A  B  C
  1  3  X1=7;X2=8;X3=9
  2  4  X1=10;X2=11;X3=12
  5  6  X1=13;X2=14", stringsAsFactors = FALSE)
## ---
> s <- strsplit(dat$C, ";|=")
> xx <- unique(unlist(s)[grepl('[A-Z]', unlist(s))])
> sap <- t(sapply(seq(s), function(i){
      wh <- which(!xx %in% s[[i]]); n <- suppressWarnings(as.numeric(s[[i]]))
      nn <- n[!is.na(n)]; if(length(wh)){ append(nn, NA, wh-1) } else { nn }
      })) ## see below for explanation
> data.frame(dat[1:2], sap)
#   A B X1 X2 X3
# 1 1 3  7  8  9
# 2 2 4 10 11 12
# 3 5 6 13 14 NA

Basically what's happening in sap is

check which values are missing
change each list element of s to numeric
remove the NA values from (2)
insert NA into the correct position with append
transpose the result

edited Jun 11, 2014 at 3:27

answered Jun 11, 2014 at 1:12

Rich Scriven

99.8k11 gold badges191 silver badges252 bronze badges

2 Comments

MrFlick Over a year ago

So this wouldn't work if say X2 were missing in the last row instead of X3 (ie, "X1=13;X3=14"). You're assuming all the values are always in order. And you're also assuming that there are the name number of extra columns as rows in the original data? Adding a row seems to create an X4 variable.

Rich Scriven Over a year ago

@MrFlick, got it sorted out. Congrats on 10k, by the way. That was quick. :)

Collectives™ on Stack Overflow

Parsing Delimited Data In a DataFrame Into Separate Columns in R

4 Answers 4

2 Comments

Comments

6 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

Comments

6 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related