6

I have a data frame which looks as such

A  B  C
1  3  X1=7;X2=8;X3=9
2  4  X1=10;X2=11;X3=12
5  6  X1=13;X2=14

I would like to parse the C column into separate columns as such...

A  B  X1  X2  X3
1  3  7   8   9
2  4  10  11  12
5  6  13  14  NA

How would one go about doing this in R?

4 Answers 4

3

First, here's the sample data in data.frame form

dd<-data.frame(
    A = c(1L, 2L, 5L), 
    B = c(3L, 4L, 6L), 
    C = c("X1=7;X2=8;X3=9", 
    "X1=10;X2=11;X3=12", "X1=13;X2=14"),
    stringsAsFactors=F
)

Now I define a small helper function to take vectors like c("A=1","B=2") and changed them into named vectors like c(A="1", B="2").

namev<-function(x) {
    a<-strsplit(x,"=")
    setNames(sapply(a,'[',2), sapply(a,'[',1))
}

and now I perform the transformations

#turn each row into a named vector
vv<-lapply(strsplit(dd$C,";"), namev)
#find list of all column names
nm<-unique(unlist(sapply(vv, names)))
#extract data from all rows for every column
nv<-do.call(rbind, lapply(vv, '[', nm))
#convert everything to numeric (optional)
class(nv)<-"numeric"
#rejoin with original data
cbind(dd[,-3], nv)

and that gives you

  A B X1 X2 X3
1 1 3  7  8  9
2 2 4 10 11 12
3 5 6 13 14 NA
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you for answering, however, when I try to run this code I get. Error in strsplit(dd$C, ";") : non-character argument for the first strsplit.
@Craig Sorry, when i retyped dd i forgot to add back in stringsAsFactors. I've updated my sample data.frame. Just make sure the C column is a character (if it's a factor, use as.character() on it)
3

My cSplit function makes solving problems like these fun. Here it is in action:

## Load some packages
library(data.table)
library(devtools) ## Just for source_gist, really
library(reshape2)

## Load `cSplit`
source_gist("https://gist.github.com/mrdwab/11380733")

First, split your values up and create a "long" dataset:

ddL <- cSplit(cSplit(dd, "C", ";", "long"), "C", "=")
ddL
#    A B C_1 C_2
# 1: 1 3  X1   7
# 2: 1 3  X2   8
# 3: 1 3  X3   9
# 4: 2 4  X1  10
# 5: 2 4  X2  11
# 6: 2 4  X3  12
# 7: 5 6  X1  13
# 8: 5 6  X2  14

Next, use dcast.data.table (or just dcast) to go from "long" to "wide":

dcast.data.table(ddL, A + B ~ C_1, value.var="C_2")
#    A B X1 X2 X3
# 1: 1 3  7  8  9
# 2: 2 4 10 11 12
# 3: 5 6 13 14 NA

Comments

1

Here's one possible approach:

dat <- read.table(text="A  B  C
1  3  X1=7;X2=8;X3=9
2  4  X1=10;X2=11;X3=12
5  6  X1=13;X2=14", header=TRUE, stringsAsFactors = FALSE)


library(qdapTools)
dat_C <- strsplit(dat$C, ";")

dat_C2 <- sapply(dat_C, function(x) {
    y <- strsplit(x, "=")
    rep(sapply(y, "[", 1), as.numeric(sapply(y, "[", 2)))
})

data.frame(dat[, -3], mtabulate(dat_C2))

##   A B X1 X2 X3
## 1 1 3  7  8  9
## 2 2 4 10 11 12
## 3 5 6 13 14  0

EDIT To obtain the NA values

m <- mtabulate(dat_C2)
m[m==0] <- NA
data.frame(dat[, -3], m)

6 Comments

Thank you for your answer, is there anyway of obtaining a NA for the missing X3 value?
Isn't that going to NA any actual zeros in the original data?
What's the difference between 0 and NA?
@TylerRinker -- NA typically means missing data; if I forgot to weigh a lab rat involved in an experiment, I'd have to put down its weight as NA, which is a totally different thing than saying its weight was 0.
@JoshO'Brien But in the context of count data is there really a difference? I guess I should have phrased my question to the OP, "In your case what is the difference between 0 and NA?"
|
1

Here's a nice, somewhat hacky way to get you there.

## read your data
> dat <- read.table(h=T, text = "A  B  C
  1  3  X1=7;X2=8;X3=9
  2  4  X1=10;X2=11;X3=12
  5  6  X1=13;X2=14", stringsAsFactors = FALSE)
## ---
> s <- strsplit(dat$C, ";|=")
> xx <- unique(unlist(s)[grepl('[A-Z]', unlist(s))])
> sap <- t(sapply(seq(s), function(i){
      wh <- which(!xx %in% s[[i]]); n <- suppressWarnings(as.numeric(s[[i]]))
      nn <- n[!is.na(n)]; if(length(wh)){ append(nn, NA, wh-1) } else { nn }
      })) ## see below for explanation
> data.frame(dat[1:2], sap)
#   A B X1 X2 X3
# 1 1 3  7  8  9
# 2 2 4 10 11 12
# 3 5 6 13 14 NA

Basically what's happening in sap is

  1. check which values are missing
  2. change each list element of s to numeric
  3. remove the NA values from (2)
  4. insert NA into the correct position with append
  5. transpose the result

2 Comments

So this wouldn't work if say X2 were missing in the last row instead of X3 (ie, "X1=13;X3=14"). You're assuming all the values are always in order. And you're also assuming that there are the name number of extra columns as rows in the original data? Adding a row seems to create an X4 variable.
@MrFlick, got it sorted out. Congrats on 10k, by the way. That was quick. :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.