Splitting one column into multiple columns

Question

I have a huge dataset in which there is one column including several values for each subject (row). Here is a simplified sample dataframe:

data <- data.frame(subject = c(1:8), sex = c(1, 2, 2, 1, 2, 1, 1, 2), 
              age = c(35, 29, 31, 46, 64, 57, 49, 58), 
              v1 = c("2", "0", "3,5", "2 1", "A,4", "B,1,C", "A and B,3", "5, 6 A or C"))

> data
  subject sex age          v1
1       1   1  35           2
2       2   2  29           0
3       3   2  31         3,5  # separated by a comma
4       4   1  46         2 1  # separated by a blank space
5       5   2  64         A,4
6       6   1  57       B,1,C
7       7   1  49   A and B,3
8       8   2  58 5, 6 A or C

I first want to remove the letters (A, B, A and B, …) in the fourth column (v1), and then split the fourth column into multiple columns just like this:

  subject sex age x1 x2 x3 x4 x5 x6
1       1   1  35  0  1  0  0  0  0        
2       2   2  29  0  0  0  0  0  0
3       3   2  31  0  0  1  0  1  0  
4       4   1  46  1  1  0  0  0  0
5       5   2  64  0  0  0  1  0  0
6       6   1  57  1  0  0  0  0  0
7       7   1  49  0  0  1  0  0  0
8       8   2  58  0  0  0  0  1  1

where the 1st subject takes 1 at x2 because it takes 2 at v1 in the original dataset, the 3rd subject takes 1 at both x3 and x5 because it takes 3 and 5 at v1 in the original dataset, and so on.

I would appreciate any help on this question. Thanks a lot.

IRTFM · Accepted Answer · 2013-02-07 21:04:51Z

5

You can cbind this result to data[-4] and get what you need:

 0+t(sapply(as.character(data$v1), function(line) 
        sapply(1:6, function(x) x %in% unlist(strsplit(line, split="\\s|\\,"))) ))
#----------------
            [,1] [,2] [,3] [,4] [,5] [,6]
2              0    1    0    0    0    0
0              0    0    0    0    0    0
3,5            0    0    1    0    1    0
2 1            1    1    0    0    0    0
A,4            0    0    0    1    0    0
B,1,C          1    0    0    0    0    0
A and B,3      0    0    1    0    0    0
5, 6 A or C    0    0    0    0    1    1

answered Feb 7, 2013 at 21:04

IRTFM

264k22 gold badges381 silver badges503 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Arun · Accepted Answer · 2013-02-07 21:20:16Z

4

One solution:

r <- sapply(strsplit(as.character(dt$v1), "[^0-9]+"), as.numeric)
m <- as.data.frame(t(sapply(r, function(x) {
        y <- rep(0, 6)
        y[x[!is.na(x)]] <- 1
        y
     })))
data <- cbind(data[, c("subject", "sex", "age")], m)

#   subject sex age V1 V2 V3 V4 V5 V6
# 1       1   1  35  0  1  0  0  0  0
# 2       2   2  29  0  0  0  0  0  0
# 3       3   2  31  0  0  1  0  1  0
# 4       4   1  46  1  1  0  0  0  0
# 5       5   2  64  0  0  0  1  0  0
# 6       6   1  57  1  0  0  0  0  0
# 7       7   1  49  0  0  1  0  0  0
# 8       8   2  58  0  0  0  0  1  1

Following DWin's awesome solution, m could be modified as:

m <- as.data.frame(t(sapply(r, function(x) {
        0 + 1:6 %in% x[!is.na(x)]
     })))

edited Feb 7, 2013 at 21:20

answered Feb 7, 2013 at 21:05

Arun

119k28 gold badges290 silver badges396 bronze badges

1 Comment

user187454 Over a year ago

Thanks for your update. I have an additional question. Here I only give a simple dataset in which v1 takes the maximum value “6”. In my real data, since it is huge, is there an easy way to determine automatically the max value of v1 instead of specifying it by my self? Thanks.

Collectives™ on Stack Overflow

Splitting one column into multiple columns

2 Answers 2

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related