7

I have a huge dataset in which there is one column including several values for each subject (row). Here is a simplified sample dataframe:

data <- data.frame(subject = c(1:8), sex = c(1, 2, 2, 1, 2, 1, 1, 2), 
              age = c(35, 29, 31, 46, 64, 57, 49, 58), 
              v1 = c("2", "0", "3,5", "2 1", "A,4", "B,1,C", "A and B,3", "5, 6 A or C"))

> data
  subject sex age          v1
1       1   1  35           2
2       2   2  29           0
3       3   2  31         3,5  # separated by a comma
4       4   1  46         2 1  # separated by a blank space
5       5   2  64         A,4
6       6   1  57       B,1,C
7       7   1  49   A and B,3
8       8   2  58 5, 6 A or C

I first want to remove the letters (A, B, A and B, …) in the fourth column (v1), and then split the fourth column into multiple columns just like this:

  subject sex age x1 x2 x3 x4 x5 x6
1       1   1  35  0  1  0  0  0  0        
2       2   2  29  0  0  0  0  0  0
3       3   2  31  0  0  1  0  1  0  
4       4   1  46  1  1  0  0  0  0
5       5   2  64  0  0  0  1  0  0
6       6   1  57  1  0  0  0  0  0
7       7   1  49  0  0  1  0  0  0
8       8   2  58  0  0  0  0  1  1

where the 1st subject takes 1 at x2 because it takes 2 at v1 in the original dataset, the 3rd subject takes 1 at both x3 and x5 because it takes 3 and 5 at v1 in the original dataset, and so on.

I would appreciate any help on this question. Thanks a lot.

2 Answers 2

5

You can cbind this result to data[-4] and get what you need:

 0+t(sapply(as.character(data$v1), function(line) 
        sapply(1:6, function(x) x %in% unlist(strsplit(line, split="\\s|\\,"))) ))
#----------------
            [,1] [,2] [,3] [,4] [,5] [,6]
2              0    1    0    0    0    0
0              0    0    0    0    0    0
3,5            0    0    1    0    1    0
2 1            1    1    0    0    0    0
A,4            0    0    0    1    0    0
B,1,C          1    0    0    0    0    0
A and B,3      0    0    1    0    0    0
5, 6 A or C    0    0    0    0    1    1
Sign up to request clarification or add additional context in comments.

Comments

4

One solution:

r <- sapply(strsplit(as.character(dt$v1), "[^0-9]+"), as.numeric)
m <- as.data.frame(t(sapply(r, function(x) {
        y <- rep(0, 6)
        y[x[!is.na(x)]] <- 1
        y
     })))
data <- cbind(data[, c("subject", "sex", "age")], m)

#   subject sex age V1 V2 V3 V4 V5 V6
# 1       1   1  35  0  1  0  0  0  0
# 2       2   2  29  0  0  0  0  0  0
# 3       3   2  31  0  0  1  0  1  0
# 4       4   1  46  1  1  0  0  0  0
# 5       5   2  64  0  0  0  1  0  0
# 6       6   1  57  1  0  0  0  0  0
# 7       7   1  49  0  0  1  0  0  0
# 8       8   2  58  0  0  0  0  1  1

Following DWin's awesome solution, m could be modified as:

m <- as.data.frame(t(sapply(r, function(x) {
        0 + 1:6 %in% x[!is.na(x)]
     })))

1 Comment

Thanks for your update. I have an additional question. Here I only give a simple dataset in which v1 takes the maximum value “6”. In my real data, since it is huge, is there an easy way to determine automatically the max value of v1 instead of specifying it by my self? Thanks.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.