4

For the data given below,

data1<-structure(list(var1 = c("2 7", "2 6 7", "2 7", "2 7", "1 7", 
"1 7", "1 5", "1 2 7", "1 5", "1 7", "1 2 3 4 5 6 7", "1 2 4 6"
)), .Names = "var1", class = "data.frame", row.names = c(NA, 
-12L))

> data1

            var1
1            2 7
2          2 6 7
3            2 7
4            2 7
5            1 7
6            1 7
7            1 5
8          1 2 7
9            1 5
10           1 7
11 1 2 3 4 5 6 7
12       1 2 4 6

I would like it to split into seven columns (7) as follows:

    v1  v2  v3  v4  v5  v6  v7
1   NA  2   NA  NA  NA  NA  7
2   NA  2   NA  NA  NA  6   7
3   NA  2   NA  NA  NA  NA  7
4   NA  2   NA  NA  NA  NA  7
5   1   NA  NA  NA  NA  NA  7
6   1   NA  NA  NA  NA  NA  7
7   1   NA  NA  NA  5   NA  NA
8   1   2   NA  NA  NA  NA  7
9   1   NA  NA  NA  5   NA  NA
10  1   NA  NA  NA  NA  NA  7
11  1   2   3   4   5   6   7
12  1   2   NA  4   NA  6   NA

I use the tstrsplit from data.table package as follows:

library(data.table)
setDT(data1)[,tstrsplit(var1," ")]



 V1 V2 V3 V4 V5 V6 V7
 1:  2  7 NA NA NA NA NA
 2:  2  6  7 NA NA NA NA
 3:  2  7 NA NA NA NA NA
 4:  2  7 NA NA NA NA NA
 5:  1  7 NA NA NA NA NA
 6:  1  7 NA NA NA NA NA
 7:  1  5 NA NA NA NA NA
 8:  1  2  7 NA NA NA NA
 9:  1  5 NA NA NA NA NA
10:  1  7 NA NA NA NA NA
11:  1  2  3  4  5  6  7
12:  1  2  4  6 NA NA NA

This is different than the expected output. I was wondering how can I get the expected output as described above.

2
  • 1
    Try cSplit_e from my splitstackshape package. Commented Jan 27, 2018 at 12:58
  • Does this function account for NA's in column to be splited?. I am getting an error Error in seq.default(min(vec), max(vec)) : 'from' must be a finite number when I included NA's, but without NA's, it works fine. Commented Feb 2, 2018 at 5:24

2 Answers 2

4

With data.table you may try

library(magrittr)
setDT(data1)[, strsplit(var1," "), by = .(rn = seq_len(nrow(data1)))] %>% 
  dcast(., rn ~ V1)
       rn      1      2      3      4      5      6      7
 1:     1     NA      2     NA     NA     NA     NA      7
 2:     2     NA      2     NA     NA     NA      6      7
 3:     3     NA      2     NA     NA     NA     NA      7
 4:     4     NA      2     NA     NA     NA     NA      7
 5:     5      1     NA     NA     NA     NA     NA      7
 6:     6      1     NA     NA     NA     NA     NA      7
 7:     7      1     NA     NA     NA      5     NA     NA
 8:     8      1      2     NA     NA     NA     NA      7
 9:     9      1     NA     NA     NA      5     NA     NA
10:    10      1     NA     NA     NA     NA     NA      7
11:    11      1      2      3      4      5      6      7
12:    12      1      2     NA      4     NA      6     NA

To get rid of the rn column, we can use

setDT(data1)[, strsplit(var1," "), by = .(rn = 1:nrow(data1))][
  , dcast(.SD, rn ~ V1)][, rn := NULL][]

Explanation

setDT(data1)[, strsplit(var1," "), by = .(rn = seq_len(nrow(data1)))]

creates a data.table directly in long format

    rn V1
 1:  1  2
 2:  1  7
 3:  2  2
 4:  2  6
 5:  2  7
 6:  3  2
 7:  3  7
 8:  4  2
 9:  4  7
10:  5  1
11:  5  7
12:  6  1
13:  6  7
14:  7  1
15:  7  5
16:  8  1
17:  8  2
18:  8  7
19:  9  1
20:  9  5
21: 10  1
22: 10  7
23: 11  1
24: 11  2
25: 11  3
26: 11  4
27: 11  5
28: 11  6
29: 11  7
30: 12  1
31: 12  2
32: 12  4
33: 12  6
    rn V1

which is then reshaped to wide format using dcast().

If we would use tstrsplit() instead of strsplit() we would get a data.table in wide format which needs to be reshaped to long format using melt():

setDT(data1)[,tstrsplit(var1," ")][, rn := .I][
  , melt(.SD, id = "rn", na.rm = TRUE)][
    , dcast(.SD, rn ~ paste0("V", value))][
      , rn := NULL][]
Sign up to request clarification or add additional context in comments.

Comments

3

In base R, we can do this by splitting the string by one or more (\\s+), create a row/column index ('i1') and assign a NA matrix ('m1') to fill up with the unlisted split values

lst <- lapply(strsplit(data1$var1, "\\s+"), as.numeric)
i1 <- cbind(rep(1:nrow(data1), lengths(lst)), unlist(lst))
m1 <- matrix(NA, nrow = max(i1[,1]), ncol = max(i1[,2]))
m1[i1] <- unlist(lst)
as.data.frame(m1)
#   V1 V2 V3 V4 V5 V6 V7
#1  NA  2 NA NA NA NA  7
#2  NA  2 NA NA NA  6  7
#3  NA  2 NA NA NA NA  7
#4  NA  2 NA NA NA NA  7
#5   1 NA NA NA NA NA  7
#6   1 NA NA NA NA NA  7
#7   1 NA NA NA  5 NA NA
#8   1  2 NA NA NA NA  7
#9   1 NA NA NA  5 NA NA
#10  1 NA NA NA NA NA  7
#11  1  2  3  4  5  6  7
#12  1  2 NA  4 NA  6 NA

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.