Split a single column into multiple columns based on a set of values

Question

Edit : Ok, so I've solved my initial problem using a suggestion from G. Grothendieck, thanks again, exactly the clean way of doing that I was after. Initial post is below. Now reality is that my file is just a little more subtle than this template.

It actually looks like this:

A1
100
200
txt 
A2
STRING
300
400
txt txt
txt
txt txt txt
A3
STRING
STRING
150
250
A2
.
.
.

a STRING that is well known right after A something, sometimes it does not occur and sometimes just one time or several occurences. I didn't notice the several occurences at first, so while thinking it was just one time when it happened, I did a loop to handle the problem :

for (i in 1:nrow(raw_data)){
  if (is.na(raw_data[i,2])) {
    raw_data <- raw_data[-c(i)]
  } else if (raw_data[i,2] == "STRING") {
    raw_data[i,2] = raw_data[i,3]
    raw_data[i,3] = raw_data[i,4]
    raw_data[i,4] = raw_data[i,5]
    raw_data[i,5] = raw_data[i,6]
    raw_data[i,6] = raw_data[i,7]
    raw_data[i,7] = raw_data[i,8]
    raw_data[i,8] = raw_data[i,9]
    raw_data[i,9] = raw_data[i,10]
    raw_data[i,10] = raw_data[i,12]
    raw_data[i,11] = "Yes"
    if (is.na(raw_data[i,13])){
      raw_data[i,12] = NA
    } else raw_data[i,12] = raw_data[i,13]

Basically I'm assigning "yes" in column 11 to say that the string was found. I clearly should state the occurence here instead of Yes/No (so 0 by default, 1 or 2 or ...). All the other column values are being shifted to the left so that they are going back to the columns where they are expected to be.

How can I adapt this, if possible, to the fact that, in reality, I may have several occurences of STRING. I might have to change entirely my approach ?

now for those of you who like the challenge, I'm really starting to assess if my processing is really efficient for this file... What about processing each line of the original file, and since we know that anything like A1 A2 etc should go in col1 etc etc... ?

Anyhow, Thanks for those who will look into this and try :)

Initial post : I have a dataset in R that is comprised of a single column containing variables that I ideally would like in multiple columns. The structure is as follow :

A1
100
200
txt 
A2
300
400
txt txt
txt
txt txt txt
A3
150
250
A2
.
.
.

Ideally this is the result I'm chasing :

A1 | 100 | 200 | txt  
A2 | 300 | 400 | txt txt | txt | txt txt
A3 | 150 | 250
A2 |  .  |  .  |  .

The set {A1;A2;A3} is known. The main difficulty I'm hitting right now is that the number of columns is unknown.

I've started by transpose my data, and was thinking doing a loop on the single row, and each time I see one of the value in my set {A1;A2;A3} I start a new row with this value in column 1 so that column 1 only contains {A1;A2;A3} values.

I'm convinced that there is a cleaner way of doing such task.

Thanks ahead of time for your assistance with this!

G. Grothendieck · Accepted Answer · 2016-11-09 22:01:52Z

5

Create a grouping variable g and with it use tapply to convert the data from long form to a list, v. Finally, convert each component of v to a "ts" object and cbind the "ts" objects together (since "ts" objects can be bound together and automatically padded with NAs) transposing the result as matrix m. Convert m to a data.frame and apply type.convert to each column to fix the column types. The two lines marked ## can be omitted if a matrix, m, is sufficient as the answer.

No packages are used.

g <- cumsum(DF[[1]] %in% c("A1", "A2", "A3"))
v <- tapply(DF[[1]], g, c, simplify = FALSE)
m <- t(do.call(cbind, lapply(v, ts)))
DFout<- as.data.frame(m, stringsAsFactors = FALSE)    ##
DFout[] <- lapply(DFout, type.convert, as.is = TRUE)  ##

giving:

> DFout
  V1  V2  V3      V4   V5          V6
1 A1 100 200    txt  <NA>        <NA>
2 A2 300 400 txt txt  txt txt txt txt
3 A3 150 250    <NA> <NA>        <NA>
4 A2  NA  NA    <NA> <NA>        <NA>

Note: The input in reproducible form is:

DF <- structure(list(V1 = c("A1", "100", "200", "txt ", "A2", "300", 
"400", "txt txt", "txt", "txt txt txt", "A3", "150", "250", "A2"
)), .Names = "V1", row.names = c(NA, -14L), class = "data.frame")

edited Nov 9, 2016 at 22:01

answered Nov 9, 2016 at 14:58

G. Grothendieck

273k18 gold badges221 silver badges365 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

homer3018 Over a year ago

Hi G. Grothendieck, I've edited the original post since I've encountered further issues with my file, not sure you've seen it so... Thanks again !

Community · Accepted Answer · 2020-06-20 09:12:55Z

The OP has edited his question after the other answers were posted. So, these answers were not aware of the additonal complexity by "STRING" appearing occasionally.

The solution below addresses this issue and counts the number of occurrences of "STRING" before removal.

library(data.table)
setDT(DF)[, rn := cumsum(V1 %like% "^A\\d+")][
  , occurrences := sum(V1 == "STRING"), by = rn][
    V1 != "STRING", 
    dcast(.SD, rn + occurrences ~ rowid(rn, prefix = "V"), value.var = "V1")][
      , lapply(.SD, function(x) if (is.character(x)) type.convert(x, as.is = TRUE) else x)]

   rn occurrences V1  V2  V3      V4  V5          V6
1:  1           0 A1 100 200     txt  NA          NA
2:  2           1 A2 300 400 txt txt txt txt txt txt
3:  3           2 A3 150 250      NA  NA          NA
4:  4           0 A2  NA  NA      NA  NA          NA

Explanation

setDF(DF) coerces to class data.table in place, i.e., without copying.
Rows starting with A followed by one or more digits are identified. Each of those rows and all subsequent rows until the next Axx get a unique group id. When the next Axx row is encountered the group id is advanced by 1.
The number of occurrences of "STRING" within each group of rows is counted.
After removal of rows containing "STRING", the remaining rows are reshaped using dcast(). The formula rn + occurrences ~ rowid(rn, prefix = "V") determines the layout of the new table. rn and occurrences go in front of each line while the rows of each group form the columns. As the number of rows within each group is not know beforehand, the rowid() function is used to number the rows within each group, thereby creating the new column names.
Finally, all character columns are converted to their appropriate types. The parameter as.is = TRUE prevents coersion of character to factor.

Data

DF <- structure(list(V1 = c("A1", "100", "200", "txt", "A2", "STRING", 
"300", "400", "txt txt", "txt", "txt txt txt", "A3", "STRING", 
"STRING", "150", "250", "A2")), .Names = "V1", row.names = c(NA, 
-17L), class = "data.frame")

Steven Beaupré · Accepted Answer · 2016-11-10 12:35:01Z

2

Another idea:

library(dplyr)
library(splitstackshape)

df %>%
  group_by(id = cumsum(V1 %in% c("A1", "A2", "A3"))) %>%
  summarise(col = toString(V1)) %>%
  cSplit('col')

Which gives:

#   id col_1 col_2 col_3   col_4 col_5       col_6
#1:  1    A1   100   200     txt    NA          NA
#2:  2    A2   300   400 txt txt   txt txt txt txt
#3:  3    A3   150   250      NA    NA          NA
#4:  4    A2    NA    NA      NA    NA          NA

edited Nov 10, 2016 at 12:35

answered Nov 9, 2016 at 16:21

Steven Beaupré

21.7k7 gold badges60 silver badges79 bronze badges

Collectives™ on Stack Overflow

Split a single column into multiple columns based on a set of values

3 Answers 3

1 Comment

Explanation

Data

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Explanation

Data

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related