2

I have a 65k element character vector, of the format. The length of each element is different, but ranges from 3 to 8 based on commas.:

b[1]= "aaaa, bbbb, cccc"
...
b[1000]="aaaa, bbbb, cccc, dddd, eeee, ffff"
...
b[3000]="aaaa, bbbb, cccc, dddd, eeee, ffff, gggg"
b[3001]="aaaa, bbbb, cccc"

I want to convert to a data frame:

row  col1 col2 col3 col4 col5 col6 col7
1    aaaa bbbb cccc
1000 aaaa bbbb cccc dddd eeee ffff
3000 aaaa bbbb cccc dddd eeee ffff gggg

I tried:

 data.frame( do.call( rbind, strsplit( b, ',' ) ) ) 

and got:

Warning message: In (function (..., deparse.level = 1) : number of columns of result is not a multiple of vector length (arg 1)

Any suggestions?

1
  • What happened to the reproducible examples on StackOverflow? Commented Jun 12, 2019 at 8:45

2 Answers 2

4

We can use read.csv after pasting the string together and collapsing with "\n".

read.csv(text = paste0(b, collapse = "\n"), header = FALSE)

#    V1    V2    V3    V4    V5    V6    V7
#1 aaaa  bbbb  cccc                        
#2 aaaa  bbbb  cccc  dddd  eeee  ffff      
#3 aaaa  bbbb  cccc  dddd  eeee  ffff  gggg

If you want to read empty strings as NA specify them in na.strings

read.csv(text = paste0(b, collapse = "\n"), header = FALSE, na.strings = "")

Another option is stri_list2matrix from stringi

data.frame(stringi::stri_list2matrix(strsplit(b, ","), byrow = TRUE))

#   X1    X2    X3    X4    X5    X6    X7
#1 aaaa  bbbb  cccc  <NA>  <NA>  <NA>  <NA>
#2 aaaa  bbbb  cccc  dddd  eeee  ffff  <NA>
#3 aaaa  bbbb  cccc  dddd  eeee  ffff  gggg

data

b <- c("aaaa, bbbb, cccc", "aaaa, bbbb, cccc, dddd, eeee, ffff", 
       "aaaa, bbbb, cccc, dddd, eeee, ffff, gggg")
Sign up to request clarification or add additional context in comments.

3 Comments

unfortunately for me, when I work on a subset of my data, this works, but when I use the entire set, everything is forced into 3 columns
@alex Not sure what would have happened there since I am not able to reproduce the issue however, I added another option with stri_list2matrix
thanks for the second option, it worked. Not sure why the first option did not work.
1

We can use fread from data.table

library(data.table)
fread(paste(b, collapse="\n", sep=""), fill = TRUE)
#   V1   V2   V3   V4   V5   V6   V7
#1: aaaa bbbb cccc                    
#2: aaaa bbbb cccc dddd eeee ffff     
#3: aaaa bbbb cccc dddd eeee ffff gggg

data

b <- c("aaaa, bbbb, cccc", "aaaa, bbbb, cccc, dddd, eeee, ffff", 
   "aaaa, bbbb, cccc, dddd, eeee, ffff, gggg")

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.