split strings and rearrange a data frame

Question

I have a data like this one

df <- structure(list(A = structure(c(2L, 3L, 6L, 7L, 5L, 4L, 1L, 1L
), .Label = c("", "NZT1", "O749", "P42I;QJ0;AIH2", "P609;QT7", 
"Q835", "Q854"), class = "factor"), B = structure(c(8L, 6L, 5L, 
7L, 4L, 3L, 2L, 1L), .Label = c("", "P079;P0C7;P0C8", "P641;Q614", 
"Q013", "Q554", "Q749", "Q955", "Q9U0"), class = "factor"), C = structure(c(7L, 
8L, 6L, 5L, 3L, 4L, 1L, 2L), .Label = c("P641;QS14", "P679;P0C7;P048", 
"Q168", "Q413", "Q550", "Q6N9", "Q980", "Q997"), class = "factor")), .Names = c("A", 
"B", "C"), class = "data.frame", row.names = c(NA, -8L))

#              A              B              C
#1          NZT1           Q9U0           Q980
#2          O749           Q749           Q997
#3          Q835           Q554           Q6N9
#4          Q854           Q955           Q550
#5      P609;QT7           Q013           Q168
#6 P42I;QJ0;AIH2      P641;Q614           Q413
#7               P079;P0C7;P0C8      P641;QS14
#8                              P679;P0C7;P048

I am trying to split them based on ";", and then put them under the other string , the expected output I seek is like this

#            A              B              C
#1          NZT1           Q9U0           Q980
#2          O749           Q749           Q997
#3          Q835           Q554           Q6N9
#4          Q854           Q955           Q550
#5          P609           Q013           Q168
#6          QT7            P641           Q413
#7          P42I           Q614           P641
#8          QJ0            P079           QS14
#9          AIH2           P0C7           P679    
#10                        P0C8           P0C7      
#11                                       P048

I tried to use strsplit() but I did not get that far

This is what I tried

myNewdf <- strsplit(as.character(unlist(df)), ";")

IRTFM · Accepted Answer · 2016-07-05 22:19:55Z

4

The scan function will succeed here although the as.data.frame will choke if the number of items in each column are not the same:

as.data.frame(lapply( df, function(x) scan( text=as.character(x) , what="", sep=";", blank.lines.skip = FALSE))
+ )
Read 11 items
Read 11 items
Read 11 items
      A    B    C
1  NZT1 Q9U0 Q980
2  O749 Q749 Q997
3  Q835 Q554 Q6N9
4  Q854 Q955 Q550
5  P609 Q013 Q168
6   QT7 P641 Q413
7  P42I Q614 P641
8   QJ0 P079 QS14
9  AIH2 P0C7 P679
10      P0C8 P0C7
11           P048

answered Jul 5, 2016 at 22:19

IRTFM

264k22 gold badges381 silver badges503 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

IRTFM Over a year ago

scan is actually the heart of all the read.* functions. It's a low level function but it can do other tasks like mulit-line reads with appropriate parameters to the what argument.

IRTFM Over a year ago

Actually using it to read individual vectors, and that use-strategy has many examples on SO and Rhelp. I learned it from answers by G.Grothendieck.

IRTFM Over a year ago

In the old days we gave scan or read.* functions a textConnection() argument, and you may still need to do such with readLines since it is not based on scan.

IRTFM Over a year ago

I thought I made that point clear. You would need to construct another method to pad the shorter items with rep("", length-max.length).

IRTFM Over a year ago

textConnection is very simple. It just turns a vector into something that most functions will see as a file. Try: x <- "1\n2\n3\n"; read.table(textConnection(x)). Or: y <- "1 a\n2 b\n3 c\n"; read.table(textConnection(y))

Zheyuan Li · Accepted Answer · 2016-07-05 22:29:07Z

3

I think you can try this:

x <- lapply(df, function (x) unlist(strsplit(as.character(x), ";")))

This gives you a list. If you want a data frame, you need some further work to pad empty string "":

m <- max(lengths(x))
y <- as.data.frame(lapply(x, function (vec) c(vec, character(m - length(vec)))))

#       A    B    C
# 1  NZT1 Q9U0 Q980
# 2  O749 Q749 Q997
# 3  Q835 Q554 Q6N9
# 4  Q854 Q955 Q550
# 5  P609 Q013 Q168
# 6   QT7 P641 Q413
# 7  P42I Q614 P641
# 8   QJ0 P079 QS14
# 9  AIH2 P0C7 P679
# 10      P0C8 P0C7
# 11           P048

edited Jul 5, 2016 at 22:29

answered Jul 5, 2016 at 22:12

Zheyuan Li

73.8k18 gold badges194 silver badges266 bronze badges

1 Comment

nik Over a year ago

@Zheyuan Li I accept your answer, thanks but would be nice if you could also write some definition to your script so that I will learn from it

989 · Accepted Answer · 2016-07-06 08:21:17Z

2

Or using the ts function:

lst <- lapply(df, function(a) unlist(strsplit(as.character(a), split = ";"))) # 1
tsr <- cbind(ts(lst$A), ts(lst$B), ts(lst$C)) # 2
tsr[is.na(tsr)] <- "" # 3
newDF <- as.data.frame(tsr) # 4
colnames(newDF) <- colnames(df) # 5 (if needed)

      # A    B    C
# 1  NZT1 Q9U0 Q980
# 2  O749 Q749 Q997
# 3  Q835 Q554 Q6N9
# 4  Q854 Q955 Q550
# 5  P609 Q013 Q168
# 6   QT7 P641 Q413
# 7  P42I Q614 P641
# 8   QJ0 P079 QS14
# 9  AIH2 P0C7 P679
# 10      P0C8 P0C7
# 11           P048

lst will give a list of ; separated columns
tsr is a column-wise binding of time series objects. Time series objects are used to take care of unequal lengths.
find NAs in tsr and make them none value.
convert to data frame.
make column names of newDF the same as df, if necessary.

edited Jul 6, 2016 at 8:21

answered Jul 5, 2016 at 23:39

989

13k6 gold badges35 silver badges57 bronze badges

Comments

akrun · Accepted Answer · 2016-07-06 09:08:11Z

2

Here is another option with stri_list2matrix. This returns a matrix with NA as missing values. If we need '', use the fill='' argument in stri_list2matrix. Also, this can be converted to data.frame with as.data.frame.

 library(stringi)
 stri_list2matrix(lapply(df, function(x) unlist(strsplit(as.character(x), ";"))))

edited Jul 6, 2016 at 9:08

answered Jul 6, 2016 at 3:21

akrun

891k38 gold badges590 silver badges700 bronze badges

4 Comments

akrun Over a year ago

@nik As the question is put on hold, others can't add answers. We will wait for the person (Procrastinatus ) to respond to your comments.

989 Over a year ago

stri_list2matrix will give a matrix of characters with NA as missing values. Is that what OP is asked for?

akrun Over a year ago

@m0h3n You can change it to '' using the fill argument. Also, the as.data.frame can convert it to data.frame

989 Over a year ago

So better to point them out in the answer itself. I think you would remember your comment from yesterday HERE. :-)

Collectives™ on Stack Overflow

split strings and rearrange a data frame

4 Answers 4

5 Comments

1 Comment

Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

5 Comments

1 Comment

Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related