3

I have a data like this one

df <- structure(list(A = structure(c(2L, 3L, 6L, 7L, 5L, 4L, 1L, 1L
), .Label = c("", "NZT1", "O749", "P42I;QJ0;AIH2", "P609;QT7", 
"Q835", "Q854"), class = "factor"), B = structure(c(8L, 6L, 5L, 
7L, 4L, 3L, 2L, 1L), .Label = c("", "P079;P0C7;P0C8", "P641;Q614", 
"Q013", "Q554", "Q749", "Q955", "Q9U0"), class = "factor"), C = structure(c(7L, 
8L, 6L, 5L, 3L, 4L, 1L, 2L), .Label = c("P641;QS14", "P679;P0C7;P048", 
"Q168", "Q413", "Q550", "Q6N9", "Q980", "Q997"), class = "factor")), .Names = c("A", 
"B", "C"), class = "data.frame", row.names = c(NA, -8L))

#              A              B              C
#1          NZT1           Q9U0           Q980
#2          O749           Q749           Q997
#3          Q835           Q554           Q6N9
#4          Q854           Q955           Q550
#5      P609;QT7           Q013           Q168
#6 P42I;QJ0;AIH2      P641;Q614           Q413
#7               P079;P0C7;P0C8      P641;QS14
#8                              P679;P0C7;P048

I am trying to split them based on ";", and then put them under the other string , the expected output I seek is like this

#            A              B              C
#1          NZT1           Q9U0           Q980
#2          O749           Q749           Q997
#3          Q835           Q554           Q6N9
#4          Q854           Q955           Q550
#5          P609           Q013           Q168
#6          QT7            P641           Q413
#7          P42I           Q614           P641
#8          QJ0            P079           QS14
#9          AIH2           P0C7           P679    
#10                        P0C8           P0C7      
#11                                       P048

I tried to use strsplit() but I did not get that far

This is what I tried

myNewdf <- strsplit(as.character(unlist(df)), ";")
0

4 Answers 4

4

The scan function will succeed here although the as.data.frame will choke if the number of items in each column are not the same:

as.data.frame(lapply( df, function(x) scan( text=as.character(x) , what="", sep=";", blank.lines.skip = FALSE))
+ )
Read 11 items
Read 11 items
Read 11 items
      A    B    C
1  NZT1 Q9U0 Q980
2  O749 Q749 Q997
3  Q835 Q554 Q6N9
4  Q854 Q955 Q550
5  P609 Q013 Q168
6   QT7 P641 Q413
7  P42I Q614 P641
8   QJ0 P079 QS14
9  AIH2 P0C7 P679
10      P0C8 P0C7
11           P048
Sign up to request clarification or add additional context in comments.

5 Comments

scan is actually the heart of all the read.* functions. It's a low level function but it can do other tasks like mulit-line reads with appropriate parameters to the what argument.
Actually using it to read individual vectors, and that use-strategy has many examples on SO and Rhelp. I learned it from answers by G.Grothendieck.
In the old days we gave scan or read.* functions a textConnection() argument, and you may still need to do such with readLines since it is not based on scan.
I thought I made that point clear. You would need to construct another method to pad the shorter items with rep("", length-max.length).
textConnection is very simple. It just turns a vector into something that most functions will see as a file. Try: x <- "1\n2\n3\n"; read.table(textConnection(x)). Or: y <- "1 a\n2 b\n3 c\n"; read.table(textConnection(y))
3

I think you can try this:

x <- lapply(df, function (x) unlist(strsplit(as.character(x), ";")))

This gives you a list. If you want a data frame, you need some further work to pad empty string "":

m <- max(lengths(x))
y <- as.data.frame(lapply(x, function (vec) c(vec, character(m - length(vec)))))

#       A    B    C
# 1  NZT1 Q9U0 Q980
# 2  O749 Q749 Q997
# 3  Q835 Q554 Q6N9
# 4  Q854 Q955 Q550
# 5  P609 Q013 Q168
# 6   QT7 P641 Q413
# 7  P42I Q614 P641
# 8   QJ0 P079 QS14
# 9  AIH2 P0C7 P679
# 10      P0C8 P0C7
# 11           P048

1 Comment

@Zheyuan Li I accept your answer, thanks but would be nice if you could also write some definition to your script so that I will learn from it
2

Or using the ts function:

lst <- lapply(df, function(a) unlist(strsplit(as.character(a), split = ";"))) # 1
tsr <- cbind(ts(lst$A), ts(lst$B), ts(lst$C)) # 2
tsr[is.na(tsr)] <- "" # 3
newDF <- as.data.frame(tsr) # 4
colnames(newDF) <- colnames(df) # 5 (if needed)

      # A    B    C
# 1  NZT1 Q9U0 Q980
# 2  O749 Q749 Q997
# 3  Q835 Q554 Q6N9
# 4  Q854 Q955 Q550
# 5  P609 Q013 Q168
# 6   QT7 P641 Q413
# 7  P42I Q614 P641
# 8   QJ0 P079 QS14
# 9  AIH2 P0C7 P679
# 10      P0C8 P0C7
# 11           P048
  1. lst will give a list of ; separated columns
  2. tsr is a column-wise binding of time series objects. Time series objects are used to take care of unequal lengths.
  3. find NAs in tsr and make them none value.
  4. convert to data frame.
  5. make column names of newDF the same as df, if necessary.

Comments

2

Here is another option with stri_list2matrix. This returns a matrix with NA as missing values. If we need '', use the fill='' argument in stri_list2matrix. Also, this can be converted to data.frame with as.data.frame.

 library(stringi)
 stri_list2matrix(lapply(df, function(x) unlist(strsplit(as.character(x), ";"))))

4 Comments

@nik As the question is put on hold, others can't add answers. We will wait for the person (Procrastinatus ) to respond to your comments.
stri_list2matrix will give a matrix of characters with NA as missing values. Is that what OP is asked for?
@m0h3n You can change it to '' using the fill argument. Also, the as.data.frame can convert it to data.frame
So better to point them out in the answer itself. I think you would remember your comment from yesterday HERE. :-)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.