Spliting string with a specific pattern

Question

I have a table with a character field that can have either of these pattern :

input              

97                 # a single number
210 foo            # a number and a word
87 bar 89          # a number, a word, a number
21 23              # two numbers
123 2 fizzbuzz     # two number, a word           
12 fizz 34 buzz    # a number, a word, a number, a word

I'd like to split each line up to 4 parts, containing respectively the first number, the first word if it exists, the second number if it exists, and the second word if it exists. So my example would give :

input               nb_1    word_1    nb_2    word_2

97                  97
210 foo             210     foo
87 bar 89           87      bar       89
21 23               21                23
123 2 fizzbuzz      123               2       fizzbuzz
12 fizz 34 buzz     12      fizz      34      buzz

Please note the case of two number, a word (the example before the last one) : it has nothing in word_1 as there is no word between the two numbers.

Is there a way to do this without a tedious if / if / else structure ?

If it can help, all the words belong to a list of 10 specific words. Also, if there are two words, they can be the same or different. Also, the numbers can be one, two or three digits long.

Thanks

Please use dput to show the example. Is it a single string or two columns in a data.frame i.e. "a single number : 97" — akrun
– akrun, Commented May 25, 2016 at 11:31
Sorry, it was just to indicate the pattern, I modified the question in an hopefully clearer way — François M.
– François M., Commented May 25, 2016 at 12:04

Sotos · Accepted Answer · 2016-05-25 12:57:41Z

Here is an idea using gsub and cSplit from splitstackshape package,

library(splitstackshape)
df$num <- gsub('\\D', ' ', df$V1)
df$wrds <- gsub('\\d', ' ', df$V1) 
newdf <- cSplit(df, 2:3, ' ', 'wide')
newdf
#                                    V1 num_1 num_2   wrds_1 wrds_2
#1:                                  97    97    NA       NA     NA
#2:                             210 foo   210    NA      foo     NA
#3:                           87 bar 89    87    89      bar     NA
#4:                               21 23    21    23       NA     NA 
#5:                      123 2 fizzbuzz   123     2 fizzbuzz     NA
#6:                     12 fizz 34 buzz    12    34     fizz   buzz

The only problem is row 5, which can be fixed as follows,

newdf$wrds_1 <- as.character(newdf$wrds_1)
newdf$wrds_2 <- as.character(newdf$wrds_2)
newdf$wrds_2[grep('[0-9]+\\s+[0-9]+\\s+[A-Za-z]', newdf$V1)] <- newdf$wrds_1[grep('[0-9]+\\s+[0-9]+\\s+[A-Za-z]', newdf$V1)]
newdf$wrds_1[grep('[0-9]+\\s+[0-9]+\\s+[A-Za-z]', newdf$V1)] <- NA

which finally gives,

newdf
#                                    V1 num_1 num_2 wrds_1   wrds_2
#1:                                  97    97    NA     NA       NA
#2:                             210 foo   210    NA    foo       NA
#3:                           87 bar 89    87    89    bar       NA
#4:                               21 23    21    23     NA       NA
#5:                      123 2 fizzbuzz   123     2     NA fizzbuzz
#6:                     12 fizz 34 buzz    12    34   fizz     buzz

DATA

dput(df)
structure(list(V1 = c("97", "                  210 foo", "                          87 bar 89", 
"                    21 23", "                    123 2 fizzbuzz", 
"                    12 fizz 34 buzz")), .Names = "V1", row.names = c(NA, 
-6L), class = "data.frame")

Thanks for all three answers, went with this one as it seemed the simpler to me.

Arun kumar mahesh · Accepted Answer · 2016-05-25 12:47:52Z

Tried in a different way...
library(splitstackshape)
    abc <- data.frame(a=c(97,"210 foo","87 bar 89","21 23","123 2 fizzbuzz","12 fizz 34 buzz"))
    abc1 <- data.frame(cSplit(abc, "a", " ", stripWhite = FALSE))
    abc <- cbind(abc,abc1)
    names(abc) <- c("input","nb_1", "word_1", "nb_2","word_2")
    abc[,1:5] <-apply(abc[,1:5] , 2, as.character)
    for(i in 1:nrow(abc)){
      abc$word_2[i] <- replace(abc$word_2[i] , is.na(abc$word_2[i]),abc$nb_2[grepl("[a-z]",abc$nb_2[i])][i])
      abc$nb_2[i] <- replace(abc$nb_2[i] , is.na(abc$nb_2[i])|grepl("[a-z]",abc$nb_2[i]),abc$word_1[grepl("[0-9]",abc$word_1[i])][i])
      }
    abc$word_1 <- ifelse(grepl("[0-9]",abc$word_1),NA,abc$word_1)
    abc[is.na(abc)] <- ""
    print(abc)
            input nb_1 word_1 nb_2   word_2
1              97   97                     
2         210 foo  210    foo              
3       87 bar 89   87    bar   89         
4           21 23   21          23         
5  123 2 fizzbuzz  123           2 fizzbuzz
6 12 fizz 34 buzz   12   fizz   34     buzz

Mark O'Connell · Accepted Answer · 2016-05-25 13:01:28Z

This is a hacky function to do it... although you might have other cases that would break it.

f <- function(x){
  string2 <- strsplit(x, " ")[[1]]
  if (length(string2) < 2)
    return(c(string2, NA, NA, NA))
  arenums <- grepl("\\d", string2)
  c(string2[which(arenums)[1]], 
   if (arenums[2]) NA else string2[which(!arenums)[1]],    
   string2[which(arenums)[2]], 
   if (arenums[2]) string2[which(!arenums)[1]] else string2[which(!arenums)[2]])
}

> f("97")
[1] "97" NA   NA   NA  
> f("210 foo")
[1] "210" "foo" NA    NA   
> f("87 bar 89")
[1] "87"  "bar" "89"  NA   
> f("21 23")
[1] "21" NA   "23" NA  
> f("123 2 fizzbuzz")
[1] "123"      NA         "2"        "fizzbuzz"
> f("12 fizz 34 buzz")
[1] "12"   "fizz" "34"   "buzz"

Collectives™ on Stack Overflow

Spliting string with a specific pattern

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related