1

I have a table with a character field that can have either of these pattern :

input              

97                 # a single number
210 foo            # a number and a word
87 bar 89          # a number, a word, a number
21 23              # two numbers
123 2 fizzbuzz     # two number, a word           
12 fizz 34 buzz    # a number, a word, a number, a word 

I'd like to split each line up to 4 parts, containing respectively the first number, the first word if it exists, the second number if it exists, and the second word if it exists. So my example would give :

input               nb_1    word_1    nb_2    word_2

97                  97
210 foo             210     foo
87 bar 89           87      bar       89
21 23               21                23
123 2 fizzbuzz      123               2       fizzbuzz
12 fizz 34 buzz     12      fizz      34      buzz

Please note the case of two number, a word (the example before the last one) : it has nothing in word_1 as there is no word between the two numbers.

Is there a way to do this without a tedious if / if / else structure ?

If it can help, all the words belong to a list of 10 specific words. Also, if there are two words, they can be the same or different. Also, the numbers can be one, two or three digits long.

Thanks

2
  • 2
    Please use dput to show the example. Is it a single string or two columns in a data.frame i.e. "a single number : 97" Commented May 25, 2016 at 11:31
  • Sorry, it was just to indicate the pattern, I modified the question in an hopefully clearer way Commented May 25, 2016 at 12:04

3 Answers 3

1

Here is an idea using gsub and cSplit from splitstackshape package,

library(splitstackshape)
df$num <- gsub('\\D', ' ', df$V1)
df$wrds <- gsub('\\d', ' ', df$V1) 
newdf <- cSplit(df, 2:3, ' ', 'wide')
newdf
#                                    V1 num_1 num_2   wrds_1 wrds_2
#1:                                  97    97    NA       NA     NA
#2:                             210 foo   210    NA      foo     NA
#3:                           87 bar 89    87    89      bar     NA
#4:                               21 23    21    23       NA     NA 
#5:                      123 2 fizzbuzz   123     2 fizzbuzz     NA
#6:                     12 fizz 34 buzz    12    34     fizz   buzz

The only problem is row 5, which can be fixed as follows,

newdf$wrds_1 <- as.character(newdf$wrds_1)
newdf$wrds_2 <- as.character(newdf$wrds_2)
newdf$wrds_2[grep('[0-9]+\\s+[0-9]+\\s+[A-Za-z]', newdf$V1)] <- newdf$wrds_1[grep('[0-9]+\\s+[0-9]+\\s+[A-Za-z]', newdf$V1)]
newdf$wrds_1[grep('[0-9]+\\s+[0-9]+\\s+[A-Za-z]', newdf$V1)] <- NA

which finally gives,

newdf
#                                    V1 num_1 num_2 wrds_1   wrds_2
#1:                                  97    97    NA     NA       NA
#2:                             210 foo   210    NA    foo       NA
#3:                           87 bar 89    87    89    bar       NA
#4:                               21 23    21    23     NA       NA
#5:                      123 2 fizzbuzz   123     2     NA fizzbuzz
#6:                     12 fizz 34 buzz    12    34   fizz     buzz

DATA

dput(df)
structure(list(V1 = c("97", "                  210 foo", "                          87 bar 89", 
"                    21 23", "                    123 2 fizzbuzz", 
"                    12 fizz 34 buzz")), .Names = "V1", row.names = c(NA, 
-6L), class = "data.frame")
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for all three answers, went with this one as it seemed the simpler to me.
1
Tried in a different way...
library(splitstackshape)
    abc <- data.frame(a=c(97,"210 foo","87 bar 89","21 23","123 2 fizzbuzz","12 fizz 34 buzz"))
    abc1 <- data.frame(cSplit(abc, "a", " ", stripWhite = FALSE))
    abc <- cbind(abc,abc1)
    names(abc) <- c("input","nb_1", "word_1", "nb_2","word_2")
    abc[,1:5] <-apply(abc[,1:5] , 2, as.character)
    for(i in 1:nrow(abc)){
      abc$word_2[i] <- replace(abc$word_2[i] , is.na(abc$word_2[i]),abc$nb_2[grepl("[a-z]",abc$nb_2[i])][i])
      abc$nb_2[i] <- replace(abc$nb_2[i] , is.na(abc$nb_2[i])|grepl("[a-z]",abc$nb_2[i]),abc$word_1[grepl("[0-9]",abc$word_1[i])][i])
      }
    abc$word_1 <- ifelse(grepl("[0-9]",abc$word_1),NA,abc$word_1)
    abc[is.na(abc)] <- ""
    print(abc)
            input nb_1 word_1 nb_2   word_2
1              97   97                     
2         210 foo  210    foo              
3       87 bar 89   87    bar   89         
4           21 23   21          23         
5  123 2 fizzbuzz  123           2 fizzbuzz
6 12 fizz 34 buzz   12   fizz   34     buzz

Comments

1

This is a hacky function to do it... although you might have other cases that would break it.

f <- function(x){
  string2 <- strsplit(x, " ")[[1]]
  if (length(string2) < 2)
    return(c(string2, NA, NA, NA))
  arenums <- grepl("\\d", string2)
  c(string2[which(arenums)[1]], 
   if (arenums[2]) NA else string2[which(!arenums)[1]],    
   string2[which(arenums)[2]], 
   if (arenums[2]) string2[which(!arenums)[1]] else string2[which(!arenums)[2]])
}

> f("97")
[1] "97" NA   NA   NA  
> f("210 foo")
[1] "210" "foo" NA    NA   
> f("87 bar 89")
[1] "87"  "bar" "89"  NA   
> f("21 23")
[1] "21" NA   "23" NA  
> f("123 2 fizzbuzz")
[1] "123"      NA         "2"        "fizzbuzz"
> f("12 fizz 34 buzz")
[1] "12"   "fizz" "34"   "buzz"

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.