Splitting a string and assigning values to variables in R

Question

I'm looking to answer a similar question as this in R.

I'm working with a dataset including a variable that concatenates 30 values of a string with numeric values in parentheses. The separate combinations of strings and parenthetical numbers are comma-separated.

IMPORTANT: Sometimes, string values may be repeating.

For example, in df the var might be:

id    var
1     Videos (10.1), Music (9.5), Games (8.3), Videos (1)
2     Videos (11.1), Dogs (10.5), Cats (8.4), Dogs (1)
3     Cars (12.1), Music (9.5), Games (8.5), Games (2)
4     Cars (14.1), Music (9.5), Dogs (8.6)
5     Horses (10.1), Antelope (9.5), Music (8.7)
6     Music (10.1), Videos (9.0), Games (8.9)

What I would like to produce is additional columns where each unique string value in var has its own column, and the value for that column is the number in parentheses (when available). Something that is a bit tricky (to me) is that when a string value (e.g. Videos) is repeated, I would like to sum the numeric values.

So, in the dataset I've produced, the ideal output would be:

id   Videos   Music   Games    Dogs   Cats    Cars   Horses   Antelope
1    11.1     9.5     8.3      NA     NA      NA     NA       NA
2    11.1     NA      NA       11.5   8.4     NA     NA       NA
3    NA       9.5     10.5      NA     NA      12.1   NA       NA
4    NA       9.5     NA       8.6     NA     14.1   NA       NA
5    NA       8.7     NA       8.6     NA     NA   10.1       9.5
5    9.0       10.1   8.7      NA     NA      NA     NA       NA

Any thoughts on how one would go about doing this in R?

EDIT: Real data included below:

 my_df<-data.frame(id=1:20, var= c("PeopleBlogs(2.88)", "Music(3.90)", "Entertainment(3.05),Music(5.10),Music(2.28)", 
"Sports(1.02)", "NonprofitsActivism(0.20),FilmAnimation(0.58)", 
"Music(3.60),Music(1.42),Music(7.60)", "GadgetsGames(0.52)", 
"Music(9.17),PeopleBlogs(0.33),PeopleBlogs(1.58),Music(8.82),Entertainment(1.38),PeopleBlogs(0.45),PeopleBlogs(0.58),Entertainment(0.92),FilmAnimation(1.60),FilmAnimation(7.57),Music(2.28),Entertainment(3.18),Entertainment(4.98),Music(0.48),FilmAnimation(0.28),FilmAnimation(0.18),Entertainment(5.97),Entertainment(1.35)", 
"FilmAnimation(2.42),GadgetsGames(3.92)", "PeopleBlogs(4.38),GadgetsGames(15.47)", 
"Entertainment(3.52)", "PeopleBlogs(0.22),Music(1.15),PetsAnimals(3.50),PeopleBlogs(2.78),PeopleBlogs(3.27)", 
"Music(2.05),PeopleBlogs(0.20)", "Music(3.48),Music(4.65),Music(0.55)", 
"Entertainment(0.78)", "Entertainment(4.35),PeopleBlogs(2.33),Comedy(7.05),PeopleBlogs(7.27)", 
"Entertainment(0.50)", "Education(1.73)", "Education(0.67)", 
"GadgetsGames(17.35),Education(7.40),NewsPolitics(0.35)"))

So when they are repeated in a row, do you want to sum them? — Rich Scriven
– Rich Scriven, Commented Sep 25, 2015 at 1:30
@RichardScriven Yes! I will clarify my language to make this more explicit. — roody
– roody, Commented Sep 25, 2015 at 1:37

jazzurro · Accepted Answer · 2015-09-25 23:33:27Z

4

Here is one approach for you. First, you use cSplit() from the splitstackshape package. You are splitting the var column by , for the first time and reshaping the data format. Then, you split the var column again separating by a space. By this time, you have a data.table, not a data.frame. Using the data.table package, you do two things. One is you remove ( and ) and convert character to numeric. Then, as you requested, you sum the numbers by id and var_1. Finally, you use dcast() in the data.table package and have the desired output. I hope this will help you.

mydf <- data.frame(id = 1:6,
                   var = c("Videos (10.1), Music (9.5), Games (8.3), Videos (1)",
                         "Videos (11.1), Dogs (10.5), Cats (8.4), Dogs (1)",
                         "Cars (12.1), Music (9.5), Games (8.5), Games (2)",
                         "Cars (14.1), Music (9.5), Dogs (8.6)",
                         "Horses (10.1), Antelope (9.5), Music (8.7)",
                         "Music (10.1), Videos (9.0), Games (8.9)"),
                   stringsAsFactors = FALSE)

library(splitstackshape)
library(data.table)
library(magrittr)

cSplit(mydf, "var", sep = ",", direction = "long") %>%
cSplit("var", sep = " ", direction = "wide") -> foo

foo[, var_2 := as.numeric(gsub(pattern = "\\(|\\)", replacement = "", x = var_2))][,
list(total = sum(var_2)), by = list(id, var_1)] %>%
dcast(id ~ var_1, value.var = "total")


#  id Antelope Cars Cats Dogs Games Horses Music Videos
#1:  1       NA   NA   NA   NA   8.3     NA   9.5   11.1
#2:  2       NA   NA  8.4 11.5    NA     NA    NA   11.1
#3:  3       NA 12.1   NA   NA  10.5     NA   9.5     NA
#4:  4       NA 14.1   NA  8.6    NA     NA   9.5     NA
#5:  5      9.5   NA   NA   NA    NA   10.1   8.7     NA
#6:  6       NA   NA   NA   NA   8.9     NA  10.1    9.0

EDIT

With your real data, Ananda's and my codes do not work. This is because you do not have a space between characters and numbers (e.g., Videos(10.1)), whereas your original sample data do have a space between them (e.g., Videos (10.1)). Modifying my original answer, the following will do the job for you. I uploaded a part of the result.

cSplit(my_df, "var", sep = ",", direction = "long") %>%
cSplit("var", sep = "(", direction = "wide") -> foo


foo[, var_2 := as.numeric(gsub(pattern = "\\)", replacement = "", x = var_2))][,
list(total = sum(var_2)), by = list(id, var_1)] %>%
dcast(id ~ var_1, value.var = "total")

#    id Comedy Education Entertainment FilmAnimation GadgetsGames Music NewsPolitics
#1:  1     NA        NA            NA            NA           NA    NA           NA
#2:  2     NA        NA            NA            NA           NA  3.90           NA
#3:  3     NA        NA          3.05            NA           NA  7.38           NA
#4:  4     NA        NA            NA            NA           NA    NA           NA
#5:  5     NA        NA            NA          0.58           NA    NA           NA

edited Sep 25, 2015 at 23:33

answered Sep 25, 2015 at 4:01

jazzurro

23.6k36 gold badges72 silver badges77 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

A5C1D2H2I1M1N2O1R2T1 Over a year ago

I had done almost the same:

cSplit(mydf, "var", ",", "long") %>% .[, var := gsub(" (", "|", gsub(")", "", var, fixed = TRUE), fixed = TRUE)] %>% cSplit(., "var", "|") %>% dcast(id ~ var_1, value.var = "var_2", fun.aggregate = sum, fill = NA)

(using magrittr). +1

jazzurro Over a year ago

@AnandaMahto Thank you for sharing your code. Using fun.aggregate = sum, we can make the code shorter. Thanks for your advice. :)

roody Over a year ago

@jazzurro This code is returning the error: Error in gsub(pattern = "\\(|\\)", replacement = "", x = var_2) : object 'var_2' not found

jazzurro Over a year ago

@roody I ran the code above again. On my side, I do not see the error message. Could you check if you have all latest packages above and use mydf and run the code? Could you also check if your data set is identical to your sample?

roody Over a year ago

@jazzurro Hi there! I updated with actual data from my dataset...I am getting the same error message regarding var_2. Thanks so much for your help!

|

bramtayl · Accepted Answer · 2015-09-25 04:25:31Z

0

library(dplyr)
library(stringi)
library(tidyr)    

mydf %>%
  mutate(both = var %>% stri_split_fixed(", ") ) %>%
  unnest(both) %>%
  separate(both, c("category", "value.string"), sep = " ") %>%
  mutate(value = value.string %>% extract_numeric) %>%
  group_by(id, category) %>%
  summarize(value = sum(value)) %>%
  spread(category, value)

edited Sep 25, 2015 at 4:25

answered Sep 25, 2015 at 4:13

bramtayl

4,0242 gold badges13 silver badges20 bronze badges

Comments

fnd · Accepted Answer · 2015-09-25 04:51:22Z

# Another method without additional libraries
# vector with data to split
X <- c("Videos (10.1), Music (9.5), Games (8.3), Videos (1)",
       "Videos (11.1), Dogs (10.5), Cats (8.4), Dogs (1)",
       "Cars (12.1), Music (9.5), Games (8.5), Games (2)",
       "Cars (14.1), Music (9.5), Dogs (8.6)",
       "Horses (10.1), Antelope (9.5), Music (8.7)")

# custom split function
g <- function(x, ...) {
  x <- chartr(")", " ", x)
  x <- chartr(",", "\n", x)
  x <- read.table(text=x, sep="(", strip.white=TRUE)
  L <- levels(x$V1)
  V <- numeric(0)
  for (l in L) {
    V <- c(V, sum(x$V2[x$V1==l]))
  }
  names(V) <- L
  return(V)
}

# making a data.frame element by element
Y <- data.frame(case=1:length(X))
for (i in 1:length(X)) {
  rw <- g(X[i]) 
  for (n in names(rw)) {
    Y[i,n] <- rw[n]
  }
}

Y

  case Games Music Videos Cats Dogs Cars Antelope Horses
1    1   8.3   9.5   11.1   NA   NA   NA       NA     NA
2    2    NA    NA   11.1  8.4 11.5   NA       NA     NA
3    3  10.5   9.5     NA   NA   NA 12.1       NA     NA
4    4    NA   9.5     NA   NA  8.6 14.1       NA     NA
5    5    NA   8.7     NA   NA   NA   NA      9.5   10.1

Collectives™ on Stack Overflow

Splitting a string and assigning values to variables in R

3 Answers 3

8 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

8 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related