2

I'm looking to answer a similar question as this in R.

I'm working with a dataset including a variable that concatenates 30 values of a string with numeric values in parentheses. The separate combinations of strings and parenthetical numbers are comma-separated.

IMPORTANT: Sometimes, string values may be repeating.

For example, in df the var might be:

id    var
1     Videos (10.1), Music (9.5), Games (8.3), Videos (1)
2     Videos (11.1), Dogs (10.5), Cats (8.4), Dogs (1)
3     Cars (12.1), Music (9.5), Games (8.5), Games (2)
4     Cars (14.1), Music (9.5), Dogs (8.6)
5     Horses (10.1), Antelope (9.5), Music (8.7)
6     Music (10.1), Videos (9.0), Games (8.9)

What I would like to produce is additional columns where each unique string value in var has its own column, and the value for that column is the number in parentheses (when available). Something that is a bit tricky (to me) is that when a string value (e.g. Videos) is repeated, I would like to sum the numeric values.

So, in the dataset I've produced, the ideal output would be:

id   Videos   Music   Games    Dogs   Cats    Cars   Horses   Antelope
1    11.1     9.5     8.3      NA     NA      NA     NA       NA
2    11.1     NA      NA       11.5   8.4     NA     NA       NA
3    NA       9.5     10.5      NA     NA      12.1   NA       NA
4    NA       9.5     NA       8.6     NA     14.1   NA       NA
5    NA       8.7     NA       8.6     NA     NA   10.1       9.5
5    9.0       10.1   8.7      NA     NA      NA     NA       NA

Any thoughts on how one would go about doing this in R?

EDIT: Real data included below:

 my_df<-data.frame(id=1:20, var= c("PeopleBlogs(2.88)", "Music(3.90)", "Entertainment(3.05),Music(5.10),Music(2.28)", 
"Sports(1.02)", "NonprofitsActivism(0.20),FilmAnimation(0.58)", 
"Music(3.60),Music(1.42),Music(7.60)", "GadgetsGames(0.52)", 
"Music(9.17),PeopleBlogs(0.33),PeopleBlogs(1.58),Music(8.82),Entertainment(1.38),PeopleBlogs(0.45),PeopleBlogs(0.58),Entertainment(0.92),FilmAnimation(1.60),FilmAnimation(7.57),Music(2.28),Entertainment(3.18),Entertainment(4.98),Music(0.48),FilmAnimation(0.28),FilmAnimation(0.18),Entertainment(5.97),Entertainment(1.35)", 
"FilmAnimation(2.42),GadgetsGames(3.92)", "PeopleBlogs(4.38),GadgetsGames(15.47)", 
"Entertainment(3.52)", "PeopleBlogs(0.22),Music(1.15),PetsAnimals(3.50),PeopleBlogs(2.78),PeopleBlogs(3.27)", 
"Music(2.05),PeopleBlogs(0.20)", "Music(3.48),Music(4.65),Music(0.55)", 
"Entertainment(0.78)", "Entertainment(4.35),PeopleBlogs(2.33),Comedy(7.05),PeopleBlogs(7.27)", 
"Entertainment(0.50)", "Education(1.73)", "Education(0.67)", 
"GadgetsGames(17.35),Education(7.40),NewsPolitics(0.35)"))
2
  • So when they are repeated in a row, do you want to sum them? Commented Sep 25, 2015 at 1:30
  • @RichardScriven Yes! I will clarify my language to make this more explicit. Commented Sep 25, 2015 at 1:37

3 Answers 3

4

Here is one approach for you. First, you use cSplit() from the splitstackshape package. You are splitting the var column by , for the first time and reshaping the data format. Then, you split the var column again separating by a space. By this time, you have a data.table, not a data.frame. Using the data.table package, you do two things. One is you remove ( and ) and convert character to numeric. Then, as you requested, you sum the numbers by id and var_1. Finally, you use dcast() in the data.table package and have the desired output. I hope this will help you.

mydf <- data.frame(id = 1:6,
                   var = c("Videos (10.1), Music (9.5), Games (8.3), Videos (1)",
                         "Videos (11.1), Dogs (10.5), Cats (8.4), Dogs (1)",
                         "Cars (12.1), Music (9.5), Games (8.5), Games (2)",
                         "Cars (14.1), Music (9.5), Dogs (8.6)",
                         "Horses (10.1), Antelope (9.5), Music (8.7)",
                         "Music (10.1), Videos (9.0), Games (8.9)"),
                   stringsAsFactors = FALSE)

library(splitstackshape)
library(data.table)
library(magrittr)

cSplit(mydf, "var", sep = ",", direction = "long") %>%
cSplit("var", sep = " ", direction = "wide") -> foo

foo[, var_2 := as.numeric(gsub(pattern = "\\(|\\)", replacement = "", x = var_2))][,
list(total = sum(var_2)), by = list(id, var_1)] %>%
dcast(id ~ var_1, value.var = "total")


#  id Antelope Cars Cats Dogs Games Horses Music Videos
#1:  1       NA   NA   NA   NA   8.3     NA   9.5   11.1
#2:  2       NA   NA  8.4 11.5    NA     NA    NA   11.1
#3:  3       NA 12.1   NA   NA  10.5     NA   9.5     NA
#4:  4       NA 14.1   NA  8.6    NA     NA   9.5     NA
#5:  5      9.5   NA   NA   NA    NA   10.1   8.7     NA
#6:  6       NA   NA   NA   NA   8.9     NA  10.1    9.0

EDIT

With your real data, Ananda's and my codes do not work. This is because you do not have a space between characters and numbers (e.g., Videos(10.1)), whereas your original sample data do have a space between them (e.g., Videos (10.1)). Modifying my original answer, the following will do the job for you. I uploaded a part of the result.

cSplit(my_df, "var", sep = ",", direction = "long") %>%
cSplit("var", sep = "(", direction = "wide") -> foo


foo[, var_2 := as.numeric(gsub(pattern = "\\)", replacement = "", x = var_2))][,
list(total = sum(var_2)), by = list(id, var_1)] %>%
dcast(id ~ var_1, value.var = "total")

#    id Comedy Education Entertainment FilmAnimation GadgetsGames Music NewsPolitics
#1:  1     NA        NA            NA            NA           NA    NA           NA
#2:  2     NA        NA            NA            NA           NA  3.90           NA
#3:  3     NA        NA          3.05            NA           NA  7.38           NA
#4:  4     NA        NA            NA            NA           NA    NA           NA
#5:  5     NA        NA            NA          0.58           NA    NA           NA
Sign up to request clarification or add additional context in comments.

8 Comments

I had done almost the same: cSplit(mydf, "var", ",", "long") %>% .[, var := gsub(" (", "|", gsub(")", "", var, fixed = TRUE), fixed = TRUE)] %>% cSplit(., "var", "|") %>% dcast(id ~ var_1, value.var = "var_2", fun.aggregate = sum, fill = NA) (using magrittr). +1
@AnandaMahto Thank you for sharing your code. Using fun.aggregate = sum, we can make the code shorter. Thanks for your advice. :)
@jazzurro This code is returning the error: Error in gsub(pattern = "\\(|\\)", replacement = "", x = var_2) : object 'var_2' not found
@roody I ran the code above again. On my side, I do not see the error message. Could you check if you have all latest packages above and use mydf and run the code? Could you also check if your data set is identical to your sample?
@jazzurro Hi there! I updated with actual data from my dataset...I am getting the same error message regarding var_2. Thanks so much for your help!
|
0
library(dplyr)
library(stringi)
library(tidyr)    

mydf %>%
  mutate(both = var %>% stri_split_fixed(", ") ) %>%
  unnest(both) %>%
  separate(both, c("category", "value.string"), sep = " ") %>%
  mutate(value = value.string %>% extract_numeric) %>%
  group_by(id, category) %>%
  summarize(value = sum(value)) %>%
  spread(category, value)

Comments

0
# Another method without additional libraries
# vector with data to split
X <- c("Videos (10.1), Music (9.5), Games (8.3), Videos (1)",
       "Videos (11.1), Dogs (10.5), Cats (8.4), Dogs (1)",
       "Cars (12.1), Music (9.5), Games (8.5), Games (2)",
       "Cars (14.1), Music (9.5), Dogs (8.6)",
       "Horses (10.1), Antelope (9.5), Music (8.7)")

# custom split function
g <- function(x, ...) {
  x <- chartr(")", " ", x)
  x <- chartr(",", "\n", x)
  x <- read.table(text=x, sep="(", strip.white=TRUE)
  L <- levels(x$V1)
  V <- numeric(0)
  for (l in L) {
    V <- c(V, sum(x$V2[x$V1==l]))
  }
  names(V) <- L
  return(V)
}

# making a data.frame element by element
Y <- data.frame(case=1:length(X))
for (i in 1:length(X)) {
  rw <- g(X[i]) 
  for (n in names(rw)) {
    Y[i,n] <- rw[n]
  }
}

Y

  case Games Music Videos Cats Dogs Cars Antelope Horses
1    1   8.3   9.5   11.1   NA   NA   NA       NA     NA
2    2    NA    NA   11.1  8.4 11.5   NA       NA     NA
3    3  10.5   9.5     NA   NA   NA 12.1       NA     NA
4    4    NA   9.5     NA   NA  8.6 14.1       NA     NA
5    5    NA   8.7     NA   NA   NA   NA      9.5   10.1

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.