Create dummy variables from string with multiple values

Question

I have a data set with a column that contains multiple values, separated by a ;.

  name    sex     good_at
1 Tom      M   Drawing;Hiking
2 Mary     F   Cooking;Joking
3 Sam      M      Running
4 Charlie  M      Swimming

I would like the create a dummy variable for each unique value in good_at such each dummy variable contains a TRUE or FALSE to indicate whether or not that individual possess that particular value.

Desired Output

Drawing   Cooking
True       False
False      True
False      False
False      False

the problem I need to solve is the existing variable contains more than one information, such as drawing+hiking. I have to use the function like REGEXMATCH in google sheet but I have no idea how to code in R. @CristianE.Nuno — xxx
– xxx, Commented Sep 27, 2018 at 1:15
Ah I see now. Your problem is not the same. Thank you for clarifying. — Cristian E. Nuno
– Cristian E. Nuno, Commented Sep 27, 2018 at 4:37

Cristian E. Nuno · Accepted Answer · 2018-09-29 20:06:14Z

Overview

To create dummy variables for each unique value in good_at required the following steps:

Separate good_at into multiple rows
Generate dummy variables - using dummy::dummy() - for each value in good_at for each name-sex pair
Reshape data into 4 columns: name, sex, key and value
- key contains all the dummy variable column names
- value contains the values in each dummy variable
Keep only records where value is not zero
Reshape data into one record per name-sex pair and as many columns as there are in key
Casting the dummy columns as logical vectors.

Code

# load necessary packages ----
library(dummy)
library(tidyverse)

# load necessary data ----
df <-
  read.table(text = "name    sex     good_at
1 Tom      M   Drawing;Hiking
             2 Mary     F   Cooking;Joking
             3 Sam      M      Running
             4 Charlie  M      Swimming"
             , header = TRUE
             , stringsAsFactors = FALSE)

# create a longer version of df -----
# where one record represents
# one unique name, sex, good_at value
df_clean <-
  df %>%
  separate_rows(good_at, sep = ";")

# create dummy variables for all unique values in "good_at" column ----
df_dummies <-
  df_clean %>%
  select(good_at) %>%
  dummy() %>%
  bind_cols(df_clean) %>%
  # drop "good_at" column 
  select(-good_at) %>%
  # make the tibble long by reshaping it into 4 columns:
  # name, sex, key and value
  # where key are the all dummy variable column names
  # and value are the values in each dummy variable
  gather(key, value, -name, -sex) %>%
  # keep records where
  # value is not equal to zero
  # note: this is due to "Tom" having both a 
  # "good_at_Drawing" value of 0 and 1. 
  filter(value != 0) %>%
  # make the tibble wide
  # with one record per name-sex pair
  # and as many columns as there are in key
  # with their values from value
  # and filling NA values to 0
  spread(key, value, fill = 0) %>%
  # for each name-sex pair
  # cast the dummy variables into logical vectors
  group_by(name, sex) %>%
  mutate_all(funs(as.integer(.) %>% as.logical())) %>%
  ungroup() %>%
  # just for safety let's join
  # the original "good_at" column
  left_join(y = df, by = c("name", "sex")) %>%
  # bring the original "good_at" column to the left-hand side 
  # of the tibble
  select(name, sex, good_at, matches("good_at_"))

# view result ----
df_dummies
# A tibble: 4 x 9
#   name  sex   good_at good_at_Cooking good_at_Drawing good_at_Hiking
#   <chr> <chr> <chr>   <lgl>           <lgl>           <lgl>         
# 1 Char… M     Swimmi… FALSE           FALSE           FALSE         
# 2 Mary  F     Cookin… TRUE            FALSE           FALSE         
# 3 Sam   M     Running FALSE           FALSE           FALSE         
# 4 Tom   M     Drawin… FALSE           TRUE            TRUE          
# ... with 3 more variables: good_at_Joking <lgl>, good_at_Running <lgl>,
#   good_at_Swimming <lgl>

# end of script #

Mike Keith · Accepted Answer · 2018-09-27 20:25:32Z

0

I've created a function that gives the desired output:

dum <- function(kw, col, type=c(T, F)) {
t <- as.data.frame(grep(as.character(kw), col, ignore.case=T))
t$one <- type[1]
colnames(t) <- c("col1","dummy") 
t2 <- as.data.frame(grep(as.character(kw), col, ignore.case=T,
  invert=T))
t2$zero <- type[2]
colnames(t2) <- c("col1","dummy")
t3<-rbind(t, t2)
t3<-t3[order(t3$col1), ]
return(t3$dummy)
}

It may not be super elegant, but it works. Using your example, your dataframe is df and the column you are trying to reference is df$Good_at

Drawing <- dum("drawing", df$Good_at)
> Drawing
  TRUE
  FALSE
  ...

Cooking <- dum("cooking", df$Good_at)
> Cooking
  FALSE
  TRUE
  ...

edited Sep 27, 2018 at 20:25

answered Sep 27, 2018 at 1:58

Mike Keith

12 bronze badges

2 Comments

xxx Over a year ago

this function works on the first three colume but the fourth and the later on column does not work, it shows: Error in $<-.data.frame(*tmp*, "one", value = TRUE) : replacement has 1 row, data has 0 @mike

Mike Keith Over a year ago

If you get that error, it means the keyword you're searching for does not appear in that column.

Collectives™ on Stack Overflow

Create dummy variables from string with multiple values

Desired Output

2 Answers 2

Overview

Code

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

Desired Output

2 Answers 2

Overview

Code

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related