0

I have a data set with a column that contains multiple values, separated by a ;.

  name    sex     good_at
1 Tom      M   Drawing;Hiking
2 Mary     F   Cooking;Joking
3 Sam      M      Running
4 Charlie  M      Swimming

I would like the create a dummy variable for each unique value in good_at such each dummy variable contains a TRUE or FALSE to indicate whether or not that individual possess that particular value.

Desired Output

Drawing   Cooking
True       False
False      True
False      False
False      False
2
  • the problem I need to solve is the existing variable contains more than one information, such as drawing+hiking. I have to use the function like REGEXMATCH in google sheet but I have no idea how to code in R. @CristianE.Nuno Commented Sep 27, 2018 at 1:15
  • Ah I see now. Your problem is not the same. Thank you for clarifying. Commented Sep 27, 2018 at 4:37

2 Answers 2

1

Overview

To create dummy variables for each unique value in good_at required the following steps:

  • Separate good_at into multiple rows
  • Generate dummy variables - using dummy::dummy() - for each value in good_at for each name-sex pair
  • Reshape data into 4 columns: name, sex, key and value
    • key contains all the dummy variable column names
    • value contains the values in each dummy variable
  • Keep only records where value is not zero
  • Reshape data into one record per name-sex pair and as many columns as there are in key
  • Casting the dummy columns as logical vectors.

Code

# load necessary packages ----
library(dummy)
library(tidyverse)

# load necessary data ----
df <-
  read.table(text = "name    sex     good_at
1 Tom      M   Drawing;Hiking
             2 Mary     F   Cooking;Joking
             3 Sam      M      Running
             4 Charlie  M      Swimming"
             , header = TRUE
             , stringsAsFactors = FALSE)

# create a longer version of df -----
# where one record represents
# one unique name, sex, good_at value
df_clean <-
  df %>%
  separate_rows(good_at, sep = ";")

# create dummy variables for all unique values in "good_at" column ----
df_dummies <-
  df_clean %>%
  select(good_at) %>%
  dummy() %>%
  bind_cols(df_clean) %>%
  # drop "good_at" column 
  select(-good_at) %>%
  # make the tibble long by reshaping it into 4 columns:
  # name, sex, key and value
  # where key are the all dummy variable column names
  # and value are the values in each dummy variable
  gather(key, value, -name, -sex) %>%
  # keep records where
  # value is not equal to zero
  # note: this is due to "Tom" having both a 
  # "good_at_Drawing" value of 0 and 1. 
  filter(value != 0) %>%
  # make the tibble wide
  # with one record per name-sex pair
  # and as many columns as there are in key
  # with their values from value
  # and filling NA values to 0
  spread(key, value, fill = 0) %>%
  # for each name-sex pair
  # cast the dummy variables into logical vectors
  group_by(name, sex) %>%
  mutate_all(funs(as.integer(.) %>% as.logical())) %>%
  ungroup() %>%
  # just for safety let's join
  # the original "good_at" column
  left_join(y = df, by = c("name", "sex")) %>%
  # bring the original "good_at" column to the left-hand side 
  # of the tibble
  select(name, sex, good_at, matches("good_at_"))

# view result ----
df_dummies
# A tibble: 4 x 9
#   name  sex   good_at good_at_Cooking good_at_Drawing good_at_Hiking
#   <chr> <chr> <chr>   <lgl>           <lgl>           <lgl>         
# 1 Char… M     Swimmi… FALSE           FALSE           FALSE         
# 2 Mary  F     Cookin… TRUE            FALSE           FALSE         
# 3 Sam   M     Running FALSE           FALSE           FALSE         
# 4 Tom   M     Drawin… FALSE           TRUE            TRUE          
# ... with 3 more variables: good_at_Joking <lgl>, good_at_Running <lgl>,
#   good_at_Swimming <lgl>

# end of script #
Sign up to request clarification or add additional context in comments.

Comments

0

I've created a function that gives the desired output:

dum <- function(kw, col, type=c(T, F)) {
t <- as.data.frame(grep(as.character(kw), col, ignore.case=T))
t$one <- type[1]
colnames(t) <- c("col1","dummy") 
t2 <- as.data.frame(grep(as.character(kw), col, ignore.case=T,
  invert=T))
t2$zero <- type[2]
colnames(t2) <- c("col1","dummy")
t3<-rbind(t, t2)
t3<-t3[order(t3$col1), ]
return(t3$dummy)
}

It may not be super elegant, but it works. Using your example, your dataframe is df and the column you are trying to reference is df$Good_at

Drawing <- dum("drawing", df$Good_at)
> Drawing
  TRUE
  FALSE
  ...

Cooking <- dum("cooking", df$Good_at)
> Cooking
  FALSE
  TRUE
  ...

2 Comments

this function works on the first three colume but the fourth and the later on column does not work, it shows: Error in $<-.data.frame(*tmp*, "one", value = TRUE) : replacement has 1 row, data has 0 @mike
If you get that error, it means the keyword you're searching for does not appear in that column.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.