R - Setting Value Based on Matching to Column Name

Question

Afternoon clever people.

I have a decent sized data set (>800k rows) and as an example I have pulled out a tiny sample of 20 columns by 2 rows. At the outset only the "Topics" column is populated with a vector, all other columns are set to FALSE.

This will recreate the data as it sits currently...

  Topics <- c("E11,E31,E313,ECAT" , "E1,E20") 
  E1     <- c(FALSE, FALSE)
  E11    <- c(FALSE, FALSE)
  E20    <- c(FALSE, FALSE)
  E30    <- c(FALSE, FALSE)
  E31    <- c(FALSE, FALSE)
  E100   <- c(FALSE, FALSE)
  E300   <- c(FALSE, FALSE)
  E313   <- c(FALSE, FALSE)
  ECAT   <- c(FALSE, FALSE)
  df     <- data.frame(Topics,E1,E11,E20,E30,E31,E100,E300,E313,ECAT)

Which will give something like...

Topics              E1    E11   E20   E30   E31   E100  E300  E313  ECAT
E11,E31,E313,ECAT   FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
E1,E20              FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

I want to set the relevant row,column to TRUE where there is a match for each of the items in the topic vector. So it should look something like...

Topics              E1    E11   E20   E30   E31   E100  E300  E313  ECAT
E11,E31,E313,ECAT   FALSE TRUE  FALSE FALSE TRUE  FALSE TRUE  FALSE TRUE
E1,E20              TRUE  FALSE TRUE  FALSE FALSE FALSE FALSE FALSE FALSE

So far I have failed UTTERLY to work this one out but I suspect it is something like:

split the topic into a vector using strsplit
for each item in vector try to match to names(df)
when matched set row,column == TRUE

BUT I have tried all sorts and cannot fathom the logic. Can anyone break this down for me please?

In the expected result, E313 should be TRUE instead of E300 — akrun
– akrun, Commented Mar 2, 2015 at 17:26

akrun · Accepted Answer · 2015-03-02 18:13:19Z

1

Try

df[-1] <-  t(vapply(strsplit(as.character(df$Topics), ','),
                 function(x) names(df)[-1] %in% x, logical(ncol(df)-1)))
df
#             Topics    E1   E11   E20   E30   E31  E100  E300  E313  ECAT
#1 E11,E31,E313,ECAT FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE
#2            E1,E20  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE

Or

 df[-1] <- t(vapply(strsplit(as.character(df$Topics), ","), function(x)
         !!table(factor(x, levels=names(df)[-1])), logical(ncol(df)-1)))

edited Mar 2, 2015 at 18:13

answered Mar 2, 2015 at 17:24

akrun

891k38 gold badges590 silver badges700 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

BarneyC Over a year ago

Epic. So I spend a whole day and probably put down 50 - 60 lines of code trying to get this to work.. You do it in ONE! Outstanding and just goes to show how much further I have to go with R. Cheers.

akrun Over a year ago

@BarneyC Glad to help you. It is only based on experience.

BarneyC Over a year ago

May I ask what would I change if there was another column before "Topic" ? I guess I'm asking what each of the -1 indexes means and to what would I set them. Cheers

akrun Over a year ago

@BarneyC Here I used -1 for selecting all the columns except the 1st column. Here, it is Topics. Suppose, you have another column before Topic, then the index should be df[-c(1,2)] and change names(df)[-(1:2)] and logical(ncol(df)-2)

A5C1D2H2I1M1N2O1R2T1 · Accepted Answer · 2015-03-02 17:49:30Z

Here's almost a step-by-step approach to the logic you describe:

## make note of the column names
Colnames <- names(df[-1])

## Create an empty FALSE matrix to modify later
Mat <- matrix(FALSE, nrow = nrow(df), 
              ncol = length(Colnames), 
              dimnames = list(NULL, Colnames))

## Use strsplit to split the "Topics" column
L <- strsplit(as.character(df[[1]]), ",", fixed = TRUE)

## Figure out which values match with which columns
## I'm using matrix indexing here to set those values to TRUE
Mat[cbind(rep(seq_along(L), vapply(L, length, 1L)),
          match(unlist(L), Colnames))] <- TRUE

## Replacement in the original dataset
df[-1] <- Mat
df
#              Topics    E1   E11   E20   E30   E31  E100  E300  E313  ECAT
# 1 E11,E31,E313,ECAT FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE
# 2            E1,E20  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE

If you were just starting with the "Topics" column, here are a few variations you can consider:

mtabulate from "qdapTools"

> library(qdapTools)
> mtabulate(strsplit(as.character(df$Topics), ",", TRUE))
  E1 E11 E20 E31 E313 ECAT
1  0   1   0   1    1    1
2  1   0   1   0    0    0

cSplit_e from my "splitstackshape" package

library(splitstackshape)
cSplit_e(df[1], "Topics", ",", type = "character", fill = 0)
#              Topics Topics_E1 Topics_E11 Topics_E20 Topics_E31 Topics_E313 Topics_ECAT
# 1 E11,E31,E313,ECAT         0          1          0          1           1           1
# 2            E1,E20         1          0          1          0           0           0

Both would require a little bit of extra work to make sure that all of the columns you expect to have are included (and to convert from 1 and 0 to TRUE and FALSE).

That walkthrough is great and is far more in line with how I was thinking it could be approached. Thanks for that.

Collectives™ on Stack Overflow

R - Setting Value Based on Matching to Column Name

2 Answers 2

4 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related