1

Afternoon clever people.

I have a decent sized data set (>800k rows) and as an example I have pulled out a tiny sample of 20 columns by 2 rows. At the outset only the "Topics" column is populated with a vector, all other columns are set to FALSE.

This will recreate the data as it sits currently...

  Topics <- c("E11,E31,E313,ECAT" , "E1,E20") 
  E1     <- c(FALSE, FALSE)
  E11    <- c(FALSE, FALSE)
  E20    <- c(FALSE, FALSE)
  E30    <- c(FALSE, FALSE)
  E31    <- c(FALSE, FALSE)
  E100   <- c(FALSE, FALSE)
  E300   <- c(FALSE, FALSE)
  E313   <- c(FALSE, FALSE)
  ECAT   <- c(FALSE, FALSE)
  df     <- data.frame(Topics,E1,E11,E20,E30,E31,E100,E300,E313,ECAT)

Which will give something like...

Topics              E1    E11   E20   E30   E31   E100  E300  E313  ECAT
E11,E31,E313,ECAT   FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
E1,E20              FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

I want to set the relevant row,column to TRUE where there is a match for each of the items in the topic vector. So it should look something like...

Topics              E1    E11   E20   E30   E31   E100  E300  E313  ECAT
E11,E31,E313,ECAT   FALSE TRUE  FALSE FALSE TRUE  FALSE TRUE  FALSE TRUE
E1,E20              TRUE  FALSE TRUE  FALSE FALSE FALSE FALSE FALSE FALSE

So far I have failed UTTERLY to work this one out but I suspect it is something like:

  • split the topic into a vector using strsplit
  • for each item in vector try to match to names(df)
  • when matched set row,column == TRUE

BUT I have tried all sorts and cannot fathom the logic. Can anyone break this down for me please?

1
  • 1
    In the expected result, E313 should be TRUE instead of E300 Commented Mar 2, 2015 at 17:26

2 Answers 2

1

Try

df[-1] <-  t(vapply(strsplit(as.character(df$Topics), ','),
                 function(x) names(df)[-1] %in% x, logical(ncol(df)-1)))
df
#             Topics    E1   E11   E20   E30   E31  E100  E300  E313  ECAT
#1 E11,E31,E313,ECAT FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE
#2            E1,E20  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE

Or

 df[-1] <- t(vapply(strsplit(as.character(df$Topics), ","), function(x)
         !!table(factor(x, levels=names(df)[-1])), logical(ncol(df)-1)))
Sign up to request clarification or add additional context in comments.

4 Comments

Epic. So I spend a whole day and probably put down 50 - 60 lines of code trying to get this to work.. You do it in ONE! Outstanding and just goes to show how much further I have to go with R. Cheers.
@BarneyC Glad to help you. It is only based on experience.
May I ask what would I change if there was another column before "Topic" ? I guess I'm asking what each of the -1 indexes means and to what would I set them. Cheers
@BarneyC Here I used -1 for selecting all the columns except the 1st column. Here, it is Topics. Suppose, you have another column before Topic, then the index should be df[-c(1,2)] and change names(df)[-(1:2)] and logical(ncol(df)-2)
1

Here's almost a step-by-step approach to the logic you describe:

## make note of the column names
Colnames <- names(df[-1])

## Create an empty FALSE matrix to modify later
Mat <- matrix(FALSE, nrow = nrow(df), 
              ncol = length(Colnames), 
              dimnames = list(NULL, Colnames))

## Use strsplit to split the "Topics" column
L <- strsplit(as.character(df[[1]]), ",", fixed = TRUE)

## Figure out which values match with which columns
## I'm using matrix indexing here to set those values to TRUE
Mat[cbind(rep(seq_along(L), vapply(L, length, 1L)),
          match(unlist(L), Colnames))] <- TRUE

## Replacement in the original dataset
df[-1] <- Mat
df
#              Topics    E1   E11   E20   E30   E31  E100  E300  E313  ECAT
# 1 E11,E31,E313,ECAT FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE
# 2            E1,E20  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE

If you were just starting with the "Topics" column, here are a few variations you can consider:

  1. mtabulate from "qdapTools"

    > library(qdapTools)
    > mtabulate(strsplit(as.character(df$Topics), ",", TRUE))
      E1 E11 E20 E31 E313 ECAT
    1  0   1   0   1    1    1
    2  1   0   1   0    0    0
    
  2. cSplit_e from my "splitstackshape" package

    library(splitstackshape)
    cSplit_e(df[1], "Topics", ",", type = "character", fill = 0)
    #              Topics Topics_E1 Topics_E11 Topics_E20 Topics_E31 Topics_E313 Topics_ECAT
    # 1 E11,E31,E313,ECAT         0          1          0          1           1           1
    # 2            E1,E20         1          0          1          0           0           0
    

Both would require a little bit of extra work to make sure that all of the columns you expect to have are included (and to convert from 1 and 0 to TRUE and FALSE).

1 Comment

That walkthrough is great and is far more in line with how I was thinking it could be approached. Thanks for that.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.