4

I have a data set that contains some logical columns and would like to replace values that are 'TRUE' with the corresponding column name. I asked a similar question here and was able to identify an appropriate solution with the help of some suggestions from other S/O users. However, the solution does not use data.table syntax and copies the whole dataset instead of replacing by reference, which is time consuming.

What is the most appropriate way to do this using data.table syntax?

I tried this:

# Load library    
library(data.table)

# Create dummy data.table:
mydt <- data.table(id = c(1,2,3,4,5), 
                   ptname = c("jack", "jill", "jo", "frankie", "claire"), 
                   sex = c("m", "f", "f", "m", "f"), apple = c(T,F,F,T,T), 
                   orange = c(F,T,F,T,F), 
                   pear = c(T,T,T,T,F))

# View dummy data:
> mydt
   id  ptname sex apple orange  pear
1:  1    jack   m  TRUE  FALSE  TRUE
2:  2    jill   f FALSE   TRUE  TRUE
3:  3      jo   f FALSE  FALSE  TRUE
4:  4 frankie   m  TRUE   TRUE  TRUE
5:  5  claire   f  TRUE  FALSE FALSE

# Function to recode values in a data.table:
recode.multi <- function(datacol, oldval, newval) {
  trans <- setNames(newval, oldval)
  trans[ match(datacol, names(trans)) ]
}

# Get a list of all the logical columns in the data set:
logicalcols <- names(which(mydt[, sapply(mydt, is.logical)] == TRUE))

# Apply the function to convert 'TRUE' to the relevant column names:
mydt[, (logicalcols) := lapply(.SD, recode.multi, 
                               oldval = c(FALSE, TRUE), 
                               newval = c("FALSE", names(.SD))), .SDcols = logicalcols]

# View the result:
> mydt
   id  ptname sex apple orange  pear
1:  1    jack   m apple  FALSE apple
2:  2    jill   f FALSE  apple apple
3:  3      jo   f FALSE  FALSE apple
4:  4 frankie   m apple  apple apple
5:  5  claire   f apple  FALSE FALSE

This isn't correct as instead of iterating through each column name for the replacement values, it just recycles the first one ("apple" in this case).

Moreover, if I reverse the order of old and new values, the function ignores my character string replacement for the second value and uses the first two column names as replacements in all cases:

# Apply the function with order of old and new values reversed:
mydt[, (logicalcols) := lapply(.SD, recode.multi, 
                               oldval = c(TRUE, FALSE), 
                               newval = c(names(.SD), "FALSE")), .SDcols = logicalcols]

# View the result:
> mydt
   id  ptname sex  apple orange   pear
1:  1    jack   m  apple orange  apple
2:  2    jill   f orange  apple  apple
3:  3      jo   f orange orange  apple
4:  4 frankie   m  apple  apple  apple
5:  5  claire   f  apple orange orange

I'm sure I'm probably missing something simple but does anyone know why the function does not iterate through the column names (and how to edit it to do this)?

My expected output would be as follows:

> mydt
   id  ptname sex apple orange  pear
1:  1    jack   m apple  FALSE  pear
2:  2    jill   f FALSE orange  pear
3:  3      jo   f FALSE  FALSE  pear
4:  4 frankie   m apple orange  pear
5:  5  claire   f apple  FALSE FALSE

Alternatively any other suggestions of concise data.table syntax to achieve this would be much appreciated.

2
  • 2
    Using chars instead of logical will be painful in any later analysis I guess. Regarding why your way doesn't work, lapply iterates on one thing at a time (.SD here). If you need it to iterate over .SD and names(.SD), try Map. Commented Apr 24, 2017 at 17:20
  • Thanks - couldn't find a syntax example for 'Map' but the help says it is a wrapper for mapply - doing this mydt[, (logicalcols) := mapply(recode.multi, datacol = .SD, oldval = c(TRUE, FALSE), newval = c(names(.SD), "FALSE"), SIMPLIFY = FALSE), .SDcols = logicalcols] almost gets me there except that the FALSE values are converted to NAs. Commented Apr 24, 2017 at 17:50

1 Answer 1

3

We can use a melt/dcast approach

dcast(melt(mydt, id.var = c("id", "ptname", "sex"))[,
     value1 := as.character(value)][(value), value1 := variable], 
            id + ptname + sex~variable, value.var = "value1")
#   id  ptname sex apple orange  pear
#1:  1    jack   m apple  FALSE  pear
#2:  2    jill   f FALSE orange  pear
#3:  3      jo   f FALSE  FALSE  pear
#4:  4 frankie   m apple orange  pear
#5:  5  claire   f apple  FALSE FALSE

Or another option is with set which would be more efficient

nm1 <- which(unlist(mydt[, lapply(.SD, class)])=="logical")
for(j in nm1){
    i1 <- which(mydt[[j]])
    set(mydt, i=NULL, j=j, value = as.character(mydt[[j]]))
    set(mydt, i = i1, j=j, value = names(mydt)[j])
}

mydt
#   id  ptname sex apple orange  pear
#1:  1    jack   m apple  FALSE  pear
#2:  2    jill   f FALSE orange  pear
#3:  3      jo   f FALSE  FALSE  pear
#4:  4 frankie   m apple orange  pear
#5:  5  claire   f apple  FALSE FALSE

Or another option mentioned in the comments is

mydt[, (nm1) := Map(function(x,y) replace(x, x, y), .SD, names(mydt)[nm1]), .SDcols = nm1]
mydt
#   id  ptname sex apple orange  pear
#1:  1    jack   m apple  FALSE  pear
#2:  2    jill   f FALSE orange  pear
#3:  3      jo   f FALSE  FALSE  pear
#4:  4 frankie   m apple orange  pear
#5:  5  claire   f apple  FALSE FALSE

UPDATE: Comparing options two and three (one is not possible due to the number of non-logical columns) with a data set comprising 18573 rows and 650 columns, of which 252 columns are logical runs with the following timings:

# Option 2:
  nm1 <- which(unlist(mydt[, lapply(.SD, is.logical)])) 
  system.time( 
   for(j in nm1){ 
     i1 <- which(mydt[[j]]) 
     set(mydt, i=NULL, j=j, value = as.character(mydt[[j]])) 
     set(mydt, i = i1, j=j, value = names(mydt)[j]) 
     } 
   ) 
 # user system elapsed 
 #  0.61 0.00 0.61

# Option 3:
system.time( 
  mydt[, (nm1) := Map(function(x,y) replace(x, x, y), .SD, names(mydt)[nm1]), .SDcols = nm1] 

   ) 
#user system elapsed 
#0.65 0.00 0.66

Both are significantly faster than the original approach not using data.table syntax:

# Original approach:
logitrue <- which(mydt == TRUE, arr.ind = T)
 system.time(
   mydt[logitrue, ] <- colnames(mydt)[logitrue[,2]]
 )
  # user  system elapsed 
  # 1.22    0.03    4.22 
Sign up to request clarification or add additional context in comments.

9 Comments

Thanks, but am after something that would be applicable when I don't know where in my data set the columns to change are (but can identify which ones to change by creating a list of all the logical columns, have edited my post to include this). Is it possible to use .SD in melt/dcast? Also in the recode.multi function above the transformation from logic to character is already taken care of, is it faster to use set for this?
@AmyM You have to change the logical to character before recoding it to character values
@AmyM There should be some pattern or something in column names to specify. If there are no patterns, how do you know which columns to change?
Curiously applying the recode.multi function above does not result in a warning about the replacement values not matching the column type - I think it is automatically converting them to character?
@AmyM Updated the post based on the value of logical column
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.