1

I have two data frames:

  • Lookup table lookup with columns varName (variable name), key, and value
  • Data frame df with columns named exactly as values in varName and values corresponding to key (values in df are keys to lookup). This data frame is much bigger than lookup data frame (e.g. 1e6 rows).

I would like to recode data in df by appending new column for every variable, where key in df is replaced by corresponding value from lookup data frame. It is important to note that keys are of type double.

Sample data:

# Generate sample data
lookup <- data.frame(
  varName = rep(LETTERS[1:3], each = 3),
  key     = runif(9),
  value   = runif(9)
  )

df <- expand.grid(
  A = lookup[lookup$varName == 'A', 'key'],
  B = lookup[lookup$varName == 'B', 'key'],
  C = lookup[lookup$varName == 'C', 'key']
  )

My current solution uses temporary renaming of variables and join from plyr:

require(plyr)

for (varName in unique(lookup$varName)) {
  tmpLookup <- rename(lookup, replace = c(key = varName))
  df[paste0(varName, '_value')] <- join(df[varName], tmpLookup[c(varName, 'value')], 
                                        by = varName)['value']  
}

df

Questions:

  • is this safe? I cannot find any information if joining double will work always correctly using join
  • is there better way to accomplish the same safer and faster?
0

1 Answer 1

1

You could try data.table. Using a set.seed(20) for creating the "df" (for reproducibility). Instead of the "wide" format, I am reshaping "df" to "long" using melt, converted to "data.table" (as.data.table), set the key columns (setkey(..)), join the "lookup" dataset, convert it back to "wide" format with dcast.data.table, and finally join the original dataset so as to have to new and old columns in the dataset. This could be also done using a for loop without reshaping

library(data.table)
library(reshape2)
DT <- as.data.table(melt(as.matrix(df)))
DT1 <- dcast.data.table(setkey(DT, Var2,
           value)[lookup], Var1~Var2, value.var='i.value')
DT2 <- setkey(setDT(df)[,Var1:=1:.N], Var1)[DT1][,Var1:=NULL]

head(DT2,2)
#          A         B          C       i.A         i.B       i.C
#1: 0.8775214 0.5291637 0.09133259 0.3700745 0.001927939 0.4520996
#2: 0.7685332 0.5291637 0.09133259 0.7155276 0.001927939 0.4520996
Sign up to request clarification or add additional context in comments.

2 Comments

The second line (DT1) fails with error: Error in setkeyv(x, cols, verbose = verbose, physical = physical) : some columns are not in the data.table: Var2.
@TomasGreif I am not able to reproduce the error using data.table_1.9.5

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.