Replace values of data.table with values from another data.table

Question

I have a dataset with 300 columns and 1000 rows and a corresponding code book in data.table format. For simplicity I am going to give 3 columns for both.

dt <- data.table(id = 1:10,
                 a  = sample(c(1,2,3),10, replace = T),
                 b  = sample(c(1,2)  ,10, replace = T),
                 c  = sample(c(1:5)  ,10, replace = T))

    id a b c
 1:  1 2 1 2
 2:  2 2 1 1
 3:  3 3 1 1
 4:  4 3 1 1
 5:  5 1 2 5
 6:  6 2 1 3
 7:  7 1 2 3
 8:  8 1 1 2
 9:  9 2 1 5
10: 10 3 2 4

cb <- data.table(var = c(rep("a", 3), rep("b", 2), rep("c", 5)),
                 val = c(1,2,3,1,2,1,2,3,4,5),
                 des = c("red", "blue", "yellow", "yes","no","K", "Na","Ag","Au","Si"))

    var val    des
 1:   a   1    red
 2:   a   2   blue
 3:   a   3 yellow
 4:   b   1    yes
 5:   b   2     no
 6:   c   1      K
 7:   c   2     Na
 8:   c   3     Ag
 9:   c   4     Au
10:   c   5     Si

In cb, var is the corresponding variable in dt, and val is the value in dt that has the corresponding des value. I want to edit dt by replacing the values in dt by the values in cb. It should look like

    id      a   b  c
 1:  1    red yes Na
 2:  2 yellow  no Ag
 3:  3   blue yes Ag
 4:  4    red yes Au
 5:  5   blue yes Ag
 6:  6   blue  no Au
 7:  7 yellow yes Si
 8:  8   blue  no Ag
 9:  9    red  no  K
10: 10 yellow  no Ag

How do I perform an operation like this efficiently and in a way that doesn't sound like my computer has built in piston?

The reason is I have a pre-written code to analyze the data and need the actual values in order to run it. It may also prove useful in general because many times I am given data and a code book, but usually they aren't this many variables.

I feel like your first row in the final output should be 'a' = 'blue', correct? — bshelt141
– bshelt141, Commented May 10, 2017 at 22:16
Please use set.seed when using sample, so the results are repeatable — Andrew Lavers
– Andrew Lavers, Commented May 10, 2017 at 22:19

lukeA · Accepted Answer · 2017-05-10 22:15:09Z

3

You could try

dcast(melt(dt, 1, var="var", val="val")[cb, on=c("var","val")], id~var, value.var="des")
#     id      a   b  c
#  1:  1    red yes  K
#  2:  2 yellow  no Si
#  3:  3    red yes Si
#  4:  4    red  no Au
#  5:  5    red  no Ag
#  6:  6   blue yes  K
#  7:  7   blue  no Si
#  8:  8 yellow yes Na
#  9:  9   blue yes Ag
# 10: 10 yellow yes Si

answered May 10, 2017 at 22:15

lukeA

54.4k5 gold badges102 silver badges101 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Andrew Lavers Over a year ago

This is the more robust solution because it works for any number of columns.

akash87 Over a year ago

By any chance do you know how long this would take with 1M obs?

lukeA Over a year ago

@akash87 Why not try it out: 300 vars w/ 100.000 obs took ~20 secs on my rather old PC

cols <- paste0("V", 1:300);cb <- setDT(expand.grid(var=cols, val=1:3))[,des:=sample(LETTERS, .N, T)];dt <- as.data.table(replicate(length(cols), sample(1:3,100000,T)))[,id:=1:.N];system.time(dcast(melt(dt, id="id", var="var", val="val")[cb, on=c("var","val"), allow.cartesian=TRUE][!is.na(id)], id~var, value.var="des"))

. That is if I did not make a mistake..

Mike H. · Accepted Answer · 2017-05-10 22:18:18Z

3

Another option would be to do multiple merge + updates:

cb_dc <- data.table::dcast(cb, des~var, value.var = "val")
cols = c("a","b","c")
dt[, (cols) := lapply(cols, function(x) cb_dc[dt, des, on = x]) ]

 #  id      a   b  c
 #1:  1    red yes Si
 #2:  2   blue yes Na
 #3:  3   blue  no Au
 #4:  4 yellow yes  K
 #5:  5    red  no Na
 #6:  6 yellow yes Na
 #7:  7 yellow  no  K
 #8:  8   blue  no Na
 #9:  9   blue yes Si
#10: 10    red  no Na

Data:

set.seed(1)
  dt <- data.table(id = 1:10,
                   a  = sample(c(1,2,3),10, replace = T),
                   b  = sample(c(1,2)  ,10, replace = T),
                   c  = sample(c(1:5)  ,10, replace = T))

answered May 10, 2017 at 22:18

Mike H.

14.4k2 gold badges33 silver badges39 bronze badges

Comments

Andrew Lavers · Accepted Answer · 2017-05-10 22:25:35Z

1

This dplyr answer essentialy joins with a sub table once for each of the three columns.

library(dplyr)

dt %>% 
  left_join(cb %>% filter(var == "a"), by=c("a" = "val")) %>% 
  left_join(cb %>% filter(var == "b"), by=c("b" = "val")) %>% 
  left_join(cb %>% filter(var == "c"), by=c("c" = "val")) %>%
  select(id, des.x, des.y, des) %>%
  rename(a = des.x, b = des.y, c = des)

answered May 10, 2017 at 22:25

Andrew Lavers

4,3881 gold badge14 silver badges19 bronze badges

Collectives™ on Stack Overflow

Replace values of data.table with values from another data.table

3 Answers 3

3 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related