Chaining multiple data.table::merge operations with data.tables

Question

Is it possible to chain multiple merge operations one after another with data.tables?

The functionality would be similar to joining multiple data.frames in a dplyr pipe but would be used for data.tables in a similar chained fashion as merging two data.tables in the below and then manipulating the data.table as required. But only you would be then able to merge another data.table. I am acknowledging this SO question here may be very similar, that is after @chinsoon12 posted the comment.

Thanks for any help!

library(dplyr)
library(data.table)

# data.frame
df1 = data.frame(food = c("apples", "bananas", "carrots", "dates"),
                 quantity = c(1:4))

df2 = data.frame(food = c("apples", "bananas", "carrots", "dates"),
                 status = c("good", "bad", "rotten", "raw"))

df3 = data.frame(food = c("apples", "bananas", "carrots", "dates"),
                 rank = c("okay", "good", "better", "best"))

df4 = left_join(df1,
                df2,
                by = "food") %>% 
  mutate(new_col = NA) %>%  # this is just to hold a position of mutation in the data.frame
  left_join(.,
            df3,
            by = "food")



# data.table
dt1 = data.table(food = c("apples", "bananas", "carrots", "dates"),
                 quantity = c(1:4))

dt2 = data.table(food = c("apples", "bananas", "carrots", "dates"),
                 status = c("good", "bad", "rotten", "raw"))

dt3 = data.table(food = c("apples", "bananas", "carrots", "dates"),
                 rank = c("okay", "good", "better", "best"))

# this is what I am not sure how to implement
dt4 = merge(dt1,
            dt2,
            by = "food")[
              food == "apples"](merge(dt4))

you can search for Reduce(function(x, y) merge(x, y), list(dt2,dt3,dt4)) — chinsoon12
– chinsoon12, Commented Sep 20, 2018 at 1:13
You might have to give a more concrete example. Guidance is here, if you're interested: stackoverflow.com/questions/5963269/… Your pseudocode in the second chunk looks fine to me as long as the commas and parentheses are balanced and in the right places. — Frank
– Frank, Commented Sep 20, 2018 at 1:16
You are right, I thought it was more of a general question but it would be easier to provide a MWE. Let me edit to provide one. — Prevost
– Prevost, Commented Sep 20, 2018 at 1:24

Chris Holbrook · Accepted Answer · 2018-09-20 02:26:50Z

6

Multiple data.table joins with the on argument can be chained. Note that without an update operator (":=") in j, this would be a right join, but with ":=" (i.e., adding columns), this becomes a left outer join. A useful post on left joins here Left join using data.table.

Example using example data above with a subset between joins:

dt4 <- dt1[dt2, on="food", `:=`(status = i.status)][
            food == "apples"][dt3, on="food", rank := i.rank]

##> dt4
## food quantity status rank
##1: apples        1   good okay

Example adding new column between joins

dt4 <- dt1[dt2, on="food", `:=`(status = i.status)][
            , new_col := NA][dt3, on="food", rank := i.rank]

##> dt4
##      food quantity status new_col   rank
##1:  apples        1   good      NA   okay
##2: bananas        2    bad      NA   good
##3: carrots        3 rotten      NA better
##4:   dates        4    raw      NA   best

Example using merge and magrittr pipes:

dt4 <-  merge(dt1, dt2, by = "food") %>%
           set( , "new_col", NA) %>% 
             merge(dt3, by = "food")

##> dt4
##      food quantity status new_col   rank
##1:  apples        1   good      NA   okay
##2: bananas        2    bad      NA   good
##3: carrots        3 rotten      NA better
##4:   dates        4    raw      NA   best

edited Sep 20, 2018 at 2:26

answered Sep 20, 2018 at 1:46

Chris Holbrook

2,6461 gold badge21 silver badges34 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Prevost Over a year ago

This may sound like a dumb question, but what is the i. in front of i.status and i.rank for?

Chris Holbrook Over a year ago

it's a good question and in hindsight probably not needed here. In joins of the form x[i] it can be used to specify that the column is taken from the i (right) table. similarly, you can specify from the x (left) table using something like x.name. BUT, in hindsight this may only be useful (needed?) in non-equi joins (see the accepted answer to this stackoverflow.com/questions/44282044/…).

AdamKarol · Accepted Answer · 2023-01-23 08:53:38Z

See no other way than this (unfortunately). You need to define vectors with column names and then You may chain joining by reference like this: 

cols_dt1 <- colnames(dt_dt1)[!colnames(dt_dt1) %in% 'join_column1']
cols_dt2 <- colnames(dt_dt2)[!colnames(dt_dt2) %in% ' join_column2']
cols_dt3 <- colnames(dt_dt3)[!colnames(dt_dt3) %in% ' join_column3']
cols_dt4 <- colnames(dt_dt4)[!colnames(dt_dt4) %in% ' join_column4']
cols_dt5 <- colnames(dt_dt5)[!colnames(dt_dt5) %in% ' join_column5']

data_dt[dt_dt1, on=.( join_column1), (cols_dt1) := mget(cols_dt1)][
  dt_dt2, on=.( join_column2), (cols_dt2) := mget(cols_dt2)][
    dt_dt3, on=.( join_column3), (cols_dt3) := mget(cols_dt3)][
      dt_dt4, on=.( join_column4), (cols_dt4) := mget(cols_dt4)][
        dt_dt5, on=.( join_column5), (cols_dt5) := mget(cols_dt5)]

Collectives™ on Stack Overflow

Chaining multiple data.table::merge operations with data.tables

2 Answers 2

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related