3

Is it possible to chain multiple merge operations one after another with data.tables?

The functionality would be similar to joining multiple data.frames in a dplyr pipe but would be used for data.tables in a similar chained fashion as merging two data.tables in the below and then manipulating the data.table as required. But only you would be then able to merge another data.table. I am acknowledging this SO question here may be very similar, that is after @chinsoon12 posted the comment.

Thanks for any help!

library(dplyr)
library(data.table)

# data.frame
df1 = data.frame(food = c("apples", "bananas", "carrots", "dates"),
                 quantity = c(1:4))

df2 = data.frame(food = c("apples", "bananas", "carrots", "dates"),
                 status = c("good", "bad", "rotten", "raw"))

df3 = data.frame(food = c("apples", "bananas", "carrots", "dates"),
                 rank = c("okay", "good", "better", "best"))

df4 = left_join(df1,
                df2,
                by = "food") %>% 
  mutate(new_col = NA) %>%  # this is just to hold a position of mutation in the data.frame
  left_join(.,
            df3,
            by = "food")



# data.table
dt1 = data.table(food = c("apples", "bananas", "carrots", "dates"),
                 quantity = c(1:4))

dt2 = data.table(food = c("apples", "bananas", "carrots", "dates"),
                 status = c("good", "bad", "rotten", "raw"))

dt3 = data.table(food = c("apples", "bananas", "carrots", "dates"),
                 rank = c("okay", "good", "better", "best"))

# this is what I am not sure how to implement
dt4 = merge(dt1,
            dt2,
            by = "food")[
              food == "apples"](merge(dt4))
4
  • you can search for Reduce(function(x, y) merge(x, y), list(dt2,dt3,dt4)) Commented Sep 20, 2018 at 1:13
  • 1
    You might have to give a more concrete example. Guidance is here, if you're interested: stackoverflow.com/questions/5963269/… Your pseudocode in the second chunk looks fine to me as long as the commas and parentheses are balanced and in the right places. Commented Sep 20, 2018 at 1:16
  • You are right, I thought it was more of a general question but it would be easier to provide a MWE. Let me edit to provide one. Commented Sep 20, 2018 at 1:24
  • see also stackoverflow.com/questions/31480615/… Commented Sep 20, 2018 at 2:41

2 Answers 2

6

Multiple data.table joins with the on argument can be chained. Note that without an update operator (":=") in j, this would be a right join, but with ":=" (i.e., adding columns), this becomes a left outer join. A useful post on left joins here Left join using data.table.

Example using example data above with a subset between joins:

dt4 <- dt1[dt2, on="food", `:=`(status = i.status)][
            food == "apples"][dt3, on="food", rank := i.rank]

##> dt4
## food quantity status rank
##1: apples        1   good okay

Example adding new column between joins

dt4 <- dt1[dt2, on="food", `:=`(status = i.status)][
            , new_col := NA][dt3, on="food", rank := i.rank]

##> dt4
##      food quantity status new_col   rank
##1:  apples        1   good      NA   okay
##2: bananas        2    bad      NA   good
##3: carrots        3 rotten      NA better
##4:   dates        4    raw      NA   best

Example using merge and magrittr pipes:

dt4 <-  merge(dt1, dt2, by = "food") %>%
           set( , "new_col", NA) %>% 
             merge(dt3, by = "food")

##> dt4
##      food quantity status new_col   rank
##1:  apples        1   good      NA   okay
##2: bananas        2    bad      NA   good
##3: carrots        3 rotten      NA better
##4:   dates        4    raw      NA   best
Sign up to request clarification or add additional context in comments.

2 Comments

This may sound like a dumb question, but what is the i. in front of i.status and i.rank for?
it's a good question and in hindsight probably not needed here. In joins of the form x[i] it can be used to specify that the column is taken from the i (right) table. similarly, you can specify from the x (left) table using something like x.name. BUT, in hindsight this may only be useful (needed?) in non-equi joins (see the accepted answer to this stackoverflow.com/questions/44282044/…).
0
See no other way than this (unfortunately). You need to define vectors with column names and then You may chain joining by reference like this: 

cols_dt1 <- colnames(dt_dt1)[!colnames(dt_dt1) %in% 'join_column1']
cols_dt2 <- colnames(dt_dt2)[!colnames(dt_dt2) %in% ' join_column2']
cols_dt3 <- colnames(dt_dt3)[!colnames(dt_dt3) %in% ' join_column3']
cols_dt4 <- colnames(dt_dt4)[!colnames(dt_dt4) %in% ' join_column4']
cols_dt5 <- colnames(dt_dt5)[!colnames(dt_dt5) %in% ' join_column5']

data_dt[dt_dt1, on=.( join_column1), (cols_dt1) := mget(cols_dt1)][
  dt_dt2, on=.( join_column2), (cols_dt2) := mget(cols_dt2)][
    dt_dt3, on=.( join_column3), (cols_dt3) := mget(cols_dt3)][
      dt_dt4, on=.( join_column4), (cols_dt4) := mget(cols_dt4)][
        dt_dt5, on=.( join_column5), (cols_dt5) := mget(cols_dt5)]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.