data.table merge by multiple columns

Question

I'm trying to match 4 variables pairwise and add a column with the lookup value. In base, I would do merge(df1,df2, by.x=c("lsr","ppr"),by.y=c("li","pro")), where df1 has 9 cols and df2 (2 being lsr and pro) df2 has only 3, li, pro, and the "value" I'm interested in, alpha.

This works fine, but as I'm beginning to be a huge fan of data.table, I would like to do this in the data.table way - and because I have some millions of rows - so base merge is slow (I saw, that the by.x, and by.y feature is pending for data.table, but maybe there is a workaround). See some sample data below:

df2:
         alpha         li        pro
      1: 0.5000000 0.01666667 0.01666667
      2: 0.3295455 0.03333333 0.01666667
      3: 0.2435897 0.05000000 0.01666667
      4: 0.1917808 0.06666667 0.01666667
      5: 0.1571429 0.08333333 0.01666667
df1:     
          demand rtime    mcv         mck        ppr       mlv         mlk        lsr
      1:    0.3     1 357.57700 0.099326944 0.01666667 558.27267 0.155075741 0.01666667
      2:    0.3    10 548.75433 0.152431759 0.01666667 614.30667 0.170640741 0.03333333
      3:    0.3    11 314.55767 0.087377130 0.01666667 636.48100 0.176800278 0.03333333
      4:    0.3     2 312.15033 0.086708426 0.01666667 677.48100 0.188189167 0.06666667
      5:    0.3     3 454.47867 0.126244074 0.01666667 608.92067 0.169144630 0.01666667
     ---                                                                               
6899196:    0.6     5 537.92673 0.149424093 1.00000000 537.92673 0.149424093 1.00000000
6899197:    0.6     6 277.34732 0.077040923 1.00000000 277.34732 0.077040923 1.00000000
6899198:    0.6     7  73.31484 0.020365235 1.00000000  73.31484 0.020365235 1.00000000
6899199:    0.6     8  32.04197 0.008900546 1.00000000  32.04197 0.008900546 1.00000000
6899200:    0.6     9  14.59008 0.004052799 1.00000000  14.59008 0.004052799 1.00000000

Last, maybe of interest is, that in df2 I have unique rows, and in df1, I have lots of duplicates in respect to lsr and ppr. I also tried to set two keys and join them the data.table way, and adding a new column with alpha. But without success.

If you don't need an outer join, you can just do a left, a right join or an inner join using binary join. The question is which one you want. Though judging by your base R code it seems like you wan an inner join? Also, which column you want to keep? You also don't have pro column in df1 — David Arenburg
– David Arenburg, Commented May 21, 2015 at 10:34
Thanks. Actually I believe, in this case the inner and the left joins are the same, as all combinations of lsr and ppr, or of li and pro are included in both df1 and df2. And not having pro in df1 is a typo!! Sorry for that, I meant, ppr in df1, I'll edit and correct it. I'd like to keep all columns from df1 and add alpha as a new column — Puki Luki
– Puki Luki, Commented May 21, 2015 at 11:27
If all you need is a left join to df1 while updating alpha column, it is a simple binary join. Try setkey(df1, lsr, ppr) ; setkey(df2, li, pro) ; df1[df2, alpha := i.alpha] ; df1 — David Arenburg
– David Arenburg, Commented May 21, 2015 at 11:44
So if you already have an alpha column in your df1 set, i will tell data.table to get the information from the alpha in df2. It is always safer to put i, before the column names on the RHS in order to avoid possible unexpected results. — David Arenburg
– David Arenburg, Commented May 21, 2015 at 12:00

Arun · Accepted Answer · 2015-07-18 03:36:22Z

29

You can use the statement provided by David Arenburg in comment:

setkey(df1, lsr, ppr)
setkey(df2, li, pro)
df1[df2, alpha := i.alpha]

From the current devel version, 1.9.5, we can perform joins directly without having to set keys using the on argument:

df1[df2, alpha := i.alpha, on = c(lsr="li", ppr="pro")]

If you don't want to install the devel version, then you can wait until this is pushed as v1.9.6 on CRAN.

edited Jul 18, 2015 at 3:36

Arun

119k28 gold badges290 silver badges396 bronze badges

answered Jul 18, 2015 at 0:08

jangorecki

16.8k5 gold badges86 silver badges169 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Herman Toothrot Over a year ago

Is df1[df2, alpha := i.alpha] supposed to add a column to df1? It's not doing anything for me.

jangorecki Over a year ago

yes, please submit bug report to data.table repository providing reproducible example and session info

Collectives™ on Stack Overflow

data.table merge by multiple columns

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related