Combining Two Dataframes Based on Multiple Conditions & Removing Rows that Don't Match

Question

I need to combine two dataframes: one with detection data and another with the metadata for those detections.

The column names of the first dataframe, which is called rt_det are:

tag_id, power, rec_ser_num, date_time

The column names of the second dataframe, which is called range_testing are:

date, tag_id, tag_type, tag_int, location, rec_num, rec_ser_num, rec_type, start_time, datetime_start, end_time, datetime_end, distance_m, depth_ft, depth_loc, group, notes

They need to be combined based on:

tag_id matching between the two dataframes
rec_ser_num matching between the two dataframes
date_time of detections dataframe being greater than or equal to datetime_start and less than or equal to datetime_end from the metadata dataframe
There will be extra rows in the detections dataframe that do not match the metadata dataframe; these rows should not transfer to the new dataframe when merged
All columns from both dataframes need to be kept in the new dataframe

I tried the code below, based on answer 6 in the post R: merge based on multiple conditions (non-equal criteria), but I get the error shown. Also, my question/need is not the same as in that post because of points 4 and 5 above. Also because in that post data is being added to a new column based on the other dataframe but I just want the data in all the columns that correspond to all the conditions.

head(
  setDT(rt_det)[
    setDT(range_testing), 
    on = c(
      "tag_id", 
      "rec_ser_num", 
      "date_time>=datetime_start", 
      "date_time<=datetime_end"
    ), 
    data := data
  ]
)

Error:

Error in `[.data.table`(setDT(rt_det), setDT(range_testing), on = c("tag_id",  : 
  invalid type/length (closure/27856) in vector allocation

Any help would be most appreciated!

This sounds like a non-equi full join where you only want one row for multiple matches. In dplyr, something like full_join(rt_det, range_testing, join_by(tag_id, rec_ser_num, date_time>=datetime_start, date_time<=datetime_end), multiple = “first”) — Jon Spring
– Jon Spring, Commented Nov 23 at 18:12
You are more likely to get useful suggestions if your question is reproducible, by including some sample data (as code) and the specific output you expect. Please have a look at stackoverflow.com/questions/5963269/… — Jon Spring
– Jon Spring, Commented Nov 23 at 18:14
As JonSpring said, this sounds like a "merge/join" operation. Some good references for that (across multiple platforms) include stackoverflow.com/q/38549/3358272, stackoverflow.com/q/1299871/3358272, and stackoverflow.com/q/5706437/3358272, and for data.table-specific discussion, see stackoverflow.com/q/34598139/3358272, stackoverflow.com/q/12773822/3358272, and rdatatable.gitlab.io/data.table/library/data.table/doc/…. — r2evans
– r2evans, Commented Nov 23 at 18:23

Tobo · Accepted Answer · 2025-11-24 05:56:57Z

First, a simplified example (which you can easily adapt):

df_points <- data.table::fread("
tag_id  date_time other_cols
   100 2025-06-30       'P1'
   200 2025-01-31       'P2'
   200 2025-04-01       'P3'
   200 2025-06-01       'P4'
   300 2025-10-01       'P5'
", data.table=FALSE)

df_ranges <- data.table::fread("
tag_id date_time_start date_time_end other_cols
   100      2025-01-01    2025-02-28       'R1'
   200      2025-03-01    2025-06-30       'R2'
   200      2025-05-01    2025-08-31       'R3'
   400      2025-09-01    2025-11-30       'R4'
", data.table=FALSE)

It seems that what you are describing is a left range join. A join is defined by two things:

a rule or combination of rules for deciding which pairs of rows from either side form a match, and therefore join together in a combined row. You're describing an inequality join, and specifically what is known as a range join, where a match exists if a point value on one side falls between an upper and lower bound on the other. You also have additional equality conditions (represented by only one column in this simplified example).
which sets of joining/non-joining rows to reflect in the result. You want to keep the unjoined rows from the left-hand table but not the right-hand table (point 4), which makes this a left join

Re. (1), there is a possible additional issue, i.e. whether you want to allow multiple matches if a time point falls into multiple intervals. Please notice that I've rigged the example so that row P4 on the left matches two intervals on the right at rows R2 and R3. You need to specify whether this is a relevant consideration and what the policy should be.

(Your use of := (implying an update join) suggests you don't want multiple matches, because an update join has to preserve the dimension of the left-hand table without recycling/expanding any of its rows. On the other hand you might just be echoing the code in the question you reference. If you do want an update join, then there are some complications with multiple matches. But if what I've just said means nothing to you, ignore it!)

Okay, so assuming you don't need an update join, you have a couple of convenient options for what to use. The first I will mention is (my) utility package {fjoin}, which writes and runs {data.table} code while adding lots of bells and whistles, and works directly on non-data.tables.

install.packages("fjoin", repos = c("https://trobx.r-universe.dev")) # on CRAN soon

library(fjoin)
fjoin_left(df_points,
           df_ranges,
           on=c("tag_id", "date_time>=date_time_start", "date_time<=date_time_end"),
           indicate=TRUE)

  .join tag_id  date_time other_cols date_time_start date_time_end R.other_cols
1     1    100 2025-06-30         P1            <NA>          <NA>         <NA>
2     1    200 2025-01-31         P2            <NA>          <NA>         <NA>
3     3    200 2025-04-01         P3      2025-03-01    2025-06-30           R2
4     3    200 2025-06-01         P4      2025-03-01    2025-06-30           R2
5     3    200 2025-06-01         P4      2025-05-01    2025-08-31           R3
6     1    300 2025-10-01         P5            <NA>          <NA>         <NA>

What is nice here is that you can set indicate=TRUE to add an upfront column showing which input each row came from (1 for left, 2 for right, 3 for both). This simple but useful feature has existed in Stata since its release in January 1985. It's also been adopted in R by the excellent {collapse} package, but {collapse} doesn't do inequality joins.

If you only want one match per row of the left input (say, the first) then you can set mult.x = "first". That will lose the second match with P4 above (I'll leave you to run it).

However, the mainstream answer is to use {dplyr}, which has supported inequality joins for a while now. This is the solution people will naturally point you to because it is such a widely used package. You will lose the indicator column (and other options that might be relevant here), and it won't be quite as fast on large data, though that's very unlikely to matter.

library(dplyr)
left_join(df_points,
          df_ranges,
          join_by(tag_id, date_time>=date_time_start, date_time<=date_time_end))

{dplyr} has a shorthand for range joins (though the join it does is the same):

left_join(df_points,
          df_ranges,
          join_by(tag_id, between(date_time, date_time_start, date_time_end)))

NB If needed, the {dplyr} equivalent of {fjoin}'s mult.x is multiple. ({fjoin} also has a mult.y but I don't think it comes into play here.)

You've used {data.table} in your attempt but, again, you might just be reflecting the code you saw in an answer you consulted. If you really do mean to write the join directly in {data.table}, be warned that it doesn't automatically represent all the join columns - it garbles them in a certain way that I'm not going to explain here. Patching that up is a bit of a pain but you can use {fjoin} to ghostwrite the code for you by telling it just to show the code that it generates by using do=FALSE:

library(fjoin)
fjoin_left(df_points,
           df_ranges,
           on=c("tag_id", "date_time>=date_time_start", "date_time<=date_time_end"),
           indicate=TRUE,
           do=FALSE)

.DT : y = df_ranges (cast as data.table)
.i  : x = df_points (cast as data.table)
Join: .DT[, fjoin.ind.DT := TRUE][.i, on = c("tag_id", "date_time_start <= 
date_time", "date_time_end >= date_time"), data.frame(.join = 
fifelse(is.na(fjoin.ind.DT), 1L, 3L), tag_id = i.tag_id, date_time, other_cols = 
i.other_cols, date_time_start = x.date_time_start, date_time_end = 
x.date_time_end, R.other_cols = other_cols)]

Collectives™ on Stack Overflow

Combining Two Dataframes Based on Multiple Conditions & Removing Rows that Don't Match

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related