fuzzy_join of two dataframes based on ZIP codes

Question

I am trying to fuzzyjoin two dataframes. Both contain the column with ZIP codes and some other columns. However, in the parental dataframe there are more ZIP codes than in the secondary one. I would like to match in particular based on the first 3 integer values of a ZIP code and then the closest numerical value of the ZIP. For that JaroWinker distance seems to be perfect.

I have tried to use the solution given here using comparator package.

adapted to my case as follows:

library(dplyr)

library(fuzzyjoin)

library(tidyverse)
library(comparator)

f1 <- tribble(
  ~colA, ~colB,
  3000,   1,
  3001,   2,
  3007,   3
)


df2 <- tribble(
  ~colA, ~colC,
  3000,   200,
  3004,   22,
  3012,   55
)


jw <- comparator::JaroWinkler()

df3 <- fuzzyjoin::fuzzy_left_join(
  x = df1, y = df2, by = "colA",
  match_fun = function(x, y) { jw(x, y) > 0.62}
)

But what I obtain as the output is a df with 9 rows, but I just want a new df that would like this

df3 <- tribble(
  ~colA, ~colB, ~colC,
  3000,  1,  200,
  3001,  2,  200,
  3012, 3,   22
)

i.e. taking into account both the JW distance and also that 3001 is numerically closer to 3000 than to 3004 and that 3007 is closer to 3004 then to 3012.

Any hints how should I modify my code? Thank you!!

I suspect it might make sense to split out your zip codes into two pieces, the first part reflecting the first three digits, which specify the “sectional center facility,” and then the last two digits, for which “main town in a region (if applicable) often gets the first ZIP Codes for that region; afterward, the numerical order often follows the alphabetical order.” en.wikipedia.org/wiki/ZIP_Code#Geographic_hierarchy — Jon Spring
– Jon Spring, Commented Jan 18 at 17:38

margusl · Accepted Answer · 2025-01-18 23:09:07Z

2

With fuzzyjoin::fuzzy_join() & co we can use custom matching functions. In this case we could aim for an exact match for the first digit(s) and numeric difference for the rest.

There will likely be more than one match, but we could use numeric difference from matching function with slice_min() to keep just a single match for each df1$colA value.

library(dplyr, warn.conflicts = FALSE)
library(fuzzyjoin)

df1 <- tribble(
  ~colA, ~colB,
  3000,   1,
  3001,   2,
  3007,   3,
  2820,   4,
)

df2 <- tribble(
  ~colA, ~colC,
  2999,   100,
  3000,   200,
  3004,   22,
  3012,   55,
)

# if match_fun returns a data.frame / tibble, it's first column is
# used as match indicator; fuzzy_join adds extra columns to resulting frame
zip_match <- function(x, y, match_digits = 1, max_num_dist = 999){
  tibble(
    start_match = stringi::stri_sub(x, to = match_digits) == stringi::stri_sub(y, to = match_digits),
    num_dist    = abs(as.numeric(stringi::stri_sub(x, from = match_digits + 1)) - 
                      as.numeric(stringi::stri_sub(y, from = match_digits + 1))),
    match = start_match & num_dist <= max_num_dist
  ) |> 
  select(match, start_match, num_dist)
}

Usage examples:

# call match_fun with defaults (match_digits = 1, max_num_dist = 999),
# for 4-digit codes only first digit must match, "2820" matches "2999" 
fuzzy_left_join(df1, df2, by = "colA", match_fun = zip_match) |> 
  # keep only single closest match
  slice_min(order_by = num_dist, by = colA.x)
#> # A tibble: 4 × 6
#>   colA.x  colB colA.y  colC start_match num_dist
#>    <dbl> <dbl>  <dbl> <dbl> <lgl>          <dbl>
#> 1   3000     1   3000   200 TRUE               0
#> 2   3001     2   3000   200 TRUE               1
#> 3   3007     3   3004    22 TRUE               3
#> 4   2820     4   2999   100 TRUE             179

# match_digits = 2, "2820" does not match "2999"
fuzzy_left_join(df1, df2, by = "colA", match_fun = zip_match, match_digits = 2) |> 
  slice_min(order_by = num_dist, by = colA.x)
#> # A tibble: 4 × 6
#>   colA.x  colB colA.y  colC start_match num_dist
#>    <dbl> <dbl>  <dbl> <dbl> <lgl>          <dbl>
#> 1   3000     1   3000   200 TRUE               0
#> 2   3001     2   3000   200 TRUE               1
#> 3   3007     3   3004    22 TRUE               3
#> 4   2820     4     NA    NA NA                NA

^{Created on 2025-01-18 with reprex v2.1.1}

1st revision with difference_join() for reference - https://stackoverflow.com/revisions/79367476/1

edited Jan 18 at 23:09

answered Jan 18 at 16:07

margusl

21.5k3 gold badges23 silver badges33 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Kass Jan 18 at 16:27

Thank you very much! The problem is that it is not robust across "thousands". In my case ZIP 3000 is geographically far away from say 2999 but close to 3002 and so for the case where the first df is df1 <- tribble( ~colA, ~colB, 2820, 1, 3001, 2, 3007, 3 ) and the second is df2 <- tribble( ~colA, ~colC, 2999, 200, 3004, 22, 3012, 55 ) and I let the max distance be say 200, I will get that 3001 in the main df will be matched to the values of 2999 in the second df. But I need that it disregards values outside of ZIPs starting with 3.

Kass Jan 18 at 16:29

Sorry if my question given your answer is obvious, I have started using R only this week...

margusl Jan 18 at 17:31

@Kass , valid point and makes perfect sense. Edited and changed to a bit different approach, 1st revision for reference - stackoverflow.com/revisions/79367476/1

Collectives™ on Stack Overflow

fuzzy_join of two dataframes based on ZIP codes

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related