0

I am trying to fuzzyjoin two dataframes. Both contain the column with ZIP codes and some other columns. However, in the parental dataframe there are more ZIP codes than in the secondary one. I would like to match in particular based on the first 3 integer values of a ZIP code and then the closest numerical value of the ZIP. For that JaroWinker distance seems to be perfect.

I have tried to use the solution given here using comparator package.

adapted to my case as follows:

library(dplyr)

library(fuzzyjoin)

library(tidyverse)
library(comparator)

f1 <- tribble(
  ~colA, ~colB,
  3000,   1,
  3001,   2,
  3007,   3
)


df2 <- tribble(
  ~colA, ~colC,
  3000,   200,
  3004,   22,
  3012,   55
)


jw <- comparator::JaroWinkler()

df3 <- fuzzyjoin::fuzzy_left_join(
  x = df1, y = df2, by = "colA",
  match_fun = function(x, y) { jw(x, y) > 0.62}
) 

But what I obtain as the output is a df with 9 rows, but I just want a new df that would like this

df3 <- tribble(
  ~colA, ~colB, ~colC,
  3000,  1,  200,
  3001,  2,  200,
  3012, 3,   22
)

i.e. taking into account both the JW distance and also that 3001 is numerically closer to 3000 than to 3004 and that 3007 is closer to 3004 then to 3012.

Any hints how should I modify my code? Thank you!!

1
  • I suspect it might make sense to split out your zip codes into two pieces, the first part reflecting the first three digits, which specify the “sectional center facility,” and then the last two digits, for which “main town in a region (if applicable) often gets the first ZIP Codes for that region; afterward, the numerical order often follows the alphabetical order.” en.wikipedia.org/wiki/ZIP_Code#Geographic_hierarchy Commented Jan 18 at 17:38

1 Answer 1

2

With fuzzyjoin::fuzzy_join() & co we can use custom matching functions. In this case we could aim for an exact match for the first digit(s) and numeric difference for the rest.

There will likely be more than one match, but we could use numeric difference from matching function with slice_min() to keep just a single match for each df1$colA value.

library(dplyr, warn.conflicts = FALSE)
library(fuzzyjoin)

df1 <- tribble(
  ~colA, ~colB,
  3000,   1,
  3001,   2,
  3007,   3,
  2820,   4,
)

df2 <- tribble(
  ~colA, ~colC,
  2999,   100,
  3000,   200,
  3004,   22,
  3012,   55,
)

# if match_fun returns a data.frame / tibble, it's first column is
# used as match indicator; fuzzy_join adds extra columns to resulting frame
zip_match <- function(x, y, match_digits = 1, max_num_dist = 999){
  tibble(
    start_match = stringi::stri_sub(x, to = match_digits) == stringi::stri_sub(y, to = match_digits),
    num_dist    = abs(as.numeric(stringi::stri_sub(x, from = match_digits + 1)) - 
                      as.numeric(stringi::stri_sub(y, from = match_digits + 1))),
    match = start_match & num_dist <= max_num_dist
  ) |> 
  select(match, start_match, num_dist)
}

Usage examples:

# call match_fun with defaults (match_digits = 1, max_num_dist = 999),
# for 4-digit codes only first digit must match, "2820" matches "2999" 
fuzzy_left_join(df1, df2, by = "colA", match_fun = zip_match) |> 
  # keep only single closest match
  slice_min(order_by = num_dist, by = colA.x)
#> # A tibble: 4 × 6
#>   colA.x  colB colA.y  colC start_match num_dist
#>    <dbl> <dbl>  <dbl> <dbl> <lgl>          <dbl>
#> 1   3000     1   3000   200 TRUE               0
#> 2   3001     2   3000   200 TRUE               1
#> 3   3007     3   3004    22 TRUE               3
#> 4   2820     4   2999   100 TRUE             179

# match_digits = 2, "2820" does not match "2999"
fuzzy_left_join(df1, df2, by = "colA", match_fun = zip_match, match_digits = 2) |> 
  slice_min(order_by = num_dist, by = colA.x)
#> # A tibble: 4 × 6
#>   colA.x  colB colA.y  colC start_match num_dist
#>    <dbl> <dbl>  <dbl> <dbl> <lgl>          <dbl>
#> 1   3000     1   3000   200 TRUE               0
#> 2   3001     2   3000   200 TRUE               1
#> 3   3007     3   3004    22 TRUE               3
#> 4   2820     4     NA    NA NA                NA

Created on 2025-01-18 with reprex v2.1.1


1st revision with difference_join() for reference - https://stackoverflow.com/revisions/79367476/1

Sign up to request clarification or add additional context in comments.

3 Comments

Thank you very much! The problem is that it is not robust across "thousands". In my case ZIP 3000 is geographically far away from say 2999 but close to 3002 and so for the case where the first df is df1 <- tribble( ~colA, ~colB, 2820, 1, 3001, 2, 3007, 3 ) and the second is df2 <- tribble( ~colA, ~colC, 2999, 200, 3004, 22, 3012, 55 ) and I let the max distance be say 200, I will get that 3001 in the main df will be matched to the values of 2999 in the second df. But I need that it disregards values outside of ZIPs starting with 3.
Sorry if my question given your answer is obvious, I have started using R only this week...
@Kass , valid point and makes perfect sense. Edited and changed to a bit different approach, 1st revision for reference - stackoverflow.com/revisions/79367476/1

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.