I am trying to fuzzyjoin two dataframes. Both contain the column with ZIP codes and some other columns. However, in the parental dataframe there are more ZIP codes than in the secondary one. I would like to match in particular based on the first 3 integer values of a ZIP code and then the closest numerical value of the ZIP. For that JaroWinker distance seems to be perfect.
I have tried to use the solution given here using comparator package.
adapted to my case as follows:
library(dplyr)
library(fuzzyjoin)
library(tidyverse)
library(comparator)
f1 <- tribble(
~colA, ~colB,
3000, 1,
3001, 2,
3007, 3
)
df2 <- tribble(
~colA, ~colC,
3000, 200,
3004, 22,
3012, 55
)
jw <- comparator::JaroWinkler()
df3 <- fuzzyjoin::fuzzy_left_join(
x = df1, y = df2, by = "colA",
match_fun = function(x, y) { jw(x, y) > 0.62}
)
But what I obtain as the output is a df with 9 rows, but I just want a new df that would like this
df3 <- tribble(
~colA, ~colB, ~colC,
3000, 1, 200,
3001, 2, 200,
3012, 3, 22
)
i.e. taking into account both the JW distance and also that 3001 is numerically closer to 3000 than to 3004 and that 3007 is closer to 3004 then to 3012.
Any hints how should I modify my code? Thank you!!