0

I have two dataframes, both have a Last_Name column. First dataframe has a column Contains_First_Name and the second has a column called First_Name. I want to join the two on the exact spelling of Last_Name and a substring match of Contains_First_Name and First_Name (where First_Name is a substring of Contains_First_Name.) Please see an example below.

library(dplyr)
library(stringr)

# Create df1
Last_Name <- c("Smith", "Jones", "Adams", "Rogers", "Lee", "Lee", "Lee")
Contains_First_Name <- c("Kimberly Nicole", "Patrick L", "Johnson Ann", "Rick", "McAdams Jennifer Marie", "Kirk", "Kirk B")
Account_Number <- c("123", "345", "678", "901", "234", "567", "890")

df1 <- data.frame(Last_Name, Contains_First_Name, Account_Number)

# Create df2
Last_Name <- c("Smith", "Jones", "Adams", "Lee", "Lee")
First_Name <- c("Kimberly", "Patrick", "Ann", "Jennifer", "Kirk")

df2 <- data.frame(Last_Name, First_Name)

Resulting dataframes:

> df1
  Last_Name    Contains_First_Name Account_Number
1     Smith        Kimberly Nicole            123
2     Jones              Patrick L            345
3     Adams            Johnson Ann            678
4    Rogers                   Rick            901
5       Lee McAdams Jennifer Marie            234
6       Lee                   Kirk            567
7       Lee                 Kirk B            890
> df2
  Last_Name First_Name
1     Smith   Kimberly
2     Jones    Patrick
3     Adams        Ann
4       Lee   Jennifer
5       Lee       Kirk

What I want to end up with:

> df3
  Last_Name    Contains_First_Name Account_Number First_Name
1     Smith        Kimberly Nicole            123 Kimberly
2     Jones              Patrick L            345 Patrick
3     Adams            Johnson Ann            678 Ann
4       Lee McAdams Jennifer Marie            234 Jennifer
5       Lee                   Kirk            567 Kirk
6       Lee                 Kirk B            890 Kirk

I tried this:

df3 <-
  filter(df1,
         Last_Name %in% df2$Last_Name,
         str_detect(Contains_First_Name, paste(df2$First_Name, collapse = "|")))

Getting the following error:

Error in match.arg(method) : 'arg' must be NULL or a character vector

I also explored fuzzyjoin library but could not figure out how to join on two variables with two different join types (exact and substring.) I saw a similar question which does not appear to have an answer: Merge two dataframes based on an exact match in one column and match within an error in another column in R. Any advice is greatly appreciated. Thank you.

1 Answer 1

1

I'd say you have two options: either use an equi-join on only the first column and filter later or use fuzzyjoin, as you described:

# Approach 1: Match all, filter later

inner_join(df1, df2, join_by(Last_Name), relationship = "many-to-many") |> 
  filter(str_detect(Contains_First_Name, First_Name))
#> # A tibble: 6 × 4
#>   Last_Name Contains_First_Name    Account_Number First_Name
#>   <chr>     <chr>                  <chr>          <chr>     
#> 1 Smith     Kimberly Nicole        123            Kimberly  
#> 2 Jones     Patrick L              345            Patrick   
#> 3 Adams     Johnson Ann            678            Ann       
#> 4 Lee       McAdams Jennifer Marie 234            Jennifer  
#> 5 Lee       Kirk                   567            Kirk      
#> 6 Lee       Kirk B                 890            Kirk

# Approach 2: fuzzyjoin

fuzzyjoin::fuzzy_inner_join(
  df1,
  df2,
  by = c("Last_Name" = "Last_Name", "Contains_First_Name" = "First_Name"),
  match_fun = list(`==`, \(x, y) str_detect(x, y))
) |> 
  select(!Last_Name.y) |> 
  rename(Last_Name = Last_Name.x)
#> # A tibble: 6 × 4
#>   Last_Name Contains_First_Name    Account_Number First_Name
#>   <chr>     <chr>                  <chr>          <chr>     
#> 1 Smith     Kimberly Nicole        123            Kimberly  
#> 2 Jones     Patrick L              345            Patrick   
#> 3 Adams     Johnson Ann            678            Ann       
#> 4 Lee       McAdams Jennifer Marie 234            Jennifer  
#> 5 Lee       Kirk                   567            Kirk      
#> 6 Lee       Kirk B                 890            Kirk

Created on 2024-01-06 with reprex v2.0.2

Sign up to request clarification or add additional context in comments.

3 Comments

One more question. How would you change the fuzzyjoin solution to work in R 3.6.0? I tried replacing the shorthand lambda function \ with explicit {function(x, y) str_detect(x, y)} but am getting Error in which(m) : argument to 'which' is not logical. Thank you.
Replacing \(x, y) with function(x, y) should indeed do the trick... Try it without the curly braces or defining the function with a name outside of fuzzy_inner_join() and see if it returns a logical vector. And of course update, ideally ;)
Same error unfortunately.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.