-1

I would like to do exact joins for the columns state and name, but a fuzzy join for the "name" and "versus" columns:

year <- c("2002", "2002", "1999", "1999", "1997", "2002")
state <- c("TN", "TN", "AL", "AL", "CA", "TN")
name <- c("George", "Sally", "David", "Laura", "John", "Kate")

df1 <- data.frame(year, state, name)

year <- c("2002", "1999")
state <- c("TN", "AL")
versus <- c("@ george v. SALLY", "@laura v. dAvid")

df2 <- data.frame(year, state, versus)

My preferred output would be the following:

year <- c("2002", "2002", "1999", "1999", "1997", "2002")
state <- c("TN", "TN", "AL", "AL", "CA", "TN")
name <- c("George", "Sally", "David", "Laura", "John", "Kate")
versus <- c("@ george v. SALLY", "@ george v. SALLY", "@laura v. dAvid", "@laura v. dAvid", NA, NA)

df3 <- data.frame(year, state, name, versus)

I've tried variations of the following:

library(fuzzyjoin)

stringdist_left_join(df1, df2, by = c("year", "state", "name" = "versus"), method = "hamming")

stringdist_left_join(df1, df2, by = c("year", "state"), method = "hamming")

And they don't seem to get close to what I want.

I'm wondering if I'll need to spit up the "versus" column (remove all special characters and delimit the names) or if there's a way for me to accomplish this with something within fuzzyjoin. Any guidance would be appreciated.

2 Answers 2

1

A simple approach, which depends somewhat on the structure of df2$versus, would be this:

library(dplyr)

left_join(df1,df2, by=c("year","state")) %>% 
  rowwise() %>% 
  mutate(versus:=if_else(grepl(name,versus,ignore.case=T), versus,as.character(NA)))

Output:

  year  state name   versus           
  <chr> <chr> <chr>  <chr>            
1 2002  TN    George @ george v. SALLY
2 2002  TN    Sally  @ george v. SALLY
3 1999  AL    David  @laura v. dAvid  
4 1999  AL    Laura  @laura v. dAvid  
5 1997  CA    John   NA               
6 2002  TN    Kate   NA             

Update/Jul 14 2022:

If name has more complicated pattern, rather than a single word (say Molly Home, Jane Doe), we need a way to retrieve the series of whole words, and check if any of them appear (case-insensitive) within the versus column. Here is one simple way to do this:

  1. Create function (f(n,v)), which takes strings n and v, extracts the whole words (wrds) from n, and then counts how many of them are found in v. Returns TRUE if this count exceeds 0
f <- function(n,v) {
  wrds = stringr::str_extract_all(n, "\\b\\w*\\b")[[1]]
  sum(sapply(wrds[which(nchar(wrds)>1)], grepl,x=v,ignore.case=T))>0
}

Left join the original frames, and apply f() by row

left_join(df1,df2, by=c("year","state")) %>% 
  rowwise() %>% 
  mutate(versus:=if_else(f(name, versus), versus,NA_character_))

Output:

1 2002  TN    Molly Homes, Jane Doe Homes (v. Vista)
2 2002  TN    Sally                 NA              
3 1999  AL    David                 @laura v. dAvid 
4 1999  AL    Laura                 @laura v. dAvid 
5 1997  CA    John                  NA              
6 2002  TN    Kate                  NA              

Input:

df1 = structure(list(year = c("2002", "2002", "1999", "1999", "1997", 
"2002"), state = c("TN", "TN", "AL", "AL", "CA", "TN"), name = c("Molly Homes, Jane Doe", 
"Sally", "David", "Laura", "John", "Kate")), class = "data.frame", row.names = c(NA, 
-6L))

df2 = structure(list(year = c("2002", "1999"), state = c("TN", "AL"
), versus = c("Homes (v. Vista)", "@laura v. dAvid")), class = "data.frame", row.names = c(NA, 
-2L))
Sign up to request clarification or add additional context in comments.

2 Comments

What if I care about any whole word? I changed George to "Molly Homes, Jane Doe" and @ george v. SALLY to "Homes (v. Vista)" and they're not merging. What's going on here?
it definitely requires a different approach if you are trying to match a subset of the name (say a whole word within name.. Perhaps post a new question, with some example data?
1

Update 15/07:

See comment. In such case one would want to check for a match in versus for each individual name in name. This could be done like this (using @langtang's 'new' data):

df1 |>
    left_join(df2, by = c("year", "state")) |>
    rowwise() |>
    mutate(versus = if_else(str_detect(tolower(versus), paste0(unlist(str_extract_all(tolower(name), "\\w+")), collapse = "|")), versus, NA_character_)) |>
    ungroup()

Output:

# A tibble: 6 × 4 
  year  state name                  versus          
  <chr> <chr> <chr>                 <chr>           
1 2002  TN    Molly Homes, Jane Doe Homes (v. Vista)
2 2002  TN    Sally                 NA              
3 1999  AL    David                 @laura v. dAvid 
4 1999  AL    Laura                 @laura v. dAvid 
5 1997  CA    John                  NA              
6 2002  TN    Kate                  NA              

Old answer:

An approach could be:

library(tidyverse)

df1 |>
  left_join(df2) |>
  group_by(state) |>
  mutate(versus = if_else(str_detect(tolower(versus), tolower(name)), versus, NA_character_)) |>
  ungroup()

Output:

# A tibble: 6 × 4
  year  state name   versus           
  <chr> <chr> <chr>  <chr>            
1 2002  TN    George @ george v. SALLY
2 2002  TN    Sally  @ george v. SALLY
3 1999  AL    David  @laura v. dAvid  
4 1999  AL    Laura  @laura v. dAvid  
5 1997  CA    John   NA               
6 2002  TN    Kate   NA               

2 Comments

What if I care about any whole word? I changed George to "Molly Homes, Jane Doe" and @ george v. SALLY to "Homes (v. Vista)" and they're not merging. What's going on here?
Answer updated.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.