0

I am trying to find data entry errors in the names and locations of my dataset by fuzzy matching. I am have a unique key from the original data, siterow_id, and have made a new key, pi_key, where I already identified some hard matches. (no fuzzy matching). After running the fuzzy matching I get duplicate values. The matches from both the left and right side of the join for some of the siterow_id's. I can manually look at the data and see where this occurs and hard code to remove the rows. I want a more algorithmic way of doing this as I go to a larger dataset with many more matches.

I tried doing it this way but it removes the matches on the left and the right. If possible I would love a tidyverse way to do this and not a loop.

The table output is included below. You can see a duplicate in row 8 and 9.

for(site in three_letter_matches$siterow_id.x){ 
  if (any(three_letter_matches$siterow_id.y == site)) {
    three_letter_matches <- three_letter_matches[!three_letter_matches$siterow_id.y == site,]
  }
}




  pi_key.x siterow_id.x last_name.x first_name.x city.x country.x pi_key.y siterow_id.y
  <chr>    <chr>        <chr>       <chr>        <chr>  <chr>     <chr>    <chr>       
1 6309     1-9CH29M     kim         kevin        san f~ united s~ 11870    1-HC3YY6    
2 7198     1-CJGRSZ     kim         jinseok      seoul  korea re~ 2952     1-2QBRZ2    
3 7198     1-CJGRSZ     kim         jinseok      seoul  korea re~ 2952     1-3AHHSU    
4 7198     1-CJGRSZ     kim         jinseok      seoul  korea re~ 2952     1-3JYF8V    
5 7567     1-CW4DXI     bar         jair         ramat~ israel    8822     1-E3UILG    
6 8822     1-E3UILG     bar         jair         ramat~ israel    7567     1-CW4DXI    
7 11870    1-HC3YY6     kim         kevin        san f~ united s~ 6309     1-9CH29M    
8 12357    1-HUUEA6     lee         hyojin       daeje~ korea re~ 13460    1-IGKCPP    
9 13460    1-IGKCPP     lee         hyo jin      daeje~ korea re~ 12357    1-HUUEA6

I found another way to do it

update <- three_letter_matches[!is.na(match(three_letter_matches$siterow_id.x, three_letter_matches$siterow_id.y)),]
update %<>% arrange(last_name.x, first_name.x) %>% 
  filter(row_number() %% 2 != 0)
three_letter_matches_update <- three_letter_matches %>% 
  anti_join(update)

Still open to suggestions.

0

1 Answer 1

1

Not the easiest problem, but there are a few ways to do this. The first that comes to mind for me is a bit slow (because it uses rowwise() which is equivalent to using map() or lapply()) is this one:

NOTE: This only works if siterow_id.x/y are character vectors. Won't work for factors.

three_letter_matches <- three_letter_matches %>%
  rowwise() %>%
  mutate(both_values = paste0(sort(c(siterow_id.x,siterow_id.y)),collapse = ",")) %>%
  ungroup() %>%
  distinct(both_values,.keep_all = TRUE) %>%
  select(-both_values)


# pi_key.x   siterow_id.x last_name.x first_name.x city.x country.x pi_key.y siterow_id.y
# 6309  1-9CH29M    kim kevin   san f~  united s~   11870   1-HC3YY6
# 7198  1-CJGRSZ    kim jinseok seoul   korea re~   2952    1-2QBRZ2
# 7198  1-CJGRSZ    kim jinseok seoul   korea re~   2952    1-3AHHSU
# 7198  1-CJGRSZ    kim jinseok seoul   korea re~   2952    1-3JYF8V
# 7567  1-CW4DXI    bar jair    ramat~  israel      8822    1-E3UILG
# 12357 1-HUUEA6    lee hyojin  daeje~  korea re~   13460   1-IGKCPP

Basically what I'm doing here is doing rowwise so that I work on one row at a time, then I take the site_row ids and sort them so that every row will have the same order, then I paste them together into a single string that is easy to compare for equivalence. Next I ungroup so that you are looking at all rows again (get rid of that rowwise). Then run a distinct to only keep the first row for each value in the new column but with the .keep_all option to keep all the columns. Then I cleanup by removing my extra column.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.