0

I am trying to join several messy datasets together without using "fuzzy matching".

In the core dataset (example dataset1 below), I have simple names for companies. In the datasets I would like to join on (that contain additional information about these companies), these names are accompanied by a variety of suffixes, prefixes, and other complications. These suffixes, prefixes are a wide variety of different lengths, making fuzzy matching less appropriate.

I would like to join based on whether strings in "exporter_group" in dataset1 are contained within strings in "company" in dataset2 as a first step, retaining the "company" column from dataset2 so that I can check the match manually.

Is this possible? Am I taking the right approach? Another way I've thought is creating a map of the simple company names and using a string matching mutate in dataset2 to create a column with the target simple name, then joining based on that new column..

Any help appreciated! The examples below are one company, in reality I will have several hundred so this needs to scale to that.

dataset1 <- tibble::tribble(
  ~exporter_group, ~exporter,
  "LOUIS DREYFUS", "LDC INDONESIA",
  "LOUIS DREYFUS", "LDC TRADING INDONESIA",
  "LOUIS DREYFUS", "LDC EAST INDONESIA",
  "LOUIS DREYFUS", "LOUIS DREYFUS"
)

dataset2 <- tibble::tribble(
  ~company, ~parent_company, ~subsidiares, ~market_cap_usd, ~bloomberg_ticker, ~thomson_reuters_ticker,
  "LOUIS DREYFUS COMPANY", NA, NA, NA, "0308213D NA EQUITY", NA
)

I've tried "fuzzy matching" and filtering based off "str_detect" but I've not quite got anywhere.

2
  • Have you looked at this? stackoverflow.com/questions/68182139/… Commented Dec 7, 2023 at 10:35
  • Hi, the present code of your second data cannot go through. Did you notice you miss another NA? Commented Dec 7, 2023 at 11:26

2 Answers 2

1

You can use regex_join(), from the package:

pacman::p_load(fuzzyjoin)
regex_join(dataset2, dataset1, by = c(company = "exporter_group"))

Output: (Note: there's something wrong with dataset2 in your example, so I had to make my own!)

 A tibble: 4 × 8
  company             parent_company subsidiares market_cap_usd bloomberg_ticker
  <chr>               <chr>                <dbl>          <dbl> <chr>           
1 LOUIS DREYFUS COMP… Amazon                   4           1000 GOOG            
2 LOUIS DREYFUS COMP… Amazon                   4           1000 GOOG            
3 LOUIS DREYFUS COMP… Amazon                   4           1000 GOOG            
4 LOUIS DREYFUS COMP… Amazon                   4           1000 GOOG            
# ℹ 3 more variables: thomson_reuters_ticker <chr>, exporter_group <chr>,
#   exporter <chr>

regex_join() by default does an inner join, but you may want to change this to be a full join or a left join, using either the mode argument, or by changing it to be regex_left_join() or regex_full_join() respectively.

Data:

dataset2 <- tibble::tribble(
  ~company, ~parent_company, ~subsidiares, ~market_cap_usd, ~bloomberg_ticker, ~thomson_reuters_ticker,
  "LOUIS DREYFUS COMPANY", "Amazon", 4, 1000, "GOOG", "idk")
Sign up to request clarification or add additional context in comments.

Comments

0

You said you dont't want "fuzzy matching" because it is "less appropriate", but "these names are accompanied by a variety of suffixes, prefixes, and other complications". It sounds conflit to me, because fuzzy matching is design to deal with varing string.

Here is a simple method using the general packages, but it is limited to the fixed additional string - 'COMPANY'. Since company in dataset2 is about having an extra ' COMPANY' string at the end of exporter_group in dataset1, we can create the company column for dataset1, and then there is no problem for a simple inner join operation.

dataset1 %>%
  mutate(company = paste0(exporter_group, " COMPANY") ) %>%
  inner_join(dataset2)

Joining, by = "company"
# A tibble: 4 x 8
  exporter_group exporter              company               parent_company subsidiares market_cap_usd bloomberg_t~1 thoms~2
  <chr>          <chr>                 <chr>                 <lgl>          <lgl>       <lgl>          <chr>         <lgl>  
1 LOUIS DREYFUS  LDC INDONESIA         LOUIS DREYFUS COMPANY NA             NA          NA             0308213D NA ~ NA     
2 LOUIS DREYFUS  LDC TRADING INDONESIA LOUIS DREYFUS COMPANY NA             NA          NA             0308213D NA ~ NA     
3 LOUIS DREYFUS  LDC EAST INDONESIA    LOUIS DREYFUS COMPANY NA             NA          NA             0308213D NA ~ NA     
4 LOUIS DREYFUS  LOUIS DREYFUS         LOUIS DREYFUS COMPANY NA             NA          NA             0308213D NA ~ NA   

This also keeps all the columns from two datasets. However, if it is not 'COMPANY' in the dataset2, I am afraid you have to do fuzzy matching like the other answer suggested.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.