I am trying to join several messy datasets together without using "fuzzy matching".
In the core dataset (example dataset1 below), I have simple names for companies. In the datasets I would like to join on (that contain additional information about these companies), these names are accompanied by a variety of suffixes, prefixes, and other complications. These suffixes, prefixes are a wide variety of different lengths, making fuzzy matching less appropriate.
I would like to join based on whether strings in "exporter_group" in dataset1 are contained within strings in "company" in dataset2 as a first step, retaining the "company" column from dataset2 so that I can check the match manually.
Is this possible? Am I taking the right approach? Another way I've thought is creating a map of the simple company names and using a string matching mutate in dataset2 to create a column with the target simple name, then joining based on that new column..
Any help appreciated! The examples below are one company, in reality I will have several hundred so this needs to scale to that.
dataset1 <- tibble::tribble(
~exporter_group, ~exporter,
"LOUIS DREYFUS", "LDC INDONESIA",
"LOUIS DREYFUS", "LDC TRADING INDONESIA",
"LOUIS DREYFUS", "LDC EAST INDONESIA",
"LOUIS DREYFUS", "LOUIS DREYFUS"
)
dataset2 <- tibble::tribble(
~company, ~parent_company, ~subsidiares, ~market_cap_usd, ~bloomberg_ticker, ~thomson_reuters_ticker,
"LOUIS DREYFUS COMPANY", NA, NA, NA, "0308213D NA EQUITY", NA
)
I've tried "fuzzy matching" and filtering based off "str_detect" but I've not quite got anywhere.