Let's suppose we have two dataframes:
df1 = pd.DataFrame({
0: 'ETERNITON',
1: 'CIELOON',
2: 'M.DIASBRANCOON',
3: 'IRBBRASIL REON',
4: '01/00 ATACADÃO S.A ON',
5: 'AMBEV S/A ON',
6: '01/00 RUMO S.A. ON',
7: 'COGNA ONON',
8: 'CURY S/A'}.items(), columns=['index', 'name']).set_index('index')
df2 = pd.DataFrame({'name': {0: 'ALLIARON', 1: 'M.DIASBRANCOON', 2: 'AMBEVS/AON', 3: 'CIELOON',
4: 'AESBRASILON', 5: 'BRASILAGROON', 6: 'IRBBRASILREON', 7: 'ATACADÃOS.AON', 8: 'ALPARGATASON',
9: 'RUMOS.A.ON', 10: 'COGNAONON'},
'yf_ticker': {0: 'AALR3.SA', 1: 'MDIA3.SA', 2: 'ABEV3.SA', 3: 'CIEL3.SA', 4: 'AESB3.SA',
5: 'AGRO3.SA', 6: 'IRBR3.SA', 7: 'CRFB3.SA', 8: 'ALPA3.SA', 9: 'RAIL3.SA', 10: 'COGN3.SA'}})
I'd like to create a new column ('ticker') in df1 using the column 'yf_ticker' from df2. If a name/string in df2['yf_ticker'] is in df1['name'] (even if it is not an exactly match), then add the yf_ticker from df2 to that row in df1['ticker']. To make it clear, the expected output would be something like:
print(df1)
name ticker
ETERNITON Missing or N/A or Nan
CIELOON CIEL3.SA
M.DIASBRANCOON MDIA3.SA
IRBBRASIL REON IRBR3.SA
01/00 ATACADÃO S.A ON CRFB3.SA
AMBEV S/A ON ABEV3.SA
01/00 RUMO S.A. ON RAIL3.SA
COGNA ONON COGN3.SA
CURY S/A Missing or N/A or Nan
The solution I tried:
df1['name'] = df1['name'].str.replace(" ","")
for i in range(len(df1)):
for j in range(len(df2)):
if df2.iloc[j,0] in df1.iloc[i,0]:
df1.loc[i, 'ticker'] = df2.iloc[j,1]
Although it worked, it seems to me that such for loop for a larger dataset is inefficient. Is there a faster (or 'vectorized') way to do that?
"not an exactly match"because standard functions may work with"exactly match""not an exactly match"it may need to use"fuzzy matching"which frist may need to calculatesimilaritybetween all pairsnamein bothDataFrameand later get value with the biggestsimilarity. But all this may need external module for"fuzzy matching"and work infor-loop.