0

Let's suppose we have two dataframes:

df1 = pd.DataFrame({
0: 'ETERNITON',
1: 'CIELOON',
2: 'M.DIASBRANCOON',
3: 'IRBBRASIL REON',
4: '01/00 ATACADÃO S.A ON',
5: 'AMBEV S/A ON',
6: '01/00 RUMO S.A. ON',
7: 'COGNA ONON',
8: 'CURY S/A'}.items(), columns=['index', 'name']).set_index('index')

df2 = pd.DataFrame({'name': {0: 'ALLIARON', 1: 'M.DIASBRANCOON', 2: 'AMBEVS/AON', 3: 'CIELOON',
  4: 'AESBRASILON', 5: 'BRASILAGROON', 6: 'IRBBRASILREON', 7: 'ATACADÃOS.AON', 8: 'ALPARGATASON',
  9: 'RUMOS.A.ON', 10: 'COGNAONON'},
 'yf_ticker': {0: 'AALR3.SA', 1: 'MDIA3.SA', 2: 'ABEV3.SA', 3: 'CIEL3.SA', 4: 'AESB3.SA',
  5: 'AGRO3.SA', 6: 'IRBR3.SA', 7: 'CRFB3.SA', 8: 'ALPA3.SA', 9: 'RAIL3.SA', 10: 'COGN3.SA'}})

I'd like to create a new column ('ticker') in df1 using the column 'yf_ticker' from df2. If a name/string in df2['yf_ticker'] is in df1['name'] (even if it is not an exactly match), then add the yf_ticker from df2 to that row in df1['ticker']. To make it clear, the expected output would be something like:

print(df1)
name                    ticker
ETERNITON               Missing or N/A or Nan
CIELOON                 CIEL3.SA
M.DIASBRANCOON          MDIA3.SA
IRBBRASIL REON          IRBR3.SA
01/00 ATACADÃO S.A ON   CRFB3.SA
AMBEV S/A ON            ABEV3.SA
01/00 RUMO S.A. ON      RAIL3.SA
COGNA ONON              COGN3.SA
CURY S/A                Missing or N/A or Nan

The solution I tried:


df1['name'] = df1['name'].str.replace(" ","")

for i in range(len(df1)):
    for j in range(len(df2)):
        if df2.iloc[j,0] in df1.iloc[i,0]:
            df1.loc[i, 'ticker'] = df2.iloc[j,1]

Although it worked, it seems to me that such for loop for a larger dataset is inefficient. Is there a faster (or 'vectorized') way to do that?

4
  • Hi @Chris. I edited my question to add the solution I tried. Would you have a different approach? Commented Dec 18, 2021 at 17:59
  • Could u try left join or mapping? Commented Dec 18, 2021 at 18:07
  • problem is "not an exactly match" because standard functions may work with "exactly match" Commented Dec 18, 2021 at 19:26
  • for "not an exactly match" it may need to use "fuzzy matching" which frist may need to calculate similarity between all pairs name in both DataFrame and later get value with the biggest similarity. But all this may need external module for "fuzzy matching" and work in for-loop. Commented Dec 18, 2021 at 19:37

1 Answer 1

1

I suggest fuzzy matching on the name columns, then get the yf_ticker from the matching row. Here is an example with python's built-in difflib:

import difflib

df1['yf_ticker'] = df1['name'].apply(lambda x: df2.loc[df2['name'] == y[0], 'yf_ticker'].iloc[0] if (y := (difflib.get_close_matches(x, df2.name))) else None)

Output:

index name yf_ticker
0 ETERNITON
1 CIELOON CIEL3.SA
2 M.DIASBRANCOON MDIA3.SA
3 IRBBRASIL REON IRBR3.SA
4 01/00 ATACADÃO S.A ON CRFB3.SA
5 AMBEV S/A ON ABEV3.SA
6 01/00 RUMO S.A. ON RAIL3.SA
7 COGNA ONON COGN3.SA
8 CURY S/A
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.