Pandas: Creating a new column based on if part of a string is anywhere in another column

Question

Let's suppose we have two dataframes:

df1 = pd.DataFrame({
0: 'ETERNITON',
1: 'CIELOON',
2: 'M.DIASBRANCOON',
3: 'IRBBRASIL REON',
4: '01/00 ATACADÃO S.A ON',
5: 'AMBEV S/A ON',
6: '01/00 RUMO S.A. ON',
7: 'COGNA ONON',
8: 'CURY S/A'}.items(), columns=['index', 'name']).set_index('index')

df2 = pd.DataFrame({'name': {0: 'ALLIARON', 1: 'M.DIASBRANCOON', 2: 'AMBEVS/AON', 3: 'CIELOON',
  4: 'AESBRASILON', 5: 'BRASILAGROON', 6: 'IRBBRASILREON', 7: 'ATACADÃOS.AON', 8: 'ALPARGATASON',
  9: 'RUMOS.A.ON', 10: 'COGNAONON'},
 'yf_ticker': {0: 'AALR3.SA', 1: 'MDIA3.SA', 2: 'ABEV3.SA', 3: 'CIEL3.SA', 4: 'AESB3.SA',
  5: 'AGRO3.SA', 6: 'IRBR3.SA', 7: 'CRFB3.SA', 8: 'ALPA3.SA', 9: 'RAIL3.SA', 10: 'COGN3.SA'}})

I'd like to create a new column ('ticker') in df1 using the column 'yf_ticker' from df2. If a name/string in df2['yf_ticker'] is in df1['name'] (even if it is not an exactly match), then add the yf_ticker from df2 to that row in df1['ticker']. To make it clear, the expected output would be something like:

print(df1)
name                    ticker
ETERNITON               Missing or N/A or Nan
CIELOON                 CIEL3.SA
M.DIASBRANCOON          MDIA3.SA
IRBBRASIL REON          IRBR3.SA
01/00 ATACADÃO S.A ON   CRFB3.SA
AMBEV S/A ON            ABEV3.SA
01/00 RUMO S.A. ON      RAIL3.SA
COGNA ONON              COGN3.SA
CURY S/A                Missing or N/A or Nan

The solution I tried:


df1['name'] = df1['name'].str.replace(" ","")

for i in range(len(df1)):
    for j in range(len(df2)):
        if df2.iloc[j,0] in df1.iloc[i,0]:
            df1.loc[i, 'ticker'] = df2.iloc[j,1]

Although it worked, it seems to me that such for loop for a larger dataset is inefficient. Is there a faster (or 'vectorized') way to do that?

Hi @Chris. I edited my question to add the solution I tried. Would you have a different approach? — Allan
– Allan, Commented Dec 18, 2021 at 17:59
problem is "not an exactly match" because standard functions may work with "exactly match" — furas
– furas, Commented Dec 18, 2021 at 19:26
for "not an exactly match" it may need to use "fuzzy matching" which frist may need to calculate similarity between all pairs name in both DataFrame and later get value with the biggest similarity. But all this may need external module for "fuzzy matching" and work in for-loop. — furas
– furas, Commented Dec 18, 2021 at 19:37

RJ Adriaansen · Accepted Answer · 2021-12-18 20:44:29Z

1

I suggest fuzzy matching on the name columns, then get the yf_ticker from the matching row. Here is an example with python's built-in difflib:

import difflib

df1['yf_ticker'] = df1['name'].apply(lambda x: df2.loc[df2['name'] == y[0], 'yf_ticker'].iloc[0] if (y := (difflib.get_close_matches(x, df2.name))) else None)

Output:

index	name	yf_ticker
0	ETERNITON
1	CIELOON	CIEL3.SA
2	M.DIASBRANCOON	MDIA3.SA
3	IRBBRASIL REON	IRBR3.SA
4	01/00 ATACADÃO S.A ON	CRFB3.SA
5	AMBEV S/A ON	ABEV3.SA
6	01/00 RUMO S.A. ON	RAIL3.SA
7	COGNA ONON	COGN3.SA
8	CURY S/A

edited Dec 18, 2021 at 20:44

answered Dec 18, 2021 at 20:28

RJ Adriaansen

9,7192 gold badges16 silver badges29 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Pandas: Creating a new column based on if part of a string is anywhere in another column

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related