Merge two dataframe if one string column is contained in another column in Pandas

Question

I need to merge the following df1 and df2, based on condition: if address in df1 contains state in df2.

df1:

                                                                           address  \
0      Cecilia Chapman 711-2880 Nulla St. Mankato Mississippi 96522 (257) 563-7401   
1  Iris Watson P.O. Box 283 8562 Fusce Rd. Frederick Nebraska 20620 (372) 587-2335   
2    Celeste Slater 606-3727 Ullamcorper. Street Roseville NH 11523 (786) 713-8616   
3            Theodore Lowe Ap #867-859 Sit Rd. Azusa New York 39531 (793) 151-6230   
4                 Calista Wise 7292 Dictum Av. San Antonio MI 47096 (492) 709-6392   

   quantity  price  
0         2     20  
1         3     13  
2         5     23  
3         3     32  
4         5     45

df2:

   id        state
0   1  Mississippi
1   2     Nebraska
2   3     New York

My expected output will like this. How could I do that? Thank you.

                                                                           address  \
0      Cecilia Chapman 711-2880 Nulla St. Mankato Mississippi 96522 (257) 563-7401   
1  Iris Watson P.O. Box 283 8562 Fusce Rd. Frederick Nebraska 20620 (372) 587-2335   
2    Celeste Slater 606-3727 Ullamcorper. Street Roseville NH 11523 (786) 713-8616   
3            Theodore Lowe Ap #867-859 Sit Rd. Azusa New York 39531 (793) 151-6230   
4                 Calista Wise 7292 Dictum Av. San Antonio MI 47096 (492) 709-6392   

   quantity  price   id        state  
0         2     20  1.0  Mississippi  
1         3     13  2.0     Nebraska  
2         5     23  NaN          NaN  
3         3     32  3.0     New York  
4         5     45  NaN          NaN

Update: the output of pat = '|'.join(r"\b{}\b".format(x) for x in df2['state']); print(df1['address'].str.extract('('+ pat + ')', expand=False))

      0    1    2    3    4    5    6    7    8    9  ...    40   41   42  \
0    NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN ...   NaN  NaN  NaN   
1    NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN ...   NaN  NaN  NaN   
2    NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN ...   NaN  NaN  NaN   
3    NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN ...   NaN  NaN  NaN    
..   ...  ...  ...  ...  ...  ...  ...  ...  ...  ... ...   ...  ...  ...  
158  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN ...   NaN  NaN  NaN   
159  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN ...   NaN  NaN  NaN

Sorry, I don't think so. :(

ah bon
– ah bon

2020-02-06 10:33:09 +00:00
Commented Feb 6, 2020 at 10:33 — ah bon
– ah bon, Commented Feb 6, 2020 at 10:33

jezrael · Accepted Answer · 2020-02-06 10:04:52Z

1

You can extract all possible states by Series.str.extract with \b\b for words boundaries to new column and then merge with left join:

pat = '|'.join(r"\b{}\b".format(x) for x in df2['state'])
df1['state']= df1['address'].str.extract('('+ pat + ')', expand=False)
print (df1)
                                             address  quantity  price  \
0  Cecilia Chapman 711-2880 Nulla St. Mankato Mis...         2     20   
1  Iris Watson P.O. Box 283 8562 Fusce Rd. Freder...         3     13   
2  Celeste Slater 606-3727 Ullamcorper. Street Ro...         5     23   
3  Theodore Lowe Ap #867-859 Sit Rd. Azusa New Yo...         3     32   
4  Calista Wise 7292 Dictum Av. San Antonio MI 47...         5     45   

         state  
0  Mississippi  
1     Nebraska  
2          NaN  
3     New York  
4          NaN  

df = df1.merge(df2, on='state', how='left')
print (df)
                                             address  quantity  price  \
0  Cecilia Chapman 711-2880 Nulla St. Mankato Mis...         2     20   
1  Iris Watson P.O. Box 283 8562 Fusce Rd. Freder...         3     13   
2  Celeste Slater 606-3727 Ullamcorper. Street Ro...         5     23   
3  Theodore Lowe Ap #867-859 Sit Rd. Azusa New Yo...         3     32   
4  Calista Wise 7292 Dictum Av. San Antonio MI 47...         5     45   

         state   id  
0  Mississippi  1.0  
1     Nebraska  2.0  
2          NaN  NaN  
3     New York  3.0  
4          NaN  NaN

answered Feb 6, 2020 at 10:04

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

ah bon Over a year ago

Thank you, but I don't understand two parenthese in str.extract('('+ ... + ')', could you explain more?

jezrael Over a year ago

@ahbon - it is because matching regex pattern need (regex), so added () to pat

ah bon Over a year ago

Yes, I think so.

jezrael Over a year ago

@ahbon - Is possible test if change pat = '|'.join(r"\b{}\b".format(x) for x in df2['state']) to import re and pat = '|'.join(r"\b{}\b".format(re.escape(x)) for x in df2['state']) ?

ah bon Over a year ago

No problem. Thank you. :) But I think the logic should be same, except English characters have more space to split words.

|

Collectives™ on Stack Overflow

Merge two dataframe if one string column is contained in another column in Pandas

1 Answer 1

9 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

9 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related