2

I have two pandas dataframes df1 and df2. I need to create a new column in df1 by searching df2['B'] to see if df1['A'] is a substring of df2['B']. If there is a match return the value of df2['A'] for the new column in df1['B'].

Below are sample dataframes

df1

      A                  B           
9.female.ceo.,ceo,       ?
9.female.ned.,ned,
9.female.ned.,chair,
2.female.ed.,ned,
2.female.ned.,ed,
9.female.chair.,ceo,
2.female.chair.,chair,

df2

     A                B
,ceo,ned,          2.male.chair.,ceo,ned,
,chair,ned,        2.male.ned.,chair,ned,  
,ned,              2.female.ed.,ned,
,ceo,chair,        6.female.ed.,ceo,chair,
,ed,ceo,           6.male.chair.,ed,ceo,
,ceo,chair,        9.female.ed.,ceo,chair,
,ceo,ned,          9.female.chair.,ceo,ned,
,chair,(in ft10),  9.male.ceo.,chair,(in ft10),

Merge wouldn't work in this case since df1['A'] contains substring of df2['B']

Any help that points to the right direction will be very much appreciated.

Expected results

df1

      A                    B           
9.female.ceo.,ceo,       
9.female.ned.,ned,
9.female.ned.,chair,
2.female.ed.,ned,         ,ned,
2.female.ned.,ed,
9.female.chair.,ceo,      ,ceo,ned,
2.female.chair.,chair,  
2
  • What is size of both DataFrames? Commented Feb 24, 2019 at 13:56
  • 1
    df1 is 941 and df2 is 66 Commented Feb 24, 2019 at 14:09

1 Answer 1

1

Idea is create sets by split by , and match by issubset:

d = {k: set(v.split(',')) for k, v in df2.set_index('A')['B'].items()}
df1['B'] = [next(iter([k for k, v in d.items() if set(x.split(',')).issubset(v)]), '') 
                      for x in df1['A']]
print (df1)
                        A          B
0      9.female.ceo.,ceo,           
1      9.female.ned.,ned,           
2    9.female.ned.,chair,           
3       2.female.ed.,ned,      ,ned,
4       2.female.ned.,ed,           
5    9.female.chair.,ceo,  ,ceo,ned,
6  2.female.chair.,chair,           

Solution with test by in:

d = df2.set_index('A')['B']
df1['B'] = [next(iter([k for k, v in d.items() if x in v]), '')  for x in df1['A']]
print (df1)
                        A          B
0      9.female.ceo.,ceo,           
1      9.female.ned.,ned,           
2    9.female.ned.,chair,           
3       2.female.ed.,ned,      ,ned,
4       2.female.ned.,ed,           
5    9.female.chair.,ceo,  ,ceo,ned,
6  2.female.chair.,chair,           

Another solution with cross join by merge with test substrings by in:

df3 = df1.assign(tmp=1).merge(df2.assign(tmp=1), on='tmp', suffixes=('','_'))
df3 = df3.loc[[a in b for a, b in zip(df3['A'], df3['B_'])], ['A','A_']]

df = df1[['A']].merge(df3.rename(columns={'A_':'B'}), on='A', how='left')
print (df)
                        A          B
0      9.female.ceo.,ceo,        NaN
1      9.female.ned.,ned,        NaN
2    9.female.ned.,chair,        NaN
3       2.female.ed.,ned,      ,ned,
4       2.female.ned.,ed,        NaN
5    9.female.chair.,ceo,  ,ceo,ned,
6  2.female.chair.,chair,        NaN
Sign up to request clarification or add additional context in comments.

2 Comments

once again thank you so much help. I always learn a lot from your solutions. Any recommendations on online materials or books to dive deep into pandas? Thanks!
@rescot - Hard question, I think modern pandas is nice :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.