1
import pandas as pd    
df = pd.DataFrame({'company' : [ABC, ABC , XYZ, XYZ],
                   'tin': ['5555', '1111', '5555', '2222']                   
                   })

I don't know how to get the column with group by column 'tin' if values is equal from the large dataset.

Desirable result:

df = pd.DataFrame({'company' : [ABC, ABC , XYZ, XYZ],                   
                   'tin': ['5555', '1111', '5555', '2222'],                     
                   'column' : ['text' ABC and XYZ, None,'text' ABC and XYZ, None]

               })
2
  • if values is equal from the large dataset. - How looks large df? Commented Nov 12, 2020 at 9:15
  • shanelynn.ie/… Commented Nov 12, 2020 at 11:59

1 Answer 1

1

I believe you need:

df1 = pd.DataFrame({ 'tin': ['5555', '5555'], 
                   'name' : 'AAA,BBB'.split(',')})

print (df1)
    tin name
0  5555  AAA
1  5555  BBB

df2 = pd.DataFrame({'company' : 'ABC,ABC,XYZ,XYZ,ABC,ABC,XYZ,XYZ'.split(','), 
                   'tin': ['5555', '1111', '5555', '2222', '5555', '1111', '5555', '2222'], 
                   'name' : 'AAA,AAA,AAA,AAA,BBB,BBB,BBB,BBB'.split(',')})

print (df2)
  company   tin name
0     ABC  5555  AAA
1     ABC  1111  AAA
2     XYZ  5555  AAA
3     XYZ  2222  AAA
4     ABC  5555  BBB
5     ABC  1111  BBB
6     XYZ  5555  BBB
7     XYZ  2222  BBB

First use DataFrame.merge for test if match by first DataFrame called df1 with parameter indicator=True and how='left' for left join:

df = df2.merge(df1, on=['tin','name'], how='left', indicator=True)
print (df)
  company   tin name     _merge
0     ABC  5555  AAA       both
1     ABC  1111  AAA  left_only
2     XYZ  5555  AAA       both
3     XYZ  2222  AAA  left_only
4     ABC  5555  BBB       both
5     ABC  1111  BBB  left_only
6     XYZ  5555  BBB       both
7     XYZ  2222  BBB  left_only

Then filter only both rows by boolean indexing:

df = df[df['_merge'].eq('both')]
print (df)
  company   tin name _merge
0     ABC  5555  AAA   both
2     XYZ  5555  AAA   both
4     ABC  5555  BBB   both
6     XYZ  5555  BBB   both

Last aggregate by both columns and assign back by DataFrame.join:

s = df.groupby(['tin','name'])['company'].agg(' and '.join).rename('new')
df = df2.join(s, on=['tin','name'])
print (df)
  company   tin name          new
0     ABC  5555  AAA  ABC and XYZ
1     ABC  1111  AAA          NaN
2     XYZ  5555  AAA  ABC and XYZ
3     XYZ  2222  AAA          NaN
4     ABC  5555  BBB  ABC and XYZ
5     ABC  1111  BBB          NaN
6     XYZ  5555  BBB  ABC and XYZ
7     XYZ  2222  BBB          NaN
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks. In the dfbig in 'column' writing dublicates values 'company'. How to adding the second columns for "uniques" rows? df['column'] = (df['tin'].map(df[df['tin'].isin([vals, vals_2])] .groupby('tin')['company'].agg(' and '.join)))
If we have two columns 'tin' and 'name' for group import pandas as pd df = pd.DataFrame({'company' : [ABC, ABC , XYZ, XYZ, ABC, ABC , XYZ, XYZ], 'tin': ['5555', '1111', '5555', '2222', '5555', '1111', '5555', '2222'], 'name' : [AAA, AAA , AAA, AAA, BBB , BBB , BBB , BBB ], })
I think 'column' : [ABC and XYZ, Nan , Nan, ABC and XYZ, ABC and XYZ , Nan , Nan , ABC and XYZ ] Thus, ABC with 5555 AAA don't intersect XYZ with 5555 BBB
Dear jezrael, yes. But, more precisely, it's not dublicated, just it's not target. We should save all row and adding new column where aggrigation info from column 'company' given that 'tin' and 'name' is matches. I tryed df['column'] = (df['tin','name'].map(df[df['tin','name'].isin{'tin':vals ,'name': vals2}] .groupby('tin','name')['company'].agg(' and '.join))) not work

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.