1

I have two pandas data frames (see below).I want to merge them based on the id (Dataframe1) and localid(Dataframe2). This code is not working; it creates additional rows in dfmerged as Dataframe2 may contains multiple same localid(e.g., D3). How can I merge these two dataframes and set the value of the 'color' column as NaN if the localid does not exists in the first dataframe (DataFrame1)?

dfmerged = pd.merge(df1, df2, left_on='id', right_on='localid')

enter image description here

2
  • Can you add desired output? Commented Oct 6, 2016 at 9:15
  • 1
    You'll first need to de-duplicate the ids in df2 by combining the colours into a single list, secondly you need to pass how='outer' if you want all ids in the final merged df by default it's inner so only ids that are present in both will be merged Commented Oct 6, 2016 at 9:15

2 Answers 2

2

I think you need groupby and sum values in list in df2 and then use merge with drop column localid:

df1 = pd.DataFrame({'id':['D1','D2','D3','D4','D5','D6'],
                   'Field1':[12,15,11,7,55,8.8]})

print (df1)
   Field1  id
0    12.0  D1
1    15.0  D2
2    11.0  D3
3     7.0  D4
4    55.0  D5
5     8.8  D6

df2 = pd.DataFrame({'localid':['D1','D2','D3','D3','D9'],
                   'color':[['b'],['a'],['a','b'],['s','d'], ['a']]})

print (df2)
    color localid
0     [b]      D1
1     [a]      D2
2  [a, b]      D3
3  [s, d]      D3
4     [a]      D9
df2 = df2.groupby('localid', as_index=False)['color'].sum()
print (df2)
  localid         color
0      D1           [b]
1      D2           [a]
2      D3  [a, b, s, d]
3      D9           [a]


dfmerged = pd.merge(df1, 
                    df2, 
                    left_on='id', 
                    right_on='localid', 
                    how='left')
             .drop('localid', axis=1)

print (dfmerged)
   Field1  id         color
0    12.0  D1           [b]
1    15.0  D2           [a]
2    11.0  D3  [a, b, s, d]
3     7.0  D4           NaN
4    55.0  D5           NaN
5     8.8  D6           NaN
Sign up to request clarification or add additional context in comments.

4 Comments

the D3 color values should be [a,b,s,d].
I am just about to add the groupby statement, you are quick! thanks :)
Thank you for accepting! And small advice - check How to make good reproducible pandas examples and dont use pictures, because then is impossible copy data. ;)
just started with python programming, thanks for the examples ^O^
0

You should probably simplify df2 to have no repeating keys, and then tell pd.merge to use union of keys from both frames (with how:'outer'):

import pandas as pd
df1 = pd.DataFrame({    'id':['D1','D2','D3','D4','D5','D6'],
                    'Field1':[  12,  15,  11,   7,  55, 8.8]})
df2 = pd.DataFrame({'localid':['D1','D2','D3','D3','D9'],
                      'color':[['blue','grey'],
                               ['yellow'],
                               ['black','red','green'],
                               ['white'],
                               ['blue']]})
dfmerged = pd.merge(df1, df2, left_on='id', right_on='localid')
dfmerged2 = pd.merge(df1, df2, left_on='id', right_on='localid', how='outer')

Which results in:

>>> dfmerged2
   Field1   id                color localid
0    12.0   D1         [blue, grey]      D1
1    15.0   D2             [yellow]      D2
2    11.0   D3  [black, red, green]      D3
3    11.0   D3              [white]      D3
4     7.0   D4                  NaN     NaN
5    55.0   D5                  NaN     NaN
6     8.8   D6                  NaN     NaN
7     NaN  NaN               [blue]      D9

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.