Python: How to merge two data frames where the values are not unique

Question

I have two data frames,

import pandas as pd
a = pd.DataFrame( { 'port':[1,1,0,1,0], 'cd':[1,2,3,2,1], 'date':["2014-02-26","2014-02-25","2014-02-26","2014-02-26","2014-02-25"] } )
b = pd.DataFrame( { 'port':[0,1,0,1,0], 'fac':[2,1,2,2,3], 'date': ["2014-02-25","2014-02-25","2014-02-26","2014-02-26","2014-02-27"] } )

What I need to do is take every date-port pair, like say port 0 and date 2014-02-25, look up the fac value in b and fill this into a new column in a. The output should therefore look like

port cd date         fac 
1    1  "2014-02-26" 2
1    2  "2014-02-25" 1
... (so on) ...

I tried just merging the frames on both date and port but got an error, which I think is due to the fact that the data frames are of different sizes--and I kind of don't expect that it would work anyway.

Abhi · Accepted Answer · 2018-09-26 14:08:30Z

2

If you are looking to merge both dataframes you should use merge

import pandas as pd
a = pd.DataFrame( { 'port':[1,1,0,1,0], 'cd':[1,2,3,2,1], 
         'date':["2014-02-26","2014-02-25","2014-02-26","2014-02-26","2014-02-25"]})

b = pd.DataFrame( { 'port':[0,1,0,1,0], 'fac':[2,1,2,2,3], 
         'date': ["2014-02-25","2014-02-25","2014-02-26","2014-02-26","2014-02-27"]})

df = a.merge(b)
print (df)

output:

  port  cd  date       fac
0   1   1   2014-02-26  2
1   1   2   2014-02-26  2
2   1   2   2014-02-25  1
3   0   3   2014-02-26  2
4   0   1   2014-02-25  2

edited Sep 26, 2018 at 14:08

answered Aug 11, 2018 at 17:13

Abhi

4,2431 gold badge18 silver badges29 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

nandoquintana · Accepted Answer · 2018-08-11 18:05:17Z

I recommend you to create a new column in dataframe A and populate it through "numpy.vectorize"

import pandas as pd
import numpy as np

A = pd.DataFrame({'port': [1, 1, 0, 1, 0], 'cd': [1, 2, 3, 2, 1], 'date': ["2014-02-26", "2014-02-25", "2014-02-26", "2014-02-26", "2014-02-25"]})
B = pd.DataFrame({'port': [0, 1, 0, 1, 0], 'fac': [2, 1, 2, 2, 3], 'date': ["2014-02-25", "2014-02-25", "2014-02-26", "2014-02-26", "2014-02-27"]})

Setup indexes in dataframe B to access by "date" and "port":

C = B.set_index(['date', 'port'])

Then, create the function that will be applied to each row in dataframe A:

def get_fac(date, port):
    try:
        return C.loc[date].loc[port]['fac']
    except KeyError:
        return ''

A['fac'] = np.vectorize(get_fac)(A['date'], A['port'])

This is the output:

   cd        date  port  fac
0   1  2014-02-26     1    2
1   2  2014-02-25     1    1
2   3  2014-02-26     0    2
3   2  2014-02-26     1    2
4   1  2014-02-25     0    2

jezrael · Accepted Answer · 2018-08-11 17:17:46Z

1

I believe need drop_duplicates with merge:

cols = ['port','date']
df = a.drop_duplicates(cols).merge(b, on=cols)
print (df)
   port  cd        date  fac
0     1   1  2014-02-26    2
1     1   2  2014-02-25    1
2     0   3  2014-02-26    2
3     0   1  2014-02-25    2

But if want combination of all duplicated pairs:

cols = ['port','date']
df1 = a.merge(b, on=cols)
print (df1)
   port  cd        date  fac
0     1   1  2014-02-26    2
1     1   2  2014-02-26    2
2     1   2  2014-02-25    1
3     0   3  2014-02-26    2
4     0   1  2014-02-25    2

answered Aug 11, 2018 at 17:17

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Collectives™ on Stack Overflow

Python: How to merge two data frames where the values are not unique

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related