Outer join in python Pandas

Question

I have two data sets as following

A         B
IDs      IDs
1        1
2        2
3        5
4        7

How in Pandas, Numpy we can apply a join which can give me all the data from B, which is not present in A Something like Following

B
Ids
5
7

I know it can be done with for loop, but that I don't want, since my real data is in millions, and I am really not sure how to use Panda Numpy here, something like following

pd.merge(A, B, on='ids', how='right')

Thanks

what is the expected output? The column names seem to be A and B and not IDs ... this is misleading. — Colonel Beauvel
– Colonel Beauvel, Commented Jun 7, 2016 at 13:08

Divakar · Accepted Answer · 2016-06-07 13:26:53Z

3

You can use NumPy's setdiff1d, like so -

np.setdiff1d(B['IDs'],A['IDs'])

Also, np.in1d could be used for the same effect, like so -

B[~np.in1d(B['IDs'],A['IDs'])]

Please note that np.setdiff1d would give us a sorted NumPy array as output.

Sample run -

>>> A = pd.DataFrame([1,2,3,4],columns=['IDs'])
>>> B = pd.DataFrame([1,7,5,2],columns=['IDs'])
>>> np.setdiff1d(B['IDs'],A['IDs'])
array([5, 7])
>>> B[~np.in1d(B['IDs'],A['IDs'])]
   IDs
1    7
2    5

edited Jun 7, 2016 at 13:26

answered Jun 7, 2016 at 13:15

Divakar

222k19 gold badges273 silver badges374 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Manu Sharma Over a year ago

Thank You so much! But despite of my several attempts: "I am receiving error, List indices must be integers not lists"

Divakar Over a year ago

@manusharma So, do you have anything else apart from integers in that column of IDs, like strings maybe or integers as strings?

Manu Sharma Over a year ago

I have two large Lists/ Dataframe, some of them are long, Integers, I tried to use Map(int, dataset) to convert all in one, still the same error List Indices must be integers not lists

jezrael · Accepted Answer · 2016-06-07 13:08:55Z

2

You can use merge with parameter indicator and then boolean indexing. Last you can drop column _merge:

A = pd.DataFrame({'IDs':[1,2,3,4],
                   'B':[4,5,6,7],
                   'C':[1,8,9,4]})
print (A)
   B  C  IDs
0  4  1    1
1  5  8    2
2  6  9    3
3  7  4    4

B = pd.DataFrame({'IDs':[1,2,5,7],
                   'A':[1,8,3,7],
                   'D':[1,8,9,4]})

print (B)
   A  D  IDs
0  1  1    1
1  8  8    2
2  3  9    5
3  7  4    7

df = (pd.merge(A, B, on='IDs', how='outer', indicator=True))
df = df[df._merge == 'right_only']

df = df.drop('_merge', axis=1)
print (df)
    B   C  IDs    A    D
4 NaN NaN  5.0  3.0  9.0
5 NaN NaN  7.0  7.0  4.0

answered Jun 7, 2016 at 13:08

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Comments

Kurt Peek · Accepted Answer · 2016-06-07 13:28:39Z

1

You could convert the data series to sets and take the difference:

import pandas as pd

df=pd.DataFrame({'A' : [1,2,3,4], 'B' : [1,2,5,7]})
A=set(df['A'])  
B=set(df['B'])
C=pd.DataFrame({'C' : list(B-A)})   # Take difference and convert back to DataFrame

The variable "C" then yields

   C
0  5
1  7

answered Jun 7, 2016 at 13:28

Kurt Peek

58.5k104 gold badges354 silver badges572 bronze badges

Comments

Alex Petralia · Accepted Answer · 2016-06-07 14:23:28Z

1

You can simply use pandas' .isin() method:

df = pd.DataFrame({'A' : [1,2,3,4], 'B' : [1,2,5,7]})
df[~df['B'].isin(df['A'])]

If these are separate DataFrames:

a = pd.DataFrame({'IDs' : [1,2,3,4]})
b = pd.DataFrame({'IDs' : [1,2,5,7]})
b[~b['IDs'].isin(a['IDs'])]

Output:

   IDs
2    5
3    7

answered Jun 7, 2016 at 14:23

Alex Petralia

1,7811 gold badge23 silver badges40 bronze badges

Collectives™ on Stack Overflow

Outer join in python Pandas

4 Answers 4

3 Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

3 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related