1

I have two data sets as following

A         B
IDs      IDs
1        1
2        2
3        5
4        7

How in Pandas, Numpy we can apply a join which can give me all the data from B, which is not present in A Something like Following

B
Ids
5
7

I know it can be done with for loop, but that I don't want, since my real data is in millions, and I am really not sure how to use Panda Numpy here, something like following

pd.merge(A, B, on='ids', how='right')

Thanks

2
  • try instead of right you can specify outer. Commented Jun 7, 2016 at 13:04
  • what is the expected output? The column names seem to be A and B and not IDs ... this is misleading. Commented Jun 7, 2016 at 13:08

4 Answers 4

3

You can use NumPy's setdiff1d, like so -

np.setdiff1d(B['IDs'],A['IDs'])

Also, np.in1d could be used for the same effect, like so -

B[~np.in1d(B['IDs'],A['IDs'])]

Please note that np.setdiff1d would give us a sorted NumPy array as output.

Sample run -

>>> A = pd.DataFrame([1,2,3,4],columns=['IDs'])
>>> B = pd.DataFrame([1,7,5,2],columns=['IDs'])
>>> np.setdiff1d(B['IDs'],A['IDs'])
array([5, 7])
>>> B[~np.in1d(B['IDs'],A['IDs'])]
   IDs
1    7
2    5
Sign up to request clarification or add additional context in comments.

3 Comments

Thank You so much! But despite of my several attempts: "I am receiving error, List indices must be integers not lists"
@manusharma So, do you have anything else apart from integers in that column of IDs, like strings maybe or integers as strings?
I have two large Lists/ Dataframe, some of them are long, Integers, I tried to use Map(int, dataset) to convert all in one, still the same error List Indices must be integers not lists
2

You can use merge with parameter indicator and then boolean indexing. Last you can drop column _merge:

A = pd.DataFrame({'IDs':[1,2,3,4],
                   'B':[4,5,6,7],
                   'C':[1,8,9,4]})
print (A)
   B  C  IDs
0  4  1    1
1  5  8    2
2  6  9    3
3  7  4    4

B = pd.DataFrame({'IDs':[1,2,5,7],
                   'A':[1,8,3,7],
                   'D':[1,8,9,4]})

print (B)
   A  D  IDs
0  1  1    1
1  8  8    2
2  3  9    5
3  7  4    7

df = (pd.merge(A, B, on='IDs', how='outer', indicator=True))
df = df[df._merge == 'right_only']

df = df.drop('_merge', axis=1)
print (df)
    B   C  IDs    A    D
4 NaN NaN  5.0  3.0  9.0
5 NaN NaN  7.0  7.0  4.0

Comments

1

You could convert the data series to sets and take the difference:

import pandas as pd

df=pd.DataFrame({'A' : [1,2,3,4], 'B' : [1,2,5,7]})
A=set(df['A'])  
B=set(df['B'])
C=pd.DataFrame({'C' : list(B-A)})   # Take difference and convert back to DataFrame 

The variable "C" then yields

   C
0  5
1  7

Comments

1

You can simply use pandas' .isin() method:

df = pd.DataFrame({'A' : [1,2,3,4], 'B' : [1,2,5,7]})
df[~df['B'].isin(df['A'])]

If these are separate DataFrames:

a = pd.DataFrame({'IDs' : [1,2,3,4]})
b = pd.DataFrame({'IDs' : [1,2,5,7]})
b[~b['IDs'].isin(a['IDs'])]

Output:

   IDs
2    5
3    7

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.