DataFrame remove rows existing in another DataFrame

Question

I have two data frames:

df1:

+----------+-------------+-------------+--------------+---------------+
|customerId|     fullName|   telephone1|    telephone2|          email|
+----------+-------------+-------------+--------------+---------------+
|    201534|MARIO JIMENEZ|01722-3500391|+5215553623333|[email protected]|
|    879535|  MARIO LOPEZ|01722-3500377|+5215553623333| [email protected]|
+----------+-------------+-------------+--------------+---------------+

df2:

+----------+-------------+-------------+--------------+---------------+
|customerId|     fullName|   telephone1|    telephone2|          email|
+----------+-------------+-------------+--------------+---------------+
|    201534|MARIO JIMENEZ|01722-3500391|+5215553623333|[email protected]|
|    201536|  ROBERT MITZ|01722-3500377|+5215553623333| [email protected]|
|    201537|     MARY ENG|01722-3500127|+5215553623111|[email protected]|
|    201538|    RICK BURT|01722-3500983|+5215553623324|[email protected]|
|    201539|     JHON DOE|01722-3502547|+5215553621476|[email protected]|
+----------+-------------+-------------+--------------+---------------+

And I need to get a third DataFrame with the ones from df1 that does not exist in df2.

like this:

+----------+-------------+-------------+--------------+---------------+
|customerId|     fullName|   telephone1|    telephone2|          email|
+----------+-------------+-------------+--------------+---------------+
|    879535|  MARIO LOPEZ|01722-3500377|+5215553623333| [email protected]|
+----------+-------------+-------------+--------------+---------------+

Whats is the correct way of doing this?

I've already tried the following:

diff = df2.join(df1, df2['customerId'] != df1['customerId'],"left")

diff = df1.subtract(df2)

diff = df1[~ df1['customerId'].isin(df2['customerId'])]

But they do not work, any suggestions?

In general, it will be easier for people to help if you can provide code to generate your dataframes. — ASGM
– ASGM, Commented Sep 17, 2021 at 21:58
your "like this" example is of the ones that do exist in df2 however you say your "need" is "that does not exist in df2" Please resolve the contradiction or we cannot like this. — Abel
– Abel, Commented Sep 17, 2021 at 22:00

Corralien · Accepted Answer · 2021-09-17 22:08:36Z

3

You can use merge with indicator=True:

df3 = df1.merge(df2, on=df1.columns.tolist(), how='left', indicator=True)
df3 = df3[df3['_merge'] == 'left_only'].drop(columns='_merge')

Output:

>>> df3
   customerId     fullName     telephone1     telephone2           email
1      879535  MARIO LOPEZ  01722-3500377  5215553623333  [email protected]

answered Sep 17, 2021 at 22:08

Corralien

121k8 gold badges44 silver badges69 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

sophocles · Accepted Answer · 2021-09-17 22:18:47Z

2

Using pyspark:

You can create a list containing the customerId from DF2 with collect():

from pyspark.sql.types import *
id_df2 = [id[0] for id in df2.select('customerId').distinct().collect()]

And then filter your DF1 customerId using isin with negation ~:

diff = df1.where(~col('customerId').isin(id_df2))

edited Sep 17, 2021 at 22:18

answered Sep 17, 2021 at 22:05

sophocles

13.9k3 gold badges18 silver badges37 bronze badges

4 Comments

TurboAza Over a year ago

I only had to import from pyspark.sql import functions as F and use it as F.col() but this worked thanks

Kafels Over a year ago

Your solution is a recipe to kill Spark's driver node. There's a straightforward solution: df1.join(df2, on='customerId', how='left_anti')

sophocles Over a year ago

Thanks for your feedback @Kafels. I tend to use the above command a lot, so could you elaborate on why it's not a good practise?

Kafels Over a year ago

@sophocles collect() will move all data from Workers to Driver node, in an example where you have a lot of GB of data, this could cause an OOM exception, which means your cluster entire will go down. Usually, should be rare the use cases where you really need to collect or move some data from spark to python context to apply a specific logic. Otherwise, if you are working with plotting data, there's no way to avoid it

Collectives™ on Stack Overflow

DataFrame remove rows existing in another DataFrame

2 Answers 2

Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related