1

I have two data frames:

df1:

+----------+-------------+-------------+--------------+---------------+
|customerId|     fullName|   telephone1|    telephone2|          email|
+----------+-------------+-------------+--------------+---------------+
|    201534|MARIO JIMENEZ|01722-3500391|+5215553623333|[email protected]|
|    879535|  MARIO LOPEZ|01722-3500377|+5215553623333| [email protected]|
+----------+-------------+-------------+--------------+---------------+

df2:

+----------+-------------+-------------+--------------+---------------+
|customerId|     fullName|   telephone1|    telephone2|          email|
+----------+-------------+-------------+--------------+---------------+
|    201534|MARIO JIMENEZ|01722-3500391|+5215553623333|[email protected]|
|    201536|  ROBERT MITZ|01722-3500377|+5215553623333| [email protected]|
|    201537|     MARY ENG|01722-3500127|+5215553623111|[email protected]|
|    201538|    RICK BURT|01722-3500983|+5215553623324|[email protected]|
|    201539|     JHON DOE|01722-3502547|+5215553621476|[email protected]|
+----------+-------------+-------------+--------------+---------------+

And I need to get a third DataFrame with the ones from df1 that does not exist in df2.

like this:

+----------+-------------+-------------+--------------+---------------+
|customerId|     fullName|   telephone1|    telephone2|          email|
+----------+-------------+-------------+--------------+---------------+
|    879535|  MARIO LOPEZ|01722-3500377|+5215553623333| [email protected]|
+----------+-------------+-------------+--------------+---------------+

Whats is the correct way of doing this?

I've already tried the following:

diff = df2.join(df1, df2['customerId'] != df1['customerId'],"left")
diff = df1.subtract(df2)
diff = df1[~ df1['customerId'].isin(df2['customerId'])]

But they do not work, any suggestions?

2
  • In general, it will be easier for people to help if you can provide code to generate your dataframes. Commented Sep 17, 2021 at 21:58
  • 1
    your "like this" example is of the ones that do exist in df2 however you say your "need" is "that does not exist in df2" Please resolve the contradiction or we cannot like this. Commented Sep 17, 2021 at 22:00

2 Answers 2

3

You can use merge with indicator=True:

df3 = df1.merge(df2, on=df1.columns.tolist(), how='left', indicator=True)
df3 = df3[df3['_merge'] == 'left_only'].drop(columns='_merge')

Output:

>>> df3
   customerId     fullName     telephone1     telephone2           email
1      879535  MARIO LOPEZ  01722-3500377  5215553623333  [email protected]
Sign up to request clarification or add additional context in comments.

Comments

2

Using pyspark:

You can create a list containing the customerId from DF2 with collect():

from pyspark.sql.types import *
id_df2 = [id[0] for id in df2.select('customerId').distinct().collect()]

And then filter your DF1 customerId using isin with negation ~:

diff = df1.where(~col('customerId').isin(id_df2))

4 Comments

I only had to import from pyspark.sql import functions as F and use it as F.col() but this worked thanks
Your solution is a recipe to kill Spark's driver node. There's a straightforward solution: df1.join(df2, on='customerId', how='left_anti')
Thanks for your feedback @Kafels. I tend to use the above command a lot, so could you elaborate on why it's not a good practise?
@sophocles collect() will move all data from Workers to Driver node, in an example where you have a lot of GB of data, this could cause an OOM exception, which means your cluster entire will go down. Usually, should be rare the use cases where you really need to collect or move some data from spark to python context to apply a specific logic. Otherwise, if you are working with plotting data, there's no way to avoid it

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.