how to check if values of a column in one dataframe contains only the values present in a column in another dataframe

Question

I have a dataframe(df1) with 3 columns fname,lname,zip.

 fname  lname zip
 ty      zz   123
 rt      kk   345
 yu      pp   678

another master_df with only a list of zip_codes.

 zip_codes
 123
 345
 555
 667

I want to write a pyspark sql code to check if zip-codes present in df1 are the ones mentioned in master list. Whichever is not present in master should go into another dataframe.

I tried :

df3 = df1.filter(df1["zip"]!=master["zip_codes"])

My required output_df should show 678 as its not present in master_df

Possible duplicate of Spark replacement for EXISTS and IN

pault
– pault

2019-07-23 16:12:09 +00:00
Commented Jul 23, 2019 at 16:12 — pault
– pault, Commented Jul 23, 2019 at 16:12

Prathik Kini · Accepted Answer · 2019-07-23 12:55:32Z

1

df2=df1.join(master,(df1.zip==master.zip_codes),'left_outer').where(master['zip_codes'].isNull())
df2.show()
+-----+-----+---+--------=+
|fname|lname|zip|zip_codes|
+-----+-----+---+---------+
|   yu|   pp|678|     null|
+-----+-----+---+---------+

edited Jul 23, 2019 at 12:55

answered Jul 23, 2019 at 12:36

Prathik Kini

1,8201 gold badge19 silver badges37 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

user1795999 Over a year ago

i have a doubt why are you adding zip_codes of master isnull condition?

Preetham · Accepted Answer · 2019-07-23 12:48:10Z

Let me know if this helps:

zip_codes = master_df.select(['zip_codes']).rdd.flatMap(lambda x :x).collect()

@F.udf(StringType())
def increment(x):
    if x in zip_codes:
        return("True")
    else:
        return("False")

TableA = TableA.withColumn('zip_presence', increment('zip'))

df_with_zipcode_match = TableA.filter(TableA['zip_presence'] == "True").drop('zip_presence')
df_without_zipcode_match = TableA.filter(TableA['zip_presence'] == "False").drop('zip_presence')


df_with_zipcode_match.show()
df_without_zipcode_match.show()


#### Input DFs####
+---------+-----+---+
|    fname|lname|zip|
+---------+-----+---+
|       ty|   zz|123|
|   Monkey|   kk|345|
|    Ninja|   pp|678|
|Spaghetti|  pgp|496|
+---------+-----+---+


+---------+
|zip_codes|
+---------+
|      123|
|      345|
|      555|
|      667|
+---------+


#### Output DFs####
+------+-----+---+
| fname|lname|zip|
+------+-----+---+
|    ty|   zz|123|
|Monkey|   kk|345|
+------+-----+---+


+---------+-----+---+
|    fname|lname|zip|
+---------+-----+---+
|    Ninja|   pp|678|
|Spaghetti|  pgp|496|
+---------+-----+---+

theredcomet · Accepted Answer · 2019-07-23 18:50:18Z

You can make use substract method here. Here's my code snippet.

from pyspark.sql import SparkSession
SS = SparkSession.builder.getOrCreate()

data_1 = [
    {"fname": "ty", "lname": "zz", "zip": 123},
    {"fname": "rt", "lname": "kk", "zip": 345},
    {"fname": "yu", "lname": "pp", "zip": 678}]

data_2 = [
    {"zip": 123},
    {"zip": 345},
    {"zip": 555},
    {"zip": 667},]

# Creating dataframes
df_1 = SS.createDataFrame(data_1)
df_2 = SS.createDataFrame(data_2)

# Creating dataframe with only zip
df_1_sliced = df_1.select("zip")

# Finding the difference
df_diff = df_1_sliced.subtract(df_2)
df_diff.show() # Count should be zero

+---+
|zip|
+---+
|678|
+---+

This will create a new dataframe containing all the zip's which are not present in zip codes.

Collectives™ on Stack Overflow

how to check if values of a column in one dataframe contains only the values present in a column in another dataframe

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related