2

I have a dataframe(df1) with 3 columns fname,lname,zip.

 fname  lname zip
 ty      zz   123
 rt      kk   345
 yu      pp   678

another master_df with only a list of zip_codes.

 zip_codes
 123
 345
 555
 667

I want to write a pyspark sql code to check if zip-codes present in df1 are the ones mentioned in master list. Whichever is not present in master should go into another dataframe.

I tried :

df3 = df1.filter(df1["zip"]!=master["zip_codes"])

My required output_df should show 678 as its not present in master_df

1

3 Answers 3

1
df2=df1.join(master,(df1.zip==master.zip_codes),'left_outer').where(master['zip_codes'].isNull())
df2.show()
+-----+-----+---+--------=+
|fname|lname|zip|zip_codes|
+-----+-----+---+---------+
|   yu|   pp|678|     null|
+-----+-----+---+---------+
Sign up to request clarification or add additional context in comments.

1 Comment

i have a doubt why are you adding zip_codes of master isnull condition?
1

Let me know if this helps:

zip_codes = master_df.select(['zip_codes']).rdd.flatMap(lambda x :x).collect()

@F.udf(StringType())
def increment(x):
    if x in zip_codes:
        return("True")
    else:
        return("False")

TableA = TableA.withColumn('zip_presence', increment('zip'))

df_with_zipcode_match = TableA.filter(TableA['zip_presence'] == "True").drop('zip_presence')
df_without_zipcode_match = TableA.filter(TableA['zip_presence'] == "False").drop('zip_presence')


df_with_zipcode_match.show()
df_without_zipcode_match.show()


#### Input DFs####
+---------+-----+---+
|    fname|lname|zip|
+---------+-----+---+
|       ty|   zz|123|
|   Monkey|   kk|345|
|    Ninja|   pp|678|
|Spaghetti|  pgp|496|
+---------+-----+---+


+---------+
|zip_codes|
+---------+
|      123|
|      345|
|      555|
|      667|
+---------+


#### Output DFs####
+------+-----+---+
| fname|lname|zip|
+------+-----+---+
|    ty|   zz|123|
|Monkey|   kk|345|
+------+-----+---+


+---------+-----+---+
|    fname|lname|zip|
+---------+-----+---+
|    Ninja|   pp|678|
|Spaghetti|  pgp|496|
+---------+-----+---+

Comments

1

You can make use substract method here. Here's my code snippet.

from pyspark.sql import SparkSession
SS = SparkSession.builder.getOrCreate()

data_1 = [
    {"fname": "ty", "lname": "zz", "zip": 123},
    {"fname": "rt", "lname": "kk", "zip": 345},
    {"fname": "yu", "lname": "pp", "zip": 678}]

data_2 = [
    {"zip": 123},
    {"zip": 345},
    {"zip": 555},
    {"zip": 667},]

# Creating dataframes
df_1 = SS.createDataFrame(data_1)
df_2 = SS.createDataFrame(data_2)

# Creating dataframe with only zip
df_1_sliced = df_1.select("zip")

# Finding the difference
df_diff = df_1_sliced.subtract(df_2)
df_diff.show() # Count should be zero

+---+
|zip|
+---+
|678|
+---+

This will create a new dataframe containing all the zip's which are not present in zip codes.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.