Pyspark Dataframe Difference - Where param != null not returning?

Question

I'm writing a function to outputting a dataframe with the difference between two dataframes. Simplified, this looks like this:

differences = df1.join(df2, df1['id'] == df2['id'], how='full') \
  .select(F.coalesce(df1['id'], df2['id']).alias('id'), df1['name'], df2['name'])
  .where(df1['name'] != df2['name'])

With the following 2 datasets, I expect the 3rd to be the output:

+---+-----+
| id| name|
+---+-----+
|  1|Alice|
|  2|  Bob|
|  3|Carol|
|  4|  Dan|
|  5|  Eve|
+---+-----+

+---+-----+
| id| name|
+---+-----+
|  1|Alice|
|  2|  Ben|
|  4|  Dan|
|  5|  Eve|
|  6| Finn|
+---+-----+

+---+-------+-------+
|age|   name|   name|
+---+-------+-------+
|  2|    Bob|    Ben|
|  3|  Carol|   null|
|  6|   null|   Finn|
+---+-------+-------+

But when I run it in databricks, the null columns are omitted from the result dataframe.

+---+-------+-------+
|age|   name|   name|
+---+-------+-------+
|  2|    Bob|    Ben|
+---+-------+-------+

Are they not considered != by the where clause? Is there hidden logic when these frames are created?

Luiz Viola · Accepted Answer · 2021-11-05 21:31:48Z

2

From Why [table].[column] != null is not working?:

"NULL in a database is not a value. It means something like "unknown" or "data missing".

You cannot tell if something where you don't have any information about is equal to something else where you also don't have any information about (=, != operators). But you can say whether there is any information available (IS NULL, IS NOT NULL)."

So you will have to add more conditions:

differences = df1.join(df2, df1['id'] == df2['id'], how='full')\
  .select(F.coalesce(df1['id'], df2['id']).alias('id'), df1['Name'], df2['Name'])\
  .where((df1['Name'] != df2['Name']) | ((df1['Name']).isNull() | (df2['Name']).isNull()))


+---+-----+----+
|id |Name |Name|
+---+-----+----+
|2  |Bob  |Ben |
|3  |Carol|null|
|6  |null |Finn|
+---+-----+----+

answered Nov 5, 2021 at 21:31

Luiz Viola

2,4662 gold badges17 silver badges36 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

blackbishop Over a year ago

There is also a null-safe equal operator in Spark <=>. using DataFrame API: .where(~df1["Name"].eqNullSafe(df2["Name"])). Or in SQL expression: F.expr("df1.Name <=> df2.name"). See sql-ref-null-semantics

Collectives™ on Stack Overflow

Pyspark Dataframe Difference - Where param != null not returning?

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related