3

I'm writing a function to outputting a dataframe with the difference between two dataframes. Simplified, this looks like this:

differences = df1.join(df2, df1['id'] == df2['id'], how='full') \
  .select(F.coalesce(df1['id'], df2['id']).alias('id'), df1['name'], df2['name'])
  .where(df1['name'] != df2['name'])

With the following 2 datasets, I expect the 3rd to be the output:

+---+-----+
| id| name|
+---+-----+
|  1|Alice|
|  2|  Bob|
|  3|Carol|
|  4|  Dan|
|  5|  Eve|
+---+-----+

+---+-----+
| id| name|
+---+-----+
|  1|Alice|
|  2|  Ben|
|  4|  Dan|
|  5|  Eve|
|  6| Finn|
+---+-----+

+---+-------+-------+
|age|   name|   name|
+---+-------+-------+
|  2|    Bob|    Ben|
|  3|  Carol|   null|
|  6|   null|   Finn|
+---+-------+-------+

But when I run it in databricks, the null columns are omitted from the result dataframe.

+---+-------+-------+
|age|   name|   name|
+---+-------+-------+
|  2|    Bob|    Ben|
+---+-------+-------+

Are they not considered != by the where clause? Is there hidden logic when these frames are created?

0

1 Answer 1

2

From Why [table].[column] != null is not working?:

"NULL in a database is not a value. It means something like "unknown" or "data missing".

You cannot tell if something where you don't have any information about is equal to something else where you also don't have any information about (=, != operators). But you can say whether there is any information available (IS NULL, IS NOT NULL)."

So you will have to add more conditions:

differences = df1.join(df2, df1['id'] == df2['id'], how='full')\
  .select(F.coalesce(df1['id'], df2['id']).alias('id'), df1['Name'], df2['Name'])\
  .where((df1['Name'] != df2['Name']) | ((df1['Name']).isNull() | (df2['Name']).isNull()))


+---+-----+----+
|id |Name |Name|
+---+-----+----+
|2  |Bob  |Ben |
|3  |Carol|null|
|6  |null |Finn|
+---+-----+----+
Sign up to request clarification or add additional context in comments.

1 Comment

There is also a null-safe equal operator in Spark <=>. using DataFrame API: .where(~df1["Name"].eqNullSafe(df2["Name"])). Or in SQL expression: F.expr("df1.Name <=> df2.name"). See sql-ref-null-semantics

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.