PySpark: Handing NULL in Joins

Question

I am trying to join 2 dataframes in pyspark. My problem is I want my "Inner Join" to give it a pass, irrespective of NULLs. I can see that in scala, I have an alternate of <=>. But, <=> is not working in pyspark.

userLeft = sc.parallelize([
Row(id=u'1', 
    first_name=u'Steve', 
    last_name=u'Kent', 
    email=u'[email protected]'),
Row(id=u'2', 
    first_name=u'Margaret', 
    last_name=u'Peace', 
    email=u'[email protected]'),
Row(id=u'3', 
    first_name=None, 
    last_name=u'hh', 
    email=u'[email protected]')]).toDF()

userRight = sc.parallelize([
Row(id=u'2', 
    first_name=u'Margaret', 
    last_name=u'Peace', 
    email=u'[email protected]'),
Row(id=u'3', 
    first_name=None, 
    last_name=u'hh', 
    email=u'[email protected]')]).toDF()

Current working version:

userLeft.join(userRight, (userLeft.last_name==userRight.last_name) & (userLeft.first_name==userRight.first_name)).show()

Current Result:

    +--------------------+----------+---+---------+--------------------+----------+---+---------+
|               email|first_name| id|last_name|               email|first_name| id|last_name|
    +--------------------+----------+---+---------+--------------------+----------+---+---------+ 
    |marge.peace@email...|  Margaret|  2|    Peace|marge.peace@email...|  Margaret|  2|    Peace|
    +--------------------+----------+---+---------+--------------------+----------+---+---------+

Expected Result:

    +--------------------+----------+---+---------+--------------------+----------+---+---------+
|               email|first_name| id|last_name|               email|first_name| id|last_name|
+--------------------+----------+---+---------+--------------------+----------+---+---------+
|  [email protected]|      null|  3|       hh|  [email protected]|      null|  3|       hh|
|marge.peace@email...|  Margaret|  2|    Peace|marge.peace@email...|  Margaret|  2|    Peace|
+--------------------+----------+---+---------+--------------------+----------+---+---------+

Marcos Pindado · Accepted Answer · 2020-05-18 15:38:24Z

12

For PYSPARK < 2.3.0 you can still build the <=> operator with an expression column like this:

import pyspark.sql.functions as F
df1.alias("df1").join(df2.alias("df2"), on = F.expr('df1.column <=> df2.column'))

For PYSPARK >= 2.3.0, you can use Column.eqNullSafe or IS NOT DISTINCT FROM as answered here.

answered May 18, 2020 at 15:38

Marcos Pindado

3512 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

MaFF · Accepted Answer · 2017-09-05 19:35:32Z

7

Use another value instead of null:

userLeft = userLeft.na.fill("unknown")
userRight = userRight.na.fill("unknown")

userLeft.join(userRight, ["last_name", "first_name"])

    +---------+----------+--------------------+---+--------------------+---+
    |last_name|first_name|               email| id|               email| id|
    +---------+----------+--------------------+---+--------------------+---+
    |    Peace|  Margaret|marge.peace@email...|  2|marge.peace@email...|  2|
    |       hh|   unknown|  [email protected]|  3|  [email protected]|  3|
    +---------+----------+--------------------+---+--------------------+---+

answered Sep 5, 2017 at 19:35

MaFF

10.2k2 gold badges39 silver badges43 bronze badges

3 Comments

orNehPraka Over a year ago

I tried this approach. For string and date columns, I was able to convert it to distinguish Null values. like: String "NULLCUSTOM" and for date: "8888-01-01". But I couldn't thing of a value for integer or float values. Do you have any idea?

MaFF Over a year ago

float("inf") it will be cast as long if column is of type int or long it's actually not infinity it's 9223372036854775807

MaFF Over a year ago

or -1 for id columns

Collectives™ on Stack Overflow

PySpark: Handing NULL in Joins

2 Answers 2

Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related