11

I am trying to join 2 dataframes in pyspark. My problem is I want my "Inner Join" to give it a pass, irrespective of NULLs. I can see that in scala, I have an alternate of <=>. But, <=> is not working in pyspark.

userLeft = sc.parallelize([
Row(id=u'1', 
    first_name=u'Steve', 
    last_name=u'Kent', 
    email=u'[email protected]'),
Row(id=u'2', 
    first_name=u'Margaret', 
    last_name=u'Peace', 
    email=u'[email protected]'),
Row(id=u'3', 
    first_name=None, 
    last_name=u'hh', 
    email=u'[email protected]')]).toDF()

userRight = sc.parallelize([
Row(id=u'2', 
    first_name=u'Margaret', 
    last_name=u'Peace', 
    email=u'[email protected]'),
Row(id=u'3', 
    first_name=None, 
    last_name=u'hh', 
    email=u'[email protected]')]).toDF()

Current working version:

userLeft.join(userRight, (userLeft.last_name==userRight.last_name) & (userLeft.first_name==userRight.first_name)).show()

Current Result:

    +--------------------+----------+---+---------+--------------------+----------+---+---------+
|               email|first_name| id|last_name|               email|first_name| id|last_name|
    +--------------------+----------+---+---------+--------------------+----------+---+---------+ 
    |marge.peace@email...|  Margaret|  2|    Peace|marge.peace@email...|  Margaret|  2|    Peace|
    +--------------------+----------+---+---------+--------------------+----------+---+---------+

Expected Result:

    +--------------------+----------+---+---------+--------------------+----------+---+---------+
|               email|first_name| id|last_name|               email|first_name| id|last_name|
+--------------------+----------+---+---------+--------------------+----------+---+---------+
|  [email protected]|      null|  3|       hh|  [email protected]|      null|  3|       hh|
|marge.peace@email...|  Margaret|  2|    Peace|marge.peace@email...|  Margaret|  2|    Peace|
+--------------------+----------+---+---------+--------------------+----------+---+---------+

2 Answers 2

12

For PYSPARK < 2.3.0 you can still build the <=> operator with an expression column like this:

import pyspark.sql.functions as F
df1.alias("df1").join(df2.alias("df2"), on = F.expr('df1.column <=> df2.column'))

For PYSPARK >= 2.3.0, you can use Column.eqNullSafe or IS NOT DISTINCT FROM as answered here.

Sign up to request clarification or add additional context in comments.

Comments

7

Use another value instead of null:

userLeft = userLeft.na.fill("unknown")
userRight = userRight.na.fill("unknown")

userLeft.join(userRight, ["last_name", "first_name"])

    +---------+----------+--------------------+---+--------------------+---+
    |last_name|first_name|               email| id|               email| id|
    +---------+----------+--------------------+---+--------------------+---+
    |    Peace|  Margaret|marge.peace@email...|  2|marge.peace@email...|  2|
    |       hh|   unknown|  [email protected]|  3|  [email protected]|  3|
    +---------+----------+--------------------+---+--------------------+---+

3 Comments

I tried this approach. For string and date columns, I was able to convert it to distinguish Null values. like: String "NULLCUSTOM" and for date: "8888-01-01". But I couldn't thing of a value for integer or float values. Do you have any idea?
float("inf") it will be cast as long if column is of type int or long it's actually not infinity it's 9223372036854775807
or -1 for id columns

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.