PySpark and SQL : join and null values

Question

I have pyspark dataframe consisting of two columns, each named input and target. These two are crossJoin of two single-column dataframes. Below is an example of how such dataframe would look like.

input	target
A	Voigt.
A	Leica
A	Zeiss
B	Voigt.
B	Leica
B	Zeiss
C	Voigt.
C	Leica
C	Zeiss

Then I have another dataframe which provides a number which describes relation between input and target column. However, it is not guaranteed that each input-target has this numerical value. For example, A - Voigt may have 2 as its relational value but A-Leica may have not have this value at all. Below is an example

input	target	val
A	Voigt.	2
A	Zeiss	1
B	Leica	3
C	Zeiss	5
C	Leica	2

Now I want a dataframe that is congregate of these two that looks like this.

input	target	val
A	Voigt.	2
A	Leica	null
A	Zeiss	1
B	Voigt.	null
B	Leica	3
B	Zeiss	null
C	Voigt.	null
C	Leica	5
C	Zeiss	2

I tried to join left these two columns, and tried to filter these out, but I've had problem completing in this form.

result = input_target.join(input_target_w_val, (input_target.input == input_target_w_val.input) & (input_target.target == input_target_w_val.target), 'left')

How should I put a filter from this point, or is there another way I can achieve this?

Dipanjan Mallick · Accepted Answer · 2022-04-09 20:04:34Z

1

Try using it as below -

Input DataFrames

df1 = spark.createDataFrame(data=[("A","Voigt.") ,("A","Leica") ,("A","Zeiss") ,("B","Voigt.") ,("B","Leica") ,("B","Zeiss") ,("C","Voigt.") ,("C","Leica") ,("C","Zeiss")], schema = ["input", "target"])
df1.show()

+-----+------+
|input|target|
+-----+------+
|    A|Voigt.|
|    A| Leica|
|    A| Zeiss|
|    B|Voigt.|
|    B| Leica|
|    B| Zeiss|
|    C|Voigt.|
|    C| Leica|
|    C| Zeiss|
+-----+------+

df2 = spark.createDataFrame(data=[("A","Voigt.",2) ,("A","Zeiss",1 ) ,("B","Leica",3 ) ,("C","Zeiss",5 ) ,("C","Leica",2 )], schema = ["input", "target", "val"])
df2.show()

+-----+------+---+
|input|target|val|
+-----+------+---+
|    A|Voigt.|  2|
|    A| Zeiss|  1|
|    B| Leica|  3|
|    C| Zeiss|  5|
|    C| Leica|  2|
+-----+------+---+

Required Output

df1.join(df2, on = ["input", "target"], how = "left_outer").select(df1["input"], df1["target"], df2["val"]).show(truncate=False)

+-----+------+----+
|input|target|val |
+-----+------+----+
|A    |Leica |null|
|A    |Voigt.|2   |
|A    |Zeiss |1   |
|B    |Leica |3   |
|B    |Voigt.|null|
|B    |Zeiss |null|
|C    |Leica |2   |
|C    |Voigt.|null|
|C    |Zeiss |5   |
+-----+------+----+

answered Apr 9, 2022 at 20:04

Dipanjan Mallick

1,7792 gold badges10 silver badges23 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Dipanjan Mallick Over a year ago

If the answer helped to solve the problem please check the ✓ symbol next to the answer. Upvote too, if you like.

过过招 · Accepted Answer · 2022-04-10 01:13:36Z

1

You can simply specify a list of join column names.

df = df1.join(df2, ['input', 'target'], 'left')

answered Apr 10, 2022 at 1:13

过过招

4,3372 gold badges7 silver badges14 bronze badges

Collectives™ on Stack Overflow

PySpark and SQL : join and null values

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related