0

I have pyspark dataframe consisting of two columns, each named input and target. These two are crossJoin of two single-column dataframes. Below is an example of how such dataframe would look like.

input target
A Voigt.
A Leica
A Zeiss
B Voigt.
B Leica
B Zeiss
C Voigt.
C Leica
C Zeiss

Then I have another dataframe which provides a number which describes relation between input and target column. However, it is not guaranteed that each input-target has this numerical value. For example, A - Voigt may have 2 as its relational value but A-Leica may have not have this value at all. Below is an example

input target val
A Voigt. 2
A Zeiss 1
B Leica 3
C Zeiss 5
C Leica 2

Now I want a dataframe that is congregate of these two that looks like this.

input target val
A Voigt. 2
A Leica null
A Zeiss 1
B Voigt. null
B Leica 3
B Zeiss null
C Voigt. null
C Leica 5
C Zeiss 2

I tried to join left these two columns, and tried to filter these out, but I've had problem completing in this form.

result = input_target.join(input_target_w_val, (input_target.input == input_target_w_val.input) & (input_target.target == input_target_w_val.target), 'left')

How should I put a filter from this point, or is there another way I can achieve this?

2 Answers 2

1

Try using it as below -

Input DataFrames

df1 = spark.createDataFrame(data=[("A","Voigt.") ,("A","Leica") ,("A","Zeiss") ,("B","Voigt.") ,("B","Leica") ,("B","Zeiss") ,("C","Voigt.") ,("C","Leica") ,("C","Zeiss")], schema = ["input", "target"])
df1.show()

+-----+------+
|input|target|
+-----+------+
|    A|Voigt.|
|    A| Leica|
|    A| Zeiss|
|    B|Voigt.|
|    B| Leica|
|    B| Zeiss|
|    C|Voigt.|
|    C| Leica|
|    C| Zeiss|
+-----+------+

df2 = spark.createDataFrame(data=[("A","Voigt.",2) ,("A","Zeiss",1 ) ,("B","Leica",3 ) ,("C","Zeiss",5 ) ,("C","Leica",2 )], schema = ["input", "target", "val"])
df2.show()

+-----+------+---+
|input|target|val|
+-----+------+---+
|    A|Voigt.|  2|
|    A| Zeiss|  1|
|    B| Leica|  3|
|    C| Zeiss|  5|
|    C| Leica|  2|
+-----+------+---+

Required Output

df1.join(df2, on = ["input", "target"], how = "left_outer").select(df1["input"], df1["target"], df2["val"]).show(truncate=False)

+-----+------+----+
|input|target|val |
+-----+------+----+
|A    |Leica |null|
|A    |Voigt.|2   |
|A    |Zeiss |1   |
|B    |Leica |3   |
|B    |Voigt.|null|
|B    |Zeiss |null|
|C    |Leica |2   |
|C    |Voigt.|null|
|C    |Zeiss |5   |
+-----+------+----+
Sign up to request clarification or add additional context in comments.

1 Comment

If the answer helped to solve the problem please check the ✓ symbol next to the answer. Upvote too, if you like.
1

You can simply specify a list of join column names.

df = df1.join(df2, ['input', 'target'], 'left')

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.