I have pyspark dataframe consisting of two columns, each named input and target. These two are crossJoin of two single-column dataframes. Below is an example of how such dataframe would look like.
| input | target |
|---|---|
| A | Voigt. |
| A | Leica |
| A | Zeiss |
| B | Voigt. |
| B | Leica |
| B | Zeiss |
| C | Voigt. |
| C | Leica |
| C | Zeiss |
Then I have another dataframe which provides a number which describes relation between input and target column. However, it is not guaranteed that each input-target has this numerical value. For example, A - Voigt may have 2 as its relational value but A-Leica may have not have this value at all. Below is an example
| input | target | val |
|---|---|---|
| A | Voigt. | 2 |
| A | Zeiss | 1 |
| B | Leica | 3 |
| C | Zeiss | 5 |
| C | Leica | 2 |
Now I want a dataframe that is congregate of these two that looks like this.
| input | target | val |
|---|---|---|
| A | Voigt. | 2 |
| A | Leica | null |
| A | Zeiss | 1 |
| B | Voigt. | null |
| B | Leica | 3 |
| B | Zeiss | null |
| C | Voigt. | null |
| C | Leica | 5 |
| C | Zeiss | 2 |
I tried to join left these two columns, and tried to filter these out, but I've had problem completing in this form.
result = input_target.join(input_target_w_val, (input_target.input == input_target_w_val.input) & (input_target.target == input_target_w_val.target), 'left')
How should I put a filter from this point, or is there another way I can achieve this?