I have a pyspark dataframe called df1 that looks like this:
| ID1 | ID2 |
|---|---|
| aaaa | a1 |
| bbbb | a2 |
| aaaa | a3 |
| bbbb | a4 |
| cccc | a2 |
And I have another dataframe called df2 that looks like this:
| ID2_1 | ID2_2 |
|---|---|
| a2 | a1 |
| a3 | a2 |
| a2 | a3 |
| a2 | a1 |
where the values of the ID2 in the first dataframe matches to the values in columns ID2_1, ID2_2 in the second dataframe.
So the resultant dataframe will look like this:
| ID1 | ID2 |
|---|---|
| aaaa | a1 |
| bbbb | a2 |
| aaaa | a3 |
| cccc | a2 |
(fourth line was filtered out)
I want to filter the column ID2 to contain only values that appear in one of the columns ID2_1 or ID2_2. I tried doing
filter= df1.filter((f.col("ID2").isin(df2.ID2_1)))|
(f.col("ID2").isin(df2.ID2_2)))
But this doesn't seem to work.
I have seen other suggestions to use a join between the two columns but this operation is way too heavy and I'm trying to avoid such actions. Any suggestions as to how to do this task?