I have two dataframes, df1 and df2:
df1.show()
+---+--------+-----+----+--------+
|cA | cB | cC | cD | cE |
+---+--------+-----+----+--------+
| A| abc | 0.1 | 0.0| 0 |
| B| def | 0.15| 0.5| 0 |
| C| ghi | 0.2 | 0.2| 1 |
| D| jkl | 1.1 | 0.1| 0 |
| E| mno | 0.1 | 0.1| 0 |
+---+--------+-----+----+--------+
df2.show()
+---+--------+-----+----+--------+
|cA | cB | cH | cI | cJ |
+---+--------+-----+----+--------+
| A| abc | a | b | ? |
| C| ghi | a | c | ? |
+---+--------+-----+----+--------+
I would like to update cE column in df1 and set it to 1, if the row is referenced in df2. Each record is identified by cA and cB columns.
Below is the desired output; Note that the cE value of the first record was updated to 1:
+---+--------+-----+----+--------+
|cA | cB | cC | cD | cE |
+---+--------+-----+----+--------+
| A| abc | 0.1 | 0.0| 1 |
| B| def | 0.15| 0.5| 0 |
| C| ghi | 0.2 | 0.2| 1 |
| D| jkl | 1.1 | 0.1| 0 |
| E| mno | 0.1 | 0.1| 0 |
+---+--------+-----+----+--------+
df1.alias('a').join(df2, (df1.cA == df2.cA) & (df1.cB==df2.cB), how='left').select('a.*')but I am not sure how to make the last stepwithColumnapi in spark. It will create new column and you can use values from other columns to fill in values. After that you can do anotherselectwith columns you requirewithColumnapi you create new column based on value from old columncEand the result of join. After that you doselectfor all relevant columns including new one, but excluding old one.