I have a dataframe with multiple columns:
| a | b | c | d |
-----------------
| 0 | 4 | 3 | 6 |
| 1 | 7 | 0 | 4 |
| 2 | 4 | 3 | 6 |
| 3 | 9 | 5 | 9 |
I would now like to combine [b,c,d] into a single column. However, I do not know, how big the list of columns will be, otherwise I could just use a UDF3 to combine the three.
So the desired outcome is:
| a | combined |
-----------------
| 0 | [4, 3, 6] |
| 1 | [7, 0, 4] |
| 2 | [4, 3, 6] |
| 3 | [9, 5, 9] |
How can I achieve this?
Non-working pseudo-code:
public static Dataset<Row> mergeColumns(Dataset<Row> ds, List<String> columns) {
return ds.withColumn("combined", collectAsList(columns))
}
Worst-case workaround would be a switch statement on the number of input columns and then write a UDF each for, i.e. 2-20 input columns and throw an error, if more input columns are supplied.