I would like to change the value of multiple fields in a row of a dataframe df. Normally, I would do a row to row transformation using a map. Something like:
+---+---------+
|num|name |
+---+---------+
| 1|Hydrogen |
| 2|Helium |
+---+---------+
df.map(row=>{
val name = row.getAs("name").toString.toUpperCase
(row(0),name)
})
But now I have a dataframe which has a very elaborate schema of many columns, out of which I would want to change the value of only some columns. The change in the value of one column is dependent on other columns. How can I avoid writing all the column values (like row.get(0), row.get(1) ... row.get(30)) in the tuple but only write the ones which have changed? Consider a df with this schema:
case class DFSchema(id: String, name: String, map1: Map[String, String], ... , map30[Sting, String])
I want to update the keys and values of df.select("map30") and modify "name" only if id is "city". Of course, there are more such transformations to be made in other columns (represented in schema as mapX.
I did not consider using UDF for this problem as even if the UDF returns a struct of many columns, I do not know how to change multiple columns using withColumn() as it only accepts "one" column name. However, solutions using UDF are equally welcome as using .map over rows.
whenwould help to modify multiple columns. Could you be more explicit. Thanks