1

I would like to change the value of multiple fields in a row of a dataframe df. Normally, I would do a row to row transformation using a map. Something like:

+---+---------+
|num|name     |
+---+---------+
|  1|Hydrogen |
|  2|Helium   |
+---+---------+
df.map(row=>{
      val name = row.getAs("name").toString.toUpperCase
      (row(0),name)
    })

But now I have a dataframe which has a very elaborate schema of many columns, out of which I would want to change the value of only some columns. The change in the value of one column is dependent on other columns. How can I avoid writing all the column values (like row.get(0), row.get(1) ... row.get(30)) in the tuple but only write the ones which have changed? Consider a df with this schema:

case class DFSchema(id: String, name: String, map1: Map[String, String], ... , map30[Sting, String])

I want to update the keys and values of df.select("map30") and modify "name" only if id is "city". Of course, there are more such transformations to be made in other columns (represented in schema as mapX.

I did not consider using UDF for this problem as even if the UDF returns a struct of many columns, I do not know how to change multiple columns using withColumn() as it only accepts "one" column name. However, solutions using UDF are equally welcome as using .map over rows.

3
  • 1
    Can you add an example of the desired output based on your input? i understand the complexity part, but don't visualize the challenge... Commented Feb 8, 2019 at 0:49
  • 3
    would a combination of withColumn and when function not work for your case? Commented Feb 8, 2019 at 1:03
  • @jpg, I have added a schema and sample problem. @urug I don't understand how using when would help to modify multiple columns. Could you be more explicit. Thanks Commented Feb 8, 2019 at 14:48

2 Answers 2

3

You can try something like this:

val rules = Seq(
  "columnA" -> lit(20),
  "columnB" -> col("columnB").plus(col("columnC")),
  "columnC" -> col("columnC").minus(col("columnD")),
  "columnN" -> col("columnA").plus(col("columnB")).plus(col("columnC"))
)

def (inputDf: DataFrame): DataFrame = {
  rules.foldLeft(inputDf) {
    case (df, (columnName, ruleColumn)) => df.withColumn(columnName, ruleColumn)
  }
}

Here we have rules which is a sequence of pairs where the first value is a name of the target column we want to change/add and the second one is a rule which should be applied using dependent columns.

Using the foldLeft operation we apply all the rules to the input DataFrame.

Sign up to request clarification or add additional context in comments.

Comments

2

You can try this :

   df.show(false)

    val newColumns = df.columns.map { x =>
      if (x == "name") {
        when(col("id") === "city", lit("miami")).otherwise(col("name")).as("name")
      } else if (x == "map30") {
        when(col("id") === "city", map(lit("h"), lit("update"), lit("n"), lit("new"))).otherwise(col("map30")).as("map30")
      } else {
        col(x).as(x)
      }
    }

    val cleanDf = df.select(newColumns: _*)

    cleanDf.show(false)

2 Comments

Hi there! This may solve the OP's problem, but it would be an even better answer if it explained why it solves the OP's problem. What does your answer do that the OP didn't think of, and why is it the right thing?
This solves the problem ,because select all atributes of your dataframe and when select for example the column 'name' , I verify with the when method thad this column contain city and if this meets the conditicion aply update otherwise return the same column without transformation

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.