Spark/Scala repeated calls to withColumn() using the same function on multiple columns

Question

I currently have code in which I repeatedly apply the same procedure to multiple DataFrame Columns via multiple chains of .withColumn, and am wanting to create a function to streamline the procedure. In my case, I am finding cumulative sums over columns aggregated by keys:

val newDF = oldDF
  .withColumn("cumA", sum("A").over(Window.partitionBy("ID").orderBy("time")))
  .withColumn("cumB", sum("B").over(Window.partitionBy("ID").orderBy("time")))
  .withColumn("cumC", sum("C").over(Window.partitionBy("ID").orderBy("time")))
  //.withColumn(...)

What I would like is either something like:

def createCumulativeColums(cols: Array[String], df: DataFrame): DataFrame = {
  // Implement the above cumulative sums, partitioning, and ordering
}

or better yet:

def withColumns(cols: Array[String], df: DataFrame, f: function): DataFrame = {
  // Implement a udf/arbitrary function on all the specified columns
}

zero323 · Accepted Answer · 2018-07-19 12:36:17Z

43

You can use select with varargs including *:

import spark.implicits._

df.select($"*" +: Seq("A", "B", "C").map(c => 
  sum(c).over(Window.partitionBy("ID").orderBy("time")).alias(s"cum$c")
): _*)

This:

Maps columns names to window expressions with Seq("A", ...).map(...)
Prepends all pre-existing columns with $"*" +: ....
Unpacks combined sequence with ... : _*.

and can be generalized as:

import org.apache.spark.sql.{Column, DataFrame}

/**
 * @param cols a sequence of columns to transform
 * @param df an input DataFrame
 * @param f a function to be applied on each col in cols
 */
def withColumns(cols: Seq[String], df: DataFrame, f: String => Column) =
  df.select($"*" +: cols.map(c => f(c)): _*)

If you find withColumn syntax more readable you can use foldLeft:

Seq("A", "B", "C").foldLeft(df)((df, c) =>
  df.withColumn(s"cum$c",  sum(c).over(Window.partitionBy("ID").orderBy("time")))
)

which can be generalized for example to:

/**
 * @param cols a sequence of columns to transform
 * @param df an input DataFrame
 * @param f a function to be applied on each col in cols
 * @param name a function mapping from input to output name.
 */
def withColumns(cols: Seq[String], df: DataFrame, 
    f: String =>  Column, name: String => String = identity) =
  cols.foldLeft(df)((df, c) => df.withColumn(name(c), f(c)))

edited Jul 19, 2018 at 12:36

answered Dec 30, 2016 at 18:01

zero323

331k108 gold badges982 silver badges958 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

jgaw Over a year ago

is there a way to apply a list of functions to those same columns dynamically? so if there are N columns and M functions, there would be N*M new columns.

zero323 Over a year ago

@jgaw Just use nested loop. Either for {f <- func; c <- cols} yield f(c) or funs.flatMap(f => cols.map(c => f(c)))

Lorenzo · Accepted Answer · 2018-05-29 14:37:54Z

9

The question is a bit old, but I thought it would be useful (perhaps for others) to note that folding over the list of columns using the DataFrame as accumulator and mapping over the DataFrame have substantially different performance outcomes when the number of columns is not trivial (see here for the full explanation). Long story short... for few columns foldLeft is fine, otherwise map is better.

answered May 29, 2018 at 14:37

Lorenzo

911 silver badge3 bronze badges

2 Comments

Ged Over a year ago

Is this still valid with Spark with 2.4?

Lorenzo Over a year ago

As far as I know, yes.

00schneider · Accepted Answer · 2023-02-07 10:22:02Z

3

In PySpark:

from pyspark.sql import Window
import pyspark.sql.functions as F

window = Window.partitionBy("ID").orderBy("time")

df.select(
    "*", # selects all existing columns
    *[
        F.sum(col).over(window).alias(col_name)
        for col, col_name in zip(["A", "B", "C"], ["cumA", "cumB", "cumC"])
    ]
)

edited Feb 7, 2023 at 10:22

answered Jul 21, 2021 at 6:37

00schneider

81813 silver badges23 bronze badges

Collectives™ on Stack Overflow

Spark/Scala repeated calls to withColumn() using the same function on multiple columns

3 Answers 3

2 Comments

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related