Apply a function to multiple columns of a SparkDataFrame, at once

Question

In basic R, this is how I apply a function to multiple columns at once, using data.table :

d <- data.table(V1 = rep(1:2, 4:3), V2 = c(1, 2, 4, 5, 2, 3, 4), V3 = 1:7, V4 = sample(letters, 7))
Cols <- c("V2", "V3")
d[, (Cols) := lapply(.SD, function(x) x * 100), .SDcols = Cols]

But now, I'm trying to replicate the same on a SparkDataFrame, in Azure Databricks with SparkR.

I looked on the side of dapply, ..., of spark.lapply, but I can't figure out how to apply a same function to several columns of a SparkDataFrame.

Vivek Atal · Accepted Answer · 2022-01-20 13:12:21Z

1

You can extract the column names as a list using SparkR::colnames function, and then use base::lapply on that list. Note that the function argument inside lapply has to use the columns as a Spark column object (SparkR::column). Example below:

df <- data.frame(v1 = c(1:3), v2 = c(3:5), v3 = c(8:10))
sdf <- SparkR::createDataFrame(df)
cols <- SparkR::colnames(sdf)
modify_cols <- c("v2", "v3")
spark_cols_new <- lapply(cols, function(x) { 
    if (!x %in% modify_cols){
      SparkR::column(x)
    } else {
      SparkR::alias(SparkR::column(x) * SparkR::lit(100), x)
    }
})
sdf_new <- SparkR::select(sdf, spark_cols_new)

Note that, if you intend to use a constant then it can be provided directly instead of using SparkR::lit function as well, but it is a safer choice.

answered Jan 20, 2022 at 13:12

Vivek Atal

5435 silver badges17 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Discus23 Over a year ago

Explanations + example = a very big step for me. Especially the concept of columns and the way they should be used. Thank you very much !

Collectives™ on Stack Overflow

Apply a function to multiple columns of a SparkDataFrame, at once

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related