0

In basic R, this is how I apply a function to multiple columns at once, using data.table :

d <- data.table(V1 = rep(1:2, 4:3), V2 = c(1, 2, 4, 5, 2, 3, 4), V3 = 1:7, V4 = sample(letters, 7))
Cols <- c("V2", "V3")
d[, (Cols) := lapply(.SD, function(x) x * 100), .SDcols = Cols]

But now, I'm trying to replicate the same on a SparkDataFrame, in Azure Databricks with SparkR.

I looked on the side of dapply, ..., of spark.lapply, but I can't figure out how to apply a same function to several columns of a SparkDataFrame.

1 Answer 1

1

You can extract the column names as a list using SparkR::colnames function, and then use base::lapply on that list. Note that the function argument inside lapply has to use the columns as a Spark column object (SparkR::column). Example below:

df <- data.frame(v1 = c(1:3), v2 = c(3:5), v3 = c(8:10))
sdf <- SparkR::createDataFrame(df)
cols <- SparkR::colnames(sdf)
modify_cols <- c("v2", "v3")
spark_cols_new <- lapply(cols, function(x) { 
    if (!x %in% modify_cols){
      SparkR::column(x)
    } else {
      SparkR::alias(SparkR::column(x) * SparkR::lit(100), x)
    }
})
sdf_new <- SparkR::select(sdf, spark_cols_new)

Note that, if you intend to use a constant then it can be provided directly instead of using SparkR::lit function as well, but it is a safer choice.

Sign up to request clarification or add additional context in comments.

1 Comment

Explanations + example = a very big step for me. Especially the concept of columns and the way they should be used. Thank you very much !

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.