0

Here is my toy dataframe.

library(tibble); library(SparkR)

df <- tibble::tribble(
  ~var1, ~var2, ~maxofvar1var2,
  1L,    1L,    1L,
  2L,    1L,    2L,
  2L,    3L,    3L,
  NA,    2L,    2L,
  1L,    4L,    4L,
  8L,    5L,    8L)

df <- df %>% as.DataFrame()

How can I calculate the rowwise calculation using SparkR to get the max value of var1 and var2 as shown in the 3rd variable in the df above? If there is no rowwise function in SparkR, how can I get the desired output?

1 Answer 1

1

To get a maximum value from a set of columns use SparkR::greatest:

df %>% withColumn("maxOfVars", greatest(df$var1, df$var2))

and in general case higher order functions, like aggregate (Spark 2.4 or later), on assembled data .

df %>% withColumn("theLastVar", expr("aggregate(array(var1, var2), (x, y) -> y)"))

or (version independent) composition of expressions:

scols <- c("var1", "var2") %>% purrr::map(column)

sumOfVars <- scols %>%
  purrr::map(function(x) coalesce(x, lit(0)))  %>%
  purrr::reduce(function(x, y) x + y, .init=lit(0))

countOfVars <- scols %>% 
  purrr::map(function(x) ifelse(isNotNull(x), lit(1), lit(0))) %>%
  purrr::reduce(
    function(x, y) x + y, .init=lit(0))

df %>% withColumn("meanOfVars", sumOfVars / countOfVars)
Sign up to request clarification or add additional context in comments.

1 Comment

Greatest is an interesting name for a function! Thanks!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.