SparkR. How to count distinct values for all columns in a Spark DataFrame?

Question

I am wondering if there is a way to count the number of distinct items in each column of a spark dataframe? That is, given this dataset:

set.seed(123)
df<- data.frame(ColA=rep(c("dog", "cat", "fish", "shark"), 4), ColB=rnorm(16), ColC=rep(seq(1:8),2))
df

I do this in R to get the counts:

sapply(df, function(x){length(unique(x))} )

> ColA ColB ColC 
   4   16    8

How would I go about doing the same thing for this Spark DataFrame?

sdf<- SparkR::createDataFrame(df)

Any help is greatly appreciated. Thank you in advance. -nate

akuiper · Accepted Answer · 2017-09-22 21:29:38Z

3

This works for me in SparkR:

exprs = lapply(names(sdf), function(x) alias(countDistinct(sdf[[x]]), x))
# here use do.call to splice the aggregation expressions to agg function
head(do.call(agg, c(x = sdf, exprs)))

#  ColA ColB ColC
#1    4   16    8

edited Sep 22, 2017 at 21:29

answered Sep 22, 2017 at 21:28

akuiper

216k33 gold badges363 silver badges380 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

SparkR. How to count distinct values for all columns in a Spark DataFrame?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related