Spark DataFrame: count distinct values of every column

Question

The question is pretty much in the title: Is there an efficient way to count the distinct values in every column in a DataFrame?

The describe method provides only the count but not the distinct count, and I wonder if there is a a way to get the distinct count for all (or some selected) columns.

mtoto · Accepted Answer · 2016-11-30 15:47:22Z

78

In pySpark you could do something like this, using countDistinct():

from pyspark.sql.functions import col, countDistinct

df.agg(*(countDistinct(col(c)).alias(c) for c in df.columns))

Similarly in Scala :

import org.apache.spark.sql.functions.countDistinct
import org.apache.spark.sql.functions.col

df.select(df.columns.map(c => countDistinct(col(c)).alias(c)): _*)

If you want to speed things up at the potential loss of accuracy, you could also use approxCountDistinct().

edited Nov 30, 2016 at 15:47

answered Nov 30, 2016 at 13:22

mtoto

24.3k4 gold badges62 silver badges74 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Peybae Over a year ago

can you please explain what's the * for here (in your pyspark solution)?

mtoto Over a year ago

The star operator in Python can be used to unpack the arguments from the iterator for the function call, also see here.

eliasah · Accepted Answer · 2018-09-24 12:00:31Z

42

Multiple aggregations would be quite expensive to compute. I suggest that you use approximation methods instead. In this case, approxating distinct count:

val df = Seq((1,3,4),(1,2,3),(2,3,4),(2,3,5)).toDF("col1","col2","col3")

val exprs = df.columns.map((_ -> "approx_count_distinct")).toMap
df.agg(exprs).show()
// +---------------------------+---------------------------+---------------------------+
// |approx_count_distinct(col1)|approx_count_distinct(col2)|approx_count_distinct(col3)|
// +---------------------------+---------------------------+---------------------------+
// |                          2|                          2|                          3|
// +---------------------------+---------------------------+---------------------------+

The approx_count_distinct method relies on HyperLogLog under the hood.

The HyperLogLog algorithm and its variant HyperLogLog++ (implemented in Spark) relies on the following clever observation.

If the numbers are spread uniformly across a range, then the count of distinct elements can be approximated from the largest number of leading zeros in the binary representation of the numbers.

For example, if we observe a number whose digits in binary form are of the form 0…(k times)…01…1, then we can estimate that there are in the order of 2^k elements in the set. This is a very crude estimate but it can be refined to great precision with a sketching algorithm.

A thorough explanation of the mechanics behind this algorithm can be found in the original paper.

Note: Starting Spark 1.6, when Spark calls SELECT SOME_AGG(DISTINCT foo)), SOME_AGG(DISTINCT bar)) FROM df each clause should trigger separate aggregation for each clause. Whereas this is different than SELECT SOME_AGG(foo), SOME_AGG(bar) FROM df where we aggregate once. Thus the performance won't be comparable when using a count(distinct(_)) and approxCountDistinct (or approx_count_distinct).

It's one of the changes of behavior since Spark 1.6 :

With the improved query planner for queries having distinct aggregations (SPARK-9241), the plan of a query having a single distinct aggregation has been changed to a more robust version. To switch back to the plan generated by Spark 1.5’s planner, please set spark.sql.specializeSingleDistinctAggPlanning to true. (SPARK-12077)

Reference : Approximate Algorithms in Apache Spark: HyperLogLog and Quantiles.

edited Sep 24, 2018 at 12:00

answered Nov 30, 2016 at 13:42

eliasah

40.5k12 gold badges128 silver badges159 bronze badges

3 Comments

Thomas Over a year ago

Just a caveat: Note that for columns where almost every value is unique, approx_count_distinct might give up to 10% error in the default configuration and might actually take the same time as count_distinct. It might even return a value that is higher than the actual row count.

eliasah Over a year ago

That's correct but the bigger your dataset is, the lower that error is.

Ross Brigoli Over a year ago

how do you do this in PySpark?

zero323 · Accepted Answer · 2019-01-14 16:16:51Z

22

if you just want to count for particular column then following could help. Although its late answer. it might help someone. (pyspark 2.2.0 tested)

from pyspark.sql.functions import col, countDistinct
df.agg(countDistinct(col("colName")).alias("count")).show()

edited Jan 14, 2019 at 16:16

zero323

331k108 gold badges982 silver badges958 bronze badges

answered Oct 5, 2017 at 9:07

desaiankitb

1,05211 silver badges17 bronze badges

1 Comment

Sarah Messer Over a year ago

If you want the answer in a variable, rather than displayed to the user, replace the .show() with .collect()[0][0]

thegooner · Accepted Answer · 2019-11-08 08:25:40Z

8

Adding to desaiankitb's answer, this would provide you a more intuitive answer :

from pyspark.sql.functions import count

df.groupBy(colname).count().show()

edited Nov 8, 2019 at 8:25

answered Apr 29, 2019 at 6:44

thegooner

1482 silver badges11 bronze badges

2 Comments

Andreas L. Over a year ago

Great answer for those who wish to display the counts of each unique value occurring within a column of choice.

haneulkim Over a year ago

To get # of unique values, just len(df.groupBy(colname).count().show()) ?

Jean-François Corbett · Accepted Answer · 2018-08-10 08:24:04Z

1

You can use the count(column name) function of SQL

Alternatively if you are using data analysis and want a rough estimation and not exact count of each and every column you can use approx_count_distinct function approx_count_distinct(expr[, relativeSD])

edited Aug 10, 2018 at 8:24

Jean-François Corbett

38.7k30 gold badges145 silver badges192 bronze badges

answered Aug 10, 2018 at 7:47

Cpt Kitkat

1,4024 gold badges36 silver badges56 bronze badges

Comments

Jha Ayush · Accepted Answer · 2022-07-25 06:36:30Z

0

This is one way to create dataframe with every column counts :

> df = df.to_pandas_on_spark()
>         collect_df = []
>         for i in df.columns:
>             collect_df.append({"field_name": i , "unique_count": df[i].nunique()})
>         uniquedf = spark.createDataFrame(collect_df)

Output would like below. I used this with another dataframe to compare values if columns names are same.Other dataframe was also created way then joined.

df_prod_merged = uniquedf1.join(uniquedf2, on='field_name', how="left")

This is easy way to do it might be expensive on very huge data like 1 tb to process but still very efficient when used to_pandas_on_spark()

answered Jul 25, 2022 at 6:36

Jha Ayush

871 silver badge9 bronze badges

Collectives™ on Stack Overflow

Spark DataFrame: count distinct values of every column

6 Answers 6

2 Comments

3 Comments

1 Comment

2 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

2 Comments

3 Comments

1 Comment

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related