Pyspark - Calculate number of null values in each dataframe column

Question

I have a dataframe with many columns. My aim is to produce a dataframe thats lists each column name, along with the number of null values in that column.

Example:

+-------------+-------------+
| Column_Name | NULL_Values |
+-------------+-------------+
|  Column_1   |      15     |
|  Column_2   |      56     |
|  Column_3   |      18     |
|     ...     |     ...     |
+-------------+-------------+

I have managed to get the number of null values for ONE column like so:

df.agg(F.count(F.when(F.isnull(c), c)).alias('NULL_Count'))

where c is a column in the dataframe. However, it does not show the name of the column. The output is:

+------------+
| NULL_Count |
+------------+
|     15     |
+------------+

Any ideas?

pault · Accepted Answer · 2018-09-12 16:46:55Z

17

You can use a list comprehension to loop over all of your columns in the agg, and use alias to rename the output column:

import pyspark.sql.functions as F

df_agg = df.agg(*[F.count(F.when(F.isnull(c), c)).alias(c) for c in df.columns])

However, this will return the results in one row as shown below:

df_agg.show()
#+--------+--------+--------+
#|Column_1|Column_2|Column_3|
#+--------+--------+--------+
#|      15|      56|      18|
#+--------+--------+--------+

If you wanted the results in one column instead, you could union each column from df_agg using functools.reduce as follows:

from functools import reduce
df_agg_col = reduce(
    lambda a, b: a.union(b),
    (
        df_agg.select(F.lit(c).alias("Column_Name"), F.col(c).alias("NULL_Count")) 
        for c in df_agg.columns
    )
)
df_agg_col.show()
#+-----------+----------+
#|Column_Name|NULL_Count|
#+-----------+----------+
#|   Column_1|        15|
#|   Column_2|        56|
#|   Column_3|        18|
#+-----------+----------+

Or you can skip the intermediate step of creating df_agg and do:

df_agg_col = reduce(
    lambda a, b: a.union(b),
    (
        df.agg(
            F.count(F.when(F.isnull(c), c)).alias('NULL_Count')
        ).select(F.lit(c).alias("Column_Name"), "NULL_Count")
        for c in df.columns
    )
)

edited Sep 12, 2018 at 16:46

answered Sep 12, 2018 at 16:27

pault

43.7k17 gold badges121 silver badges161 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

LEJ Over a year ago

Works perfectly, thank you! Could you explain what the asterisk does in your first aggregate function?

pault Over a year ago

@LEJ the * is for argument unpacking. This syntax "unpacks" the contents of the list so they can be passed as arguments to the function.

ravi malhotra · Accepted Answer · 2020-07-21 14:57:12Z

0

Scala alternative could be

case class Test(id:Int, weight:Option[Int], age:Int, gender: Option[String])

val df1 = Seq(Test(1, Some(100), 23, Some("Male")), Test(2, None, 25, None), Test(3, None, 33, Some("Female"))).toDF()

df1.show()

+---+------+---+------+
| id|weight|age|gender|
+---+------+---+------+
|  1|   100| 23|  Male|
|  2|  null| 25|  null|
|  3|  null| 33|Female|
+---+------+---+------+

val s = df1.columns.map(c => sum(col(c).isNull.cast("integer")).alias(c))

val df2 = df1.agg(s.head, s.tail:_*)

val t = df2.columns.map(c => df2.select(lit(c).alias("col_name"), col(c).alias("null_count")))

val df_agg_col = t.reduce((df1, df2) => df1.union(df2))

df_agg_col.show()

answered Jul 21, 2020 at 14:57

ravi malhotra

7335 silver badges14 bronze badges

Collectives™ on Stack Overflow

Pyspark - Calculate number of null values in each dataframe column

2 Answers 2

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related