5

I have a dataframe with many columns. My aim is to produce a dataframe thats lists each column name, along with the number of null values in that column.

Example:

+-------------+-------------+
| Column_Name | NULL_Values |
+-------------+-------------+
|  Column_1   |      15     |
|  Column_2   |      56     |
|  Column_3   |      18     |
|     ...     |     ...     |
+-------------+-------------+

I have managed to get the number of null values for ONE column like so:

df.agg(F.count(F.when(F.isnull(c), c)).alias('NULL_Count'))

where c is a column in the dataframe. However, it does not show the name of the column. The output is:

+------------+
| NULL_Count |
+------------+
|     15     |
+------------+

Any ideas?

2 Answers 2

17

You can use a list comprehension to loop over all of your columns in the agg, and use alias to rename the output column:

import pyspark.sql.functions as F

df_agg = df.agg(*[F.count(F.when(F.isnull(c), c)).alias(c) for c in df.columns])

However, this will return the results in one row as shown below:

df_agg.show()
#+--------+--------+--------+
#|Column_1|Column_2|Column_3|
#+--------+--------+--------+
#|      15|      56|      18|
#+--------+--------+--------+

If you wanted the results in one column instead, you could union each column from df_agg using functools.reduce as follows:

from functools import reduce
df_agg_col = reduce(
    lambda a, b: a.union(b),
    (
        df_agg.select(F.lit(c).alias("Column_Name"), F.col(c).alias("NULL_Count")) 
        for c in df_agg.columns
    )
)
df_agg_col.show()
#+-----------+----------+
#|Column_Name|NULL_Count|
#+-----------+----------+
#|   Column_1|        15|
#|   Column_2|        56|
#|   Column_3|        18|
#+-----------+----------+

Or you can skip the intermediate step of creating df_agg and do:

df_agg_col = reduce(
    lambda a, b: a.union(b),
    (
        df.agg(
            F.count(F.when(F.isnull(c), c)).alias('NULL_Count')
        ).select(F.lit(c).alias("Column_Name"), "NULL_Count")
        for c in df.columns
    )
)
Sign up to request clarification or add additional context in comments.

2 Comments

Works perfectly, thank you! Could you explain what the asterisk does in your first aggregate function?
@LEJ the * is for argument unpacking. This syntax "unpacks" the contents of the list so they can be passed as arguments to the function.
0

Scala alternative could be

case class Test(id:Int, weight:Option[Int], age:Int, gender: Option[String])

val df1 = Seq(Test(1, Some(100), 23, Some("Male")), Test(2, None, 25, None), Test(3, None, 33, Some("Female"))).toDF()

df1.show()

+---+------+---+------+
| id|weight|age|gender|
+---+------+---+------+
|  1|   100| 23|  Male|
|  2|  null| 25|  null|
|  3|  null| 33|Female|
+---+------+---+------+

val s = df1.columns.map(c => sum(col(c).isNull.cast("integer")).alias(c))

val df2 = df1.agg(s.head, s.tail:_*)

val t = df2.columns.map(c => df2.select(lit(c).alias("col_name"), col(c).alias("null_count")))

val df_agg_col = t.reduce((df1, df2) => df1.union(df2))

df_agg_col.show()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.