11

I have a large dataset of which I would like to drop columns that contain null values and return a new dataframe. How can I do that?

The following only drops a single column or rows containing null.

df.where(col("dt_mvmt").isNull()) #doesnt work because I do not have all the columns names or for 1000's of columns
df.filter(df.dt_mvmt.isNotNull()) #same reason as above
df.na.drop() #drops rows that contain null, instead of columns that contain null

For example

a |  b  | c
1 |     | 0
2 |  2  | 3

In the above case it will drop the whole column B because one of its values is empty.

1

2 Answers 2

17

Here is one possible approach for dropping all columns that have NULL values: See here for the source on the code of counting NULL values per column.

import pyspark.sql.functions as F

# Sample data
df = pd.DataFrame({'x1': ['a', '1', '2'],
                   'x2': ['b', None, '2'],
                   'x3': ['c', '0', '3'] })
df = sqlContext.createDataFrame(df)
df.show()

def drop_null_columns(df):
    """
    This function drops all columns which contain null values.
    :param df: A PySpark DataFrame
    """
    null_counts = df.select([F.count(F.when(F.col(c).isNull(), c)).alias(c) for c in df.columns]).collect()[0].asDict()
    to_drop = [k for k, v in null_counts.items() if v > 0]
    df = df.drop(*to_drop)
    return df

# Drops column b2, because it contains null values
drop_null_columns(df).show()

Before:

+---+----+---+
| x1|  x2| x3|
+---+----+---+
|  a|   b|  c|
|  1|null|  0|
|  2|   2|  3|
+---+----+---+

After:

+---+---+
| x1| x3|
+---+---+
|  a|  c|
|  1|  0|
|  2|  3|
+---+---+

Hope this helps!

Sign up to request clarification or add additional context in comments.

3 Comments

yes sir ! It did help. How beautiful ! The other 3 earlier lines also worked perfectly
Glad I could help! I removed the threshold-part, maybe a bit confusing to future people who stumble upon this question.
@Florian You should keep the threshold part, it makes it a complete answer! It would be really helpful, thanks :)
1

If we need to keep only the rows having at least one inspected column not null then use this. Execution time is very less.

from operator import or_
from functools import reduce

inspected = df.columns
df = df.where(reduce(or_, (F.col(c).isNotNull() for c in inspected ), F.lit(False)))```

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.