Pyspark dataframe how to drop rows with nulls in all columns?

Question

For a dataframe, before it is like:

+----+----+----+
|  ID|TYPE|CODE|
+----+----+----+
|   1|   B|  X1|
|null|null|null|
|null|   B|  X1|
+----+----+----+

After I hope it's like:

+----+----+----+
|  ID|TYPE|CODE|
+----+----+----+
|   1|   B|  X1|
|null|   B|  X1|
+----+----+----+

I prefer a general method such that it can apply when df.columns is very long. Thanks!

zero323 · Accepted Answer · 2018-01-12 22:05:52Z

23

Providing strategy for na.drop is all you need:

df = spark.createDataFrame([
    (1, "B", "X1"), (None, None, None), (None, "B", "X1"), (None, "C", None)],
    ("ID", "TYPE", "CODE")
)

df.na.drop(how="all").show()

+----+----+----+
|  ID|TYPE|CODE|
+----+----+----+  
|   1|   B|  X1|
|null|   B|  X1|
|null|   C|null|
+----+----+----+

Alternative formulation can be achieved with threshold (number of NOT NULL values):

df.na.drop(thresh=1).show()

+----+----+----+
|  ID|TYPE|CODE|
+----+----+----+
|   1|   B|  X1|
|null|   B|  X1|
|null|   C|null|
+----+----+----+

answered Jan 12, 2018 at 22:05

zero323

331k108 gold badges982 silver badges958 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

akuiper · Accepted Answer · 2018-01-12 15:24:20Z

6

One option is to use functools.reduce to construct the conditions:

from functools import reduce
df.filter(~reduce(lambda x, y: x & y, [df[c].isNull() for c in df.columns])).show()
+----+----+----+
|  ID|TYPE|CODE|
+----+----+----+
|   1|   B|  X1|
|null|   B|  X1|
+----+----+----+

where reduce produce a query as follows:

~reduce(lambda x, y: x & y, [df[c].isNull() for c in df.columns])
# Column<b'(NOT (((ID IS NULL) AND (TYPE IS NULL)) AND (CODE IS NULL)))'>

answered Jan 12, 2018 at 15:24

akuiper

216k33 gold badges362 silver badges379 bronze badges

Comments

Abhishek Rai · Accepted Answer · 2020-11-24 07:19:48Z

0

You can try this.

df=df.dropna(how='all')

edited Nov 24, 2020 at 7:19

Abhishek Rai

2,2474 gold badges26 silver badges49 bronze badges

answered Nov 24, 2020 at 4:27

Venkateswara Rao Rajanala

111 bronze badge

Collectives™ on Stack Overflow

Pyspark dataframe how to drop rows with nulls in all columns?

3 Answers 3

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related