Drop rows containing specific value in PySpark dataframe

Question

I have a pyspark dataframe like:

A    B      C
1    NA     9
4    2       5
6    4       2
5    1    NA

I want to delete rows which contain value "NA". In this case first and the last row. How to implement this using Python and Spark?

Update based on comment: Looking for a solution that removes rows that have the string: NA in any of the many columns.

What does NA mean? Does it mean a missing value for you? Or is it like that in your DataFrame? In that case your column B will be a String! Kindly confirm. — cph_sto
– cph_sto, Commented Feb 23, 2019 at 17:38
NA is not a missing value. It's a string keyword. I want to drop all the rows that contain string "NA". — jason_1093
– jason_1093, Commented Feb 23, 2019 at 20:04
Also, NA could also be present in another column, not necessarily in column B so that row should also be dropped. — jason_1093
– jason_1093, Commented Feb 23, 2019 at 20:06
Yes, so spark has labeled them as String because of "NA" present there. I want to remove "NA" so that the columns could be labeled as Integer. One way would be to replace "NA" with 0 everywhere. I am not able to implement it with correct syntax. — jason_1093
– jason_1093, Commented Feb 23, 2019 at 22:59
You should add a minimum example and what you've tried so far. A simple filter will do the job — Yeikel
– Yeikel, Commented Feb 24, 2019 at 0:50

cronoik · Accepted Answer · 2019-02-24 11:45:33Z

13

Just use a dataframe filter expression:

l = [('1','NA','9')
    ,('4','2', '5')
    ,('6','4','2')
    ,('5','NA','1')]
df = spark.createDataFrame(l,['A','B','C'])
#The following command requires that the checked columns are strings!
df = df.filter((df.A != 'NA') & (df.B != 'NA') & (df.C != 'NA'))
df.show()

+---+---+---+ 
|  A|  B|  C| 
+---+---+---+ 
|  4|  2|  5| 
|  6|  4|  2| 
+---+---+---+

@bluephantom: In the case you have hundreds of columns, just generate a string expression via list comprehension:

#In my example are columns need to be checked
listOfRelevantStringColumns = df.columns
expr = ' and '.join('(%s != "NA")' % col_name for col_name in listOfRelevantStringColumns)
df.filter(expr).show()

edited Feb 24, 2019 at 11:45

answered Feb 24, 2019 at 2:18

cronoik

20k4 gold badges52 silver badges90 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Ged Over a year ago

What if hundreds of columns?

cronoik Over a year ago

I have updated my post. I think it is the best way to use a string expression for the filter method.

Ged Over a year ago

Agree, for sure.

Ged Over a year ago

What I am thinking is how do we get this to work and stop if the the first encountered match is found? I am looking nto it from Scala side

cronoik Over a year ago

You want to stop after the first row based or the first column based match? Maybe it makes sense to open a new question to discuss this.

|

Ged · Accepted Answer · 2019-02-25 00:08:13Z

In Scala I did this differently, but got to this using pyspark. Not my favourite answer, but it is because of lesser pyspark knowledge my side. Things seem easier in Scala. Unlike an array there is no global match against all columns that can stop as soon as one found. Dynamic in terms of number of columns.

Assumptions made on data not having ~~ as part of data, could have split to array but decided not to do here. Using None instead of NA.

from pyspark.sql import functions as f

data = [(1,    None,    4,    None),
        (2,    'c',     3,    'd'),
        (None, None,    None, None),
        (3,    None,    None, 'z')]
df = spark.createDataFrame(data, ['k', 'v1', 'v2', 'v3'])

columns = df.columns
columns_Count = len(df.columns)

# colCompare is String
df2 = df.select(df['*'], f.concat_ws('~~', *columns).alias('colCompare') )
df3 = df2.filter(f.size(f.split(f.col("colCompare"), r"~~"))  == columns_Count).drop("colCompare")
df3.show()

returns:

+---+---+---+---+
|  k| v1| v2| v3|
+---+---+---+---+
|  2|  c|  3|  d|
+---+---+---+---+

Ghias Ali · Accepted Answer · 2022-01-10 18:31:31Z

0

In case if you want to remove the row

df = df.filter((df.A != 'NA') | (df.B != 'NA'))

But sometimes we need to replace with mean(in case of numeric column) or most frequent value(in case of categorical). for that you need to add column with same name which replace the original column i-e "A"

from pyspark.sql.functions import mean,col,when,count
df=df.withColumn("A",when(df.A=="NA",mean(df.A)).otherwise(df.A))

answered Jan 10, 2022 at 18:31

Ghias Ali

4274 silver badges14 bronze badges

2 Comments

Yash Khasgiwala Over a year ago

Instead of | it should be &

Ghias Ali Over a year ago

& will be used if both columns have 'NA' values in row, here | is used to filter if there is any of both has 'NA' value

Collectives™ on Stack Overflow

Drop rows containing specific value in PySpark dataframe

3 Answers 3

7 Comments

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

7 Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related