9

I have a pyspark dataframe like:

A    B      C
1    NA     9
4    2       5
6    4       2
5    1    NA

I want to delete rows which contain value "NA". In this case first and the last row. How to implement this using Python and Spark?


Update based on comment: Looking for a solution that removes rows that have the string: NA in any of the many columns.

7
  • What does NA mean? Does it mean a missing value for you? Or is it like that in your DataFrame? In that case your column B will be a String! Kindly confirm. Commented Feb 23, 2019 at 17:38
  • NA is not a missing value. It's a string keyword. I want to drop all the rows that contain string "NA". Commented Feb 23, 2019 at 20:04
  • 1
    Also, NA could also be present in another column, not necessarily in column B so that row should also be dropped. Commented Feb 23, 2019 at 20:06
  • 1
    Yes, so spark has labeled them as String because of "NA" present there. I want to remove "NA" so that the columns could be labeled as Integer. One way would be to replace "NA" with 0 everywhere. I am not able to implement it with correct syntax. Commented Feb 23, 2019 at 22:59
  • 1
    You should add a minimum example and what you've tried so far. A simple filter will do the job Commented Feb 24, 2019 at 0:50

3 Answers 3

13

Just use a dataframe filter expression:

l = [('1','NA','9')
    ,('4','2', '5')
    ,('6','4','2')
    ,('5','NA','1')]
df = spark.createDataFrame(l,['A','B','C'])
#The following command requires that the checked columns are strings!
df = df.filter((df.A != 'NA') & (df.B != 'NA') & (df.C != 'NA'))
df.show()

+---+---+---+ 
|  A|  B|  C| 
+---+---+---+ 
|  4|  2|  5| 
|  6|  4|  2| 
+---+---+---+

@bluephantom: In the case you have hundreds of columns, just generate a string expression via list comprehension:

#In my example are columns need to be checked
listOfRelevantStringColumns = df.columns
expr = ' and '.join('(%s != "NA")' % col_name for col_name in listOfRelevantStringColumns)
df.filter(expr).show()
Sign up to request clarification or add additional context in comments.

7 Comments

What if hundreds of columns?
I have updated my post. I think it is the best way to use a string expression for the filter method.
Agree, for sure.
What I am thinking is how do we get this to work and stop if the the first encountered match is found? I am looking nto it from Scala side
You want to stop after the first row based or the first column based match? Maybe it makes sense to open a new question to discuss this.
|
0

In Scala I did this differently, but got to this using pyspark. Not my favourite answer, but it is because of lesser pyspark knowledge my side. Things seem easier in Scala. Unlike an array there is no global match against all columns that can stop as soon as one found. Dynamic in terms of number of columns.

Assumptions made on data not having ~~ as part of data, could have split to array but decided not to do here. Using None instead of NA.

from pyspark.sql import functions as f

data = [(1,    None,    4,    None),
        (2,    'c',     3,    'd'),
        (None, None,    None, None),
        (3,    None,    None, 'z')]
df = spark.createDataFrame(data, ['k', 'v1', 'v2', 'v3'])

columns = df.columns
columns_Count = len(df.columns)

# colCompare is String
df2 = df.select(df['*'], f.concat_ws('~~', *columns).alias('colCompare') )
df3 = df2.filter(f.size(f.split(f.col("colCompare"), r"~~"))  == columns_Count).drop("colCompare")
df3.show()

returns:

+---+---+---+---+
|  k| v1| v2| v3|
+---+---+---+---+
|  2|  c|  3|  d|
+---+---+---+---+

Comments

0

In case if you want to remove the row

df = df.filter((df.A != 'NA') | (df.B != 'NA'))

But sometimes we need to replace with mean(in case of numeric column) or most frequent value(in case of categorical). for that you need to add column with same name which replace the original column i-e "A"

from pyspark.sql.functions import mean,col,when,count
df=df.withColumn("A",when(df.A=="NA",mean(df.A)).otherwise(df.A))

2 Comments

Instead of | it should be &
& will be used if both columns have 'NA' values in row, here | is used to filter if there is any of both has 'NA' value

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.