Pyspark Removing null values from a column in dataframe

Question

My Dataframe looks like below

ID,FirstName,LastName

1,Navee,Srikanth

2,,Srikanth 

3,Naveen,

Now My Problem statement is I have to remove the row number 2 since First Name is null.

I am using below pyspark script

join_Df1= Name.filter(Name.col(FirstName).isnotnull()).show()

I am getting error as

  File "D:\0\NameValidation.py", line 13, in <module>
join_Df1= filter(Name.FirstName.isnotnull()).show()

TypeError: 'Column' object is not callable

Can anyone please help me on this to resolve

Check out the answer stackoverflow.com/questions/37262762/… — Dhruv Aggarwal
– Dhruv Aggarwal, Commented Jun 23, 2017 at 5:59
Possible duplicate of Filter Pyspark dataframe column with None value — Jacek Laskowski
– Jacek Laskowski, Commented Jun 25, 2017 at 17:23

Rakesh Kumar · Accepted Answer · 2017-06-23 07:25:03Z

It looks like your DataFrame FirstName have empty value instead Null. Below are some options to try out:-

df = sqlContext.createDataFrame([[1,'Navee','Srikanth'], [2,'','Srikanth'] , [3,'Naveen','']], ['ID','FirstName','LastName'])
df.show()
+---+---------+--------+
| ID|FirstName|LastName|
+---+---------+--------+
|  1|    Navee|Srikanth|
|  2|         |Srikanth|
|  3|   Naveen|        |
+---+---------+--------+

df.where(df.FirstName.isNotNull()).show() #This doen't remove null because df have empty value
+---+---------+--------+
| ID|FirstName|LastName|
+---+---------+--------+
|  1|    Navee|Srikanth|
|  2|         |Srikanth|
|  3|   Naveen|        |
+---+---------+--------+

df.where(df.FirstName != '').show()
+---+---------+--------+
| ID|FirstName|LastName|
+---+---------+--------+
|  1|    Navee|Srikanth|
|  3|   Naveen|        |
+---+---------+--------+

df.filter(df.FirstName != '').show()
+---+---------+--------+
| ID|FirstName|LastName|
+---+---------+--------+
|  1|    Navee|Srikanth|
|  3|   Naveen|        |
+---+---------+--------+

df.where("FirstName != ''").show()
+---+---------+--------+
| ID|FirstName|LastName|
+---+---------+--------+
|  1|    Navee|Srikanth|
|  3|   Naveen|        |
+---+---------+--------+

koiralo · Accepted Answer · 2017-06-23 07:03:36Z

7

You should be doing as below

join_Df1.filter(join_Df1.FirstName.isNotNull()).show

Hope this helps!

answered Jun 23, 2017 at 7:03

koiralo

23.2k6 gold badges57 silver badges77 bronze badges

Comments

void · Accepted Answer · 2017-06-23 07:00:16Z

-3

I think what you might need is this notnull().

So this is your input in csv file my_test.csv:

ID,FirstName,LastName
1,Navee,Srikanth

2,,Srikanth

3,Naveen

The code:

import pandas as pd
df = pd.read_csv("my_test.csv")

print(df[df['FirstName'].notnull()])

output:

  ID FirstName  LastName
0   1     Navee  Srikanth
2   3    Naveen       NaN

This is what you would like! df[df['FirstName'].notnull()]

output of df['FirstName'].notnull():

0     True
1    False
2     True

This creates a dataframe df where df['FirstName'].notnull() returns True

How this is checked? df['FirstName'].notnull() If the value for FirstName column is notnull return True else if NaN is present return False.

answered Jun 23, 2017 at 7:00

void

2,6622 gold badges23 silver badges35 bronze badges

1 Comment

Danial Shabbir Over a year ago

The question is about PySpark not Pandas

Collectives™ on Stack Overflow

Pyspark Removing null values from a column in dataframe

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related