8

My Dataframe looks like below

ID,FirstName,LastName

1,Navee,Srikanth

2,,Srikanth 

3,Naveen,

Now My Problem statement is I have to remove the row number 2 since First Name is null.

I am using below pyspark script

join_Df1= Name.filter(Name.col(FirstName).isnotnull()).show()

I am getting error as

  File "D:\0\NameValidation.py", line 13, in <module>
join_Df1= filter(Name.FirstName.isnotnull()).show()

TypeError: 'Column' object is not callable

Can anyone please help me on this to resolve

2

3 Answers 3

13

It looks like your DataFrame FirstName have empty value instead Null. Below are some options to try out:-

df = sqlContext.createDataFrame([[1,'Navee','Srikanth'], [2,'','Srikanth'] , [3,'Naveen','']], ['ID','FirstName','LastName'])
df.show()
+---+---------+--------+
| ID|FirstName|LastName|
+---+---------+--------+
|  1|    Navee|Srikanth|
|  2|         |Srikanth|
|  3|   Naveen|        |
+---+---------+--------+

df.where(df.FirstName.isNotNull()).show() #This doen't remove null because df have empty value
+---+---------+--------+
| ID|FirstName|LastName|
+---+---------+--------+
|  1|    Navee|Srikanth|
|  2|         |Srikanth|
|  3|   Naveen|        |
+---+---------+--------+

df.where(df.FirstName != '').show()
+---+---------+--------+
| ID|FirstName|LastName|
+---+---------+--------+
|  1|    Navee|Srikanth|
|  3|   Naveen|        |
+---+---------+--------+

df.filter(df.FirstName != '').show()
+---+---------+--------+
| ID|FirstName|LastName|
+---+---------+--------+
|  1|    Navee|Srikanth|
|  3|   Naveen|        |
+---+---------+--------+

df.where("FirstName != ''").show()
+---+---------+--------+
| ID|FirstName|LastName|
+---+---------+--------+
|  1|    Navee|Srikanth|
|  3|   Naveen|        |
+---+---------+--------+
Sign up to request clarification or add additional context in comments.

Comments

7

You should be doing as below

join_Df1.filter(join_Df1.FirstName.isNotNull()).show

Hope this helps!

Comments

-3

I think what you might need is this notnull().

So this is your input in csv file my_test.csv:

ID,FirstName,LastName
1,Navee,Srikanth

2,,Srikanth

3,Naveen

The code:

import pandas as pd
df = pd.read_csv("my_test.csv")

print(df[df['FirstName'].notnull()])

output:

  ID FirstName  LastName
0   1     Navee  Srikanth
2   3    Naveen       NaN

This is what you would like! df[df['FirstName'].notnull()]

output of df['FirstName'].notnull():

0     True
1    False
2     True

This creates a dataframe df where df['FirstName'].notnull() returns True

How this is checked? df['FirstName'].notnull() If the value for FirstName column is notnull return True else if NaN is present return False.

1 Comment

The question is about PySpark not Pandas

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.