7

Suppose we have a simple dataframe:

from pyspark.sql.types import *

schema = StructType([
StructField('id', LongType(), False),
StructField('name', StringType(), False),
StructField('count', LongType(), True),
])
df = spark.createDataFrame([(1,'Alice',None), (2,'Bob',1)], schema)

The question is how to detect null values? I tried the following:

df.where(df.count == None).show()
df.where(df.count is 'null').show()
df.where(df.count == 'null').show()

It results in error:

condition should be string or Column

I know the following works:

df.where("count is null").show()

But is there a way to achieve with without the full string? I.e. df.count...?

2 Answers 2

9

Another way of doing the same is by using filter api

from pyspark.sql import functions as F
df.filter(F.isnull("count")).show()
Sign up to request clarification or add additional context in comments.

2 Comments

Is there any significant difference between where and filter? I mean generally, not only in this case.
@MiroslavStola,where is an alias for filter. filter is standard for functional programming whereas where is for those who prefer SQL ways.
8

You can use Spark Function isnull

from pyspark.sql import functions as F
df.where(F.isnull(F.col("count"))).show()

or directly with the method isNull

df.where(F.col("count").isNull()).show()

2 Comments

And for those unfamiliar with pyspark syntax likeme, .isNotNull() gives you everywhere that is not null.
~F.col("count").isNull() also works to provide the negation

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.