10

filter on basic scala collections containing null values has the following (and quite intuitive) behaviour:

scala> List("a", "b", null).filter(_ != "a")
res0: List[String] = List(b, null)

However, I was very surprised to figure out that the following filter removes nulls in spark dataframe:

scala> val df = List(("a", null), ( "c", "d")).toDF("A", "B")
scala> df.show
+---+----+
|  A|   B|
+---+----+
|  a|null|
|  c|   d|
+---+----+
scala> df.filter('B =!= "d").show
+---+---+
|  A|  B|
+---+---+
+---+---+

If I want to keep null values, I should add

df.filter('B =!= "d" || 'B.isNull).show
+---+----+
|  A|   B|
+---+----+
|  a|null|
+---+----+

Personally, I think that removing nulls by default is very error prone. Why this choice? and why is not explicitely stated in the api documentation? Am I missing something?

1

1 Answer 1

8

This is because the standard for SQL is not to be null-safe - so Spark SQL follows this (but not Scala).

Spark dataframes has a null-safe equality though

scala> df.filter($"B" <=> null).show
+---+----+
|  A|   B|
+---+----+
|  a|null|
+---+----+


scala> df.filter(not($"B" <=> "d")).show
+---+----+
|  A|   B|
+---+----+
|  a|null|
+---+----+

Note in edit: the point of not being null safe by default is so to allow null as a result of a test. Is a missing value equal to "c"? We don't know. Is a missing value equal to another missing value? We don't know either. But in a filter, null is false.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.