why does filter remove null value by default on spark dataframe?

Question

filter on basic scala collections containing null values has the following (and quite intuitive) behaviour:

scala> List("a", "b", null).filter(_ != "a")
res0: List[String] = List(b, null)

However, I was very surprised to figure out that the following filter removes nulls in spark dataframe:

scala> val df = List(("a", null), ( "c", "d")).toDF("A", "B")
scala> df.show
+---+----+
|  A|   B|
+---+----+
|  a|null|
|  c|   d|
+---+----+
scala> df.filter('B =!= "d").show
+---+---+
|  A|  B|
+---+---+
+---+---+

If I want to keep null values, I should add

df.filter('B =!= "d" || 'B.isNull).show
+---+----+
|  A|   B|
+---+----+
|  a|null|
+---+----+

Personally, I think that removing nulls by default is very error prone. Why this choice? and why is not explicitely stated in the api documentation? Am I missing something?

Related to stackoverflow.com/questions/36131942/…

Wilmerton
– Wilmerton

2018-03-05 16:31:50 +00:00
Commented Mar 5, 2018 at 16:31 — Wilmerton
– Wilmerton, Commented Mar 5, 2018 at 16:31

Wilmerton · Accepted Answer · 2018-03-06 11:13:11Z

8

This is because the standard for SQL is not to be null-safe - so Spark SQL follows this (but not Scala).

Spark dataframes has a null-safe equality though

scala> df.filter($"B" <=> null).show
+---+----+
|  A|   B|
+---+----+
|  a|null|
+---+----+


scala> df.filter(not($"B" <=> "d")).show
+---+----+
|  A|   B|
+---+----+
|  a|null|
+---+----+

Note in edit: the point of not being null safe by default is so to allow null as a result of a test. Is a missing value equal to "c"? We don't know. Is a missing value equal to another missing value? We don't know either. But in a filter, null is false.

edited Mar 6, 2018 at 11:13

answered Mar 5, 2018 at 16:26

Wilmerton

1,5381 gold badge13 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

why does filter remove null value by default on spark dataframe?

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related