0

val someDF = Seq(
  (8, "bat"),
  (64, "mouse"),
  (-27, "horse"),
  (10, null),
  (11, "")
).toDF("number", "word")


Using this above data frame I'm trying to filter out null and empty word column values.

Trail - 1

someDF.filter(col("word") =!= "" || col("word").isNotNull).show(false)
+------+-----+
|number|word |
+------+-----+
|8     |bat  |
|64    |mouse|
|-27   |horse|
|11    |     |
+------+-----+

I have used OR condition but still, it is not removing the empty string word column values.


Trail - 2

someDF.filter(col("word") =!= "").filter(col("word").isNotNull).show(false)
+------+-----+
|number|word |
+------+-----+
|8     |bat  |
|64    |mouse|
|-27   |horse|
+------+-----+


In trail - 2 I have used the chain filter then it removed both null and empty values from the data frame.


Trail - 3


someDF.filter(col("word") =!= "" && col("word").isNotNull).show(false)
+------+-----+
|number|word |
+------+-----+
|8     |bat  |
|64    |mouse|
|-27   |horse|
+------+-----+


In trail -3 I have used AND operation then it removed the null/empty values.

Can anyone please explain to me why with OR operation it's not working? Is something wrong in my code?

4
  • your code don't have any issue I think you should check this article explain more details Commented Nov 3, 2020 at 6:37
  • I trying to filter both null values and empty string column values. but it's not working Commented Nov 3, 2020 at 6:39
  • I'm trying like this res35.filter(col("word") =!= "" || col("word").isNotNull).show(false). but its' not filter the empty string value Commented Nov 3, 2020 at 6:41
  • check with all your trial with explain and see what's difference Commented Nov 3, 2020 at 7:02

1 Answer 1

1

In general Spark SQL (including SQL and the DataFrame and Dataset API) does not guarantee the order of evaluation of subexpressions. In particular, the inputs of an operator or function are not necessarily evaluated left-to-right or in any other fixed order. For example, logical AND and OR expressions do not have left-to-right “short-circuiting” semantics.

Therefore, it is dangerous to rely on the side effects or order of evaluation of Boolean expressions, and the order of WHERE and HAVING clauses, since such expressions and clauses can be reordered during query optimization and planning. Specifically, if a UDF relies on short-circuiting semantics in SQL for null checking, there’s no guarantee that the null check will happen before invoking the UDF. For example,

now lets see your examples

trail 1: someDF.filter(col("word") =!= "" || col("word").isNotNull).show(false)

its a logical or operator meaning its enough for one side to be true : "" =!= "" -> false "".isnotNull -> true

meaning an empty word is true and should not be filtered out

trail 2 and 3 are the same you are using the logical and operator "" =!= "" -> false which is enough to decide that the expression is false and should be filtered out.

Sign up to request clarification or add additional context in comments.

1 Comment

One clarification, How come this someDF.filter(col("word") =!= "") is removing both null and emtpy string values from dataframe ?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.