0

I am looking for a way to use filter on a filed in DataFrame which has null data. Below is my sample DataFrame with two fields: id and value. value field has a null value in it.

val testData = Array((1,"actualstring1"),(2,null),(3,"actualstring2"),(4,"testString1"))
val testDataDF = sc.parallelize(testData).toDF("id", "value")

I used the below code snippet to filter out the test strings assuming the output to have three records. To my surprise I've got only below two records:

testDataDF.filter(!col("value").contains("test")).show

which gives the below result:

+---+-------------+
| id|        value|
+---+-------------+
|  1|actualstring1|
|  3|actualstring2|
+---+-------------+

here we see that the record with id=2 is ignored in this filteration process. Im now stuck how to include the row for id=2 aswell in the output along with the two rows we are getting.

Appreciate any help

2 Answers 2

1

You replace current condition with a one which defaults to FALSE:

not(coalesce(col("value").contains("test"), lit(false))

where

lit(false)

is a boolean literal, and

coalesce(_, _)

returns the first NOT NULL element, counting from the left, or NULL if such element doesn't exist.

Sign up to request clarification or add additional context in comments.

Comments

1

You can test col("value") != null in the filter:

testDataDF.filter(col("value") != null && !col("value").contains("test")).show

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.