8

The issues:

1) Spark doesn't call UDF if input is column of primitive type that contains null:

inputDF.show()

+-----+
|  x  |
+-----+
| null|
|  1.0|
+-----+

inputDF
  .withColumn("y",
     udf { (x: Double) => 2.0 }.apply($"x") // will not be invoked if $"x" == null
  )
  .show()

+-----+-----+
|  x  |  y  |
+-----+-----+
| null| null|
|  1.0|  2.0|
+-----+-----+

2) Can't produce null from UDF as a column of primitive type:

udf { (x: String) => null: Double } // compile error

3 Answers 3

12

Accordingly to the docs:

Note that if you use primitive parameters, you are not able to check if it is null or not, and the UDF will return null for you if the primitive input is null. Use boxed type or [[Option]] if you wanna do the null-handling yourself.


So, the easiest solution is just to use boxed types if your UDF input is nullable column of primitive type OR/AND you need to output null from UDF as a column of primitive type:

inputDF
  .withColumn("y",
     udf { (x: java.lang.Double) => 
       (if (x == null) 1 else null): java.lang.Integer
     }.apply($"x")
  )
  .show()

+-----+-----+
|  x  |  y  |
+-----+-----+
| null| null|
|  1.0|  2.0|
+-----+-----+
Sign up to request clarification or add additional context in comments.

2 Comments

Note that the docs you cited are a bit misleading: Although Option can be used to return "nullable" primitives, Option cannot be used as input for the UDF (at least in Spark 1.6, not sure about 2.0)
@RaphaelRoth Agree, they had to mention that Option[] is useful only to return nullable values.
2

I would also use Artur's solution, but there is also another way without using javas wrapper classes by using struct:

import org.apache.spark.sql.functions.struct
import org.apache.spark.sql.Row

inputDF
  .withColumn("y",
     udf { (r: Row) => 
       if (r.isNullAt(0)) Some(1) else None
     }.apply(struct($"x"))
  )
  .show()

1 Comment

That's nice too!
0

Based on the solution provided at SparkSQL: How to deal with null values in user defined function? by @zero323, an alternative way to achieve the requested result is:

import scala.util.Try
val udfHandlingNulls udf((x: Double) => Try(2.0).toOption)
inputDF.withColumn("y", udfHandlingNulls($"x")).show()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.