null pointer exception while converting dataframe to list inside udf

Question

I am reading 2 different .csv files which has only column as below:

    val dF1 = sqlContext.read.csv("some.csv").select($"ID")
    val dF2 = sqlContext.read.csv("other.csv").select($"PID")

trying to search if dF2("PID") exists in dF1("ID"):

    val getIdUdf = udf((x:String)=>{dF1.collect().map(_(0)).toList.contains(x)})
    val dfFinal = dF2.withColumn("hasId", getIdUdf($"PID"))

This gives me null pointer exception. but if I convert dF1 outside and use list in udf it works:

    val dF1 = sqlContext.read.csv("some.csv").select($"ID").collect().map(_(0)).toList
    val getIdUdf = udf((x:String)=>{dF1.contains(x)})
    val dfFinal = dF2.withColumn("hasId", getIdUdf($"PID"))

I know I can use join to get this done but want to know what is the reason of null pointer exception here.

Thanks.

I think put a collect inside UDF is not a good manner and maybe is the source of the error. Think that function will be called too many times, so will perform a collect each time. Have you think about extract that collect outside and broadcast the data? — Raúl Reguillo Carmona
– Raúl Reguillo Carmona, Commented Nov 3, 2017 at 9:15

efan · Accepted Answer · 2017-11-03 12:07:28Z

3

Please check this question about accessing dataframe inside the transformation of another dataframe. This is exactly what you are doing with your UDF, and this is not possible in spark. Solution is either to use join, or collect outside of transformation and broadcast.

answered Nov 3, 2017 at 12:07

efan

9686 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

null pointer exception while converting dataframe to list inside udf

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related