How to use a PySpark UDF in a Scala Spark project?

Question

Several people (1, 2, 3) have discussed using a Scala UDF in a PySpark application, usually for performance reasons. I am interested in the opposite - using a python UDF in a Scala Spark project.

I am particularly interested in building a model using sklearn (and MLFlow) then efficiently applying that to records in a Spark streaming job. I know I could also host the python model behind a REST API and make calls to that API in the Spark streaming application in mapPartitions, but managing concurrency for that task and setting up the API for hosted model isn't something I'm super excited about.

Is this possible without too much custom development with something like Py4J? Is this just a bad idea?

Thanks!

It is possible, though definitely not supported nor straightforward. So the question really is why would you even try. It is really hard to find a reasonable justification for such process. — zero323
– zero323, Commented Aug 19, 2018 at 11:02
@user6910411 Thanks for the response. I explained the use case in the question - I'd like to use a model I trained using sklearn to evaluate individual rows in a structured streaming application. — turtlemonvh
– turtlemonvh, Commented Aug 20, 2018 at 13:20
I guess the question is - if you already want to pay a price of inter-language communication, why not go with PySpark all the way? — zero323
– zero323, Commented Aug 20, 2018 at 16:10
In this case, because 1) the python operation will be a small piece of a larger Spark job, and I'd rather not pay the PySpark penalty for the whole thing and 2) I already have a mature Scala project, and just want to add in a bit of python w/o needing a rewrite. — turtlemonvh
– turtlemonvh, Commented Aug 20, 2018 at 18:18

Napoleon Borntoparty · Accepted Answer · 2019-11-25 17:36:20Z

1

Maybe I'm late to the party, but at least I can help with this for posterity. This is actually achievable by creating your python udf and registering it with spark.udf.register("my_python_udf", foo). You can view the doc here https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.UDFRegistration.register

This function can then be called from sqlContext in Python, Scala, Java, R or any language really, because you're accessing sqlContext directly (where the udf is registered). For example, you would call something like

spark.sql("SELECT my_python_udf(...)").show()

PROS - You get to call your sklearn model from Scala.

CONS - You have to use sqlContext and write SQL style queries.

I hope this helps, at least for any future visitors.

edited Nov 25, 2019 at 17:36

answered Nov 25, 2019 at 14:42

Napoleon Borntoparty

2,0122 gold badges10 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

turtlemonvh Over a year ago

Thanks for this. It looks like we should be able to submit python zips alongside a primary jar for a spark job and use those python zips as dependencies.

Def_Os Over a year ago

I think you're speaking from a situation where you have a context in a Python process, register the UDF, and then reuse the context in a JVM where you could access it. This would be possible in a Databricks notebook, but not when I have a single job that I start with spark-submit.

Averell Over a year ago

I'm looking for the way to create a PySpark UDF for a SparkContext created using Scala. The link you gave is now broken.

Collectives™ on Stack Overflow

How to use a PySpark UDF in a Scala Spark project?

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related