12

Several people (1, 2, 3) have discussed using a Scala UDF in a PySpark application, usually for performance reasons. I am interested in the opposite - using a python UDF in a Scala Spark project.

I am particularly interested in building a model using sklearn (and MLFlow) then efficiently applying that to records in a Spark streaming job. I know I could also host the python model behind a REST API and make calls to that API in the Spark streaming application in mapPartitions, but managing concurrency for that task and setting up the API for hosted model isn't something I'm super excited about.

Is this possible without too much custom development with something like Py4J? Is this just a bad idea?

Thanks!

11
  • It is possible, though definitely not supported nor straightforward. So the question really is why would you even try. It is really hard to find a reasonable justification for such process. Commented Aug 19, 2018 at 11:02
  • @user6910411 Thanks for the response. I explained the use case in the question - I'd like to use a model I trained using sklearn to evaluate individual rows in a structured streaming application. Commented Aug 20, 2018 at 13:20
  • I guess the question is - if you already want to pay a price of inter-language communication, why not go with PySpark all the way? Commented Aug 20, 2018 at 16:10
  • In this case, because 1) the python operation will be a small piece of a larger Spark job, and I'd rather not pay the PySpark penalty for the whole thing and 2) I already have a mature Scala project, and just want to add in a bit of python w/o needing a rewrite. Commented Aug 20, 2018 at 18:18
  • 1
    Please, have a look at stackoverflow.com/q/76802912/6380624 Commented Aug 2, 2023 at 11:20

1 Answer 1

1

Maybe I'm late to the party, but at least I can help with this for posterity. This is actually achievable by creating your python udf and registering it with spark.udf.register("my_python_udf", foo). You can view the doc here https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.UDFRegistration.register

This function can then be called from sqlContext in Python, Scala, Java, R or any language really, because you're accessing sqlContext directly (where the udf is registered). For example, you would call something like

spark.sql("SELECT my_python_udf(...)").show()

PROS - You get to call your sklearn model from Scala.

CONS - You have to use sqlContext and write SQL style queries.

I hope this helps, at least for any future visitors.

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks for this. It looks like we should be able to submit python zips alongside a primary jar for a spark job and use those python zips as dependencies.
I think you're speaking from a situation where you have a context in a Python process, register the UDF, and then reuse the context in a JVM where you could access it. This would be possible in a Databricks notebook, but not when I have a single job that I start with spark-submit.
I'm looking for the way to create a PySpark UDF for a SparkContext created using Scala. The link you gave is now broken.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.