Adding custom jars to pyspark in jupyter notebook

Question

I am using the Jupyter notebook with Pyspark with the following docker image: Jupyter all-spark-notebook

Now I would like to write a pyspark streaming application which consumes messages from Kafka. In the Spark-Kafka Integration guide they describe how to deploy such an application using spark-submit (it requires linking an external jar - explanation is in 3. Deploying). But since I am using Jupyter notebook I never actually run the spark-submit command, I assume it gets run in the back if I press execute.

In the spark-submit command you can specify some parameters, one of them is -jars, but it is not clear to me how I can set this parameter from the notebook (or externally via environment variables?). I am assuming I can link this external jar dynamically via the SparkConf or the SparkContext object. Has anyone experience on how to perform the linking properly from the notebook?

DDW · Accepted Answer · 2016-03-29 21:20:58Z

25

I've managed to get it working from within the jupyter notebook which is running form the all-spark container.

I start a python3 notebook in jupyterhub and overwrite the PYSPARK_SUBMIT_ARGS flag as shown below. The Kafka consumer library was downloaded from the maven repository and put in my home directory /home/jovyan:

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = 
  '--jars /home/jovyan/spark-streaming-kafka-assembly_2.10-1.6.1.jar pyspark-shell'

import pyspark
from pyspark.streaming.kafka import KafkaUtils
from pyspark.streaming import StreamingContext

sc = pyspark.SparkContext()
ssc = StreamingContext(sc,1)

broker = "<my_broker_ip>"
directKafkaStream = KafkaUtils.createDirectStream(ssc, ["test1"],
                        {"metadata.broker.list": broker})
directKafkaStream.pprint()
ssc.start()

Note: Don't forget the pyspark-shell in the environment variables!

Extension: If you want to include code from spark-packages you can use the --packages flag instead. An example on how to do this in the all-spark-notebook can be found here

edited Mar 29, 2016 at 21:20

answered Mar 29, 2016 at 21:14

DDW

2,0152 gold badges15 silver badges26 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

kww Over a year ago

Thanks. Just want to say that broker should be of format like: "localhost:9092".

Michele Piccolini Over a year ago

Were you ever able to do the same thing without downloading the jar and using --packages option (mentioned here: spark.apache.org/docs/latest/submitting-applications.html) instead?

FrankZhu Over a year ago

I am surprised that this actually worked for you. I have to setup PYSPARK_SUBMIT_ARGS in Dockerfile before container start.

Vishal Patwardhan Over a year ago

Thanks. It worked. I am using .Net (C#) language in jupyter notebook and above mentioned way of setting jar file for pyspark submit argument using environment variable worked.

Nandan Rao · Accepted Answer · 2019-10-30 09:43:59Z

10

Indeed, there is a way to link it dynamically via the SparkConf object when you create the SparkSession, as explained in this answer:

spark = SparkSession \
    .builder \
    .appName("My App") \
    .config("spark.jars", "/path/to/jar.jar,/path/to/another/jar.jar") \
    .getOrCreate()

answered Oct 30, 2019 at 9:43

Nandan Rao

3432 silver badges11 bronze badges

Comments

Assaf Mendelson · Accepted Answer · 2016-03-13 14:41:15Z

1

You can run your jupyter notebook with the pyspark command by setting the relevant environment variables:

export PYSPARK_DRIVER_PYTHON=jupyter
export IPYTHON=1
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --port=XXX --ip=YYY"

with XXX being the port you want to use to access the notebook and YYY being the ip address.

Now simply run pyspark and add --jars as a switch the same as you would spark submit

answered Mar 13, 2016 at 14:41

Assaf Mendelson

13k5 gold badges51 silver badges57 bronze badges

1 Comment

Paul Over a year ago

That's interesting. Docker can set environment variables with docker run -e , but they can also get clobbered somewhere. The Dockerfile for all-spark-notebook uses env SPARK_OPTS but I have noticed that all-spark-notebook Toree (scala) was clobbering a --driver-memory setting as well as --master and using local[2] in a particular kernel.json file. See, e.g., my post about some manual testing in github.com/jupyter/docker-stacks/pull/144 .

Dd__Mad · Accepted Answer · 2020-02-06 03:39:19Z

1

In case someone is the same as me: I tried all above solutions and none of them works for me. What I'm trying to do is to use Delta Lake in the Jupyter notebook.

Finally I can use from delta.tables import * by calling SparkContext.addPyFile("/path/to/your/jar.jar") first. Though in the spark official docs, it only mentions adding .zip or .py file, but I tried .jar and it worked perfectly.

answered Feb 6, 2020 at 3:39

Dd__Mad

1168 bronze badges

1 Comment

Joshua Cook Over a year ago

gist.github.com/joshuacook/fbda6fdbec7dc6b0fb9bd7ed9953004a

prajwal · Accepted Answer · 2019-05-19 21:14:11Z

0

for working on jupyter-notebook with spark you need to give the location of the external jars before the creation of sparkContext object. pyspark --jars youJar will create a sparkcontext with location of external jars

answered May 19, 2019 at 21:14

prajwal

695 bronze badges

Collectives™ on Stack Overflow

Adding custom jars to pyspark in jupyter notebook

5 Answers 5

4 Comments

Comments

1 Comment

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

4 Comments

Comments

1 Comment

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related