How to load jar dependenices in IPython Notebook

Question

This page was inspiring me to try out spark-csv for reading .csv file in PySpark I found a couple of posts such as this describing how to use spark-csv

But I am not able to initialize the ipython instance by including either the .jar file or package extension in the start-up that could be done through spark-shell.

That is, instead of

ipython notebook --profile=pyspark

I tried out

ipython notebook --profile=pyspark --packages com.databricks:spark-csv_2.10:1.0.3

but it is not supported.

Please advise.

zero323 · Accepted Answer · 2016-09-21 14:22:36Z

18

You can simply pass it in the PYSPARK_SUBMIT_ARGS variable. For example:

export PACKAGES="com.databricks:spark-csv_2.11:1.3.0"
export PYSPARK_SUBMIT_ARGS="--packages ${PACKAGES} pyspark-shell"

These property can be also set dynamically in your code before SparkContext / SparkSession and corresponding JVM have been started:

packages = "com.databricks:spark-csv_2.11:1.3.0"

os.environ["PYSPARK_SUBMIT_ARGS"] = (
    "--packages {0} pyspark-shell".format(packages)
)

edited Sep 21, 2016 at 14:22

answered Nov 25, 2015 at 4:26

zero323

331k108 gold badges982 silver badges958 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

David Arenburg Over a year ago

Wouldn't this override everything that is already in os.environ["PYSPARK_SUBMIT_ARGS"] ? I think this needs to be mentioned cause I've spent a lot of time figuring what happened

Hemant Chandurkar Over a year ago

This is not working for Kafka. I am still getting below error: java.lang.ClassNotFoundException: Failed to find data source: kafka. Code:

import os os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0 pyspark-shell'

Disco4Ever · Accepted Answer · 2016-01-28 17:11:15Z

11

I believe you can also add this as a variable to your spark-defaults.conf file. So something like:

spark.jars.packages    com.databricks:spark-csv_2.10:1.3.0

This will load the spark-csv library into PySpark every time you launch the driver.

Obviously zero's answer is more flexible because you can add these lines to your PySpark app before you import the PySpark package:

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-csv_2.10:1.3.0 pyspark-shell'

from pyspark import SparkContext, SparkConf

This way you are only importing the packages you actually need for your script.

answered Jan 28, 2016 at 17:11

Disco4Ever

1,0532 gold badges11 silver badges16 bronze badges

2 Comments

mrArias Over a year ago

If you are running a notebook, this is by far the most portable option: I'm running the all-spark-notebook version, and this unlocks CSV parsing for all three languages at once.

Naveenan Over a year ago

i am trying to import package mmlspark. Using the following in my notebook. But getting the error mmlspark not found import os import sys os.environ["PYSPARK_SUBMIT_ARGS"] = \ "--packages Azure:mmlspark:0.13 \ pyspark-shell" import findspark findspark.add_packages(["Azure:mmlspark:0.13"]) findspark.init()

Collectives™ on Stack Overflow

How to load jar dependenices in IPython Notebook

2 Answers 2

2 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related