8

This page was inspiring me to try out spark-csv for reading .csv file in PySpark I found a couple of posts such as this describing how to use spark-csv

But I am not able to initialize the ipython instance by including either the .jar file or package extension in the start-up that could be done through spark-shell.

That is, instead of

ipython notebook --profile=pyspark

I tried out

ipython notebook --profile=pyspark --packages com.databricks:spark-csv_2.10:1.0.3

but it is not supported.

Please advise.

2 Answers 2

18

You can simply pass it in the PYSPARK_SUBMIT_ARGS variable. For example:

export PACKAGES="com.databricks:spark-csv_2.11:1.3.0"
export PYSPARK_SUBMIT_ARGS="--packages ${PACKAGES} pyspark-shell"

These property can be also set dynamically in your code before SparkContext / SparkSession and corresponding JVM have been started:

packages = "com.databricks:spark-csv_2.11:1.3.0"

os.environ["PYSPARK_SUBMIT_ARGS"] = (
    "--packages {0} pyspark-shell".format(packages)
)
Sign up to request clarification or add additional context in comments.

2 Comments

Wouldn't this override everything that is already in os.environ["PYSPARK_SUBMIT_ARGS"] ? I think this needs to be mentioned cause I've spent a lot of time figuring what happened
This is not working for Kafka. I am still getting below error: java.lang.ClassNotFoundException: Failed to find data source: kafka. Code: import os os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0 pyspark-shell'
11

I believe you can also add this as a variable to your spark-defaults.conf file. So something like:

spark.jars.packages    com.databricks:spark-csv_2.10:1.3.0

This will load the spark-csv library into PySpark every time you launch the driver.

Obviously zero's answer is more flexible because you can add these lines to your PySpark app before you import the PySpark package:

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-csv_2.10:1.3.0 pyspark-shell'

from pyspark import SparkContext, SparkConf

This way you are only importing the packages you actually need for your script.

2 Comments

If you are running a notebook, this is by far the most portable option: I'm running the all-spark-notebook version, and this unlocks CSV parsing for all three languages at once.
i am trying to import package mmlspark. Using the following in my notebook. But getting the error mmlspark not found import os import sys os.environ["PYSPARK_SUBMIT_ARGS"] = \ "--packages Azure:mmlspark:0.13 \ pyspark-shell" import findspark findspark.add_packages(["Azure:mmlspark:0.13"]) findspark.init()

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.