0

I need to run Apace Spark script on Amazon EC2. Script uses such libs, as numpy, pandas etc. The trouble is that I have numpy installed in /usr/local/lib64/python2.7/site-packages, and this folder isn't in PYTHONPATH by default. So when I export PYTHONPATH=$PYTHONPATH:/usr/local/lib64/python2.7/site-packages, usual python detects it (import numpy causes no problems), but when I'm trying to import it in pyspark shell - it shows:

>>> import numpy
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named numpy
>>> exit()

is there any solution how to change pyspark's PYTHONPATH?

1
  • +1 on Joe's answer, I will refrain from a -1 on your question but it would be nice to know if this worked, to improve SO. Also, I can't edit just one character but it's "Apache" for Google not "Apace" haha :) This question shows very high in Google's search results, would help if you either delete it or complete it please? Commented Oct 27, 2016 at 14:10

2 Answers 2

1

Can you try setting

export PYTHONPATH=$PYTHONPATH:/usr/local/lib64/python2.7/site-packages

in $SPARK_CONF_DIR/spark-env.sh?

Sign up to request clarification or add additional context in comments.

1 Comment

Ditto on this answer would be nice if OP would have responded, surely this was the problem and the correct answer!
1

Joe Young's answer is good if you want to set the path "permanently." If you want to set it on a per job basis, Continuum (Anaconda folks) has this page about setting your PYTHONPATH job by job on the command line:

https://www.continuum.io/blog/developer-blog/using-anaconda-pyspark-distributed-language-processing-hadoop-cluster

For example (written for a Cloudera install substitute your Spark location):

Configuring the spark-submit command with your Hadoop Cluster

To use Python from Anaconda along with PySpark, you can set the PYSPARK_PYTHON environment variable on a per-job basis along with the spark-submit command. If you’re using the Anaconda parcel for CDH, you can run a PySpark script (e.g., spark-job.py) using the following command:

$ PYSPARK_PYTHON=/opt/cloudera/parcels/Anaconda/bin/python spark-submit spark-job.py

1 Comment

Looking at this more closely, I think Cloudera has a typo? Or this is an an interesting bit of a way to form a command. I would normally enter PYSPARK_PYTHON=/opt/cloudera/parcels/Anaconda/bin/python and spark-submit spark-job.py on two lines, or separated by a ;. But it works! Been using UNIX/Linux 20 years, learn something new every day! I tried PYSPARK_PYTHON=/opt/cloudera/parcels/Anaconda/bin/python spark-submit spark-job.py ls -l and it works.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.