Apache Spark on Amazon EC2 PYTHONPATH trouble

Question

I need to run Apace Spark script on Amazon EC2. Script uses such libs, as numpy, pandas etc. The trouble is that I have numpy installed in /usr/local/lib64/python2.7/site-packages, and this folder isn't in PYTHONPATH by default. So when I export PYTHONPATH=$PYTHONPATH:/usr/local/lib64/python2.7/site-packages, usual python detects it (import numpy causes no problems), but when I'm trying to import it in pyspark shell - it shows:

>>> import numpy
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named numpy
>>> exit()

is there any solution how to change pyspark's PYTHONPATH?

+1 on Joe's answer, I will refrain from a -1 on your question but it would be nice to know if this worked, to improve SO. Also, I can't edit just one character but it's "Apache" for Google not "Apace" haha :) This question shows very high in Google's search results, would help if you either delete it or complete it please? — JimLohse
– JimLohse, Commented Oct 27, 2016 at 14:10

Joe Young · Accepted Answer · 2015-08-05 07:07:01Z

1

Can you try setting

export PYTHONPATH=$PYTHONPATH:/usr/local/lib64/python2.7/site-packages

in $SPARK_CONF_DIR/spark-env.sh?

answered Aug 5, 2015 at 7:07

Joe Young

5,9153 gold badges31 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

JimLohse Over a year ago

Ditto on this answer would be nice if OP would have responded, surely this was the problem and the correct answer!

Community · Accepted Answer · 2017-05-23 12:33:21Z

1

Joe Young's answer is good if you want to set the path "permanently." If you want to set it on a per job basis, Continuum (Anaconda folks) has this page about setting your PYTHONPATH job by job on the command line:

https://www.continuum.io/blog/developer-blog/using-anaconda-pyspark-distributed-language-processing-hadoop-cluster

For example (written for a Cloudera install substitute your Spark location):

Configuring the spark-submit command with your Hadoop Cluster

To use Python from Anaconda along with PySpark, you can set the PYSPARK_PYTHON environment variable on a per-job basis along with the spark-submit command. If you’re using the Anaconda parcel for CDH, you can run a PySpark script (e.g., spark-job.py) using the following command:

$ PYSPARK_PYTHON=/opt/cloudera/parcels/Anaconda/bin/python spark-submit spark-job.py

edited May 23, 2017 at 12:33

CommunityBot

11 silver badge

answered Oct 27, 2016 at 14:16

JimLohse

1,3044 gold badges21 silver badges45 bronze badges

1 Comment

JimLohse Over a year ago

Looking at this more closely, I think Cloudera has a typo? Or this is an an interesting bit of a way to form a command. I would normally enter PYSPARK_PYTHON=/opt/cloudera/parcels/Anaconda/bin/python and spark-submit spark-job.py on two lines, or separated by a ;. But it works! Been using UNIX/Linux 20 years, learn something new every day! I tried PYSPARK_PYTHON=/opt/cloudera/parcels/Anaconda/bin/python spark-submit spark-job.py ls -l and it works.

Collectives™ on Stack Overflow

Apache Spark on Amazon EC2 PYTHONPATH trouble

2 Answers 2

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related