6

I’m spark-submitting a python file that imports numpy but I’m getting a no module named numpy error.

$ spark-submit --py-files projects/other_requirements.egg projects/jobs/my_numpy_als.py
Traceback (most recent call last):
  File "/usr/local/www/my_numpy_als.py", line 13, in <module>
    from pyspark.mllib.recommendation import ALS
  File "/usr/lib/spark/python/pyspark/mllib/__init__.py", line 24, in <module>
    import numpy
ImportError: No module named numpy

I was thinking I would pull in an egg for numpy —python-files, but I'm having trouble figuring out how to build that egg. But then it occurred to me that pyspark itself uses numpy. It would be silly to pull in my own version of numpy.

Any idea on the appropriate thing to do here?

4 Answers 4

4

It looks like Spark is using a version of Python that does not have numpy installed. It could be because you are working inside a virtual environment.

Try this:

# The following is for specifying a Python version for PySpark. Here we
# use the currently calling Python version.
# This is handy for when we are using a virtualenv, for example, because
# otherwise Spark would choose the default system Python version.
os.environ['PYSPARK_PYTHON'] = sys.executable
Sign up to request clarification or add additional context in comments.

1 Comment

Try install full SciPy or independent NumPy package for the Python binary you're currently using: scipy.org/install.html
1

I got this to work by installing numpy on all the emr-nodes by configuring a small bootstrapping script that contains the following (among other things).

#!/bin/bash -xe sudo yum install python-numpy python-scipy -y

Then configure the bootstrap script to be executed when you start your cluster by adding the following option to the aws emr command (the following example gives an argument to the bootstrap script)

--bootstrap-actions Path=s3://some-bucket/keylocation/bootstrap.sh,Name=setup_dependencies,Args=[s3://some-bucket]

This can be used when setting up a cluster automatically from DataPipeline as well.

Comments

0

Sometimes, when you import certain libraries, your namespace is polluted with numpy functions. Functions such as min, max and sum are especially prone to this pollution. Whenever in doubt, locate calls to these functions and replace these calls with __builtin__.sum etc. Doing so will sometimes be faster than locating the pollution source.

Comments

0

Make sure your spark-env.sh has PYSPARK_PATH pointing to the correct Python release. Add export PYSPARK_PATH=/your_python_exe_path to /conf/spark-env.sh file.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.