No module named numpy when spark-submitting

Question

I’m spark-submitting a python file that imports numpy but I’m getting a no module named numpy error.

$ spark-submit --py-files projects/other_requirements.egg projects/jobs/my_numpy_als.py
Traceback (most recent call last):
  File "/usr/local/www/my_numpy_als.py", line 13, in <module>
    from pyspark.mllib.recommendation import ALS
  File "/usr/lib/spark/python/pyspark/mllib/__init__.py", line 24, in <module>
    import numpy
ImportError: No module named numpy

I was thinking I would pull in an egg for numpy —python-files, but I'm having trouble figuring out how to build that egg. But then it occurred to me that pyspark itself uses numpy. It would be silly to pull in my own version of numpy.

Any idea on the appropriate thing to do here?

Def_Os · Accepted Answer · 2015-05-19 18:11:21Z

4

It looks like Spark is using a version of Python that does not have numpy installed. It could be because you are working inside a virtual environment.

Try this:

# The following is for specifying a Python version for PySpark. Here we
# use the currently calling Python version.
# This is handy for when we are using a virtualenv, for example, because
# otherwise Spark would choose the default system Python version.
os.environ['PYSPARK_PYTHON'] = sys.executable

answered May 19, 2015 at 18:11

Def_Os

5,4775 gold badges37 silver badges64 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Roger Huang Over a year ago

Try install full SciPy or independent NumPy package for the Python binary you're currently using: scipy.org/install.html

Hans Peter Hagblom · Accepted Answer · 2016-07-14 08:24:04Z

1

I got this to work by installing numpy on all the emr-nodes by configuring a small bootstrapping script that contains the following (among other things).

#!/bin/bash -xe sudo yum install python-numpy python-scipy -y

Then configure the bootstrap script to be executed when you start your cluster by adding the following option to the aws emr command (the following example gives an argument to the bootstrap script)

--bootstrap-actions Path=s3://some-bucket/keylocation/bootstrap.sh,Name=setup_dependencies,Args=[s3://some-bucket]

This can be used when setting up a cluster automatically from DataPipeline as well.

answered Jul 14, 2016 at 8:24

Hans Peter Hagblom

886 bronze badges

Comments

Boris Gorelik · Accepted Answer · 2016-03-30 14:05:38Z

0

Sometimes, when you import certain libraries, your namespace is polluted with numpy functions. Functions such as min, max and sum are especially prone to this pollution. Whenever in doubt, locate calls to these functions and replace these calls with __builtin__.sum etc. Doing so will sometimes be faster than locating the pollution source.

answered Mar 30, 2016 at 14:05

Boris Gorelik

32.1k41 gold badges136 silver badges172 bronze badges

Comments

Doctor-Wh0 · Accepted Answer · 2018-11-02 10:07:19Z

0

Make sure your spark-env.sh has PYSPARK_PATH pointing to the correct Python release. Add export PYSPARK_PATH=/your_python_exe_path to /conf/spark-env.sh file.

answered Nov 2, 2018 at 10:07

Doctor-Wh0

1

Collectives™ on Stack Overflow

No module named numpy when spark-submitting

4 Answers 4

1 Comment

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related