3

In my hadoop cluster they installed anaconda package in some other path other than python default path. I am getting below error when i try to access numpy in pyspark

ImportError: No module named numpy

I am invoking pyspark using oozie.

I tried to give this custom python library path in below approaches

Using oozie tags

<property>
  <name>oozie.launcher.mapreduce.map.env</name>
  <value>PYSPARK_PYTHON=/var/opt/teradata/anaconda2/bin/python2.7</value>
</property>

Using spark option tag

<spark-opts>spark.yarn.appMasterEnv.PYSPARK_PYTHON=/var/opt/teradata/anaconda2/bin/python2.7 --conf spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=/var/opt/teradata/anaconda2/bin/python2.7 --conf spark.pyspark.python=/var/opt/teradata/anaconda2/bin/python2.7 --conf spark.pyspark.driver.python=/var/opt/teradata/anaconda2/bin/python2.7</spark-opts>

Nothing works.

When i run plain python script it works fine. Problem is passing to pyspark

Even i gave this in pyspark header also as

#! /usr/bin/env /var/opt/teradata/anaconda2/bin/python2.7

When i print sys.path in my pyspark code it still gives me below default path

​[ '/usr/lib/python27.zip', '/usr/lib64/python2.7', '/usr/lib64/python2.7/plat-linux2', '/usr/lib64/python2.7/lib-tk', '/usr/lib64/python2.7/lib-old', '/usr/lib64/python2.7/lib-dynload', '/usr/lib64/python2.7/site-packages', '/usr/local/lib64/python2.7/site-packages', '/usr/local/lib/python2.7/site-packages', '/usr/lib/python2.7/site-packages']​

1 Answer 1

0

I'm having this same issue. In my case it seems like ml classes (e.g. Vector) are calling numpy behind the scenes, but are not looking for it in the standard installation places. Even though the Python2 and Python3 versions of numpy are installed on all nodes of the cluster, PySpark is still complaining it can't find it.

I've tried a lot of suggestions that have not worked.

Two things that have been suggested that I haven't tried:

1) Use the bashrc for the user that PySpark runs as (ubuntu for me) to set the preferred python path. Do this on all nodes.

2) Have the PySpark script attempt to install the module in question as part of its functionality (e.g. by shelling out to pip/pip3).

I'll keep an eye here and if I find an answer I'll post it.

Sign up to request clarification or add additional context in comments.

1 Comment

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.