0

I'm trying to run the following Python script locally, using spark-submit command:

import sys
sys.path.insert(0, '.')
from pyspark import SparkContext, SparkConf
from commons.Utils import Utils

def splitComma(line):
    splits = Utils.COMMA_DELIMITER.split(line)
    return "{}, {}".format(splits[1], splits[2])

if __name__ == "__main__":
    conf = SparkConf().setAppName("airports").setMaster("local[2]")
    sc = SparkContext(conf = conf)

    airports = sc.textFile("in/airports.text")
    airportsInUSA = airports\
    .filter(lambda line : Utils.COMMA_DELIMITER.split(line)[3] == "\"United States\"")

    airportsNameAndCityNames = airportsInUSA.map(splitComma)
    airportsNameAndCityNames.saveAsTextFile("out/airports_in_usa.text")

The command used (while inside the project directory):

spark-submit rdd/AirportsInUsaSolution.py

I keep getting this error:

Traceback (most recent call last): File "/home/gustavo/Documentos/TCC/python_spark_yt/python-spark-tutorial/rdd/AirportsInUsaSolution.py", line 4, in from commons.Utils import Utils ImportError: No module named commons.Utils

Even though there is a commons.Utils with a Utils class.

It seems that the only imports it accepts are the ones from Spark, because this error persists when I try to import any other class or file from my project.

0

4 Answers 4

3
from pyspark import SparkContext, SparkConf

def splitComma(line):
    splits = Utils.COMMA_DELIMITER.split(line)
    return "{}, {}".format(splits[1], splits[2])

if __name__ == "__main__":
    conf = SparkConf().setAppName("airports").setMaster("local[2]")
    sc = SparkContext(conf = conf)

    sc.addPyFile('.../pathto commons.zip')
    from commons import Utils

    airports = sc.textFile("in/airports.text")
    airportsInUSA = airports\
    .filter(lambda line : Utils.COMMA_DELIMITER.split(line)[3] == "\"United States\"")

    airportsNameAndCityNames = airportsInUSA.map(splitComma)
    airportsNameAndCityNames.saveAsTextFile("out/airports_in_usa.text")

Yes, it only accepts the ones from the Spark. You can zip the required files (Utils, numpy) etc and specify the parameter --py-files in the spark-submit.

spark-submit  --py-files rdd/file.zip rdd/AirportsInUsaSolution.py 
Sign up to request clarification or add additional context in comments.

Comments

0

for python to consider a directory as package you need to create __init__.py in that directory. The __init__.py file doesn't need to contain anything.

In this case once you create __init__.py in the commons directory you will be able to access that package.

Comments

0

I think problem is in SRARK configuration. Add, pls, PYSPARK_PYTHON environment variable in your ~/.bashrc. In my case, it looks like : export PYSPARK_PYTHON =/home/comrade/environments/spark/bin/python3, where PYSPARK_PYTHON is path to my python executable in "spark" environment.1

Hope, it helps)

Comments

-1

Create a python script named: Utils.py which will contain:

import re

class Utils():

    COMMA_DELIMITER = re.compile(''',(?=(?:[^"]*"[^"]*")*[^"]*$)''')

Put this Utils.py python script on a commons folder and put this folder in your working directory (type pwd to know it). You can then import the Utils class:

from commons.Utils import Utils

Hope it will help you.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.