1

I've written a python code on summing up all numbers in first-column for each csv file which is as follow:

import os, sys, inspect, csv

### Current directory path.
curr_dir = os.path.split(inspect.getfile(inspect.currentframe()))[0]

### Setup the environment variables
spark_home_dir = os.path.realpath(os.path.abspath(os.path.join(curr_dir, "../spark")))
python_dir = os.path.realpath(os.path.abspath(os.path.join(spark_home_dir, "./python")))
os.environ["SPARK_HOME"] = spark_home_dir
os.environ["PYTHONPATH"] = python_dir

### Setup pyspark directory path
pyspark_dir = python_dir
sys.path.append(pyspark_dir)

### Import the pyspark
from pyspark import SparkConf, SparkContext

### Specify the data file directory, and load the data files
data_path = os.path.realpath(os.path.abspath(os.path.join(curr_dir, "./test_dir")))

### myfunc is to add all numbers in the first column.
def myfunc(s):
    total = 0
    if s.endswith(".csv"):
            cr = csv.reader(open(s,"rb"))
            for row in cr:
                total += int(row[0])
                return total

def main():
### Initialize the SparkConf and SparkContext
    conf = SparkConf().setAppName("ruofan").setMaster("spark://ec2-52-26-177-197.us-west-2.compute.amazonaws.com:7077")
    sc = SparkContext(conf = conf)
    datafile = sc.wholeTextFiles(data_path)

    ### Sent the application in each of the slave node
    temp = datafile.map(lambda (path, content): myfunc(str(path).strip('file:')))

    ### Collect the result and print it out.
    for x in temp.collect():
            print x

if __name__ == "__main__":
    main()

I would like to use Apache-Spark to parallelize the summation process for several csv files using the same python code. I've already done the following steps:

  1. I've created one master and two slave nodes on AWS.
  2. I've used the bash command $ scp -r -i my-key-pair.pem my_dir [email protected] to upload directory my_dir including my python code with the csv files onto the cluster master node.
  3. I've login my master node, and from there used the bash command $ ./spark/copy-dir my_dir to send my python code as well as csv files to all slave nodes.
  4. I've setup the environment variables on the master node:

    $ export SPARK_HOME=~/spark

    $ export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH

However, when I run the python code on the master node: $ python sum.py, it shows up the following error:

Traceback (most recent call last):
  File "sum.py", line 18, in <module>
    from pyspark import SparkConf, SparkContext
  File "/root/spark/python/pyspark/__init__.py", line 41, in <module>
    from pyspark.context import SparkContext
  File "/root/spark/python/pyspark/context.py", line 31, in <module>
    from pyspark.java_gateway import launch_gateway
  File "/root/spark/python/pyspark/java_gateway.py", line 31, in <module>
    from py4j.java_gateway import java_import, JavaGateway, GatewayClient
ImportError: No module named py4j.java_gateway

I have no ideas about this error. Also, I am wondering if the master node automatically calls all slave nodes to run in parallel. I really appreciate if anyone can help me.

2 Answers 2

2

Here is how I would debug this particular import error.

  1. ssh to your master node
  2. Run the python REPL with $ python
  3. Try the failing import line >> from py4j.java_gateway import java_import, JavaGateway, GatewayClient
  4. If it fails, try simply running >> import py4j
  5. If that fails, it means that your system either does not have py4j installed or cannot find it.
  6. Exit the REPL >> exit()
  7. Try installing py4j $ pip install py4j (you'll need to have pip installed)
  8. Open the REPL $ python
  9. Try importing again >> from py4j.java_gateway import java_import, JavaGateway, GatewayClient
  10. If that works, then >> exit() and try running your $ python sum.py again
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for the comment! I just figured out why my case does not work. It seems that the AMI running Spark on AWS is not Ubuntu based, and py4j is not installed at all.
I'm glad you got it worked out. Would you be willing to mark this as the correct answer?
0

I think you are asking two separate questions. It looks like you have an import error. Is it possible that you have a different version of the package py4j installed on your local computer that you haven't installed on your master node?

I can't help with running this in parallel.

3 Comments

Thanks for the comment! I don't think this is an issue as my codes should run any version of spark with its packages. Also, spark on the master node is already setup when I created it.
Well it's definitely an import error. You need to debug with a focus on helping python find your installed module. It could be related to your PATH: stackoverflow.com/questions/26533169/…
Spark on AWS master node is automatically installed. Also as I stated in my problem, I correctly exported the environment variables. So I have no ideas about my problems...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.