How to run code on the AWS cluster using Apache-Spark?

Question

I've written a python code on summing up all numbers in first-column for each csv file which is as follow:

import os, sys, inspect, csv

### Current directory path.
curr_dir = os.path.split(inspect.getfile(inspect.currentframe()))[0]

### Setup the environment variables
spark_home_dir = os.path.realpath(os.path.abspath(os.path.join(curr_dir, "../spark")))
python_dir = os.path.realpath(os.path.abspath(os.path.join(spark_home_dir, "./python")))
os.environ["SPARK_HOME"] = spark_home_dir
os.environ["PYTHONPATH"] = python_dir

### Setup pyspark directory path
pyspark_dir = python_dir
sys.path.append(pyspark_dir)

### Import the pyspark
from pyspark import SparkConf, SparkContext

### Specify the data file directory, and load the data files
data_path = os.path.realpath(os.path.abspath(os.path.join(curr_dir, "./test_dir")))

### myfunc is to add all numbers in the first column.
def myfunc(s):
    total = 0
    if s.endswith(".csv"):
            cr = csv.reader(open(s,"rb"))
            for row in cr:
                total += int(row[0])
                return total

def main():
### Initialize the SparkConf and SparkContext
    conf = SparkConf().setAppName("ruofan").setMaster("spark://ec2-52-26-177-197.us-west-2.compute.amazonaws.com:7077")
    sc = SparkContext(conf = conf)
    datafile = sc.wholeTextFiles(data_path)

    ### Sent the application in each of the slave node
    temp = datafile.map(lambda (path, content): myfunc(str(path).strip('file:')))

    ### Collect the result and print it out.
    for x in temp.collect():
            print x

if __name__ == "__main__":
    main()

I would like to use Apache-Spark to parallelize the summation process for several csv files using the same python code. I've already done the following steps:

I've created one master and two slave nodes on AWS.
I've used the bash command $ scp -r -i my-key-pair.pem my_dir [email protected] to upload directory my_dir including my python code with the csv files onto the cluster master node.
I've login my master node, and from there used the bash command $ ./spark/copy-dir my_dir to send my python code as well as csv files to all slave nodes.
I've setup the environment variables on the master node:

$ export SPARK_HOME=~/spark

$ export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH

However, when I run the python code on the master node: $ python sum.py, it shows up the following error:

Traceback (most recent call last):
  File "sum.py", line 18, in <module>
    from pyspark import SparkConf, SparkContext
  File "/root/spark/python/pyspark/__init__.py", line 41, in <module>
    from pyspark.context import SparkContext
  File "/root/spark/python/pyspark/context.py", line 31, in <module>
    from pyspark.java_gateway import launch_gateway
  File "/root/spark/python/pyspark/java_gateway.py", line 31, in <module>
    from py4j.java_gateway import java_import, JavaGateway, GatewayClient
ImportError: No module named py4j.java_gateway

I have no ideas about this error. Also, I am wondering if the master node automatically calls all slave nodes to run in parallel. I really appreciate if anyone can help me.

eatCitrus · Accepted Answer · 2015-07-18 21:03:19Z

2

Here is how I would debug this particular import error.

ssh to your master node
Run the python REPL with $ python
Try the failing import line >> from py4j.java_gateway import java_import, JavaGateway, GatewayClient
If it fails, try simply running >> import py4j
If that fails, it means that your system either does not have py4j installed or cannot find it.
Exit the REPL >> exit()
Try installing py4j $ pip install py4j (you'll need to have pip installed)
Open the REPL $ python
Try importing again >> from py4j.java_gateway import java_import, JavaGateway, GatewayClient
If that works, then >> exit() and try running your $ python sum.py again

answered Jul 18, 2015 at 21:03

eatCitrus

1105 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Ruofan Kong Over a year ago

Thanks for the comment! I just figured out why my case does not work. It seems that the AMI running Spark on AWS is not Ubuntu based, and py4j is not installed at all.

eatCitrus Over a year ago

I'm glad you got it worked out. Would you be willing to mark this as the correct answer?

eatCitrus · Accepted Answer · 2015-07-17 02:25:22Z

0

I think you are asking two separate questions. It looks like you have an import error. Is it possible that you have a different version of the package py4j installed on your local computer that you haven't installed on your master node?

I can't help with running this in parallel.

answered Jul 17, 2015 at 2:25

eatCitrus

1105 bronze badges

3 Comments

Ruofan Kong Over a year ago

Thanks for the comment! I don't think this is an issue as my codes should run any version of spark with its packages. Also, spark on the master node is already setup when I created it.

eatCitrus Over a year ago

Well it's definitely an import error. You need to debug with a focus on helping python find your installed module. It could be related to your PATH: stackoverflow.com/questions/26533169/…

Ruofan Kong Over a year ago

Spark on AWS master node is automatically installed. Also as I stated in my problem, I correctly exported the environment variables. So I have no ideas about my problems...

Collectives™ on Stack Overflow

How to run code on the AWS cluster using Apache-Spark?

2 Answers 2

2 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related