How can I make my python code run on the AWS slave nodes using Apache-Spark?

Question

I am learning Apache-Spark as well as its interface with AWS. I've already created a master node on AWS with 6 slave nodes. I also have the following Python code written with Spark:

from pyspark import SparkConf, SparkContext

conf = SparkConf().setAppName("print_num").setMaster("AWS_master_url")
sc = SparkContext(conf = conf)

# Make the list be distributed
rdd = sc.parallelize([1,2,3,4,5])

# Just want each of 5 slave nodes do the mapping work.
temp = rdd.map(lambda x: x + 1)

# Also want another slave node do the reducing work.
for x in temp.sample(False, 1).collect(): 
    print x

My question is how I can set up the 6 slave nodes in AWS, such that 5 slave nodes do the mapping work as I mentioned in the code, and the other slave node do the reducing work. I really appreciate if anyone helps me.

cnnrznn · Accepted Answer · 2015-07-13 22:14:19Z

1

From what I understand, you cannot specify five nodes serve as map nodes and one as a reduce node within a single spark cluster.

You could have two clusters running, one with five nodes for running the map tasks and one for the reduce tasks. Then, you could break your code into two different jobs and submit them to the two clusters sequentially, writing the results to disk in between. However, this might be less efficient than letting Spark handle shuffle communication.

In Spark, the call to .map() is "lazy" in the sense that it does not execute until the call to an "action." In your code, this would be the call to .collect().

See https://spark.apache.org/docs/latest/programming-guide.html

Out of curiosity, is there a reason you want one node to handle all reductions?

Also, based on the documentation the .sample() function takes three parameters. Could you post stderr and stdout from this code?

answered Jul 13, 2015 at 22:14

cnnrznn

4451 gold badge4 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Ruofan Kong Over a year ago

Thanks! Actually, I did this successfully on my local computer, and I just want it to run on AWS clusters using spark. The reason I want to use 5 nodes as a mapper and 1 node as reducer is just because I want to deploy each data file on each of the mapping node (In this example, we have each number for each mapping node respectively, which is a simplified version of what I want to do), and I want my code such as x = x + 1 to work for each of the data file. Also, I need additional slave node to collect and print out all results. I am just wondering how to make my code work on AWS using Spark.

cnnrznn Over a year ago

In that case, why not just use the .foreach() function? this is an "action" that will apply a function to each element in the RDD. If this is what you want to do, the .map() and consecutive .sample() may be unnecessarily complicated.

Ruofan Kong Over a year ago

Yeah, you are right, and I can use foreach() instead of map(). But I think my biggest problem is how I can make code run in each of slave nodes in AWS.

Collectives™ on Stack Overflow

How can I make my python code run on the AWS slave nodes using Apache-Spark?

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related