0

I am learning Apache-Spark as well as its interface with AWS. I've already created a master node on AWS with 6 slave nodes. I also have the following Python code written with Spark:

from pyspark import SparkConf, SparkContext

conf = SparkConf().setAppName("print_num").setMaster("AWS_master_url")
sc = SparkContext(conf = conf)

# Make the list be distributed
rdd = sc.parallelize([1,2,3,4,5])

# Just want each of 5 slave nodes do the mapping work.
temp = rdd.map(lambda x: x + 1)

# Also want another slave node do the reducing work.
for x in temp.sample(False, 1).collect(): 
    print x

My question is how I can set up the 6 slave nodes in AWS, such that 5 slave nodes do the mapping work as I mentioned in the code, and the other slave node do the reducing work. I really appreciate if anyone helps me.

1 Answer 1

1

From what I understand, you cannot specify five nodes serve as map nodes and one as a reduce node within a single spark cluster.

You could have two clusters running, one with five nodes for running the map tasks and one for the reduce tasks. Then, you could break your code into two different jobs and submit them to the two clusters sequentially, writing the results to disk in between. However, this might be less efficient than letting Spark handle shuffle communication.

In Spark, the call to .map() is "lazy" in the sense that it does not execute until the call to an "action." In your code, this would be the call to .collect().

See https://spark.apache.org/docs/latest/programming-guide.html

Out of curiosity, is there a reason you want one node to handle all reductions?

Also, based on the documentation the .sample() function takes three parameters. Could you post stderr and stdout from this code?

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks! Actually, I did this successfully on my local computer, and I just want it to run on AWS clusters using spark. The reason I want to use 5 nodes as a mapper and 1 node as reducer is just because I want to deploy each data file on each of the mapping node (In this example, we have each number for each mapping node respectively, which is a simplified version of what I want to do), and I want my code such as x = x + 1 to work for each of the data file. Also, I need additional slave node to collect and print out all results. I am just wondering how to make my code work on AWS using Spark.
In that case, why not just use the .foreach() function? this is an "action" that will apply a function to each element in the RDD. If this is what you want to do, the .map() and consecutive .sample() may be unnecessarily complicated.
Yeah, you are right, and I can use foreach() instead of map(). But I think my biggest problem is how I can make code run in each of slave nodes in AWS.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.