4

We use python with pyspark api in order to run simple code on spark cluster.

from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName('appName').setMaster('spark://clusterip:7077')
sc = SparkContext(conf=conf)

rdd = sc.parallelize([1, 2, 3, 4])
rdd.map(lambda x: x**2).collect()

It works when we setup a spark cluster locally and with dockers.

We would now like to start an emr cluster and test the same code. And seems that pyspark can't connect to the spark cluster on emr

We opened ports 8080 and 7077 from our machine to the spark master

We are getting past the firewall and just seems that nothing is listening on port 7077 and we get connection refused.

We found this explaining how to serve a job using the cli but we need to run it directly from pyspark api on the driver.

What are we missing here?

How can one start an emr cluster and actually run pyspark code locally on python using this cluster?

edit: running this code from the master itself works As opposed to what was suggested, when connecting to the master using ssh, and running python from the terminal, the very same code (with proper adjustments for the master ip, given it's the same machine) works. No issues no problems. How does this make sense given the documentation that clearly states otherwise?

0

2 Answers 2

6

You try to run pyspark (which calls spark-submit) form a remote computer outside the spark cluster. This is technically possible but it is not the intended way of deploying applications. In yarn mode, it will make your computer participate in the spark protocol as a client. Thus it would require opening several ports and installing exactly the same spark jars as on spark aws emr.

Form the spark submit doc :

 A common deployment strategy is to submit your application from a gateway machine that is physically co-located with your worker machines (e.g. Master node in a standalone EC2 cluster)

A simple deploy strategy is

  • sync code to master node via rsync, scp or git
cd ~/projects/spark-jobs # on local machine
EMR_MASTER_IP='255.16.17.13'
TARGET_DIR=spark_jobs
rsync -avze "ssh -i ~/dataScienceKey.pem" --rsync-path="mkdir -p ${TARGET_DIR} && rsync" --delete ./ hadoop@${EMR_MASTER_IP}:${TARGET_DIR}
  • ssh to the master node
ssh -i ~/dataScienceKey.pem hadoop@${EMR_HOST}
  • run spark-submit on the master node
cd spark_jobs
spark-submit --master yarn --deploy-mode cluster my-job.py
# my-job.py
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("my-job-py").getOrCreate()
sc = spark.sparkContext
rdd = sc.parallelize([1, 2, 3, 4])
res = rdd.map(lambda x: x**2).collect()
print(res)

There is a way to submit the job directly to spark emr without syncing. Spark EMR runs Apache Livy on port 8998 by default. It is a rest webservice which allows to submit jobs via a rest api. You can pass the same spark-submit parameters with a curl script from your machine. See doc

For interactive development we have also configured local running jupyter notebooks which automatically submit cell runs to livy. This is done via the spark-magic project

Sign up to request clarification or add additional context in comments.

1 Comment

We use the setup described here from a Jupyter Sagemaker notebook with the Sparkmagic kernel and it works very well. Livy makes it transparent to the notebook user that the code is running on a full-blown always-on EMR cluster provisioned via EMR.
3

According to this Amazon Doc, you can't do that:

Common errors

Standalone mode

Amazon EMR doesn't support standalone mode for Spark. It's not possible to submit a Spark application to a remote Amazon EMR cluster with a command like this:

SparkConf conf = new SparkConf().setMaster("spark://master_url:7077”).setAppName("WordCount");

Instead, set up your local machine as explained earlier in this article. Then, submit the application using the spark-submit command.

You can follow the above linked resource to configure your local machine in order to submit spark jobs to EMR Cluster. Or more simpler, use the ssh key you specified when you create your cluster to connect to the master node and submit spark jobs:

ssh -i ~/path/ssh_key hadoop@$<master_ip_address> 

6 Comments

The answer is true. So I upvote. However a spark install always ships with pyspark. It can be launched in yarn mode .@thebeancounter will post some snippets, once at my notebook.
So if I connect to the master, I do manage to run this code from python. How is that?
I mean when connected to the master, you can submit jobs not run that piece of code as it is.
@thebeancounter The difference is that when you execute it under the master node you are under the EMR cluster and not on local. And what is said in the link I provided is that you can not use Spark EMR on standalone mode from outside EMR Cluster. In other words, when you are on the master node it's the same thing when you execute it under your local spark cluster. I don't know if I'm clear on my explanation.
The article had been updated before it was removed. Here is the updated version: web.archive.org/web/20210827045733/https://aws.amazon.com/… And here is the original version: web.archive.org/web/20210820041635/https://aws.amazon.com/…
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.