How to allow pyspark to run code on emr cluster

Question

We use python with pyspark api in order to run simple code on spark cluster.

from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName('appName').setMaster('spark://clusterip:7077')
sc = SparkContext(conf=conf)

rdd = sc.parallelize([1, 2, 3, 4])
rdd.map(lambda x: x**2).collect()

It works when we setup a spark cluster locally and with dockers.

We would now like to start an emr cluster and test the same code. And seems that pyspark can't connect to the spark cluster on emr

We opened ports 8080 and 7077 from our machine to the spark master

We are getting past the firewall and just seems that nothing is listening on port 7077 and we get connection refused.

We found this explaining how to serve a job using the cli but we need to run it directly from pyspark api on the driver.

What are we missing here?

How can one start an emr cluster and actually run pyspark code locally on python using this cluster?

edit: running this code from the master itself works As opposed to what was suggested, when connecting to the master using ssh, and running python from the terminal, the very same code (with proper adjustments for the master ip, given it's the same machine) works. No issues no problems. How does this make sense given the documentation that clearly states otherwise?

dre-hh · Accepted Answer · 2019-12-12 17:54:50Z

You try to run pyspark (which calls spark-submit) form a remote computer outside the spark cluster. This is technically possible but it is not the intended way of deploying applications. In yarn mode, it will make your computer participate in the spark protocol as a client. Thus it would require opening several ports and installing exactly the same spark jars as on spark aws emr.

Form the spark submit doc :

 A common deployment strategy is to submit your application from a gateway machine that is physically co-located with your worker machines (e.g. Master node in a standalone EC2 cluster)

A simple deploy strategy is

sync code to master node via rsync, scp or git

cd ~/projects/spark-jobs # on local machine
EMR_MASTER_IP='255.16.17.13'
TARGET_DIR=spark_jobs
rsync -avze "ssh -i ~/dataScienceKey.pem" --rsync-path="mkdir -p ${TARGET_DIR} && rsync" --delete ./ hadoop@${EMR_MASTER_IP}:${TARGET_DIR}

ssh to the master node

ssh -i ~/dataScienceKey.pem hadoop@${EMR_HOST}

run spark-submit on the master node

cd spark_jobs
spark-submit --master yarn --deploy-mode cluster my-job.py

# my-job.py
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("my-job-py").getOrCreate()
sc = spark.sparkContext
rdd = sc.parallelize([1, 2, 3, 4])
res = rdd.map(lambda x: x**2).collect()
print(res)

There is a way to submit the job directly to spark emr without syncing. Spark EMR runs Apache Livy on port 8998 by default. It is a rest webservice which allows to submit jobs via a rest api. You can pass the same spark-submit parameters with a curl script from your machine. See doc

For interactive development we have also configured local running jupyter notebooks which automatically submit cell runs to livy. This is done via the spark-magic project

We use the setup described here from a Jupyter Sagemaker notebook with the Sparkmagic kernel and it works very well. Livy makes it transparent to the notebook user that the code is running on a full-blown always-on EMR cluster provisioned via EMR.

blackbishop · Accepted Answer · 2019-12-11 16:27:50Z

3

According to this Amazon Doc, you can't do that:

Common errors

Standalone mode

Amazon EMR doesn't support standalone mode for Spark. It's not possible to submit a Spark application to a remote Amazon EMR cluster with a command like this:

SparkConf conf = new SparkConf().setMaster("spark://master_url:7077”).setAppName("WordCount");

Instead, set up your local machine as explained earlier in this article. Then, submit the application using the spark-submit command.

You can follow the above linked resource to configure your local machine in order to submit spark jobs to EMR Cluster. Or more simpler, use the ssh key you specified when you create your cluster to connect to the master node and submit spark jobs:

ssh -i ~/path/ssh_key hadoop@$<master_ip_address>

answered Dec 11, 2019 at 16:27

blackbishop

32.8k11 gold badges61 silver badges86 bronze badges

6 Comments

dre-hh Over a year ago

The answer is true. So I upvote. However a spark install always ships with pyspark. It can be launched in yarn mode .@thebeancounter will post some snippets, once at my notebook.

thebeancounter Over a year ago

So if I connect to the master, I do manage to run this code from python. How is that?

blackbishop Over a year ago

I mean when connected to the master, you can submit jobs not run that piece of code as it is.

blackbishop Over a year ago

@thebeancounter The difference is that when you execute it under the master node you are under the EMR cluster and not on local. And what is said in the link I provided is that you can not use Spark EMR on standalone mode from outside EMR Cluster. In other words, when you are on the master node it's the same thing when you execute it under your local spark cluster. I don't know if I'm clear on my explanation.

dz902 Over a year ago

The article had been updated before it was removed. Here is the updated version: web.archive.org/web/20210827045733/https://aws.amazon.com/… And here is the original version: web.archive.org/web/20210820041635/https://aws.amazon.com/…

|

Collectives™ on Stack Overflow

How to allow pyspark to run code on emr cluster

2 Answers 2

1 Comment

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related