We use python with pyspark api in order to run simple code on spark cluster.
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName('appName').setMaster('spark://clusterip:7077')
sc = SparkContext(conf=conf)
rdd = sc.parallelize([1, 2, 3, 4])
rdd.map(lambda x: x**2).collect()
It works when we setup a spark cluster locally and with dockers.
We would now like to start an emr cluster and test the same code. And seems that pyspark can't connect to the spark cluster on emr
We opened ports 8080 and 7077 from our machine to the spark master
We are getting past the firewall and just seems that nothing is listening on port 7077 and we get connection refused.
We found this explaining how to serve a job using the cli but we need to run it directly from pyspark api on the driver.
What are we missing here?
How can one start an emr cluster and actually run pyspark code locally on python using this cluster?
edit: running this code from the master itself works As opposed to what was suggested, when connecting to the master using ssh, and running python from the terminal, the very same code (with proper adjustments for the master ip, given it's the same machine) works. No issues no problems. How does this make sense given the documentation that clearly states otherwise?