4

I am new to spark and AWS, I am trying to install Jupyter on my Spark cluster (EMR), i am not able to open Jupyter Notebook on my browser in the end.

Context: I have firewall issues from the place i am working, i can't get access to the EMR clsuter's IP address i create on a day-to-day basis. I have a dedicated EC-2 instance (IP address for this instance is white listed) that i am using as a client to connect to the EMR cluster i create on a need basis.

I have access to the IP address of the EC2 instance and the ports 22 and 8080. I do not have access to the IP address of EMR cluster.

Following are the steps that i am following:

  1. Open putty and connect to the EC2 instance
  2. Establish connection between my EC2 instance and EMR cluster ssh -i publickey.pem ec2-user@host name of the EMR cluster
  3. install jupyter on the spark cluster using the following command: pip install jupyter

  4. Connect to spark: PYSPARK_DRIVER_PYTHON=/usr/local/bin/jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser --port=7777" pyspark --packages com.databricks:spark-csv_2.10:1.1.0 --master spark://127.0.0.1:7077 --executor-memory 6400M --driver-memory 6400M

  5. Establish a tunnel to browser: ssh -L 0.0.0.0:8080:127.0.0.1:7777 ip-172-31-34-209 -i publickey.pem

  6. open Jupyter on browser:

http://host name of EMR cluster:8080

I am able to run the first 5 steps, but not able to open the Jupyter notebook on my browser.

2 Answers 2

1

Didn't test it, as it involves setting up a test EMR server, but here's what should work:

Step 5:

ssh -i publickkey.pem -L 8080:127.0.0.1:7777 HOSTNAME

Step 6:

Open jupyter notebook on browser using 127.0.0.1:8080

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you @user1669710. I had a similar problem. From reading your answer I realized that I had skipped Step 5 because I was already ssh'd into to my server and I forgot that I needed to ssh into this port.
-4

You can use an EMR notebook with Amazon EMR clusters running Apache Spark to remotely run queries and code. An EMR notebook is a "serverless" Jupyter notebook. EMR notebook sits outside the cluster and takes care of cluster attachment without you having to worry about it.

More information here: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-managed-notebooks.html

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.