6

I'm newly use Spark with PySpark on JupyterHub. I understand that before creating an EMR I can set the bootstrap to setup the environment in each cluster, like Python package/library. But If I already started the EMR, how can I install more Python package/library without restarting the EMR?

I searched and got some answer that I can install it via cell in Jupyterhub. For example,

%%spark
sc.install_pypi_package("matplotlib")

which I tried and got error

RuntimeError: install_pypi_packages can only use called when spark.pyspark.virtualenv.enabled is set to true

So I tried to set that config in /usr/lib/spark/conf/spark-defaults.conf on master cluster by adding this line into that file.

"spark.pyspark.virtualenv.enabled": "true"

But it doesn't work, JupyterHub still return an error.

So I want to know

  1. What is best practice when I want to install more Python package/library to clusters when I already start the EMR?

  2. How to config "spark.pyspark.virtualenv.enabled": "true" or can I set it up in software settings before creating the EMR?

Thank you in advance.

3
  • I think the correct way if its to be installed on all machines in the cluster is to get an admin to add the package as an bootstrap action: docs.aws.amazon.com/emr/latest/ManagementGuide/… Commented May 23, 2020 at 23:59
  • @chappers That's quite clear for me, thanks a lot! Commented May 26, 2020 at 4:22
  • 1
    I am seeing the same error on a SparkMagic(PySpark) kernel associated with an EMR cluster instance. Commented Jul 24, 2020 at 4:26

1 Answer 1

8

I ran into the same issue. Here's what I had to insert as a new block above the sc.install_pypi_package() call.

%%configure -f
{
    "conf": {
        [other configs relevant to your situation],
        "spark.pyspark.python": "python3",
        "spark.pyspark.virtualenv.enabled": "true",
        "spark.pyspark.virtualenv.type":"native",
        "spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv"
    }
}

Inspired by https://aws.amazon.com/blogs/big-data/install-python-libraries-on-a-running-cluster-with-emr-notebooks/

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.