Install more Python package/library to each cluster after creating an AWS EMR

Question

I'm newly use Spark with PySpark on JupyterHub. I understand that before creating an EMR I can set the bootstrap to setup the environment in each cluster, like Python package/library. But If I already started the EMR, how can I install more Python package/library without restarting the EMR?

I searched and got some answer that I can install it via cell in Jupyterhub. For example,

%%spark
sc.install_pypi_package("matplotlib")

which I tried and got error

RuntimeError: install_pypi_packages can only use called when spark.pyspark.virtualenv.enabled is set to true

So I tried to set that config in /usr/lib/spark/conf/spark-defaults.conf on master cluster by adding this line into that file.

"spark.pyspark.virtualenv.enabled": "true"

But it doesn't work, JupyterHub still return an error.

So I want to know

What is best practice when I want to install more Python package/library to clusters when I already start the EMR?
How to config "spark.pyspark.virtualenv.enabled": "true" or can I set it up in software settings before creating the EMR?

Thank you in advance.

I think the correct way if its to be installed on all machines in the cluster is to get an admin to add the package as an bootstrap action: docs.aws.amazon.com/emr/latest/ManagementGuide/… — chappers
– chappers, Commented May 23, 2020 at 23:59
I am seeing the same error on a SparkMagic(PySpark) kernel associated with an EMR cluster instance. — Pablo Adames
– Pablo Adames, Commented Jul 24, 2020 at 4:26

stevetu21 · Accepted Answer · 2021-07-06 13:46:48Z

8

I ran into the same issue. Here's what I had to insert as a new block above the sc.install_pypi_package() call.

%%configure -f
{
    "conf": {
        [other configs relevant to your situation],
        "spark.pyspark.python": "python3",
        "spark.pyspark.virtualenv.enabled": "true",
        "spark.pyspark.virtualenv.type":"native",
        "spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv"
    }
}

Inspired by https://aws.amazon.com/blogs/big-data/install-python-libraries-on-a-running-cluster-with-emr-notebooks/

answered Jul 6, 2021 at 13:46

stevetu21

1581 silver badge9 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Install more Python package/library to each cluster after creating an AWS EMR

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related