I'm newly use Spark with PySpark on JupyterHub. I understand that before creating an EMR I can set the bootstrap to setup the environment in each cluster, like Python package/library. But If I already started the EMR, how can I install more Python package/library without restarting the EMR?
I searched and got some answer that I can install it via cell in Jupyterhub. For example,
%%spark
sc.install_pypi_package("matplotlib")
which I tried and got error
RuntimeError: install_pypi_packages can only use called when spark.pyspark.virtualenv.enabled is set to true
So I tried to set that config in /usr/lib/spark/conf/spark-defaults.conf on master cluster by adding this line into that file.
"spark.pyspark.virtualenv.enabled": "true"
But it doesn't work, JupyterHub still return an error.
So I want to know
What is best practice when I want to install more Python package/library to clusters when I already start the EMR?
How to config
"spark.pyspark.virtualenv.enabled": "true"or can I set it up in software settings before creating the EMR?
Thank you in advance.