I'm currently trying to analyze some data using a notebook using EMR. The problem I'm having is that I cannot figure out how to when I'm using the PySpark kernel how to include specific artifacts. Specifically, I'm trying to include org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0 which I would normally do in the command line when starting the PySpark environment by simply using the --packages argument. Do I have to include a Bootstrap action maybe? I'm not entirely certain what I would even put there. Any help would be most appreciated.
1 Answer
I asked on reddit and someone from the EMR team answered:
You can use a %%configure block as the first cell in your notebook to specify additional packages. In your case, this would look like this:
%%configure
{ "conf": {"spark.jars.packages": "org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0" }}
Here's a screenshot of an example notebook that loads spark-avro.
(Disclaimer: AWS employee on the EMR team 👋)
1 Comment
Eugenia Castilla
Hi Victor! Thanks for your answer, it works great with generic libraries. However, I am having some issues using this with for example a library called: Clustering4ever [link]github.com/Clustering4Ever/Clustering4Ever Does your answer work with this type of libraries as well?? Sorry if the question is dumb but I am new to this and hitting my head against the wall!