4

I'm currently trying to analyze some data using a notebook using EMR. The problem I'm having is that I cannot figure out how to when I'm using the PySpark kernel how to include specific artifacts. Specifically, I'm trying to include org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0 which I would normally do in the command line when starting the PySpark environment by simply using the --packages argument. Do I have to include a Bootstrap action maybe? I'm not entirely certain what I would even put there. Any help would be most appreciated.

0

1 Answer 1

6

I asked on reddit and someone from the EMR team answered:

You can use a %%configure block as the first cell in your notebook to specify additional packages. In your case, this would look like this:

%%configure
{ "conf": {"spark.jars.packages": "org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0" }}

Here's a screenshot of an example notebook that loads spark-avro.

(Disclaimer: AWS employee on the EMR team 👋)

Sign up to request clarification or add additional context in comments.

1 Comment

Hi Victor! Thanks for your answer, it works great with generic libraries. However, I am having some issues using this with for example a library called: Clustering4ever [link]github.com/Clustering4Ever/Clustering4Ever Does your answer work with this type of libraries as well?? Sorry if the question is dumb but I am new to this and hitting my head against the wall!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.