How to include packages in PySpark when using notebooks on EMR?

Question

I'm currently trying to analyze some data using a notebook using EMR. The problem I'm having is that I cannot figure out how to when I'm using the PySpark kernel how to include specific artifacts. Specifically, I'm trying to include org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0 which I would normally do in the command line when starting the PySpark environment by simply using the --packages argument. Do I have to include a Bootstrap action maybe? I'm not entirely certain what I would even put there. Any help would be most appreciated.

Victor · Accepted Answer · 2019-08-07 18:43:59Z

6

I asked on reddit and someone from the EMR team answered:

You can use a %%configure block as the first cell in your notebook to specify additional packages. In your case, this would look like this:

%%configure
{ "conf": {"spark.jars.packages": "org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0" }}

Here's a screenshot of an example notebook that loads spark-avro.

(Disclaimer: AWS employee on the EMR team 👋)

answered Aug 7, 2019 at 18:43

Victor

1011 silver badge6 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Eugenia Castilla Over a year ago

Hi Victor! Thanks for your answer, it works great with generic libraries. However, I am having some issues using this with for example a library called: Clustering4ever [link]github.com/Clustering4Ever/Clustering4Ever Does your answer work with this type of libraries as well?? Sorry if the question is dumb but I am new to this and hitting my head against the wall!

Collectives™ on Stack Overflow

How to include packages in PySpark when using notebooks on EMR?

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related