PySpark Task Size

Question

I currently have a Spark cluster of 1 Driver and 2 Workers on version 2.4.5.

I would like to go further on optimizing parallelism to get a better throughput when loading and processing data, when I am doing this I often get these messages on the console:

WARN scheduler.TaskSetManager: Stage contains a task of very large size (728 KB). The maximum recommended task size is 100 KB.

How does this work? I am fairly new to the Spark technology but understand the basics of it, I would like to know how to optimize this but I'm not sure if it involves configuring the Slaves to have more executors and this way get more parallelism or if I need to Partition my Dataframes with either the coalesce or repartition functions.

Thank you guys in advance!

Ged · Accepted Answer · 2020-07-18 20:17:35Z

1

The general gist here is that you need to repartition to get more, but smaller size partitions, so as to get more parallelism and higher thruput. The 728k is an arbitrary number related to your Stage. I had this sometimes when I first started out with Scala and Spark.

I cannot see your code so I leave it at this. But googling here on SO suggests lack of parallelism as well. In all honesty quite well known.

edited Jul 18, 2020 at 20:17

answered Jul 17, 2020 at 18:19

Ged

18.5k8 gold badges53 silver badges108 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Jose Gutierrez Over a year ago

There is no much code to be honest, I was asking about the general picture. All I do is read a parquet file with a sparkContext and then start working on it to get information. How can I partion more this dataframe or how do I get more executors on my worker slaves?

Ged Over a year ago

repartition or coalesce. it is a common enough issue.

Collectives™ on Stack Overflow

PySpark Task Size

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related