0

I currently have a Spark cluster of 1 Driver and 2 Workers on version 2.4.5.

I would like to go further on optimizing parallelism to get a better throughput when loading and processing data, when I am doing this I often get these messages on the console:

WARN scheduler.TaskSetManager: Stage contains a task of very large size (728 KB). The maximum recommended task size is 100 KB.

How does this work? I am fairly new to the Spark technology but understand the basics of it, I would like to know how to optimize this but I'm not sure if it involves configuring the Slaves to have more executors and this way get more parallelism or if I need to Partition my Dataframes with either the coalesce or repartition functions.

Thank you guys in advance!

1 Answer 1

1

The general gist here is that you need to repartition to get more, but smaller size partitions, so as to get more parallelism and higher thruput. The 728k is an arbitrary number related to your Stage. I had this sometimes when I first started out with Scala and Spark.

I cannot see your code so I leave it at this. But googling here on SO suggests lack of parallelism as well. In all honesty quite well known.

Sign up to request clarification or add additional context in comments.

2 Comments

There is no much code to be honest, I was asking about the general picture. All I do is read a parquet file with a sparkContext and then start working on it to get information. How can I partion more this dataframe or how do I get more executors on my worker slaves?
repartition or coalesce. it is a common enough issue.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.