Spark-SQL slow query performance

Question

Below are configurations:

Hadoop-2x (1 master, 2 slaves) yarn.nodemanager.resource.memory = 7096 m yarn.scheduler.maximum-allocation= 2560 m
Spark - 1.5.1
spark/conf details in all three nodes :
spark.driver.memory 4g spark.executor.memory 2g spark.executor.instances 2

spark-sql>CREATE TABLE demo USING org.apache.spark.sql.json OPTIONS path

This path has 32 GB compressed data. It is taking 25 minutes to create table demo. Is there anyway to optimize and bring it down in few minutes? Am I missing something out here?

Nhor · Accepted Answer · 2015-11-18 08:37:42Z

2

Most usually each executor should represent each core of your CPU. Also note that master is the most irrelevant of your all your machines, because it only assigns tasks to slaves, which do the actual data processing. Your setup is then correct if your slaves are single-core machines but in most cases you would do something like:

spark.driver.memory      // this may be the whole memory of your master
spark.executor.instances // sum of all CPU cores that your slaves have
spark.executor.memory    // (sum of all slaves memory) / (executor.instances)

That's the easiest formula and will work in vast majority of Spark jobs.

answered Nov 18, 2015 at 8:37

Nhor

3,9806 gold badges31 silver badges41 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Spark-SQL slow query performance

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related