2

Below are configurations:

  1. Hadoop-2x (1 master, 2 slaves) yarn.nodemanager.resource.memory = 7096 m yarn.scheduler.maximum-allocation= 2560 m
  2. Spark - 1.5.1
    spark/conf details in all three nodes :
    spark.driver.memory 4g
    spark.executor.memory 2g
    spark.executor.instances 2

    spark-sql>CREATE TABLE demo USING org.apache.spark.sql.json OPTIONS path

This path has 32 GB compressed data. It is taking 25 minutes to create table demo. Is there anyway to optimize and bring it down in few minutes? Am I missing something out here?

1 Answer 1

2

Most usually each executor should represent each core of your CPU. Also note that master is the most irrelevant of your all your machines, because it only assigns tasks to slaves, which do the actual data processing. Your setup is then correct if your slaves are single-core machines but in most cases you would do something like:

spark.driver.memory      // this may be the whole memory of your master
spark.executor.instances // sum of all CPU cores that your slaves have
spark.executor.memory    // (sum of all slaves memory) / (executor.instances)

That's the easiest formula and will work in vast majority of Spark jobs.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.