0

I have two linux machines, both with different configuration

Machine 1: 16 GB RAM, 4 Virtual Cores and 40 GB HDD (Master and Slave Machine)

Machine 2: 8 GB RAM, 2 Virtual Cores and 40 GB HDD (Slave machine)

I have set up a hadoop cluster between these two machines.
I am using Machine 1 as both master and slave.
And Machine 2 as slave.

I want to run my spark application and utilise as much as Virtual Cores and memory as possible but I am unable to figure out what settings.

My spark code looks something like:

conf = SparkConf().setAppName("Simple Application")
sc = SparkContext('spark://master:7077')
hc = HiveContext(sc)
sqlContext = SQLContext(sc)
spark = SparkSession.builder.appName("SimpleApplication").master("yarn-cluster").getOrCreate()

So far, I have tried the following:

  1. When I process my 2 GB file only on Machine 1 (in local mode as Single node cluster), it uses all the 4 CPUs of the machine and completes in about 8 mins.

  2. When I process my 2 GB file with cluster configuration as above, it takes slightly longer than 8 mins, though I expected, it would take less time.

What number of executors, cores, memory do I need to set to maximize the usage of cluster?
I have referred below articles but because I have different machine configuration in my case, not sure what parameter would fit best.

Apache Spark: The number of cores vs. the number of executors

Any help will be greatly appreciated.

6
  • Well, obviously you can't use more resources than the smallest node in the cluster... Also, you do not need a HIveContext and SQLContext... Both are deprecated in favor of SparkSession.sql, and you set your app name twice? Pass the conf into the session builder, or only use the session builder Commented Jan 31, 2018 at 13:28
  • Hi, i tried using --num-executors 1 --executor-cores 2, post which i could see all the virtual cores are being used during processing. Probably that seems to be the best configuration. And thanks for correction about the appName being set twice. Commented Jan 31, 2018 at 14:08
  • Do i not need hiveContext to query hive table? Or do you mean i can use SparkSession. sql to to query hive table? Commented Jan 31, 2018 at 14:10
  • It's not needed. You will need to enable Hive support, but yes spark.apache.org/docs/latest/… Commented Jan 31, 2018 at 14:18
  • You could increase the executor memory, as well, if necessary Commented Jan 31, 2018 at 14:20

1 Answer 1

0

When I process my 2 GB file with cluster configuration as above, it takes slightly longer than 8 mins, though I expected, it would take less time.

Its not clear where your file is stored.

I see you're using Spark Standalone mode, so I'll assume it's not split on HDFS into about 16 blocks (given block size of 128MB).

In that scenario, your entire file will processed at least once in whole, plus the overhead of shuffling that data amongst the network.

If you used YARN as the Spark master with HDFS as the FileSystem, and a splittable file format, then the computation would go "to the data", which you could expect quicker run times.

As far as optimal settings, there's tradeoffs between cores&memory and amount of executors, but there's no magic number for a particular workload and you'll always be limited by the smallest node in the cluster, keeping in mind the memory of the Spark driver and other processes on the OS should be accounted for when calculating sizes

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.