0

Here is my cluster configuration:

Master nodes: 1 (16 vCPU, 64 GB memory)

Worker nodes: 2 (total of 64 vCPU, 256 GB memory)

Here is the Hive query I'm trying to run on the Spark SQL shell:

select a.*,b.name as name from (
small_tbl b 
join
(select * 
from large_tbl where date = '2019-01-01') a
on a.id = b.id);

Here is the query execution plan as shown on the Spark UI:

enter image description here

The configuration properties set while launching the shell are as follows:

spark-sql --conf spark.driver.maxResultSize=30g \
--conf spark.broadcast.compress=true \
--conf spark.rdd.compress=true \
--conf spark.memory.offHeap.enabled=true \
--conf spark.memory.offHeap.size=304857600 \
--conf spark.dynamicAllocation.enabled=false \
--conf spark.executor.instances=12 \
--conf spark.executor.memory=16g 
--conf spark.executor.cores=5 \
--conf spark.driver.memory=32g \
--conf spark.yarn.executor.memoryOverhead=512 \
--conf spark.executor.extrajavaoptions=-Xms20g \
--conf spark.executor.heartbeatInterval=30s \
--conf spark.shuffle.io.preferDirectBufs=true \
--conf spark.memory.fraction=0.5

I have tried most of the solutions suggested here and here which is evident in the properties set above. As far as I know it's not a good idea to increase the maxResultSize property on the driver side since datasets may grow beyond driver's memory size and driver shouldn't be used to store data in this scale.

I have executed the query on Tez engine successfully which took around 4 minutes, whereas Spark takes more than 15 mins to execute and terminates abruptly with the lack of heap space issue.

I strongly believe there must be a way to speed up the query execution on Spark. Please suggest me a solution that works for this kind of queries.

9
  • try following this answer Commented Nov 3, 2019 at 17:46
  • I'm no Java expert but I'm puzzled by the combination of executor.extrajavaoptions=-Xms20g and executor.memory=16g (plus/minus the YARN overhead, off-heap and of course stack+code size...) Commented Nov 3, 2019 at 20:40
  • did you check nulls in both the fileds ?? add this ```--conf "spark.sql.crossJoin.enabled=false"```` in your spark-sql & give a try Commented Nov 4, 2019 at 9:11
  • @Gsquare Let's validate the solutions against my query code issues: 1.Can't try checkpoint() as i'm on SQL shell 2.Broadcasting shouldn't be an issue.I tried disabling the autoBroadcastJoinThreshold. It carried out SortedMerge join ending up in the same issue 3.Data size in each task is in ideal range Data issues: 1.There isn't much difference in time between 25th,median and 75th percentile of stage execution 2.Input file format:ORC,the large table is partitioned on date column.I tried setting spark.default.parallelism 5 but no luck Please let me know if there's something wrong here Commented Nov 4, 2019 at 17:43
  • @SarathChandraVema, I tried running the query with the property you suggested. But no improvement. I got GC related error: GC overhead limit exceeded. By the way, did you mean I need to check for null values in the joining keys on both tables? Please clarify. Commented Nov 4, 2019 at 18:00

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.