Java heap space OutOfMemoryError while running join query in Spark SQL shell

Ask Question

Asked 6 years ago

Modified 5 years, 9 months ago

Viewed 470 times

Here is my cluster configuration:

Master nodes: 1 (16 vCPU, 64 GB memory)

Worker nodes: 2 (total of 64 vCPU, 256 GB memory)

Here is the Hive query I'm trying to run on the Spark SQL shell:

select a.*,b.name as name from (
small_tbl b 
join
(select * 
from large_tbl where date = '2019-01-01') a
on a.id = b.id);

Here is the query execution plan as shown on the Spark UI:

The configuration properties set while launching the shell are as follows:

spark-sql --conf spark.driver.maxResultSize=30g \
--conf spark.broadcast.compress=true \
--conf spark.rdd.compress=true \
--conf spark.memory.offHeap.enabled=true \
--conf spark.memory.offHeap.size=304857600 \
--conf spark.dynamicAllocation.enabled=false \
--conf spark.executor.instances=12 \
--conf spark.executor.memory=16g 
--conf spark.executor.cores=5 \
--conf spark.driver.memory=32g \
--conf spark.yarn.executor.memoryOverhead=512 \
--conf spark.executor.extrajavaoptions=-Xms20g \
--conf spark.executor.heartbeatInterval=30s \
--conf spark.shuffle.io.preferDirectBufs=true \
--conf spark.memory.fraction=0.5

I have tried most of the solutions suggested here and here which is evident in the properties set above. As far as I know it's not a good idea to increase the maxResultSize property on the driver side since datasets may grow beyond driver's memory size and driver shouldn't be used to store data in this scale.

I have executed the query on Tez engine successfully which took around 4 minutes, whereas Spark takes more than 15 mins to execute and terminates abruptly with the lack of heap space issue.

I strongly believe there must be a way to speed up the query execution on Spark. Please suggest me a solution that works for this kind of queries.

edited Nov 3, 2019 at 16:16

asked Nov 3, 2019 at 16:02

Arjun A J

4361 gold badge10 silver badges38 bronze badges

try following this answer

Gsquare
– Gsquare

2019-11-03 17:46:55 +00:00
Commented Nov 3, 2019 at 17:46
I'm no Java expert but I'm puzzled by the combination of executor.extrajavaoptions=-Xms20g and executor.memory=16g (plus/minus the YARN overhead, off-heap and of course stack+code size...)

Samson Scharfrichter
– Samson Scharfrichter

2019-11-03 20:40:14 +00:00
Commented Nov 3, 2019 at 20:40
did you check nulls in both the fileds ?? add this ```--conf "spark.sql.crossJoin.enabled=false"```` in your spark-sql & give a try

Sarath Chandra Vema
– Sarath Chandra Vema

2019-11-04 09:11:02 +00:00
Commented Nov 4, 2019 at 9:11
@Gsquare Let's validate the solutions against my query code issues: 1.Can't try checkpoint() as i'm on SQL shell 2.Broadcasting shouldn't be an issue.I tried disabling the autoBroadcastJoinThreshold. It carried out SortedMerge join ending up in the same issue 3.Data size in each task is in ideal range Data issues: 1.There isn't much difference in time between 25th,median and 75th percentile of stage execution 2.Input file format:ORC,the large table is partitioned on date column.I tried setting spark.default.parallelism 5 but no luck Please let me know if there's something wrong here

Arjun A J
– Arjun A J

2019-11-04 17:43:45 +00:00
Commented Nov 4, 2019 at 17:43
@SarathChandraVema, I tried running the query with the property you suggested. But no improvement. I got GC related error: GC overhead limit exceeded. By the way, did you mean I need to check for null values in the joining keys on both tables? Please clarify.

Arjun A J
– Arjun A J

2019-11-04 18:00:09 +00:00
Commented Nov 4, 2019 at 18:00

| Show 4 more comments

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Java heap space OutOfMemoryError while running join query in Spark SQL shell

0

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Linked