Out of memory issue when compare two large datasets using spark scala

Question

I am daily importing 10 Million records from Mysql to Hive using Spark scala program and comparing datasets yesterdays and todays datasets.

val yesterdayDf=sqlContext.sql("select * from t_yesterdayProducts");
val todayDf=sqlContext.sql("select * from t_todayProducts");
val diffDf=todayDf.except(yesterdayDf);

I am using 3 node cluster and program working fine for 4 million records. For more than 4 million we are facing out of memory issue as RAM memory is not sufficient.

I would like to know best way to compare two large datasets.

Actually, SparkSQL api call of except is an implicit call for substract of spark api. If you have the key , can you try todayDf.subtractByKey(yesterdayDf); — Aditya
– Aditya, Commented Aug 10, 2016 at 15:38

user3803714 · Accepted Answer · 2016-08-09 21:02:06Z

2

Have you tried findout out how many partitions do you have: yesterdayDf.rdd.partitions.size will give you that information for yesterdayDf dataframe and you can do the same for other dataframes too.

You can also use yesterdayDf.repartition(1000) // (a large number) to see if the OOM problem goes away.

answered Aug 9, 2016 at 21:02

user3803714

5,40910 gold badges45 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Aravind Kumar Anugula Over a year ago

Thanks for your quick response.Tried this way

val yesterdayDf=sqlContext.sql("select * from t_yesterdayProducts"); yesterdayPartDf=yesterdayDf.repartition(1000) val todayDf=sqlContext.sql("select * from t_todayProducts"); todayPartDf=todayDf.repartition(1000) val diffDf=todayPartDf.except(yesterdayPartDf);

but after 993 tasks completed of first df throwing OOM problem

user3803714 Over a year ago

@ArvindKumarAnugula try 1200 or 1500 instead of 1000. Also specify --executor-memory 32G or higher

Aravind Kumar Anugula Over a year ago

Exact memory issue is GC memory overhead exceeded.

Aravind Kumar Anugula Over a year ago

Each node RAM size is 16GB only. can I use --executor-memory 16G ?

Aravind Kumar Anugula Over a year ago

you can see config here what I am passing: --master yarn-client --executor-memory 16G --num-executors 12 --executor-cores 4 --driver-memory 8G

|

Community · Accepted Answer · 2017-05-23 10:28:40Z

0

The reason for this issue is hard to say. But the issue could be that for some reason the workers are taking too many data. Try to clear the data frames to do the except. According to my question in comments, you said that you have key columns so take only they like this:

val yesterdayDfKey = yesterdayDf.select("key-column")
val todayDfKey = todayDf.select("key-column")
val diffDf=todayDfKey.except(yesterdayDfKey);

With that you will take an data frame with the keys. Than you can make a filter with that using join like this post.

edited May 23, 2017 at 10:28

CommunityBot

11 silver badge

answered Aug 10, 2016 at 12:50

Thiago Baldim

7,7823 gold badges34 silver badges55 bronze badges

Comments

Joey Van Halen · Accepted Answer · 2017-06-19 12:03:29Z

0

you also need to make sure your yarn.nodemanager.resource.memory-mb is larger than your --executor-memory.

answered Jun 19, 2017 at 12:03

Joey Van Halen

212 bronze badges

1 Comment

Roberto Anić Banić Over a year ago

This kind of suggestion should be kept in the comments

prince · Accepted Answer · 2021-10-05 06:13:15Z

0

you can also try joining two df on keys with left_anti join and then check count of number of records

answered Oct 5, 2021 at 6:13

prince

263 bronze badges

Collectives™ on Stack Overflow

Out of memory issue when compare two large datasets using spark scala

4 Answers 4

6 Comments

Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

6 Comments

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related