I am daily importing 10 Million records from Mysql to Hive using Spark scala program and comparing datasets yesterdays and todays datasets.
val yesterdayDf=sqlContext.sql("select * from t_yesterdayProducts");
val todayDf=sqlContext.sql("select * from t_todayProducts");
val diffDf=todayDf.except(yesterdayDf);
I am using 3 node cluster and program working fine for 4 million records. For more than 4 million we are facing out of memory issue as RAM memory is not sufficient.
I would like to know best way to compare two large datasets.