We have a use case where in a Spark job
- We iterate over partitions of an external table
- Load data of this partition (almost same data vol in each partition)
- Do transformation(self joins, no udfs) on the data
- save data to an external location
- repeat
In the process, we do not cache any dataframes.
We are seeing a steady increase in memory usage after each iteration. After some time the job exits with Out of heap memory error.
I am looking for some pointers that might help in debugging this issue.
The code looks very similar to this.
while(date < endDate) {
val df = spark.sql("SELECT * FROM tbl JOIN tbl2 ON tbl.date = '${date}' AND tbl2.date = '${date}'")
df.write.partitionBy("date").mode("overwrite").("s3://bucket/path")
date = increment(date)
}