I have a spark application that reads a file with 100 million lines (each line has a code, such as US1.234.567B1) and gets some patterns out of it, as follows:
val codes = sc.textFile("/data/codes.txt")
def getPattern(code: String) = code.replaceAll("\\d", "d")
val patterns: RDD[(String, Int)] = codes
.groupBy(getPattern)
.mapValues(_.size)
.sortBy(- _._2)
patterns
.map { case (pattern, size) => s"$size\t$pattern" }
.saveAsTextFile("/tmp/patterns")
I am running this on master=local[*], and it fails with java.lang.OutOfMemoryError: GC overhead limit exceeded.
Why is that?
I thought that Spark can handle any size of input, as long as it has enough hard disk space.