0

I have a spark application that reads a file with 100 million lines (each line has a code, such as US1.234.567B1) and gets some patterns out of it, as follows:

  val codes = sc.textFile("/data/codes.txt")

  def getPattern(code: String) = code.replaceAll("\\d", "d")

  val patterns: RDD[(String, Int)] = codes
    .groupBy(getPattern)
    .mapValues(_.size)
    .sortBy(- _._2)

  patterns
    .map { case (pattern, size) => s"$size\t$pattern" }
    .saveAsTextFile("/tmp/patterns")

I am running this on master=local[*], and it fails with java.lang.OutOfMemoryError: GC overhead limit exceeded.

Why is that?

I thought that Spark can handle any size of input, as long as it has enough hard disk space.

1

2 Answers 2

2

Long short you're trying to use Spark anti-pattern:

.groupBy(getPattern)
.mapValues(_.size)

that can be easily expressed for example as:

codes.keyBy(getPattern).mapValues(_ => 1L).reduceByKey(_ + _).sortBy(_._2, false)

I thought that Spark can handle any size of input.

It usually can scale out as long as you don't make it impossible. group / groupByKey on RDDs create local collections for each key. Each of these hast to in the memory of a single executor.

Sign up to request clarification or add additional context in comments.

Comments

1

Yes spark can process very large files, but the unit of parallelism is the executor. 'Out of memory error' is because the spark executor memory or the spark driver memory is insufficient. Please try increasing spark.executor.memory and spark.driver.memory and also tune the number of executors before you submit the job.

You can set these values in a property file or in SparkConf or directly in command line during spark-submit. Link http://spark.apache.org/docs/latest/configuration.html

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.