5

Consider the following Set benchmark:

import scala.collection.immutable._

object SetTest extends App {
  def time[a](f: => a): (a,Double) = {
    val start = System.nanoTime()
    val result: a = f
    val end = System.nanoTime()
    (result, 1e-9*(end-start))
  }

  for (n <- List(1000000,10000000)) {
    println("n = %d".format(n))
    val (s2,t2) = time((Set() ++ (1 to n)).sum)
    println("sum %d, time %g".format(s2,t2))
  }
}

Compiling and running produces

tile:scalafab% scala SetTest
n = 1000000
sum 1784293664, time 0.982045
n = 10000000
Exception in thread "Poller SunPKCS11-Darwin" java.lang.OutOfMemoryError: Java heap space
...

I.e., Scala is unable to represent a set of 10 million Ints on a machine with 8 GB of memory. Is this expected behavior? Is there some way to reduce the memory footprint?

6
  • 3
    If I remember right, by default, Java uses much less than the total available memory. Commented Aug 20, 2011 at 18:58
  • 1
    The Int values are all being boxed there I believe. @specialized to the rescue? Also could use fold to count and only use stream vs. forcing entire evaluation of 1 to n sequence (required in the ++), but that is different semantics in some cases (not here) :) Commented Aug 20, 2011 at 19:02
  • ++ should be using Range.foreach, so there should be no sequence expansion going on. How would one use @specialized here? Commented Aug 20, 2011 at 19:07
  • specialized is no use here unless you write your own set library. Commented Aug 20, 2011 at 19:10
  • 1
    @Geoffrey Irving How can the elements live in the new set without expansion? Commented Aug 20, 2011 at 19:10

3 Answers 3

10

Generic immutable sets do take a lot of memory. The default is for only 256M heap, which leaves only 26 bytes per object. The hash trie for immutable sets generally takes one to two hundred bytes per object an extra 60 or so bytes per element. If you add -J-Xmx2G on your command line to increase the heap space to 2G, you should be fine.

(This level of overhead is one reason why there are bitsets, for example.)

Sign up to request clarification or add additional context in comments.

9 Comments

One of the reasons I like functional programming is that it demonstrates that immutable collections can be faster and more memory efficient, if designed to take advantage of their immutability (eg. by sharing memory)
@Ken Wayne VanderLinde - Sure. But sometimes you just want one big un-shared set, so you either need more memory, or a reasonably implemented HashSet.
Hash tries take O(n log c) time, but only O(n) space, where c is either the size of a machine word or n (I'm not sure what arity the trie is).
To confirm: c is the size of a machine word.
@Geoffrey Irving - Oops, you're right; it is a hash trie in the trie-in-a-hash-table-like-structure sense, not in the i-have-hash-values-in-a-trie sense. Fixed.
|
3

I'm not that familiar with Scala, but here's what I think is happening:

First off, the integers are being stored on the heap (as the must be, since the data structure is stored on the heap). So we are talking about available heap memory, not stack memory at all (just to clarify the validity of what I'm about to say next).

The real kicker is that Java's default heap size is pretty small - I believe its only 128 megabytes (this is probably an really old number, but the point is that the number exists, and it's quite small).

So it's not that your program uses too much memory - it's more like Java just doesn't give you enough in the first place. There is a solution, though: the minimum and maximum heap sizes can be set with the -Xms and -Xmx command line options. They can be used like:

java -Xms32m -Xmx128m MyClass   (starts MyClass with a minimum heap of 32 megabytes, maximum of 128 megabytes)

java -Xms1g -Xmx3g MyClass (executes MyClass with a minimum heap of 1 gigabytes, maximum of 3 gigabytes)

If you use an IDE, there are probably options in there to change the heap size as well.

Comments

-1

This should always overflow. Holding such large values is not required in this case. If you want to sum use an iterator or a range.

val (s2,t2) = time( (1 to n).sum)

The above line completes in a second with no overflow.

You can always increase memory allocation using other answers.

1 Comment

I think the OP is deliberately creating big Sets, just to test their performance. If he cared about efficiency, he could write val (s2,t2) = time( (n * (n+1)) / 2)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.