Scala: IndexedSeq.newBuilder vs. ArrayBuffer

Question

I had accepted that building an IndexedSeq in a loop should use an ArrayBuffer, followed by a conversion to a Vector via ".toVector()".

In an example profiled showed the CPU hotspot was in this section, and so I tried an alternative: use IndexedSeq.newBuilder() followed by conversion to immutable via ".result()".

This change gave a significance performance improvement. The code looks almost the same. So it seems using IndexedSeq.newBuilder() is best practice. Is this correct? The example method is shown below, with the ArrayBuffer difference commented out.

def interleave[T](a: IndexedSeq[T], b: IndexedSeq[T]): IndexedSeq[T] = {

  val al = a.length
  val bl = b.length

  val buffer = IndexedSeq.newBuilder[T]
  //---> val buffer = new ArrayBuffer[T](al + bl)
  val commonLength = Math.min(al, bl)
  val aExtra = al - commonLength
  val bExtra = bl - commonLength

  var i = 0
  while (i < commonLength) {
    buffer += a(i)
    buffer += b(i)
    i += 1
  }

  if (aExtra > 0) {
    while (i < al) {
      buffer += a(i)
      i += 1
    }
  } else if (bExtra > 0) {
    while (i < bl) {
      buffer += b(i)
      i += 1
    }
  }

  buffer.result()
  //---> buffer.toVector()
}

Why not just return ArrayBuffer itself? It implements IndexedSeq. — Alexey Romanov
– Alexey Romanov, Commented Oct 29, 2017 at 21:21
@AlexeyRomanov ArrayBuffer implements a mutable IndexedSeq, OP seems to want to return an immutable version. — Mike Allen
– Mike Allen, Commented Oct 30, 2017 at 2:09
If you return scala.collection.IndexedSeq, you still can't mutate it (without casting or pattern-matching). It's less safe in that adversarial or just badly written code can cast it, but this is often not an important concern. — Alexey Romanov
– Alexey Romanov, Commented Oct 30, 2017 at 4:41
Did my answer address your question? If not, let me know what I'm missing. Thanks1 — Mike Allen
– Mike Allen, Commented Nov 2, 2017 at 13:14
Thanks for the prompting for better metrics. I spent quite a bit of time fiddling with ScalaMeter to get consistent results -- getting accurate performance numbers is not trivial. Returning the ArrayBuffer directly does give you the best performance, which would be best practice in some circumstances -- certainly keep the ArrayBuffer until is handed off to code outside of your control. — Mike Hanafey
– Mike Hanafey, Commented Nov 5, 2017 at 15:39

Mike Allen · Accepted Answer · 2017-10-30 02:37:41Z

3

As to which is best practice, I guess it depends upon your requirements. Both approaches are acceptable and understandable. All things being equal, in this particular case, I would favor the IndexedSeq.newBuilder over ArrayBuilder (since the latter targets the creation of an Array, while the former's result is a Vector).

Just one point on benchmarking: this is a real art form, due to caching, JIT & HotSpot performance, garbage collection, etc. One piece of software you might consider using to do this is ScalaMeter. You will need to write both versions of the function to populate the final vector, and ScalaMeter will give you accurate statistics on both. ScalaMeter allows the code to warm-up before taking measurements, and can also look at memory requirements as well as CPU time.

edited Oct 30, 2017 at 2:37

answered Oct 29, 2017 at 18:19

Mike Allen

8,3773 gold badges26 silver badges46 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Mike Hanafey · Accepted Answer · 2017-11-12 15:41:18Z

In this example informal testing did not deceive, but ScalaMeter does provide a clearer picture of performance. Building the result in an ArrayBuffer (top orange line) is definitely slower than the more direct newBuilder (blue line).

Returning the ArrayBuffer as an IndexedSeq is the fastest (green line), but of course it does not give you the true protection of an immutable collection.

Building the intermediate result in an Array (red line) is intermediate between ArrayBuffer and newBuilder.

The "zipAll" collection method allows the interleave to be done in a more functional style:

  def interleaveZipAllBuilderPat[T](a: IndexedSeq[T], b: IndexedSeq[T]): IndexedSeq[T] = {
    a.zipAll(b, null, null).foldLeft(Vector.newBuilder[T]) { (z, tp) =>
      tp match {
        case ((x:T, null)) => z += x
        case ((x:T,y:T)) => z += x += y
      }
    }.result()
  }

The slowest are the functional method, with the top two almost the same and these differ only in that one does the pattern match, and the other an if statement, so the pattern is not slow.

Functional is marginally worse than the direct loop method if an ArrayBuffer is used to accumulate the result, but the direct loop using the newBuilder is significantly faster.

If "zipAll" could return a builder, and if the builder were iterable, the functional style could be faster - no need to produce the immutable result if the next step just requires an iteration over elements.

So for me newBuilder is the clear winner.

Collectives™ on Stack Overflow

Scala: IndexedSeq.newBuilder vs. ArrayBuffer

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related