The problem is boxing. It takes a lot longer to create an object than to add two numbers, but if you use generic (non-specialized) folds, you have to create an object every time. The problem with just specializing everything is that you make the entire library 100x larger since you need every combination of two primitive parameters (including with non-primitives), plus the original no-type-parameter version. (100x because there are 8 primitives plus Unit plus AnyRef/non-specialized T.) This is untenable, and since there is no readily available alternate solution, the collections are presently unspecialized.
Also, specialization itself is relatively new and thus still has some deficits in its implementation. In particular, you seem to have hit one with SpecializedIterator: the function in foreach doesn't end up specialized (I collapsed the trait/object thing into a single class to make it easier to track down):
public class Main$SpecializedArrayIterator$mcJ$sp extends Main$SpecializedArrayIterator{
public final void foreach$mcJ$sp(scala.Function1);
Code:
0: aload_0
1: invokevirtual #39; //Method Main$SpecializedArrayIterator.hasNext:()Z
4: ifeq 24
7: aload_1
8: aload_0
9: invokevirtual #14; //Method next$mcJ$sp:()J
12: invokestatic #45; //Method scala/runtime/BoxesRunTime.boxToLong:(J)Ljava/lang/Long;
15: invokeinterface #51, 2; //InterfaceMethod scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;
20: pop
21: goto 0
24: return
See the box at line 12, followed by a call to un-specialized Function1? Oops. (The tuple (A, (A,A) => A) used in sum also messes up specialization.) An implementation like this is full speed:
class SpecializedArrayIterator[@specialized A](src: Array[A]) {
var i = 0
val l = src.length
@inline final def hasNext: Boolean = i < l
@inline final def next(): A = { val res = src(i); i += 1; res }
@inline final def foldLeft[@specialized B](z: B)(op: (B, A) => B): B = {
var result = z
while (hasNext) result = op(result,next)
result
}
}
...
measure((new SpecializedArrayIterator[Long](test)).foldLeft(0L)(_ + _))
...
With results like so:
Launched 51298 times in 2000 milliseconds, ratio = 25.649 // New impl
Launched 51614 times in 2000 milliseconds, ratio = 25.807 // While loop
SpecializedIteratorclass in the example? The difference is from JIT being unable to inline code? In that case, can it be improved or using any kind ofIterators inherently sacrifices much speed?