Let's say you are doing some computation over a large set of large float vectors, e.g. calculating the average of each:
public static float avg(float[] data, int offset, int length) {
float sum = 0;
for (int i = offset; i < offset + length; i++) {
sum += data[i];
}
return sum / length;
}
If you have all your vectors stored in an in-memory float[], you can implement the loop as this:
float[] data; // <-- vectors here
float sum = 0;
for (int i = 0; i < nVectors; i++) {
sum += avg(data, i * vectorSize, vectorSize);
}
If your the vectors are stored in a file instead, memory-mapping it should as fast as the first solution, in theory, once the OS has cached the whole thing:
RandomAccessFile file; // <-- vectors here
MappedByteBuffer buffer = file.getChannel().map(READ_WRITE, 0, 4*data.length);
FloatBuffer floatBuffer = buffer.asFloatBuffer();
buffer.load(); // <-- this forces the OS to cache the file
float[] vector = new float[vectorSize];
float sum = 0;
for (int i = 0; i < nVectors; i++) {
floatBuffer.get(vector);
sum += avg(vector, 0, vector.length);
}
However, my tests show that the memory-mapped version is ~5 times slower than the in-memory one. I know that FloatBuffer.get(float[]) is copying memory, and I guess that's the reason for the slowdown. Can it get any faster? Is there a way to avoid any memory copying at all and just get my data from the OS' buffer?
I've uploaded my full benchmark to this gist, in case you want to try it just run:
$ java -Xmx1024m ArrayVsMMap 100 100000 100
Edit:
In the end, the best I have been able to get out of a MappedByteBuffer in this scenario is still slower than using a regular float[] by ~35%. The tricks so far are:
- use the native byte order to avoid conversion:
buffer.order(ByteOrder.nativeOrder()) - wrap the
MappedByteBufferwith aFloatBufferusingbuffer.asFloatBuffer() - use the simple
floatBuffer.get(int index)instead of the bulk version, this avoids memory copying.
You can see the new benchmark and results at this gist.
A slowdown of 1.35 is much better than one of 5, but it's still far from 1. I'm probably still missing something, or else it's something in the JVM that should be improved.
java -version java version "1.6.0_33" Java(TM) SE Runtime Environment (build 1.6.0_33-b03-424-11M3720) Java HotSpot(TM) 64-Bit Server VM (build 20.8-b03-424, mixed mode)the command line is:java -Xmx1024m ArrayVsMMap 100 100000 100res+=avg(...); System.out.println(res)) otherwise it's a clear NOP. I'd advise you to read some articles on microbenchmarks and avoid microbenchmarks overall unless you know how the JVM optimizes - including constant folding, loop unrolling, callsites optimizations (and deoptimizations), etc etc...