4

Let's say you are doing some computation over a large set of large float vectors, e.g. calculating the average of each:

public static float avg(float[] data, int offset, int length) {
  float sum = 0;
  for (int i = offset; i < offset + length; i++) {
    sum += data[i];
  }
  return sum / length;
}

If you have all your vectors stored in an in-memory float[], you can implement the loop as this:

float[] data; // <-- vectors here
float sum = 0;
for (int i = 0; i < nVectors; i++) {
  sum += avg(data, i * vectorSize, vectorSize);
}

If your the vectors are stored in a file instead, memory-mapping it should as fast as the first solution, in theory, once the OS has cached the whole thing:

RandomAccessFile file; // <-- vectors here
MappedByteBuffer buffer = file.getChannel().map(READ_WRITE, 0, 4*data.length);
FloatBuffer floatBuffer = buffer.asFloatBuffer();
buffer.load(); // <-- this forces the OS to cache the file

float[] vector = new float[vectorSize];
float sum = 0;
for (int i = 0; i < nVectors; i++) {
  floatBuffer.get(vector);
  sum += avg(vector, 0, vector.length);
}

However, my tests show that the memory-mapped version is ~5 times slower than the in-memory one. I know that FloatBuffer.get(float[]) is copying memory, and I guess that's the reason for the slowdown. Can it get any faster? Is there a way to avoid any memory copying at all and just get my data from the OS' buffer?

I've uploaded my full benchmark to this gist, in case you want to try it just run:

$ java -Xmx1024m ArrayVsMMap 100 100000 100

Edit:

In the end, the best I have been able to get out of a MappedByteBuffer in this scenario is still slower than using a regular float[] by ~35%. The tricks so far are:

  • use the native byte order to avoid conversion: buffer.order(ByteOrder.nativeOrder())
  • wrap the MappedByteBuffer with a FloatBuffer using buffer.asFloatBuffer()
  • use the simple floatBuffer.get(int index) instead of the bulk version, this avoids memory copying.

You can see the new benchmark and results at this gist.

A slowdown of 1.35 is much better than one of 5, but it's still far from 1. I'm probably still missing something, or else it's something in the JVM that should be improved.

9
  • What JVM? What command-line parameters? Commented Aug 26, 2012 at 18:10
  • I'm using the default java version in Mac OS X 10.7.4 java -version java version "1.6.0_33" Java(TM) SE Runtime Environment (build 1.6.0_33-b03-424-11M3720) Java HotSpot(TM) 64-Bit Server VM (build 20.8-b03-424, mixed mode) the command line is: java -Xmx1024m ArrayVsMMap 100 100000 100 Commented Aug 26, 2012 at 20:19
  • Your avg returns a result and that's ok HOWEVER you have to use the result (like res+=avg(...); System.out.println(res)) otherwise it's a clear NOP. I'd advise you to read some articles on microbenchmarks and avoid microbenchmarks overall unless you know how the JVM optimizes - including constant folding, loop unrolling, callsites optimizations (and deoptimizations), etc etc... Commented Aug 26, 2012 at 23:06
  • The second part w/ the ByteBuffer cannot be optimized away since it relies on unknown data (outside the JVM heap) and also modifies the internal state of the ByteBuffer. Commented Aug 26, 2012 at 23:10
  • as 3rd remark, if you use mapped files/direct buffers - just use them as bytebuffer/floatbuffer and avoid the memory copy. Commented Aug 26, 2012 at 23:13

2 Answers 2

3

Your array-based time is ridiculously fast! I get .0002 nanoseconds per float. The JVM is probably optimizing the loop away.

This is the problem:

    void iterate() {
        for (int i = 0; i < nVectors; i++) {
            calc(data, i * vectorSize, vectorSize);
        }
    }

The JVM realizes that calc has no side effects, so iterate doesn't either, so it can just be replaced with a NOP. A simple fix is to accumulate the results from calc and return it. You also need to do the same with the results of iterate in the timing loop, and print the result. That prevents the optimizer from deleting all of your code.

Edit:

This looks like it is probably just overhead on the Java side, nothing to do with memory mapping itself, just the interface to it. Try the following test, which just wraps a FloatBuffer around a ByteBuffer around a byte[]:

  private static final class ArrayByteBufferTest extends IterationTest {
    private final FloatBuffer floatBuffer;
    private final int vectorSize;
    private final int nVectors;

    ArrayByteBufferTest(float[] data, int vectorSize, int nVectors) {
      ByteBuffer bb = ByteBuffer.wrap(new byte[data.length * 4]);
      for (int i = 0; i < data.length; i++) {
        bb.putFloat(data[i]);
      }
      bb.rewind();
      this.floatBuffer = bb.asFloatBuffer();
      this.vectorSize = vectorSize;
      this.nVectors = nVectors;
    }

    float iterate() {
      float sum = 0;
      floatBuffer.rewind();
      float[] vector = new float[vectorSize];
      for (int i = 0; i < nVectors; i++) {
        floatBuffer.get(vector);
        sum += calc(vector, 0, vector.length);
      }
      return sum;
    }
  }

Since you're doing so little work on the float itself (just adding it, probably 1 cycle), the cost of reading 4 bytes, building a float, and copying it to an array all add up. I noticed it helps the overhead a bit to have fewer, bigger vectors, at least until the vector is bigger than (L1?) cache.

Sign up to request clarification or add additional context in comments.

7 Comments

thanks! i've updated the question and the gist to reflect that it's still 5 times slower after avoiding the NOP replacement.
@MartinBlech: updated with more reasons why it might be slow.
@KeithRandall: that's very insightful, thanks. So the problem basically boils down to the JVM not letting my code see the underlying OS buffer directly as a float[] (as one could with void *mmap() in C, for example). The cost of building floats on the fly and copying them to the heap is not at all negligible. Would you agree?
@MartinBlech: I would agree, although it doesn't have to be that way. asFloatBuffer could do something smarter for memory-mapped ByteBuffers.
@Marti$nBlech, mapping in java is internally implemented via mmap actually, DirectBuffer has the address (i.e. void*) of the mapped structure. Any ByteBuffer has 2+1 methods of interest. getFloat(), asFloatBuffer() + order(ByteOrder). If you use native byteorder on x86 instead of the default BigEndian - you'd be almost as good as the float[], although the JVM has to optimize the code removing the bound checking (arrays are more easily optimized)
|
0

There is no reason in theory they should perform the same. The mapped solution implies page faults and disk I/O to an entirely unpredictable extent. The float[] array doesn't. You should expect the latter to be faster, except in the special case where the entire file is mapped into memory and you never change it and it stays mapped and is never paged out. Most of these factors you cannot control or predict.

2 Comments

That's why I said "once the OS has cached the whole thing". Is it very unreasonable to expect a relatively small file to still be in memory after iterating over it once, in a machine with lots of free memory, using a file I know no other process is using?
@MaetinBlech It's reasonable if you can control the factors I mentioned (and any I left out).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.