5
int steps = 256 * 1024 * 1024;
int[] a = new int[2];

// Loop 1
for (int i=0; i<steps; i++) { a[0]++; a[0]++; }

// Loop 2
for (int i=0; i<steps; i++) { a[0]++; a[1]++; }

Can someone explain why the second loop is 20x times slower than the first (19 ms vs 232 ms)?

That is how I'm timing it:

long start_time = System.currentTimeMillis();

// Loop

long end_time = System.currentTimeMillis();
System.out.println(end_time - start_time);
12
  • 2
    look at the bytecode, first one probably got optimized out completely Commented Jul 21, 2017 at 19:28
  • 3
    It would help if you provided your timing code in a full MVCE. Commented Jul 21, 2017 at 19:28
  • 1
    Actually, I just timed it (System.currentTimeMillis() before, between, and after each loop), and got roughly what they did. Commented Jul 21, 2017 at 19:31
  • 1
    How many times did you time it? You should do at least 10.000 iterations to be sure you have an accurate number Commented Jul 21, 2017 at 19:31
  • 1
    Timed it like 20 times, got almost the exact same for both timings each iteration Commented Jul 21, 2017 at 19:32

1 Answer 1

10

Summary

The JIT compiler is converting the first loop into a multiply, but not optimizing the second loop very much.

Discussion

The bytecode for both loops is basically the same (you can view this with javap -c test.class).

In Java, the bytecode is converted into x86 instructions by a JIT compiler which has the ability to perform additional optimizations.

You can actually view the assembly produced by the JIT via java -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly ... if you have the hsdis plugin.

I changed the value you add to each element to 0xbad to make it easier to spot the relevant code and changed the loop counter to long.

The first loop produces:

  mov     r11d,dword ptr [r13+10h]    Load from memory a[0]
  ...
  add     r11d,175ah                  Add 2 * 0xbad to the value
  mov     dword ptr [r13+10h],r11d    Store to memory a[0]

The second loop produces:

   mov     ebx,dword ptr [rax+10h]    Load from memory a[0]
   add     ebx,0badh                  Add 0xbad
   ...
   mov     dword ptr [rax+10h],ebx    Store to memory
   ...
   mov     ebx,dword ptr [rax+14h]    Load from memory a[1]
   add     ebx,0badh                  Add 0xbad
   ...
   mov     dword ptr [rax+14h],ebx    Store to memory a[1]

so you can see that the compiler is already able to optimize the first loop into fewer instructions.

In particular, it has spotted that the two additions to the same array element can be coalesced into a single addition of twice the value.

When I changed the loop counter back to int I noticed that the compiler manages to do even better with your first loop:

mov     r10d,dword ptr [r14+10h]
imul    ecx,r13d,175ah     This line converts lots of adds of 0xbad into a single multiply  
mov     r11d,r10d
sub     r11d,ecx
add     r10d,175ah
mov     dword ptr [r14+10h],r10d

In this case it has spotted that it can actually implement several iterations of your loop in a single pass by using a multiplication! This explains how the first loop can be an order of magnitude faster than the second.

Sign up to request clarification or add additional context in comments.

1 Comment

Can you explain the conclusion further? What exactly is the difference? Why is this relevant for the speed and so on. I think this helps OP even more than just showing him stuff that he possibly has never seen before.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.