Let's inspect the assembly. I modified the main function a bit so that the output becomes a bit clearer (but the performance remains identical). I used GHC 7.8.2 with -O2.
main :: IO ()
main = do
loop (maxBound :: Word32) $ \i -> do
when (i `rem` 100000000 == 0) $
putStrLn "foo"
There is a lot of clutter, so I try to only include the interesting parts:
Native Codegen
Main_zdwa_info:
.Lc3JD: /* check if there's enough space for stack growth */
leaq -16(%rbp),%rax
cmpq %r15,%rax
jb .Lc3JO /* this jumps to some GC code that grows the stack, then
reenters the main closure */
.Lc3JP:
movl $4294967295,%eax /* issue: loading the bound on every iteration */
cmpq %rax,%r14
jne .Lc3JB
.Lc3JC:
/* Return from main. Code omitted */
.Lc3JB: /* test the index for modulus */
movl $100000000,%eax /* issue: unnecessary moves */
movq %rax,%rbx
movq %r14,%rax
xorq %rdx,%rdx
divq %rbx /* issue: doing the division (llvm and gcc avoid this) */
testq %rdx,%rdx
jne .Lc3JU
.Lc3JV:
/* do the printing. Code omitted. */
.Lc3JN:
/* increment index and (I guess) restore registers messed up by the printing */
movq 8(%rbp),%rax
incq %rax
movl %eax,%r14d
addq $16,%rbp
jmp Main_zdwa_info
.Lc3JU:
leaq 1(%r14),%rax /*issue: why not just increment r14? */
movl %eax,%r14d
jmp Main_zdwa_info
LLVM
Main_zdwa_info:
/* code omitted: the same stack-checking stuff as in native */
.LBB1_1:
movl $4294967295, %esi /* load the bound */
movabsq $-6067343680855748867, %rdi /*load a magic number for the modulus */
jmp .LBB1_2
.LBB1_4:
incl %ecx
.LBB1_2:
cmpq %rsi, %rcx
je .LBB1_6 /* check bound */
/* do the modulus with two multiplications, a shift and a magic number */
/* note : gcc does the same reduction */
movq %rcx, %rax
mulq %rdi
shrq $26, %rdx
imulq $100000000, %rdx, %rax
cmpq %rax, %rcx
jne .LBB1_4
/* Code omitted: print, then return to loop beginning */
.LBB1_6:
/* Code omitted: return from main */
Observations
IO overhead is nonexistent in both assemblies. The zero-byte RealWorld state token is conspicuously absent.
Native codegen doesn't do much strength reduction, in contrast to LLVM, which readily converts the modulus into multiplication, shift and magic numbers.
Native codegen redoes the stack space checking on each iteration, while LLVM doesn't. It doesn't seem to be a significant overhead, however.
Native codegen is just plain bad here at looping and register allocation. It shuffles around registers and loads the bound on each iteration. LLVM emits code comparable to hand-written code in tidiness.
As to your question:
Is there a way to improve my loop so that it is fast also without -fllvm, or is this >already the fastest IO loop over Word32 one can achieve?
The best you can do here is manual strength reduction, I think, though I personally find that option unacceptable. However, after doing that your code will be still significantly slower. I also ran the following trivial loop, and it's twice as fast with LLVM than with native:
import Data.Word
main = go 0 where
go :: Word32 -> IO ()
go i | i == maxBound = return ()
go i = go (i + 1)
The culprit is again unnecessary register-shuffling and bound-loading. There isn't really any way to remedy these kind of low level issues, aside from switching to LLVM.
::Floatafter thefromIntegrals to be sure they are not defaulted to something else. This is unrelated to the LLVM options, though.(+)is already strict in both arguments. As soon as the argument is evaluated, it will be fully evaluated.putStrLn "Foo", and similarly in the C code to remove any time spent formatting the numbers. I got about a 30% speed increase withmain = sequence_ $ every 100000000 $ replicate (fromIntegral (maxBound :: Word32)) $ putStrLn "Foo"whereeverygets each nth element (not quite equivalent since it skips the first element, but that won't have a significant affect on performance). It looks like a portion of your problem is that you've tried to write a C-style loop in Haskell, and you're doing the loop logic inside a monad which requires extra constructors.replicateis interesting, but I don't think explains the 1000% speed difference I observe. Regarding your point about constructors: I do not think code in the IO monad allocates you any extra constructors - in contrary, the list you propose usually does (especially in the linked bug)! I believe the 30% speedup you see is from avoiding therem.