16

Edit: The GHC bug I filed about this 10 years ago has was fixed in 2024. I expect GHC >= 9.12 to no longer have the performance problem described in this question.


In order to compare performance with lists being slow in this GHC bug I'm trying to get the following loop as fast as possible:

{-# LANGUAGE BangPatterns #-}

module Main (main) where

import Control.Monad
import Data.Word


main :: IO ()
main = do
  loop (maxBound :: Word32) $ \i -> do
    when (i `rem` 100000000 == 0) $
      print (fromIntegral i / fromIntegral (maxBound :: Word32))


loop :: Word32 -> (Word32 -> IO ()) -> IO ()
loop n f = go 0
  where
    go !i | i == n = return ()
    go !i          = f i >> go (i + 1)

compiled with ghc -O loop.hs.

However, running this takes 50 seconds on my computer - 10 times slower than the equivalent C program:

#include "limits.h"
#include "stdint.h"
#include "stdio.h"

int main(int argc, char const *argv[])
{
  for (uint32_t i = 0; i < UINT_MAX; ++i)
  {
    if (i % 100000000 == 0) printf("%f\n", (float) i / (float) UINT_MAX );
  }
  return 0;
}

compiled with gcc -O2 -std=c99 -o testc test.c.


Using the freshly released GHC 7.8 or using -O2 did not improve the performance.

However, compiling with the -fllvm flag (on either ghc version) brought a 10x speed improvement, bringing the performance on par with C.

Questions:

  1. Why is GHC's native codegen so much slower for my loop?
  2. Is there a way to improve my loop so that it is fast also without -fllvm, or is this already the fastest IO loop over Word32 one can achive?
12
  • Wild guess: add ::Float after the fromIntegrals to be sure they are not defaulted to something else. This is unrelated to the LLVM options, though. Commented Apr 26, 2014 at 18:43
  • For the record, I don't think those strictness annotations are necessary since (+) is already strict in both arguments. As soon as the argument is evaluated, it will be fully evaluated. Commented Apr 26, 2014 at 18:52
  • I first replaced the print with just putStrLn "Foo", and similarly in the C code to remove any time spent formatting the numbers. I got about a 30% speed increase with main = sequence_ $ every 100000000 $ replicate (fromIntegral (maxBound :: Word32)) $ putStrLn "Foo" where every gets each nth element (not quite equivalent since it skips the first element, but that won't have a significant affect on performance). It looks like a portion of your problem is that you've tried to write a C-style loop in Haskell, and you're doing the loop logic inside a monad which requires extra constructors. Commented Apr 26, 2014 at 20:26
  • 1
    @bheklilr Your idea using replicate is interesting, but I don't think explains the 1000% speed difference I observe. Regarding your point about constructors: I do not think code in the IO monad allocates you any extra constructors - in contrary, the list you propose usually does (especially in the linked bug)! I believe the 30% speedup you see is from avoiding the rem. Commented Apr 26, 2014 at 21:08
  • 2
    @leventov It is clearly not a formatting / printing IO issue, since it also happens when you don't print anything at all. Of course the link to the GHC bug is "irrelevant" as that one is about lists, which I do not use here; I solely link to the bug to motivate why I'm interested writing in such a loop. I will gladly accept an answer that technically explains why and how any of the things you suggest the problem. Commented Apr 27, 2014 at 2:16

1 Answer 1

13

Let's inspect the assembly. I modified the main function a bit so that the output becomes a bit clearer (but the performance remains identical). I used GHC 7.8.2 with -O2.

main :: IO ()
main = do
  loop (maxBound :: Word32) $ \i -> do
    when (i `rem` 100000000 == 0) $
      putStrLn "foo"

There is a lot of clutter, so I try to only include the interesting parts:

Native Codegen

Main_zdwa_info:
.Lc3JD: /* check if there's enough space for stack growth */
    leaq -16(%rbp),%rax
    cmpq %r15,%rax
    jb .Lc3JO /* this jumps to some GC code that grows the stack, then
                 reenters the main closure */
.Lc3JP:
    movl $4294967295,%eax /* issue: loading the bound on every iteration */
    cmpq %rax,%r14
    jne .Lc3JB
.Lc3JC:
   /* Return from main. Code omitted */
.Lc3JB: /* test the index for modulus */
    movl $100000000,%eax /* issue: unnecessary moves */
    movq %rax,%rbx      
    movq %r14,%rax
    xorq %rdx,%rdx
    divq %rbx /* issue: doing the division (llvm and gcc avoid this) */
    testq %rdx,%rdx
    jne .Lc3JU
.Lc3JV: 
   /* do the printing. Code omitted. */
.Lc3JN:
   /* increment index and (I guess) restore registers messed up by the printing */
    movq 8(%rbp),%rax
    incq %rax  
    movl %eax,%r14d
    addq $16,%rbp
    jmp Main_zdwa_info
.Lc3JU:
    leaq 1(%r14),%rax   /*issue: why not just increment r14? */
    movl %eax,%r14d     
    jmp Main_zdwa_info

LLVM

 Main_zdwa_info:
/* code omitted: the same stack-checking stuff as in native */
.LBB1_1:
    movl    $4294967295, %esi /* load the bound */
    movabsq $-6067343680855748867, %rdi /*load a magic number for the modulus */
    jmp .LBB1_2
.LBB1_4:              
    incl    %ecx
.LBB1_2:  
    cmpq    %rsi, %rcx
    je  .LBB1_6 /* check bound */

    /* do the modulus with two multiplications, a shift and a magic number */
    /* note : gcc does the same reduction */ 
    movq    %rcx, %rax
    mulq    %rdi
    shrq    $26, %rdx
    imulq   $100000000, %rdx, %rax  
    cmpq    %rax, %rcx
    jne .LBB1_4 
    /* Code omitted: print, then return to loop beginning */
.LBB1_6:                       
    /* Code omitted: return from main */

Observations

  • IO overhead is nonexistent in both assemblies. The zero-byte RealWorld state token is conspicuously absent.

  • Native codegen doesn't do much strength reduction, in contrast to LLVM, which readily converts the modulus into multiplication, shift and magic numbers.

  • Native codegen redoes the stack space checking on each iteration, while LLVM doesn't. It doesn't seem to be a significant overhead, however.

  • Native codegen is just plain bad here at looping and register allocation. It shuffles around registers and loads the bound on each iteration. LLVM emits code comparable to hand-written code in tidiness.

As to your question:

Is there a way to improve my loop so that it is fast also without -fllvm, or is this >already the fastest IO loop over Word32 one can achieve?

The best you can do here is manual strength reduction, I think, though I personally find that option unacceptable. However, after doing that your code will be still significantly slower. I also ran the following trivial loop, and it's twice as fast with LLVM than with native:

import Data.Word
main = go 0 where
    go :: Word32 -> IO ()
    go i | i == maxBound = return ()
    go i = go (i + 1)

The culprit is again unnecessary register-shuffling and bound-loading. There isn't really any way to remedy these kind of low level issues, aside from switching to LLVM.

Sign up to request clarification or add additional context in comments.

6 Comments

Thanks for the excellent analysis. One remark: I quite sure arity checks are static business only; rwbarton on IRC hinted at the leaq -16(%rbp),%rax being the stack overflow check (another thing that should really get optimized out of the loop).
@nh2 You're right! I just looked at the STG tutorial on GHC pages and it indeed seems to be checking for stack space. I'll edit the answer correspondingly.
I have created this GHC bug based on your answer.
@Sarah I suspect forM_ is slow because of the bug I mentioned in my question.
The bug I filed 10 years ago has just been fixed. The next GHC release should solve this issue!
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.