Optimizing Haskell Inner Loops

Question

Still working on my SHA1 implementation in Haskell. I've now got a working implementation and this is the inner loop:

iterateBlock' :: Int -> [Word32] -> Word32 -> Word32 -> Word32 -> Word32 -> Word32 -> [Word32]
iterateBlock' 80 ws a b c d e    = [a, b, c, d, e]
iterateBlock' t (w:ws) a b c d e = iterateBlock' (t+1) ws a' b' c' d' e'
    where
    a' = rotate a 5 + f t b c d + e + w + k t
    b' = a
    c' = rotate b 30
    d' = c
    e' = d

The profiler tells me that this function takes 1/3 of the runtime of my implementation. I can think of no way to further optimize it other than maybe inlining the temp variables but I believe -O2 will do that for me anyway.

Can anyone see a significant optimization that can be further applied?

FYI the k and f calls are below. They are so simple I don't think there is a way to optimize these other. Unless the Data.Bits module is slow?

f :: Int -> Word32 -> Word32 -> Word32 -> Word32
f t b c d
    | t <= 19   = (b .&. c) .|. ((complement b) .&. d)
    | t <= 39   = b `xor` c `xor` d
    | t <= 59   = (b .&. c) .|. (b .&. d) .|. (c .&. d)
    | otherwise = b `xor` c `xor` d

k :: Int -> Word32
k t
    | t <= 19   = 0x5A827999
    | t <= 39   = 0x6ED9EBA1
    | t <= 59   = 0x8F1BBCDC
    | otherwise = 0xCA62C1D6

Without trying, I'm guessing a lot of the issue is keeping your block data in a list (too much point/memory traffic). I'd try hard to move to a unboxed vectors of Word32 and manually unroll the loop. Short of that, try it with a strict/unpacked structure holding a, b, c, d, and e; then you'd only have one variable needing passed (and you'd be sure to put a bang pattern on it, right?). — Thomas M. DuBuisson
– Thomas M. DuBuisson, Commented Nov 14, 2011 at 21:56
I would also try to replace all the (<=) with a table lookup, though I'm not sure it will help much. — n. m. could be an AI
– n. m. could be an AI, Commented Nov 14, 2011 at 22:27
Another thing: It is often a good idea to write tight arithmetic functions in C and call that using the FFI. If you are careful to introduce no side-effects, the runtime can use a fast call into C that gives good performance. — fuz
– fuz, Commented Nov 14, 2011 at 22:39
But the C-calls still take a lot of cycles, so be sure you do enough work in C-land. For small things, it's better to smoke a few MagicHashes and use the GHC primops. — Daniel Fischer
– Daniel Fischer, Commented Nov 14, 2011 at 23:15

Daniel Fischer · Accepted Answer · 2011-11-14 23:05:40Z

11

Looking at the core produced by ghc-7.2.2, the inlining works out well. What doesn't work so well is that in each iteration a couple of Word32 values are first unboxed, to perform the work, and then reboxed for the next iteration. Unboxing and re-boxing can cost a surprisingly large amount of time (and allocation). You can probably avoid that by using Word instead of Word32. You couldn't use rotate from Data.Bits then, but would have to implement it yourself (not hard) to have it work also on 64-bit systems. For a' you would have to manually mask out the high bits.

Another point that looks suboptimal is that in each iteration t is compared to 19, 39 and 59 (if it's large enough), so that the loop body contains four branches. It will probably be faster if you split iterateBlock' into four loops (0-19, 20-39, 40-59, 60-79) and use constants k1, ..., k4, and four functions f1, ..., f4 (without the t parameter) to avoid branches and have smaller code-size for each loop.

And, as Thomas said, using a list for the block data isn't optimal, an unboxed Word array/vector would probably help too.

With the bang patterns, the core looks much better. Two or three less-than-ideal points remain.

                      (GHC.Prim.narrow32Word#
                         (GHC.Prim.plusWord#
                            (GHC.Prim.narrow32Word#
                               (GHC.Prim.plusWord#
                                  (GHC.Prim.narrow32Word#
                                     (GHC.Prim.plusWord#
                                        (GHC.Prim.narrow32Word#
                                           (GHC.Prim.plusWord#
                                              (GHC.Prim.narrow32Word#
                                                 (GHC.Prim.or#
                                                    (GHC.Prim.uncheckedShiftL# sc2_sEn 5)
                                                    (GHC.Prim.uncheckedShiftRL# sc2_sEn 27)))
                                              y#_aBw))
                                        sc6_sEr))
                                  y#1_XCZ))
                            y#2_XD6))

See all these narrow32Word#? They're cheap, but not free. Only the outermost is needed, there may be a bit to harvest by hand-coding the steps and using Word.

Then the comparisons of t with 19, ..., they appear twice, once to determine the k constant, and once for the f transform. The comparisons alone are cheap, but they cause branches and without them, further inlining may be possible. I expect a bit could be gained here too.

And still, the list. That means w can't be unboxed, the core could be simpler if w were unboxable.

edited Nov 14, 2011 at 23:05

answered Nov 14, 2011 at 22:33

Daniel Fischer

184k19 gold badges319 silver badges436 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

fuz Over a year ago

I added bang patterns to all (!) parameters of all functions (except ws), that made unboxing work.

Daniel Fischer Over a year ago

Good find. You don't need bang patterns on all parameters, though, with bangs on a, b, c, d, e, a' everything's roses, k and f are inlined, everything unboxable unboxed.

fuz Over a year ago

yeah. It's generally a good idea to put bang-patterns on arguments that are supposed to be strict.

Daniel Fischer Over a year ago

Yes. I was fooled by GHC unboxing everything immediately in the loop body; with the traditional weakness in treating the fixed-width types, I didn't test.

Louis Wasserman Over a year ago

Are you sure that LLVM won't optimize most of those narrows away? I'd expect that to be likely.

|

Collectives™ on Stack Overflow

Optimizing Haskell Inner Loops

1 Answer 1

8 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

8 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related