10

Still working on my SHA1 implementation in Haskell. I've now got a working implementation and this is the inner loop:

iterateBlock' :: Int -> [Word32] -> Word32 -> Word32 -> Word32 -> Word32 -> Word32 -> [Word32]
iterateBlock' 80 ws a b c d e    = [a, b, c, d, e]
iterateBlock' t (w:ws) a b c d e = iterateBlock' (t+1) ws a' b' c' d' e'
    where
    a' = rotate a 5 + f t b c d + e + w + k t
    b' = a
    c' = rotate b 30
    d' = c
    e' = d

The profiler tells me that this function takes 1/3 of the runtime of my implementation. I can think of no way to further optimize it other than maybe inlining the temp variables but I believe -O2 will do that for me anyway.

Can anyone see a significant optimization that can be further applied?

FYI the k and f calls are below. They are so simple I don't think there is a way to optimize these other. Unless the Data.Bits module is slow?

f :: Int -> Word32 -> Word32 -> Word32 -> Word32
f t b c d
    | t <= 19   = (b .&. c) .|. ((complement b) .&. d)
    | t <= 39   = b `xor` c `xor` d
    | t <= 59   = (b .&. c) .|. (b .&. d) .|. (c .&. d)
    | otherwise = b `xor` c `xor` d

k :: Int -> Word32
k t
    | t <= 19   = 0x5A827999
    | t <= 39   = 0x6ED9EBA1
    | t <= 59   = 0x8F1BBCDC
    | otherwise = 0xCA62C1D6
4
  • Without trying, I'm guessing a lot of the issue is keeping your block data in a list (too much point/memory traffic). I'd try hard to move to a unboxed vectors of Word32 and manually unroll the loop. Short of that, try it with a strict/unpacked structure holding a, b, c, d, and e; then you'd only have one variable needing passed (and you'd be sure to put a bang pattern on it, right?). Commented Nov 14, 2011 at 21:56
  • 1
    I would also try to replace all the (<=) with a table lookup, though I'm not sure it will help much. Commented Nov 14, 2011 at 22:27
  • 1
    Another thing: It is often a good idea to write tight arithmetic functions in C and call that using the FFI. If you are careful to introduce no side-effects, the runtime can use a fast call into C that gives good performance. Commented Nov 14, 2011 at 22:39
  • 2
    But the C-calls still take a lot of cycles, so be sure you do enough work in C-land. For small things, it's better to smoke a few MagicHashes and use the GHC primops. Commented Nov 14, 2011 at 23:15

1 Answer 1

11

Looking at the core produced by ghc-7.2.2, the inlining works out well. What doesn't work so well is that in each iteration a couple of Word32 values are first unboxed, to perform the work, and then reboxed for the next iteration. Unboxing and re-boxing can cost a surprisingly large amount of time (and allocation). You can probably avoid that by using Word instead of Word32. You couldn't use rotate from Data.Bits then, but would have to implement it yourself (not hard) to have it work also on 64-bit systems. For a' you would have to manually mask out the high bits.

Another point that looks suboptimal is that in each iteration t is compared to 19, 39 and 59 (if it's large enough), so that the loop body contains four branches. It will probably be faster if you split iterateBlock' into four loops (0-19, 20-39, 40-59, 60-79) and use constants k1, ..., k4, and four functions f1, ..., f4 (without the t parameter) to avoid branches and have smaller code-size for each loop.

And, as Thomas said, using a list for the block data isn't optimal, an unboxed Word array/vector would probably help too.

With the bang patterns, the core looks much better. Two or three less-than-ideal points remain.

                      (GHC.Prim.narrow32Word#
                         (GHC.Prim.plusWord#
                            (GHC.Prim.narrow32Word#
                               (GHC.Prim.plusWord#
                                  (GHC.Prim.narrow32Word#
                                     (GHC.Prim.plusWord#
                                        (GHC.Prim.narrow32Word#
                                           (GHC.Prim.plusWord#
                                              (GHC.Prim.narrow32Word#
                                                 (GHC.Prim.or#
                                                    (GHC.Prim.uncheckedShiftL# sc2_sEn 5)
                                                    (GHC.Prim.uncheckedShiftRL# sc2_sEn 27)))
                                              y#_aBw))
                                        sc6_sEr))
                                  y#1_XCZ))
                            y#2_XD6))

See all these narrow32Word#? They're cheap, but not free. Only the outermost is needed, there may be a bit to harvest by hand-coding the steps and using Word.

Then the comparisons of t with 19, ..., they appear twice, once to determine the k constant, and once for the f transform. The comparisons alone are cheap, but they cause branches and without them, further inlining may be possible. I expect a bit could be gained here too.

And still, the list. That means w can't be unboxed, the core could be simpler if w were unboxable.

Sign up to request clarification or add additional context in comments.

8 Comments

I added bang patterns to all (!) parameters of all functions (except ws), that made unboxing work.
Good find. You don't need bang patterns on all parameters, though, with bangs on a, b, c, d, e, a' everything's roses, k and f are inlined, everything unboxable unboxed.
yeah. It's generally a good idea to put bang-patterns on arguments that are supposed to be strict.
Yes. I was fooled by GHC unboxing everything immediately in the loop body; with the traditional weakness in treating the fixed-width types, I didn't test.
Are you sure that LLVM won't optimize most of those narrows away? I'd expect that to be likely.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.