5

Another Haskell optimization question from my previous. I need to generate a list recursively, similar to the fibs function found in many introductory Haskell articles:

generateSchedule :: [Word32] -> [Word32]
generateSchedule blkw = take 80 ws
    where
    ws          = blkw ++ zipWith4 gen (drop 13 ws) (drop 8 ws) (drop 2 ws) ws
    gen a b c d = rotate (a `xor` b `xor` c `xor` d) 1

The above function has overtaken as the most time and alloc -consuming function for me. The profiler gives me the following statistics:

COST CENTRE        MODULE             %time %alloc  ticks     bytes
generateSchedule   Test.Hash.SHA1     22.1   40.4   31        702556640

I thought of applying unboxed vectors to calculate the list but cannot figure a way to do it since the list is recursive. This would have a natural implementation in C but I do not see a way to make this faster (other than to unroll and write 80 lines of variable declarations). Any help?

Update: I actually did unroll it quickly to see if it helps. The code is here. It is ugly, and in fact it was slower.

COST CENTRE        MODULE             %time %alloc  ticks     bytes
generateSchedule   GG.Hash.SHA1       22.7   27.6   40        394270592
5
  • 5
    I still think you need to ditch the lists entirely. Not just as an intermediary, input your data as a ByteString and use Data.Vector.Storable or some such. I don't see much point in optimizing when the input is a list of words. If you unroll partcb entirely then you won't even need this generateSchedule function (partab); it would be explicit (inlined) in the unrolling. Also: you're putting enough effort into this that I'm curious as to the goal - is it education or are you wanting to use the implementation in production code? If #2, is there a reason for avoiding cryptohash? Commented Nov 15, 2011 at 21:32
  • Education. I want to know whether I can write clean, idiomatic Haskell that is comparable in speed without resorting to optimization tricks that make me wish I had just written it in C instead. In the case of SHA1, I'm starting to feel like that is the case. I also looked at the source for Data.Digest.Pure.SHA. It's great -- within 2-3x the speed of a C implementation. But it is not the kind of code I would want to write. Commented Nov 16, 2011 at 1:07
  • FYI my full, latest code here. Commented Nov 16, 2011 at 1:42
  • 1
    You can get a cheap 23% boost just by unpacking your Vec160 structure (using the {-# UNPACK #-} pragma before each Word32 field) Commented Nov 16, 2011 at 4:36
  • 1
    Are you sure that generateSchedule is actually the bottleneck here? If blkw isn't already evaluated, the cost of evaluating it will probably be attributed to generateSchedule. If it is evaluated, then representing it with a list is probably not the right thing anyway. Commented Nov 16, 2011 at 9:23

2 Answers 2

5
import Data.Array.Base
import Data.Array.ST
import Data.Array.Unboxed

generateSchedule :: [Word32] -> UArray Int Word32
generateSchedule ws0 = runSTUArray $ do
    arr <- unsafeNewArray_ (0,79)
    let fromList i [] = fill i 0
        fromList i (w:ws) = do
            unsafeWrite arr i w
            fromList (i+1) ws
        fill i j
          | i == 80 = return arr
          | otherwise = do
              d <- unsafeRead arr j
              c <- unsafeRead arr (j+2)
              b <- unsafeRead arr (j+8)
              a <- unsafeRead arr (j+13)
              unsafeWrite arr i (gen a b c d)
              fill (i+1) (j+1)
    fromList 0 ws0

will create an unboxed array corresponding to your list. It relies on the assumption that the list argument contains at least 14 and at most 80 items, otherwise it'll misbehave badly. I think it'll always be 16 items (64 bytes), so that should be safe for you. (But it's probably better to start filling directly from the ByteString rather than to construct an intermediate list.)

By strictly evaluating this before doing the hashing rounds, you save the switching between the list-construction and the hashing you have with the lazily construced list, that should reduce time needed. By using an unboxed array we avoid the allocation overhead of lists, which may further improve speed (but ghc's allocator is very fast, so don't expect too much impact from that).

In your hashing rounds, get the needed Word32 via unsafeAt array t to avoid unnecessary bounds-checking.

Addendum: Unrolling the creation of the list might be faster if you put a bang on each wn, though I'm not sure. Since you already have the code, adding bangs and checking isn't too much work, is it? I'm curious.

Sign up to request clarification or add additional context in comments.

3 Comments

Hi Daniel. I did as you suggested. The numbers are 13.4%, 15.7% after the bangs and 30%, 27% before. So it cuts down on about half of the runtime and alloc. Great idea.
Re the vector solution. Thanks. But I'm not getting the warm fuzzy from it with all the unsafe* calls. Plus, it's getting to the point where the advantage of Haskell, i.e. clarity, conciseness, is overshadowed by wrangling code for speed. If I have to do that, I'd simply write it in C as originally suggested, I think.
Here, the 'unsafe' just means no bounds-checks (and no initialisation in unsafeNewArray_). Since the bounds-checking is in the calling code, it's actually safe (if the precondition on the length of ws0 holds). It's an unfortunate naming, it would have been better to call the functions uncheckedXXX. As for clarity and conciseness, well, it's a tradeoff. Until we get more compiler and fusion magic, sometimes you have to write ugly imperative code. But it's more localised in Haskell, once the low level is up to snuff, the using code can be nice and compositional.
1

We can use lazy arrays to get a halfway house between going straight mutable and using pure lists. You get the benefits of a recursive definition, but for that reason still pay the price of laziness and boxing -- though less so than with lists. The following code uses criterion to test two lazy array solutions (using standard arrays, and vectors) as well as the original list code and Daniel's mutable uarray code above:

module Main where
import Data.Bits
import Data.List
import Data.Word
import qualified Data.Vector as LV
import Data.Array.ST
import Data.Array.Unboxed
import qualified Data.Array as A
import Data.Array.Base
import Criterion.Main

gen :: Word32 -> Word32 -> Word32 -> Word32 -> Word32
gen a b c d = rotate (a `xor` b `xor` c `xor` d) 1

gss blkw = LV.toList v
    where v = LV.fromList $ blkw ++ rest
          rest = map (\i -> gen (LV.unsafeIndex v (i + 13))
                                (LV.unsafeIndex v (i + 8))
                                (LV.unsafeIndex v (i + 2))
                                (LV.unsafeIndex v i)
                     )
                 [0..79 - 14]

gss' blkw = A.elems v
    where v = A.listArray (0,79) $ blkw ++ rest
          rest = map (\i -> gen (unsafeAt v (i + 13))
                                (unsafeAt v (i + 8))
                                (unsafeAt v (i + 2))
                                (unsafeAt v i)
                     )
                 [0..79 - 14]

generateSchedule :: [Word32] -> [Word32]
generateSchedule blkw = take 80 ws
    where
    ws          = blkw ++ zipWith4 gen (drop 13 ws) (drop 8 ws) (drop 2 ws) ws

gs :: [Word32] -> [Word32]
gs ws = elems (generateSched ws)

generateSched :: [Word32] -> UArray Int Word32
generateSched ws0 = runSTUArray $ do
    arr <- unsafeNewArray_ (0,79)
    let fromList i [] = fill i 0
        fromList i (w:ws) = do
            unsafeWrite arr i w
            fromList (i+1) ws
        fill i j
          | i == 80 = return arr
          | otherwise = do
              d <- unsafeRead arr j
              c <- unsafeRead arr (j+2)
              b <- unsafeRead arr (j+8)
              a <- unsafeRead arr (j+13)
              unsafeWrite arr i (gen a b c d)
              fill (i+1) (j+1)
    fromList 0 ws0

args = [0..13]

main = defaultMain [
        bench "list"   $ whnf (sum . generateSchedule) args
       ,bench "vector" $ whnf (sum . gss) args
       ,bench "array"  $ whnf (sum . gss') args
       ,bench "uarray" $ whnf (sum . gs) args
       ]

I compiled the code with -O2 and -funfolding-use-threshold=256 to force lots of inlining.

The criterion benchmarks demonstrate that the vector solution is slightly better, and the array solution slightly better still, but that the unboxed mutable solution still wins by a landslide:

benchmarking list
mean: 8.021718 us, lb 7.720636 us, ub 8.605683 us, ci 0.950
std dev: 2.083916 us, lb 1.237193 us, ub 3.309458 us, ci 0.950

benchmarking vector
mean: 6.829923 us, lb 6.725189 us, ub 7.226799 us, ci 0.950
std dev: 882.3681 ns, lb 76.20755 ns, ub 2.026598 us, ci 0.950

benchmarking array
mean: 6.212669 us, lb 5.995038 us, ub 6.635405 us, ci 0.950
std dev: 1.518521 us, lb 946.8826 ns, ub 2.409086 us, ci 0.950

benchmarking uarray
mean: 2.380519 us, lb 2.147896 us, ub 2.715305 us, ci 0.950
std dev: 1.411092 us, lb 1.083180 us, ub 1.862854 us, ci 0.950

I ran some basic profiling too, and noticed that the lazy/boxed array solutions did slightly better than the list solution, but again significantly worse than the unboxed array approach.

2 Comments

Strictly speaking, that should be [0..79 - length blkw] (i.e. [0..15] because Ana will always be giving this lists of length 64), but you used LV.take as a backup (smart). I'm curious if there's any measureable difference. At the same time, I'm too lazy to find out.
@Thomas: yeah, I dashed off the code and tested it rather quickly and was therefore lazy about a few things. I'll run some benchmarks and tune and update over the next day or so.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.