We can use lazy arrays to get a halfway house between going straight mutable and using pure lists. You get the benefits of a recursive definition, but for that reason still pay the price of laziness and boxing -- though less so than with lists. The following code uses criterion to test two lazy array solutions (using standard arrays, and vectors) as well as the original list code and Daniel's mutable uarray code above:
module Main where
import Data.Bits
import Data.List
import Data.Word
import qualified Data.Vector as LV
import Data.Array.ST
import Data.Array.Unboxed
import qualified Data.Array as A
import Data.Array.Base
import Criterion.Main
gen :: Word32 -> Word32 -> Word32 -> Word32 -> Word32
gen a b c d = rotate (a `xor` b `xor` c `xor` d) 1
gss blkw = LV.toList v
where v = LV.fromList $ blkw ++ rest
rest = map (\i -> gen (LV.unsafeIndex v (i + 13))
(LV.unsafeIndex v (i + 8))
(LV.unsafeIndex v (i + 2))
(LV.unsafeIndex v i)
)
[0..79 - 14]
gss' blkw = A.elems v
where v = A.listArray (0,79) $ blkw ++ rest
rest = map (\i -> gen (unsafeAt v (i + 13))
(unsafeAt v (i + 8))
(unsafeAt v (i + 2))
(unsafeAt v i)
)
[0..79 - 14]
generateSchedule :: [Word32] -> [Word32]
generateSchedule blkw = take 80 ws
where
ws = blkw ++ zipWith4 gen (drop 13 ws) (drop 8 ws) (drop 2 ws) ws
gs :: [Word32] -> [Word32]
gs ws = elems (generateSched ws)
generateSched :: [Word32] -> UArray Int Word32
generateSched ws0 = runSTUArray $ do
arr <- unsafeNewArray_ (0,79)
let fromList i [] = fill i 0
fromList i (w:ws) = do
unsafeWrite arr i w
fromList (i+1) ws
fill i j
| i == 80 = return arr
| otherwise = do
d <- unsafeRead arr j
c <- unsafeRead arr (j+2)
b <- unsafeRead arr (j+8)
a <- unsafeRead arr (j+13)
unsafeWrite arr i (gen a b c d)
fill (i+1) (j+1)
fromList 0 ws0
args = [0..13]
main = defaultMain [
bench "list" $ whnf (sum . generateSchedule) args
,bench "vector" $ whnf (sum . gss) args
,bench "array" $ whnf (sum . gss') args
,bench "uarray" $ whnf (sum . gs) args
]
I compiled the code with -O2 and -funfolding-use-threshold=256 to force lots of inlining.
The criterion benchmarks demonstrate that the vector solution is slightly better, and the array solution slightly better still, but that the unboxed mutable solution still wins by a landslide:
benchmarking list
mean: 8.021718 us, lb 7.720636 us, ub 8.605683 us, ci 0.950
std dev: 2.083916 us, lb 1.237193 us, ub 3.309458 us, ci 0.950
benchmarking vector
mean: 6.829923 us, lb 6.725189 us, ub 7.226799 us, ci 0.950
std dev: 882.3681 ns, lb 76.20755 ns, ub 2.026598 us, ci 0.950
benchmarking array
mean: 6.212669 us, lb 5.995038 us, ub 6.635405 us, ci 0.950
std dev: 1.518521 us, lb 946.8826 ns, ub 2.409086 us, ci 0.950
benchmarking uarray
mean: 2.380519 us, lb 2.147896 us, ub 2.715305 us, ci 0.950
std dev: 1.411092 us, lb 1.083180 us, ub 1.862854 us, ci 0.950
I ran some basic profiling too, and noticed that the lazy/boxed array solutions did slightly better than the list solution, but again significantly worse than the unboxed array approach.
ByteStringand use Data.Vector.Storable or some such. I don't see much point in optimizing when the input is a list of words. If you unrollpartcbentirely then you won't even need thisgenerateSchedulefunction (partab); it would be explicit (inlined) in the unrolling. Also: you're putting enough effort into this that I'm curious as to the goal - is it education or are you wanting to use the implementation in production code? If #2, is there a reason for avoidingcryptohash?{-# UNPACK #-}pragma before eachWord32field)generateScheduleis actually the bottleneck here? Ifblkwisn't already evaluated, the cost of evaluating it will probably be attributed togenerateSchedule. If it is evaluated, then representing it with a list is probably not the right thing anyway.