Optimizing ByteString escaping

Question

I wrote a string escaping function in C, and I'm trying to rewrite it to Haskell. The C version, along with comments explaining why it does what it does, can be found on GitHub.

Here's a naïve implementation based on Data.ByteString.concatMap:

{-# LANGUAGE ViewPatterns #-}
import Data.Bits
import Data.ByteString          (ByteString)
import Data.ByteString.Char8    ()
import Data.ByteString.Internal (c2w, w2c)

import qualified Data.ByteString        as B
import qualified Data.ByteString.Lazy   as L

escapeCopyBytea :: ByteString -> ByteString
escapeCopyBytea = B.concatMap f
    where
        f (w2c -> '\\') = B.replicate 4 (c2w '\\')
        f c | c >= 32 && c <= 126 = B.singleton c
        f c = B.pack
                [ c2w '\\'
                , c2w '\\'
                , c2w '0' + ((c `shiftR` 6) .&. 0x7)
                , c2w '0' + ((c `shiftR` 3) .&. 0x7)
                , c2w '0' + (c .&. 0x7)
                ]

mapChunks :: (ByteString -> ByteString) -> L.ByteString -> L.ByteString
mapChunks f = L.fromChunks . map f . L.toChunks

main :: IO ()
main = L.getContents >>= L.putStr . mapChunks escapeCopyBytea

I'd expect this to be a few times slower than the C version. Nope, it is 125 times slower than the C version.

Then, I tried using blaze-builder:

import Blaze.ByteString.Builder
import Data.Bits
import Data.Monoid              (mappend, mconcat, mempty)
import Data.ByteString          (ByteString)
import Data.Word                (Word8)

import qualified Data.ByteString        as B
import qualified Data.ByteString.Lazy   as L

writeBackslash :: Write
writeBackslash = writeWord8 92

escape1 :: Word8 -> Builder
escape1 92 = fromWrite $ writeBackslash
               `mappend` writeBackslash
               `mappend` writeBackslash
               `mappend` writeBackslash
escape1 c | c >= 32 && c <= 126 = fromWrite $ writeWord8 c
          | otherwise = fromWrite $ writeBackslash
                          `mappend` writeBackslash
                          `mappend` writeWord8 (48 + ((c `shiftR` 6) .&. 0x7))
                          `mappend` writeWord8 (48 + ((c `shiftR` 3) .&. 0x7))
                          `mappend` writeWord8 (48 + (c .&. 0x7))

escapeCopyBytea2 :: ByteString -> Builder
escapeCopyBytea2 = B.foldl' f mempty
    where
        f b c = b `mappend` escape1 c

main :: IO ()
main = L.getContents >>= L.putStr . toLazyByteString . mconcat
                       . map escapeCopyBytea2 . L.toChunks

This made a difference. It's even slower, 300 times slower than the C version. I thought blaze-builder was supposed to be really fast!

Simply summing bytes, by folding over the input in a similar fashion as I do above, is reasonably fast (takes 5 times longer than the C version of the escaping code):

import qualified Data.ByteString      as B
import qualified Data.ByteString.Lazy as L

import Data.List (foldl')

main :: IO ()
main = L.getContents >>= print . foldl' (+) 0 . map (B.foldl' (+) 0) . L.toChunks

What can I do to make this escaping function faster? Is it possible to get anywhere near the performance of C here, or is C (or perhaps working with Ptr or Addr# directly and doing it the C way) the only viable option to make this efficient?

Edit: I wrote a much faster implementation and put it on GitHub. However, it uses a lot of ugly buffer manipulation. I'd still like to know if there's a simpler way to escape bytes in Haskell that isn't slow.

Peter Wortmann · Accepted Answer · 2012-03-16 19:09:43Z

Whew, this is quite a tricky one. The main problem here is that you really, really need to keep Haskell from allocating anything in the relatively spacious inner loop. Otherwise, you are looking at a dozen bytes going through the garbage collector for every byte you escape.

As you note, blaze-builder has a reputation for being fast, mainly for being pretty good at this. Let's start off with an explanation why that is. Here's roughly how a Builder is defined (see Blaze.ByteString.Builder.Internal.Types):

newtype Builder = Builder (forall r. BuildStep r -> BuildStep r)
newtype BuildStep a = BufRange -> IO (BuildSignal a)
data BufRange a = BufRange !(Ptr Word8) !(Ptr Word8)

The compiler will replace all newtype wrappers with simple "casts", and un-box BufRange where possible. So this means that, say, your escape1 function would actually have an after-optimization type closer to:

escape1 :: Word8 -> (BufRange -> IO (BuildSignal a)) -> BufRange -> IO (BuildSignal a)

So this actually just takes the pointers from BufRange to write data using the IO monad, then finally calls the continuation. Ideally, we even know the continuation in question and can jump directly to the code. At no point does this require heap-allocation, everything gets nicely optimized out.

So, what's the problem in your case? The trouble is the "glue code". Let us look at the source of ByteString's foldl, with some additional type annotations:

lgo :: Builder -> Addr# -> Addr# -> IO Builder
lgo !z !p !q | p == q    = return z
             | otherwise = do c <- peek p
                              lgo (f z c) (p `plusPtr` 1) q

For every byte, this worker loop forces f in order to get the new "accumulated value". Unfortunately, that's a pretty bad idea here, as our accumulator is a Builder - a closure (as we saw above). Therefore this will actually allocate a huge list of closures, but not start producing anything of value until the very end.

So, how to improve this? Remember again that our f has type ... -> BuildStep r -> BuildStep r somewhere deep down. We would actually like to pass lgo as the BuildStep continuation, so after inlining it becomes a simple jump. We also don't really need the accumulation parameter, as the Builder is taking care of accumulating its result by itself.

Going through with this, we actually get a lazy right fold:

lgo :: Addr# -> Addr# -> BuildStep r -> BuildStep r
lgo !p !q | p == q    = v
          | otherwise = f (inlinePerformIO (peek p)) (lgo (p `plusPtr` 1) q)

(Note I also removed the IO by using inlinePerformIO, which might be unsafe in other usage scenarios, as any addresses escaping due to lazy evaluation might point to a garbage-collected buffer later!)

The type signature looks a lot better now - and from my testing this actually performs roughly twice as fast as the original version. Where does the rest of the performance get lost? Well, a look at the Core shows code like

lgo = \p q -> case p == q of
                True -> \r -> let ... in \stp -> ...
                False ->

This means that GHC is generating a lot of partially-applied functions - primarily because the optimizer doesn't want to lose sharing in case we want to call lgo with the same p and q, but different BuildSteps (which we never will, but GHC can't prove that).

So we need to force GHC to use higher arity for this function's implementation. Unfortunately, I am somewhat out of ideas how to accomplish that in an elegant way. So here's the manual solution -- writing our lambdas out at the top of the expression:

foldBuilder :: (Word8 -> Builder) -> B.ByteString -> Builder
foldBuilder f (PS x s l) =
  fromBuildStepCont $ \cont range ->
    withForeignPtr x $ \ptr -> do
      let lgo !p !q !range'
            | p == q    = cont range'
            | otherwise = do
                c <- peek p
                let p' = p `plusPtr` 1
                    step = unBuilder (f c) (BuildStep $ lgo p' q)
                runBuildStep step range'
      lgo (ptr `plusPtr` s) (ptr `plusPtr` (s+l)) range

I get decent performance out of this version here, and I suppose it's slightly better than implementing it completely by hand. It also happens to not require inlinePerformIO at all any more, which is a nice plus.

Code can be found at GitHub. I hope this helps in some way :)

Neat, thanks! It runs in only 2 seconds on my test case (slower than my version, but at least it's simpler). However, it produces the wrong output when fed a hundred megabytes of data from /dev/urandom (I'm still troubleshooting). Also, I don't think using c2w instead of magic numbers like 92 and 48 impacts performance, but I could be wrong. — Joey Adams
– Joey Adams, Commented Mar 16, 2012 at 20:04
The problem appears to be premature garbage collection. If I touch all the input chunks at the end (leaking memory), the program produces the correct result. — Joey Adams
– Joey Adams, Commented Mar 16, 2012 at 20:50
Just add touchForeignPtr x after the peek. It fixes the problem, and does not affect performance. — Joey Adams
– Joey Adams, Commented Mar 16, 2012 at 20:56
Interesting. I thought withForeignPtr was a bit more robust than that... I'll have to investigate more sometime. — Peter Wortmann
– Peter Wortmann, Commented Mar 17, 2012 at 13:36
The problem is that withForeignPtr simply touches the foreign pointer after the inner computation completes. Accesses to ptr afterward may be invalidated by garbage collection. runBuildStep sometimes returns a continuation (BufferFull or InsertByteString) which is called outside of withForeignPtr. — Joey Adams
– Joey Adams, Commented Mar 17, 2012 at 14:33

Stack Exchange Network

Optimizing ByteString escaping

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Optimizing ByteString escaping

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions