Skip to main content
added 282 characters in body
Source Link
Joey Adams
  • 808
  • 5
  • 13

Edit: I wrote a much faster implementation and put it on GitHub. However, it uses a lot of ugly buffer manipulation. I'd still like to know if there's a simpler way to escape bytes in Haskell that isn't slow.

Edit: I wrote a much faster implementation and put it on GitHub. However, it uses a lot of ugly buffer manipulation. I'd still like to know if there's a simpler way to escape bytes in Haskell that isn't slow.

Tweeted twitter.com/#!/StackCodeReview/status/179939991089192960
Source Link
Joey Adams
  • 808
  • 5
  • 13

Optimizing ByteString escaping

I wrote a string escaping function in C, and I'm trying to rewrite it to Haskell. The C version, along with comments explaining why it does what it does, can be found on GitHub.

Here's a naïve implementation based on Data.ByteString.concatMap:

{-# LANGUAGE ViewPatterns #-}
import Data.Bits
import Data.ByteString          (ByteString)
import Data.ByteString.Char8    ()
import Data.ByteString.Internal (c2w, w2c)

import qualified Data.ByteString        as B
import qualified Data.ByteString.Lazy   as L

escapeCopyBytea :: ByteString -> ByteString
escapeCopyBytea = B.concatMap f
    where
        f (w2c -> '\\') = B.replicate 4 (c2w '\\')
        f c | c >= 32 && c <= 126 = B.singleton c
        f c = B.pack
                [ c2w '\\'
                , c2w '\\'
                , c2w '0' + ((c `shiftR` 6) .&. 0x7)
                , c2w '0' + ((c `shiftR` 3) .&. 0x7)
                , c2w '0' + (c .&. 0x7)
                ]

mapChunks :: (ByteString -> ByteString) -> L.ByteString -> L.ByteString
mapChunks f = L.fromChunks . map f . L.toChunks

main :: IO ()
main = L.getContents >>= L.putStr . mapChunks escapeCopyBytea

I'd expect this to be a few times slower than the C version. Nope, it is 125 times slower than the C version.

Then, I tried using blaze-builder:

import Blaze.ByteString.Builder
import Data.Bits
import Data.Monoid              (mappend, mconcat, mempty)
import Data.ByteString          (ByteString)
import Data.Word                (Word8)

import qualified Data.ByteString        as B
import qualified Data.ByteString.Lazy   as L

writeBackslash :: Write
writeBackslash = writeWord8 92

escape1 :: Word8 -> Builder
escape1 92 = fromWrite $ writeBackslash
               `mappend` writeBackslash
               `mappend` writeBackslash
               `mappend` writeBackslash
escape1 c | c >= 32 && c <= 126 = fromWrite $ writeWord8 c
          | otherwise = fromWrite $ writeBackslash
                          `mappend` writeBackslash
                          `mappend` writeWord8 (48 + ((c `shiftR` 6) .&. 0x7))
                          `mappend` writeWord8 (48 + ((c `shiftR` 3) .&. 0x7))
                          `mappend` writeWord8 (48 + (c .&. 0x7))

escapeCopyBytea2 :: ByteString -> Builder
escapeCopyBytea2 = B.foldl' f mempty
    where
        f b c = b `mappend` escape1 c

main :: IO ()
main = L.getContents >>= L.putStr . toLazyByteString . mconcat
                       . map escapeCopyBytea2 . L.toChunks

This made a difference. It's even slower, 300 times slower than the C version. I thought blaze-builder was supposed to be really fast!

Simply summing bytes, by folding over the input in a similar fashion as I do above, is reasonably fast (takes 5 times longer than the C version of the escaping code):

import qualified Data.ByteString      as B
import qualified Data.ByteString.Lazy as L

import Data.List (foldl')

main :: IO ()
main = L.getContents >>= print . foldl' (+) 0 . map (B.foldl' (+) 0) . L.toChunks

What can I do to make this escaping function faster? Is it possible to get anywhere near the performance of C here, or is C (or perhaps working with Ptr or Addr# directly and doing it the C way) the only viable option to make this efficient?