I wrote a string escaping function in C, and I'm trying to rewrite it to Haskell. The C version, along with comments explaining why it does what it does, can be found on GitHub.
Here's a naïve implementation based on Data.ByteString.concatMap:
{-# LANGUAGE ViewPatterns #-}
import Data.Bits
import Data.ByteString (ByteString)
import Data.ByteString.Char8 ()
import Data.ByteString.Internal (c2w, w2c)
import qualified Data.ByteString as B
import qualified Data.ByteString.Lazy as L
escapeCopyBytea :: ByteString -> ByteString
escapeCopyBytea = B.concatMap f
where
f (w2c -> '\\') = B.replicate 4 (c2w '\\')
f c | c >= 32 && c <= 126 = B.singleton c
f c = B.pack
[ c2w '\\'
, c2w '\\'
, c2w '0' + ((c `shiftR` 6) .&. 0x7)
, c2w '0' + ((c `shiftR` 3) .&. 0x7)
, c2w '0' + (c .&. 0x7)
]
mapChunks :: (ByteString -> ByteString) -> L.ByteString -> L.ByteString
mapChunks f = L.fromChunks . map f . L.toChunks
main :: IO ()
main = L.getContents >>= L.putStr . mapChunks escapeCopyBytea
I'd expect this to be a few times slower than the C version. Nope, it is 125 times slower than the C version.
Then, I tried using blaze-builder:
import Blaze.ByteString.Builder
import Data.Bits
import Data.Monoid (mappend, mconcat, mempty)
import Data.ByteString (ByteString)
import Data.Word (Word8)
import qualified Data.ByteString as B
import qualified Data.ByteString.Lazy as L
writeBackslash :: Write
writeBackslash = writeWord8 92
escape1 :: Word8 -> Builder
escape1 92 = fromWrite $ writeBackslash
`mappend` writeBackslash
`mappend` writeBackslash
`mappend` writeBackslash
escape1 c | c >= 32 && c <= 126 = fromWrite $ writeWord8 c
| otherwise = fromWrite $ writeBackslash
`mappend` writeBackslash
`mappend` writeWord8 (48 + ((c `shiftR` 6) .&. 0x7))
`mappend` writeWord8 (48 + ((c `shiftR` 3) .&. 0x7))
`mappend` writeWord8 (48 + (c .&. 0x7))
escapeCopyBytea2 :: ByteString -> Builder
escapeCopyBytea2 = B.foldl' f mempty
where
f b c = b `mappend` escape1 c
main :: IO ()
main = L.getContents >>= L.putStr . toLazyByteString . mconcat
. map escapeCopyBytea2 . L.toChunks
This made a difference. It's even slower, 300 times slower than the C version. I thought blaze-builder was supposed to be really fast!
Simply summing bytes, by folding over the input in a similar fashion as I do above, is reasonably fast (takes 5 times longer than the C version of the escaping code):
import qualified Data.ByteString as B
import qualified Data.ByteString.Lazy as L
import Data.List (foldl')
main :: IO ()
main = L.getContents >>= print . foldl' (+) 0 . map (B.foldl' (+) 0) . L.toChunks
What can I do to make this escaping function faster? Is it possible to get anywhere near the performance of C here, or is C (or perhaps working with Ptr or Addr# directly and doing it the C way) the only viable option to make this efficient?
Edit: I wrote a much faster implementation and put it on GitHub. However, it uses a lot of ugly buffer manipulation. I'd still like to know if there's a simpler way to escape bytes in Haskell that isn't slow.