Storing large structured binary data with Haskell

Question

I'm writing an application that interacts with a large (10-1000 GB) memory-mapped binary file, holding essentially a bunch of objects that refer to each other. I've come up with a mechanism to read/write this data which is effective, but ugly and verbose (imo).

Q: Is there a more elegant way to achieve what I've done?

I have a typeclass for structured data, with one method that reads the structure into a Haskell datatype (DataOp is a ReaderT around IO).

class DBStruct a where
    structRead :: Addr a -> DataOp a

To make this more readable, I have another typeclass that defines what structure members go where:

class DBStruct st => StructMem structTy valTy name | structTy name -> valTy where
    offset :: structTy -> valTy -> name -> Int64

I have a few helper functions that use the offset method for reading/writing structure elements, for reading structures from stored references, and for lazily deferring structure reads (to allow lazy reading of the entire file).

The problem with this is that it involves a lot of repetition to use. For one structure, I first have to define the Haskell type:

data RowBlock = RowBlock {rbNext :: Maybe RowBlock
                         ,rbPrev :: Maybe RowBlock
                         ,rbRows :: [RowTy]
                         }

Then the name types:

data Next = Next
data Prev = Prev
data Count = Count
newtype Row = Row Int64

Then the instances for each structure member:

instance StructMem RowBlock (Maybe (Addr RowBlock)) Next where offset _ _ _ = 0
instance StructMem RowBlock (Maybe (Addr RowBlock)) Prev where offset _ _ _ = 8
instance StructMem RowBlock Int64 Count where offset _ _ _ = 16
instance StructMem RowBlock RowTy Row where offset _ _ (Row n) = 24 + n * 8

Then the structure read method:

instance DBStruct RowBlock where
    structRead a = do
        n <- elemMaybePtr a Next
        p <- elemMaybePtr a Prev
        c <- elemRead a Count
        rs <- mapM (elemRead a . Row) [0 .. c-1]
        return $ RowBlock n p rs

So all I've really accomplished is to re-implement C structs in a much more verbose (and slow) way. I would be happier if this were more concise while preserving type safety. Surely this is a commonly encountered problem.

A few possible alternatives I can think of are:

Ditch memory-mapped files and use Data.Binary, writing the ByteStrings to disk the normal way.
Use deriving Generic to create generic read and write functions
Overload Functional References
Do something magical with monadic lenses.

EDIT: SSCCE as requested

Could you provide a simplified, self-contained example that can be compiled, something to start with? — Petr
– Petr, Commented Aug 7, 2013 at 15:21
@PetrPudlák I can do that later today if you think it'll help people come up with an answer. However, this was meant to be more of an architectural question than a why-isn't-my-code-working question; the code I did post is meant to be more illustrative of my current architecture than anything else. — Dan
– Dan, Commented Aug 7, 2013 at 15:59
Yes, SSCCE would help me to better understand your current design. — Petr
– Petr, Commented Aug 7, 2013 at 16:28
What operations you need to perform on the files? Could you give an example of such a sequence of operations? — Petr
– Petr, Commented Aug 7, 2013 at 16:53

Levi Pearson · Accepted Answer · 2013-08-08 17:24:38Z

1

You might try using Data.Binary with your Ptrs.

For writing:

Use Data.Binary to build a ByteString. A ByteString is a tuple (ForeignPtr Word8, Int, Int) that holds the address, offset, and length where the data is stored. You can use the Data.ByteString.Internal package to get toForeignPtr, which will unpack the tuple for you. Foreign.ForeignPtr provides withForeignPtr, which takes a function that performs an IO action via the pointer. In there you can memcpy (a binding for this is also provided in Data.ByteString.Internal) the bytestring storage to the mmapped Ptr you got from mmap.

For reading:

You can use Data.ByteString.Internal's fromForiegnPtr to turn a Ptr into a bytestring. This is basically what the mmap libraries do, but you could do it a record at a time instead of with the entire region. Once you have a ByteString view on the memory, you can unpack it with Data.Binary.

Another option is to take advantage of the fact that ByteString has an alternative implementation in Data.Vector.Storable.ByteString, which would let you use the Storable interface you're using now to read/write them to the mmaped Ptrs. The interface and basic type is isomorphic to the Data.ByteString one, but it's got the Storable instances.

edited Aug 8, 2013 at 17:24

answered Aug 7, 2013 at 20:29

Levi Pearson

4,9941 gold badge19 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Dan Over a year ago

I need to be able to write to the file as well though.

Collectives™ on Stack Overflow

Storing large structured binary data with Haskell

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related