I'm writing an application that interacts with a large (10-1000 GB) memory-mapped binary file, holding essentially a bunch of objects that refer to each other. I've come up with a mechanism to read/write this data which is effective, but ugly and verbose (imo).
Q: Is there a more elegant way to achieve what I've done?
I have a typeclass for structured data, with one method that reads the structure into a Haskell datatype (DataOp is a ReaderT around IO).
class DBStruct a where
structRead :: Addr a -> DataOp a
To make this more readable, I have another typeclass that defines what structure members go where:
class DBStruct st => StructMem structTy valTy name | structTy name -> valTy where
offset :: structTy -> valTy -> name -> Int64
I have a few helper functions that use the offset method for reading/writing structure elements, for reading structures from stored references, and for lazily deferring structure reads (to allow lazy reading of the entire file).
The problem with this is that it involves a lot of repetition to use. For one structure, I first have to define the Haskell type:
data RowBlock = RowBlock {rbNext :: Maybe RowBlock
,rbPrev :: Maybe RowBlock
,rbRows :: [RowTy]
}
Then the name types:
data Next = Next
data Prev = Prev
data Count = Count
newtype Row = Row Int64
Then the instances for each structure member:
instance StructMem RowBlock (Maybe (Addr RowBlock)) Next where offset _ _ _ = 0
instance StructMem RowBlock (Maybe (Addr RowBlock)) Prev where offset _ _ _ = 8
instance StructMem RowBlock Int64 Count where offset _ _ _ = 16
instance StructMem RowBlock RowTy Row where offset _ _ (Row n) = 24 + n * 8
Then the structure read method:
instance DBStruct RowBlock where
structRead a = do
n <- elemMaybePtr a Next
p <- elemMaybePtr a Prev
c <- elemRead a Count
rs <- mapM (elemRead a . Row) [0 .. c-1]
return $ RowBlock n p rs
So all I've really accomplished is to re-implement C structs in a much more verbose (and slow) way. I would be happier if this were more concise while preserving type safety. Surely this is a commonly encountered problem.
A few possible alternatives I can think of are:
- Ditch memory-mapped files and use
Data.Binary, writing theByteStringsto disk the normal way. - Use
deriving Genericto create generic read and write functions - Overload Functional References
- Do something magical with monadic lenses.
EDIT: SSCCE as requested