I am switching some of my DataContractSerializer usage over to protocol-buffers serialization (specifically using protobuf-net) with the goal of faster serialization and smaller serialized data size for storing in a database blob.
I found that changing my object model has a big impact on the message size. I take this to mean that my serialized data is being artificially inflated due to my choice of object model, and I'd like to fix that.
Specifically my question is: could I change my protobuf-net usage, or possibly serialization library, to get a smaller message size? I'll give an object model and what I have been able to figure out so far below.
In my case I'm serializing OCR data... here is a simplified object model:
[ProtoContract(SkipConstructor = true, UseProtoMembersOnly = true)]
public class OcrTable
{
[ProtoMember(1)]
public List<OcrTableCell> Cells;
}
[ProtoContract(SkipConstructor = true, UseProtoMembersOnly = true)]
public class OcrTableCell
{
[ProtoMember(1)]
public int Row;
[ProtoMember(2)]
public int Column;
[ProtoMember(3)]
public int RowSpan;
//...
[ProtoMember(10)]
public int Height;
[ProtoMember(11)]
public List<OcrCharacter> Characters;
}
[ProtoContract(SkipConstructor = true, UseProtoMembersOnly = true)]
public class OcrCharacter
{
[ProtoMember(1)]
public int Code;
[ProtoMember(2)]
public int Data;
[ProtoMember(3)]
public int Confidence;
//...
[ProtoMember(11)]
public int Width;
}
Since the data is ultimately just a bunch of associated primitives (mostly int's), I assume the benefits of packed-bits serialization would be helpful, but in the current class structure, all the actual lists are of custom types.
To allow for packed bits serialization, I tinkered with dropping the custom types altogether, and having multiple lists of primitives, correlated by their sequence. For example:
[ProtoContract(SkipConstructor = true, UseProtoMembersOnly = true)]
public class OcrTableCell
{
[ProtoMember(1)]
public int Row;
//...
[ProtoMember(10)]
public int Height;
[ProtoMember(11, IsPacked=true)]
public List<int> CharacterCode;
[ProtoMember(12, IsPacked=true)]
public List<int> CharacterData;
//...
[ProtoMember(21, IsPacked=true)]
public List<int> CharacterWidth;
}
Here you can see I replaced List<OcrCharacter> with multiple lists: one for each field in OcrCharacter. This has a fairly large impact on serialized data size, in some cases reducing by two-thirds (even after gzipping).
I don't think its practical to make changes like these to my object model just to support serialization ... and keeping a second "helper" model to prepare for serialization seems undesirable.
Still it bugs me that I have an artificially inflated serialized data size just because of the object model for the data.
Is there a better choice of serialization parameters or library to serialize this type of object graph? I did try setting DataFormat=DataFormat.Group on the ProtoMember attributes applied to lists, but saw 0 change in the message size which confused me.
OcrCharacterI suspect I can get it working - basically, the rule would be "to enable packed transposition, all the members on the type must be of the same type, and that type must be a primitive that is compatible with packed encoding; the field-number specified on the field would map to the 1-based position in the transposed sequence" - does that sound reasonable? I'm also tempted to add "field numbers of the members must be contiguous". This would all seem to describe yourOcrCharactertype quite well. Thoughts?TransposePacked=trueon the[ProtoContract]… would that work?Pointobjects or similar. Depending on the data (like in my case) the packed bits encoding has just outstanding performance (lots of runs in the data) and I hated to miss out on it.