protocol-buffers serialization with large list of non-primitive types

Question

I am switching some of my DataContractSerializer usage over to protocol-buffers serialization (specifically using protobuf-net) with the goal of faster serialization and smaller serialized data size for storing in a database blob.

I found that changing my object model has a big impact on the message size. I take this to mean that my serialized data is being artificially inflated due to my choice of object model, and I'd like to fix that.

Specifically my question is: could I change my protobuf-net usage, or possibly serialization library, to get a smaller message size? I'll give an object model and what I have been able to figure out so far below.

In my case I'm serializing OCR data... here is a simplified object model:

[ProtoContract(SkipConstructor = true, UseProtoMembersOnly = true)]
public class OcrTable
{
    [ProtoMember(1)]        
    public List<OcrTableCell> Cells;
}

[ProtoContract(SkipConstructor = true, UseProtoMembersOnly = true)]
public class OcrTableCell
{
    [ProtoMember(1)]
    public int Row;
    [ProtoMember(2)]
    public int Column;
    [ProtoMember(3)]
    public int RowSpan;

    //...

    [ProtoMember(10)]
    public int Height;

    [ProtoMember(11)]
    public List<OcrCharacter> Characters;
}

[ProtoContract(SkipConstructor = true, UseProtoMembersOnly = true)]
public class OcrCharacter
{
    [ProtoMember(1)]
    public int Code;
    [ProtoMember(2)]
    public int Data;
    [ProtoMember(3)]
    public int Confidence;

    //...

    [ProtoMember(11)]
    public int Width;
}

Since the data is ultimately just a bunch of associated primitives (mostly int's), I assume the benefits of packed-bits serialization would be helpful, but in the current class structure, all the actual lists are of custom types.

To allow for packed bits serialization, I tinkered with dropping the custom types altogether, and having multiple lists of primitives, correlated by their sequence. For example:

[ProtoContract(SkipConstructor = true, UseProtoMembersOnly = true)]
public class OcrTableCell
{
    [ProtoMember(1)]
    public int Row;

    //...

    [ProtoMember(10)]
    public int Height;

    [ProtoMember(11, IsPacked=true)]
    public List<int> CharacterCode;

    [ProtoMember(12, IsPacked=true)]
    public List<int> CharacterData;

    //...

    [ProtoMember(21, IsPacked=true)]
    public List<int> CharacterWidth;
}

Here you can see I replaced List<OcrCharacter> with multiple lists: one for each field in OcrCharacter. This has a fairly large impact on serialized data size, in some cases reducing by two-thirds (even after gzipping).

I don't think its practical to make changes like these to my object model just to support serialization ... and keeping a second "helper" model to prepare for serialization seems undesirable.

Still it bugs me that I have an artificially inflated serialized data size just because of the object model for the data.

Is there a better choice of serialization parameters or library to serialize this type of object graph? I did try setting DataFormat=DataFormat.Group on the ProtoMember attributes applied to lists, but saw 0 change in the message size which confused me.

I confess, I'm intrigued - it isn't impossible to write some kind of automatic transposer, mapping properties by index to "packed" elements by position. Can I think more on this and get back to you? — Marc Gravell
– Marc Gravell, Commented Jun 7, 2013 at 23:11
For OcrCharacter I suspect I can get it working - basically, the rule would be "to enable packed transposition, all the members on the type must be of the same type, and that type must be a primitive that is compatible with packed encoding; the field-number specified on the field would map to the 1-based position in the transposed sequence" - does that sound reasonable? I'm also tempted to add "field numbers of the members must be contiguous". This would all seem to describe your OcrCharacter type quite well. Thoughts? — Marc Gravell
– Marc Gravell, Commented Jun 7, 2013 at 23:26
Also: it would need to be elective (opt-in) - maybe some new TransposePacked=true on the [ProtoContract]… would that work? — Marc Gravell
– Marc Gravell, Commented Jun 7, 2013 at 23:27
Heck yes that would work :) glad you're intrigued. I didn't think it was too pie-in-the-sky of an idea ... was thinking surely people serialize a lot of integer Point objects or similar. Depending on the data (like in my case) the packed bits encoding has just outstanding performance (lots of runs in the data) and I hated to miss out on it. — TCC
– TCC, Commented Jun 8, 2013 at 4:51
As for the type restrictions, you would know what is required for the implementation you have in mind. I was thinking the containing class would essentially pick up one list per field of the original class... that should maximize likelihood of runs hence maximum packed bits benefit. But then not sure why they would need to be of similar type, or even a primitive type (some types wouldn't be packed, but oh well). If I were to implement it manually this is what I had in mind. You might have something very different in mind, and indeed the restrictions you mention would work in my case. — TCC
– TCC, Commented Jun 8, 2013 at 5:01

Marc Gravell · Accepted Answer · 2013-06-07 22:52:15Z

2

There is nothing inside protobuf-net that is going to magiacally rearrange your object model to exploit specific features; that requires detailed knowledge of the data, which is something that is obvious to a human but pretty hard to generalize. Without investing significant time, the answer here is simply: it is going to serialize it as it is laid out in the model - and if that isn't the perfect scenario: so be it.

As for the Group data-format not helping: grouped sub-messages only applies to things like List<OcrCharacter>; since the field-number is 11, it guarantees to need 2 bytes overhead: 1 byte for the start-group marker, one byte for the end-group marker. The alternative is length-prefixed, which will need 1 byte for the field-header, and a variable number of bytes for the length of the sub-message, encoded as a varint. If each sub-message is less than 128 bytes, this will still only require one byte to encode the length (so 2 bytes overall) - which is probably why it isn't making any difference: each individual OcrCharacter is small enough (less than 128 bytes) that Group can't help.

answered Jun 7, 2013 at 22:52

Marc Gravell

1.1m273 gold badges2.6k silver badges3k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

TCC Over a year ago

Thanks for the Group explanation, that makes good sense. Also this answers my question about available options in protocol-buffers (and protobuf-net). I agree that a serializer can't know which optimizations are perfect without domain knowledge or some fancy heuristics, but it could expose a few options. I think of this case as the serializer just changing how it stores pretty simple structured data, which is comparable to existing reference tracking or packedbits options.

Collectives™ on Stack Overflow

protocol-buffers serialization with large list of non-primitive types

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related