2

I need to write millions of Java POJOs to disk, and read them from disk, and I need to do it fast.

I would prefer to avoid having to define a separate template file as I believe is required with Thrift and Google Protocol Buffers. Rather, it would be preferable if the Java class itself was the authoritative specification for the object (as with Java Serialization, Gson, and other serialization protocols). I realize that there may be a bit of a performance hit here, but its ok provided its not an order of magnitude slower.

The classes to be serialized consist of several simple long and String fields, and a single Map (where the values in this map are all either Numbers or Strings).

Can anyone suggest some libraries that I should look at for this?

3
  • Have you measured native Java serialization and saw that it wasn't fast enough? What's the time you had, and what's the time you want? Commented Sep 14, 2011 at 18:38
  • There isn't really a threshold above which its good and below which is bad. Faster is better. Native serialization may be fine, I'm just wondering whether there are some commonly understood faster approaches. Commented Sep 14, 2011 at 20:02
  • Re your "it would be preferable..." - I have a .NET version of protobuf that would work that way (code-first), but not Java; mentioned in case it applies to some later reader (see: protobuf-net) Commented Sep 14, 2011 at 20:42

1 Answer 1

4

Test first with Java serialization, and see if it's fast enough. It's built in, and is competent enough to handle graphs and multiple versions.

There is no reason to look for alternatives until you know you need it.

Edit: You will need to reset() the ObjectStream, in order to not fill the lookup table with references to already written objects. If you are writing relatively independent objects, that is probably not a problem to do a reset after every "top" object, but if you have complex relations in your data, i suggest that you try JPA or something else.

Sign up to request clarification or add additional context in comments.

4 Comments

For a simple object. Native Serialization is good enough. +1 for simple direct answer.
There are lots of faster approaches but the faster you go the more complex it gets for the developer. Your time is important too. ;)
It's not blazingly fast: My laptop wrote 100000 data objects in 29,85300 seconds, each object contained a map with 10 strings, and 5 additional strings. Totally 1 500 000 objects or so. Reading is faster, it took 5 seconds to read everything back.
SOrry, that would be 3 000 000 objects... The map contains 10 keys and 10 values... The file is about 230 MB, and thats about 73 bytes per String, which isn't much overhead actually.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.