How efficient is Java serialization?

Question

We have a class called Row which represents a row in a result set. We need to write a List<Row> to file so that it can be retrieved much later.

One way to do this is by using Java's serialization support.

I imagine the best way is to implement serialization inside the Row class. Then we would use the serialize method of List<Row>, in order to write to file.

I wanted to know, how efficient would this be? Would it take up far more space than simply writing a CSV file adapter that converts our List<Row> object to a CSV file?

That's an interesting question, there is a long serialVersionUID (so every row would be padded by a minimum of 64-bits). You could use Externalizable (and you could pack the data). If you want a more complete answer, I suggest posting some code (and a benchmark). — Elliott Frisch
– Elliott Frisch, Commented Jun 25, 2016 at 0:09
You may be interested in this question. (Possible dupe?) It deals more with speed (which IMO is more important), but they mention file size too. — castletheperson
– castletheperson, Commented Jun 25, 2016 at 0:12
@ElliotFrisch The serialVersionUID isn't transmitted with every object. It is transmitted once per newClassDesc, which is transmitted once per class per stream. — user207421
– user207421, Commented Jun 25, 2016 at 0:20
Am I right to assume that if I implement serialization in the Row class, then the List<Row>.serialize() method will do the trick? — ktm5124
– ktm5124, Commented Jun 25, 2016 at 0:26
@Dici That's an accepted shorthand to name a method in documentation. It doesn't have to compile. Socket.close() is another example. — user207421
– user207421, Commented Jun 25, 2016 at 1:03

Stephen C · Accepted Answer · 2016-06-25 00:58:22Z

4

Would it take up far more space than simply writing a CSV file adapter that converts our List object to a CSV file?

It depends on the type of Row, and also on the size and other aspects of the data you are saving¹.

On the one hand, the Java serialization protocol includes metadata for every class mentioned in the serialization. This takes significant space.

On the other hand:

Java serialization only includes the metadata once per serialization. So if you serialize lots of instances of the same class, the metadata cost becomes insignificant.
In a CSV file, all non textual data has to be converted to text. In some cases (e.g. large numbers, floating point numbers, booleans) the textual representation will be larger than the binary representation that is used in Java serialization.

^{1 - For example, an array of random numbers versus an array of zeros and ones. Java serialization will be better in the first case, and CSV will be better in the second.}

But I think that you are probably focusing on the wrong thing here:

Unless the files you are generating are enormous, the size probably doesn't matter. Disk space is cheap.
The files are likely to be compressible in either case, with the less dense form probably being more compressible.
What matters more is whether the representation is suitable for purpose; e.g.
- Do you want it to be human readable?
- Do you want it to be readable by non-Java programs, including shell scripts?
- Do you need to worry about changes to your Java code introducing class versus serialization version problems?
- Do you want to be able to stream the data? (When writing or reading.)

edited Jun 25, 2016 at 0:58

answered Jun 25, 2016 at 0:44

Stephen C

723k95 gold badges849 silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

Onur Over a year ago

@Tom actually no. Check this answer stackoverflow.com/a/769675 also take a look at Csv Schema

Stephen C Over a year ago

@Tom - Well-formed CSV deals with that case by quoting. However, there is another problem that there are many variants of CSV, and you need to know which one you are using to read it correctly.

Stephen C Over a year ago

@Tom - Because the files were written by applications coded by people who didn't know what they were doing :-) Hint: don't code CSV readers / writers by hand. Use a library, if you can.

Stephen C Over a year ago

Hint: @Tom - you are not the only person reading these comments.

Stephen C Over a year ago

The @Tom doesn't mean that I am >>only<< talking to you. That is not how it is used.

|

adao7000 · Accepted Answer · 2016-06-25 03:57:13Z

3

Java serialization will be less space efficient in certain cases than simply writing to a CSV file because it stores extra metadata to identify class-types.

I verified such a scenario with two simple test programs. The first one writes an array of ints to a .csv file.

import java.io.*;

public class CSVDemo {
  public static void main(String [] args) {
    try {
       PrintWriter pw = new PrintWriter(new File("dummy.csv"));
       StringBuilder sb = new StringBuilder();
       for(int i = 0; i < 1000; i++){
         sb.append(1);
         sb.append(",");
       }

       pw.write(sb.toString());
       pw.close();
       System.out.printf("Data is saved in dummy.csv");
    } catch(FileNotFoundException e) {
        e.printStackTrace();
    }
  }
}

The second one serializes an object containing an array of ints to a .ser file.

import java.io.*;

public class SerializeDemo
{
   public static void main(String [] args)
   {
      DummyData dummy = new DummyData();

      try {
         FileOutputStream fileOut = new FileOutputStream("dummy.ser");
         ObjectOutputStream out = new ObjectOutputStream(fileOut);
         out.writeObject(dummy);
         out.close();
         fileOut.close();
         System.out.printf("Serialized data is saved in dummy.ser");
      } catch(IOException i) {
          i.printStackTrace();
      }
   }

   public static class DummyData implements java.io.Serializable{
     int[] data = new int[1000];
     public DummyData(){
       for(int i = 0; i < 1000; i++){
         data[i] = 1;
       }
     }
   }
}

The .ser file took 4079 bytes. The .csv file took 2000 bytes. Granted, this is a slight simplification of your use case (I'm equating an int to your Row type), but the general trend should be the same.

Trying with larger numbers yields the same result. Using 100000 ints results in ~400KB for .ser and 200KB for .csv

However, as the below comment pointed out, if choosing random values for ints, the .csv actually grows larger.

edited Jun 25, 2016 at 3:57

answered Jun 25, 2016 at 0:28

adao7000

3,6702 gold badges30 silver badges37 bronze badges

6 Comments

Onur Over a year ago

Actually you have a slight error on your csv file. CSV uses "," to seperate columns, you need to use "\r\n" for rows. So sb.append(","); will be sb.append("\r\n"); result will be a 3000 bytes file instead 2000 bytes.

Stephen C Over a year ago

Note that the properties of this particular example make serialization worse. If instead you serialized a large (enough) array of int values chosen using Random.nextInt(), the CSV form would be larger.

Onur Over a year ago

To prove @StephenC 's comment Test with random values

adao7000 Over a year ago

@StephenC That's interesting, I had no idea. Why is that? I've modified my answer to reflect your comment. Thanks!

Dici Over a year ago

@adao7000 I'm assuming it's because in the serialized format an int will always take 32 bits while stored as a character, a single-digit integer will only weight 8 bits. As soon as you start storing numbers strictly bigger than 999 the CSV format becomes more wasteful than binary. Considering there are much more numbers between 1000 and 2^32 - 1 than between 0 and 999, it is indeed way better to use binary format to store an integer array

|

Collectives™ on Stack Overflow

How efficient is Java serialization?

2 Answers 2

9 Comments

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

9 Comments

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related