3

We have a class called Row which represents a row in a result set. We need to write a List<Row> to file so that it can be retrieved much later.

One way to do this is by using Java's serialization support.

I imagine the best way is to implement serialization inside the Row class. Then we would use the serialize method of List<Row>, in order to write to file.

I wanted to know, how efficient would this be? Would it take up far more space than simply writing a CSV file adapter that converts our List<Row> object to a CSV file?

12
  • That's an interesting question, there is a long serialVersionUID (so every row would be padded by a minimum of 64-bits). You could use Externalizable (and you could pack the data). If you want a more complete answer, I suggest posting some code (and a benchmark). Commented Jun 25, 2016 at 0:09
  • You may be interested in this question. (Possible dupe?) It deals more with speed (which IMO is more important), but they mention file size too. Commented Jun 25, 2016 at 0:12
  • 2
    @ElliotFrisch The serialVersionUID isn't transmitted with every object. It is transmitted once per newClassDesc, which is transmitted once per class per stream. Commented Jun 25, 2016 at 0:20
  • Am I right to assume that if I implement serialization in the Row class, then the List<Row>.serialize() method will do the trick? Commented Jun 25, 2016 at 0:26
  • 1
    @Dici That's an accepted shorthand to name a method in documentation. It doesn't have to compile. Socket.close() is another example. Commented Jun 25, 2016 at 1:03

2 Answers 2

4

Would it take up far more space than simply writing a CSV file adapter that converts our List object to a CSV file?

It depends on the type of Row, and also on the size and other aspects of the data you are saving1.

On the one hand, the Java serialization protocol includes metadata for every class mentioned in the serialization. This takes significant space.

On the other hand:

  • Java serialization only includes the metadata once per serialization. So if you serialize lots of instances of the same class, the metadata cost becomes insignificant.
  • In a CSV file, all non textual data has to be converted to text. In some cases (e.g. large numbers, floating point numbers, booleans) the textual representation will be larger than the binary representation that is used in Java serialization.

1 - For example, an array of random numbers versus an array of zeros and ones. Java serialization will be better in the first case, and CSV will be better in the second.


But I think that you are probably focusing on the wrong thing here:

  • Unless the files you are generating are enormous, the size probably doesn't matter. Disk space is cheap.
  • The files are likely to be compressible in either case, with the less dense form probably being more compressible.
  • What matters more is whether the representation is suitable for purpose; e.g.
    • Do you want it to be human readable?
    • Do you want it to be readable by non-Java programs, including shell scripts?
    • Do you need to worry about changes to your Java code introducing class versus serialization version problems?
    • Do you want to be able to stream the data? (When writing or reading.)
Sign up to request clarification or add additional context in comments.

9 Comments

@Tom actually no. Check this answer stackoverflow.com/a/769675 also take a look at Csv Schema
@Tom - Well-formed CSV deals with that case by quoting. However, there is another problem that there are many variants of CSV, and you need to know which one you are using to read it correctly.
@Tom - Because the files were written by applications coded by people who didn't know what they were doing :-) Hint: don't code CSV readers / writers by hand. Use a library, if you can.
Hint: @Tom - you are not the only person reading these comments.
The @Tom doesn't mean that I am >>only<< talking to you. That is not how it is used.
|
3

Java serialization will be less space efficient in certain cases than simply writing to a CSV file because it stores extra metadata to identify class-types.

I verified such a scenario with two simple test programs. The first one writes an array of ints to a .csv file.

import java.io.*;

public class CSVDemo {
  public static void main(String [] args) {
    try {
       PrintWriter pw = new PrintWriter(new File("dummy.csv"));
       StringBuilder sb = new StringBuilder();
       for(int i = 0; i < 1000; i++){
         sb.append(1);
         sb.append(",");
       }

       pw.write(sb.toString());
       pw.close();
       System.out.printf("Data is saved in dummy.csv");
    } catch(FileNotFoundException e) {
        e.printStackTrace();
    }
  }
}

The second one serializes an object containing an array of ints to a .ser file.

import java.io.*;

public class SerializeDemo
{
   public static void main(String [] args)
   {
      DummyData dummy = new DummyData();

      try {
         FileOutputStream fileOut = new FileOutputStream("dummy.ser");
         ObjectOutputStream out = new ObjectOutputStream(fileOut);
         out.writeObject(dummy);
         out.close();
         fileOut.close();
         System.out.printf("Serialized data is saved in dummy.ser");
      } catch(IOException i) {
          i.printStackTrace();
      }
   }

   public static class DummyData implements java.io.Serializable{
     int[] data = new int[1000];
     public DummyData(){
       for(int i = 0; i < 1000; i++){
         data[i] = 1;
       }
     }
   }
}

The .ser file took 4079 bytes. The .csv file took 2000 bytes. Granted, this is a slight simplification of your use case (I'm equating an int to your Row type), but the general trend should be the same.

Trying with larger numbers yields the same result. Using 100000 ints results in ~400KB for .ser and 200KB for .csv

However, as the below comment pointed out, if choosing random values for ints, the .csv actually grows larger.

6 Comments

Actually you have a slight error on your csv file. CSV uses "," to seperate columns, you need to use "\r\n" for rows. So sb.append(","); will be sb.append("\r\n"); result will be a 3000 bytes file instead 2000 bytes.
Note that the properties of this particular example make serialization worse. If instead you serialized a large (enough) array of int values chosen using Random.nextInt(), the CSV form would be larger.
To prove @StephenC 's comment Test with random values
@StephenC That's interesting, I had no idea. Why is that? I've modified my answer to reflect your comment. Thanks!
@adao7000 I'm assuming it's because in the serialized format an int will always take 32 bits while stored as a character, a single-digit integer will only weight 8 bits. As soon as you start storing numbers strictly bigger than 999 the CSV format becomes more wasteful than binary. Considering there are much more numbers between 1000 and 2^32 - 1 than between 0 and 999, it is indeed way better to use binary format to store an integer array
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.