Differing sizes of String representation in Java

Question

I'm comparing the various ways of storing a String in java by breaking a String down into its constituent parts. I have this code snippet:

final String message = "ABCDEFGHIJ";
System.out.println("As String " + RamUsageEstimator.humanSizeOf(message));
System.out.println("As byte[] " + RamUsageEstimator.humanSizeOf(message.getBytes()));
System.out.println("As char[] " + RamUsageEstimator.humanSizeOf(message.toCharArray()));

This is using sizeof to measure the size of the objects. The results of the above show:

As String 64 bytes
As byte[] 32 bytes
As char[] 40 bytes

Given that a byte is 8 bits and a char is 16 bits why are the results not 10 bytes and 20 bytes respectively?

Also what is the overhead for the String object that causes it to be twice the size of the underlying byte[]?

This is using

java version "1.8.0_60"
Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)

On OSX

To get a good answer, you need to specify the exact JVM you're using. — biziclop
– biziclop, Commented Feb 11, 2016 at 11:07
A String also contains: int offset, int hashcode, int length in oracle java, those make 12 bytes, leaving 2 bytes for the pointer to the array, I assume.. — Ferrybig
– Ferrybig, Commented Feb 11, 2016 at 11:08
@Ferrybig The three int fields only add up to 12 bytes, but there is an object header as well. The size of the array reference depends on the VM (32/64 bit) and its settings (whether it uses compressed pointers or not). Then there's the question of alignment and of course it would be nice to know how RamUsageEstimator actually works. (And Java 8 did away with the offset and length fields.) — biziclop
– biziclop, Commented Feb 11, 2016 at 11:09

assylias · Accepted Answer · 2016-02-11 14:46:14Z

The data below is for Hotspot / Java 8 - numbers will vary for other JVMs/Java versions (for example, in Java 7, String has two additional int fields).

A new Object() takes 12 bytes of memory (due to internal things such as the object header).

A String has (number of bytes in brackets):

an object header (12),
a reference to a char[] (4 - assuming compressed OOP in 64 bit JVM),
an int hash (4).

That's 20 bytes but objects get padded to multiples of 8 bytes => 24. So that's already 24 bytes on top of the actual content of the array.

The char[] has a header (12), a length (4) and each char (10 x 2 = 20) padded to the next multiple of 8 - or 40 in total.

The byte[] has a header (12), a length (4) and each byte (10 x 1 = 10) = 26, padded to the next multiple of 8 = 32.

So we get to your numbers.

Also note that the number of bytes depends on the encoding you use - if you retry with message.getBytes(StandardCharsets.UTF_16) for example, you will see that the byte array uses 40 bytes instead of 32.

You can use jol to visualise the memory usage and confirm the calculation above. The output for the char[] is:

 OFFSET  SIZE  TYPE DESCRIPTION                    VALUE
      0     4       (object header)                01 00 00 00 (00000001 00000000 00000000 00000000) (1)
      4     4       (object header)                00 00 00 00 (00000000 00000000 00000000 00000000) (0)
      8     4       (object header)                41 00 00 f8 (01000001 00000000 00000000 11111000) (-134217663)
     12     4       (object header)                0a 00 00 00 (00001010 00000000 00000000 00000000) (10)
     16    20  char [C.<elements>                  N/A
     36     4       (loss due to the next object alignment)
Instance size: 40 bytes (reported by Instrumentation API)

So you can see the header of 12 (first 3 lines), the length (line 4), the chars (line 5) and the padding (line 6).

Similarly for the String (note that this excludes the size of the array itself):

 OFFSET  SIZE   TYPE DESCRIPTION                    VALUE
      0     4        (object header)                01 00 00 00 (00000001 00000000 00000000 00000000) (1)
      4     4        (object header)                00 00 00 00 (00000000 00000000 00000000 00000000) (0)
      8     4        (object header)                da 02 00 f8 (11011010 00000010 00000000 11111000) (-134216998)
     12     4 char[] String.value                   [A, B, C, D, E, F, G, H, I, J]
     16     4    int String.hash                    0
     20     4        (loss due to the next object alignment)
Instance size: 24 bytes (reported by Instrumentation API)

Jean-Baptiste Yunès · Accepted Answer · 2016-02-11 11:27:49Z

1

Each of your test, estimates the size of an Object. In the first case a String object, in the second a byte array object, and finally a char array object. Every object, as instance of a class, may contains some private attributes and other things like that; so you cannot expect something better than: a String of 10 chars, contains at least the 10 chars, each of 2 bytes long, then the whole size should be ≥20 bytes, which is coherent with your results.

For the byte/char comparison you are wrong, because the byte array from a string will give you all the bytes for a given encoding. It may happens that your current encoding uses more than one byte for a char.

You may have a look at Java source code for Object, String class and array support in JVM to understand what happens exactly.

edited Feb 11, 2016 at 11:27

answered Feb 11, 2016 at 11:12

Jean-Baptiste Yunès

36.6k4 gold badges51 silver badges76 bronze badges

3 Comments

Ferrybig Over a year ago

Where can the source code be found of the array class you mention? Never known this class existed in java

user207421 Over a year ago

@Ferrybig There is no source code. All array classes are synthesised by the JVM as required.

Jean-Baptiste Yunès Over a year ago

This is a built-in type, with special support.

Collectives™ on Stack Overflow

Differing sizes of String representation in Java

2 Answers 2

Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related