Does it make sense to "waste" 8 bytes per String instance for offset/count?

Question

Strings in Java support structural sharing for some methods like substring, which means that supposedly immutable data doesn't need to be copied (which (unexpectedly) keeps large char arrays alive which would have been GC'd otherwise.)

This feature is implemented with two fields offset and count which are set accordingly when a String is substringed in Java.

Considering that .NET doesn't do this and claims that "O(n) is O(1) if n does not grow large", would a slightly different design of Strings make sense which accommodates both requirements?

E. g. would it make sense to have a sealed, memory-efficient, general purpose version of String which doesn't have these superfluous fields and a subclass "SubString" which is only returned by substring methods and has the additional fields to avoid copying?

Rough sketch:

sealed class String {
  val codeunits: Array[Char] = ...
  def length = codeunits.length

  def substring: SubString = ...

  ...
}

final class SubString extends String {
  val offset: Int = ...
  override def length = codeunits.length - offset /* and so on */

  ...
}

The field is called count, not length, at least in the Oracle JVM. — Michael Borgwardt
– Michael Borgwardt, Commented Sep 16, 2011 at 14:35
I think you want to limit tags to java/scala since C# as you mention does not have the thing you propose to work around/replace — sehe
– sehe, Commented Sep 16, 2011 at 14:36
@sehe: I'm especially interested what experiences .NET devs have made with their choice, that's why I left the tag in. — soc
– soc, Commented Sep 16, 2011 at 14:38
In theory it should be possible for a sufficiently-smart JVM to automatically shrink the array behind your back, but it might not be practical or worthwhile. — Stuart Cook
– Stuart Cook, Commented Sep 16, 2011 at 14:38
Are there any good benchmarks showing how each model stacks up under similar load in terms of memory and processor usage? Seems like that would be your starting place to see if there's enough difference to warrant a mixed model. It might also show where each model ought to be used. (Keep in mind Java currently allows both models, you just have to choose which one to use, and I'm not so sure it's wasting as many bytes as you think. Length is already required by C# strings and there's going to be a starting memory address for any string) — user645280
– user645280, Commented Sep 16, 2011 at 14:51

Peter Lawrey · Accepted Answer · 2011-09-16 15:25:10Z

2

What you suggest could make the common case more efficient in terms of memory and cpu.

You may be interested to know the JVM can change this without a code change. The Sun/Oracle JVM currently uses a byte[] automagically when the characters fit into bytes without loss.

In any case its the sort of thing you would want the JVM to do for you transparently, like -XX:+UseCompressedStrings does.

answered Sep 16, 2011 at 15:25

Peter Lawrey

535k83 gold badges770 silver badges1.2k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

soc Over a year ago

Thanks, but isn't that something completely different? Having a byte[] instead of a char[] doesn't change the behavior in question ...

Peter Lawrey Over a year ago

It shows that a change for performance reasons is possible, has happened relatively recently, and even desirable provided its transparent to the Java application.

Collectives™ on Stack Overflow

Does it make sense to "waste" 8 bytes per String instance for offset/count?

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related