Why doesn't string.Substring share memory with the source string?

Question

As we all know, strings in .NET are immutable. (Well, not 100% totally immutable, but immutable by design and used as such by any reasonable person, anyway.)

This makes it basically OK that, for example, the following code just stores a reference to the same string in two variables:

string x = "shark";
string y = x.Substring(0);

// Proof:
fixed (char* c = y)
{
    c[4] = 'p';
}

Console.WriteLine(x);
Console.WriteLine(y);

The above outputs:

sharp
sharp

Clearly x and y refer to the same string object. So here's my question: why wouldn't Substring always share state with the source string? A string is essentially a char* pointer with a length, right? So it seems to me the following should at least in theory be allowed to allocate a single block of memory to hold 5 characters, with two variables simply pointing to different locations within that (immutable) block:

string x = "shark";
string y = x.Substring(1);

// Does c[0] point to the same location as x[1]?
fixed (char* c = y)
{
    c[0] = 'p';
}

// Apparently not...
Console.WriteLine(x);
Console.WriteLine(y);

The above outputs:

shark
park

In substring documentation: "This method does not modify the value of the current instance. Instead, it returns a new string that begins at the startIndex position in the current string." I would say that it never should behave like in ur 1st example. If u use substring then it should be expected to create different instances for further modyfication. — Piotr Auguscik
– Piotr Auguscik, Commented Jun 8, 2011 at 5:30
Just to ask...do you really expect anything to work when you're sneaking around class invariants? — cHao
– cHao, Commented Jun 8, 2011 at 5:31
Related: msdn.microsoft.com/en-us/library/system.string.intern.aspx — Andrew Savinykh
– Andrew Savinykh, Commented Jun 8, 2011 at 5:44
Why doesn't the .net framework store all permutations of the alphabet in memory and we just reference a pointer to the part we need? :-) — benPearce
– benPearce, Commented Jun 8, 2011 at 5:53

Joey Sabey · Accepted Answer · 2023-11-15 09:38:18Z

28

For two reasons:

The string meta data (e.g. length) is stored in the same memory block as the characters, to allow one string to use part of the character data of another string would mean that you would have to allocate two memory blocks for most strings instead of one. As most strings are not substrings of other strings, that extra memory allocation would be more memory consuming than what you could gain by reusing part of strings.
There is an extra NUL character stored after the last character of the string, to make the string also usable by system functions that expect a null terminated string. You can't put that extra NUL character after a substring that is part of another string.

edited Nov 15, 2023 at 9:38

Joey Sabey

1,1337 silver badges13 bronze badges

answered Jun 8, 2011 at 5:30

Guffa

703k111 gold badges760 silver badges1k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Dan Tao Over a year ago

I suspected there would be some very good reasons for this; and sure enough, there are! Thanks for the insight.

Joey Sabey Over a year ago

Somewhat of a disappointing technical reason to lose out on potentially pretty nice CoW benefits in the core design. Interestingly, of course, the trailing NUL part is rather moot with the single-parameter substring call in the question examples, I presume.

Joey Sabey · Accepted Answer · 2023-11-15 05:29:42Z

12

C# 'strings are both null-terminated and length-prefixed' - while this is an implementation detail that shouldn't concern managed consumers, there are some cases (e.g. marshaling) where it's important.

Also if a substring shared a buffer with a much longer string, this means a reference to the short substring would prevent the longer string from being collected. And the possibility of a rats nest of string references that refer to the same buffer.

edited Nov 15, 2023 at 5:29

Joey Sabey

1,1337 silver badges13 bronze badges

answered Jun 8, 2011 at 5:32

to StackOverflow

125k34 gold badges212 silver badges345 bronze badges

3 Comments

Dan Tao Over a year ago

This was also a great answer; thanks! Makes perfect sense after considering those points.

wischi Over a year ago

C# strings are NOT null terminated and it's very easy to prove that. "abc\0def".Length is 7 and not 3 (what it would be if they were null terminated)

to StackOverflow Over a year ago

@wischi - What I meant by "null terminated" is that I think there is a null ('\0') character following the string's characters in the underlying memory buffer. Not that it is "null terminated" in the classic C sense, i.e. the string is terminated by the first null character it contains in its buffer. Guffa's answer says the same thing, but more clearly, and is rightly the accepted answer.

sleske · Accepted Answer · 2011-06-09 09:05:17Z

6

To add to the other answers:

Apparently, the Java standard classes do this: The string returned by String.substring() reuses the internal character array of the original string (source, or look at the JDK sources by Sun).

The problem is that this means that the original String cannot be GCed until all the substrings are eligible for GC as well (as they share the backing character array). This can lead to wasted memory if you start out with a large string, and extract some smaller strings out of it, then discard the big string. That would be common when parsing an input file, for example.

Of course, a clever GC might work around this by copying the character array when it is worth it (the Sun JVM may do this, I don't know), but the added complexity might be a reason not to implement this sharing behaviour at all.

answered Jun 9, 2011 at 9:05

sleske

84k40 gold badges196 silver badges239 bronze badges

2 Comments

Dan Tao Over a year ago

+1 to avoiding the added complexity. This is something that's been on my mind a lot lately: I think in many cases I prefer the "dumb, obvious" solution over clever, less easily provable ideas, more so than I used to.

sleske Over a year ago

@Dan Tao: Yes, just my thoughts. "Clever" is often something bad when programming.

supercat · Accepted Answer · 2011-07-27 01:48:28Z

There are a number of ways something like String could be implemented:

Have a "String" object effectively contain an array, with the implication that all characters in the array are in the string. This is what .net actually does.
Have every "String" be a class which contains an array reference along with a starting offset and length. Problem: Creating most strings would require instantiating two objects rather than one.
Have every "String" be a structure which contains an array reference along with a starting offset and length. Problem: Assignments to string type fields would no longer be atomic.
Have two or more types of "String" objects--those which contain all the characters in an array, and those which contain a reference to another string along with an offset and length. Problem: This would require many methods of string to be virtual.
Have every "String" be a special class which includes a starting offset and length, an object reference to what may or may not be the same object, and a built-in array of characters. This would waste a little space in the common case where a string contains its own characters (because all of them), but would allow the same code to work with strings that contain their own characters or strings that 'borrow' from others.
Have a general-purpose ImmutableArray<T> type (which would inherit ReadableArray<T>), and have an ImmutableArray<Char> be interchangeable with String. There are many uses for immutable arrays; String is probably the most common usage case, but hardly the only one.
Have a general-purpose ImmutableArray type<T> type as above, but also an ImmutableArraySegment<T> class, both inheriting from ImmutableArrayBase<T>. This would require many methods to be virtual, and would probably be my favorite possibility.

Note that most of these approaches have significant limitations in at least some usage scenarios.

BobTurbo · Accepted Answer · 2011-06-08 05:34:55Z

0

I believe these are CLR optimisations that have nothing to do with programmers as you shouldn't be doing the things you are doing. You should assume it is a new string every time (as a programmer).

answered Jun 8, 2011 at 5:34

BobTurbo

2894 silver badges14 bronze badges

2 Comments

Dan Tao Over a year ago

Well, sure... I never said anything about should. I'm just curious, from a technical standpoint, why this decision was made. I think Guffa and Joe have given some great reasons.

Guffa Over a year ago

You are right that this is details that you shouldn't normally bother yourself with. However, there is still a value in discussing how the internals of the language is constructed for the sake of gaining a better knowledge on how it's meant to be used, so that you can avoid things that are inherently ineffective.

vityanya · Accepted Answer · 2011-06-08 05:47:14Z

0

after reviewing Substring method with reflector i figured out that if you pass 0 in substriong method - it will return the same object.

[SecurityCritical]
private unsafe string InternalSubString(int startIndex, int length, bool fAlwaysCopy)
{
    if (((startIndex == 0) && (length == this.Length)) && !fAlwaysCopy)
    {
        return this;
    }
    string str = FastAllocateString(length);
    fixed (char* chRef = &str.m_firstChar)
    {
        fixed (char* chRef2 = &this.m_firstChar)
        {
            wstrcpy(chRef, chRef2 + startIndex, length);
        }
    }
    return str;
}

answered Jun 8, 2011 at 5:47

vityanya

1,1861 gold badge8 silver badges10 bronze badges

2 Comments

Dan Tao Over a year ago

Yeah... this is basically what I was trying to show with my first example. The question is why when you pass a non-zero value, the string object returned does not share the same char values in memory with the original.

vityanya Over a year ago

maybe this link can help stackoverflow.com/questions/636932/…

Stuart · Accepted Answer · 2011-06-08 06:31:37Z

0

This would add complexity (or at least more smarts) to the intern table. Imagine you already have two entries in the intern table "pending" and "bending" and the following code:

var x = "pending";
var y = x.Substring(1);

which entry in the intern table would be considered a hit?

answered Jun 8, 2011 at 6:31

Stuart

6733 silver badges10 bronze badges

1 Comment

Guffa Over a year ago

Neither. Strings created at runtime are not automatically interned.

Collectives™ on Stack Overflow

Why doesn't string.Substring share memory with the source string?

7 Answers 7

2 Comments

3 Comments

2 Comments

Comments

2 Comments

2 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

2 Comments

3 Comments

2 Comments

Comments

2 Comments

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related