40

As we all know, strings in .NET are immutable. (Well, not 100% totally immutable, but immutable by design and used as such by any reasonable person, anyway.)

This makes it basically OK that, for example, the following code just stores a reference to the same string in two variables:

string x = "shark";
string y = x.Substring(0);

// Proof:
fixed (char* c = y)
{
    c[4] = 'p';
}

Console.WriteLine(x);
Console.WriteLine(y);

The above outputs:

sharp
sharp

Clearly x and y refer to the same string object. So here's my question: why wouldn't Substring always share state with the source string? A string is essentially a char* pointer with a length, right? So it seems to me the following should at least in theory be allowed to allocate a single block of memory to hold 5 characters, with two variables simply pointing to different locations within that (immutable) block:

string x = "shark";
string y = x.Substring(1);

// Does c[0] point to the same location as x[1]?
fixed (char* c = y)
{
    c[0] = 'p';
}

// Apparently not...
Console.WriteLine(x);
Console.WriteLine(y);

The above outputs:

shark
park
9
  • 1
    substring creates new instance of base string,is'nt it? Commented Jun 8, 2011 at 5:28
  • In substring documentation: "This method does not modify the value of the current instance. Instead, it returns a new string that begins at the startIndex position in the current string." I would say that it never should behave like in ur 1st example. If u use substring then it should be expected to create different instances for further modyfication. Commented Jun 8, 2011 at 5:30
  • Just to ask...do you really expect anything to work when you're sneaking around class invariants? Commented Jun 8, 2011 at 5:31
  • Related: msdn.microsoft.com/en-us/library/system.string.intern.aspx Commented Jun 8, 2011 at 5:44
  • 1
    Why doesn't the .net framework store all permutations of the alphabet in memory and we just reference a pointer to the part we need? :-) Commented Jun 8, 2011 at 5:53

7 Answers 7

28

For two reasons:

  • The string meta data (e.g. length) is stored in the same memory block as the characters, to allow one string to use part of the character data of another string would mean that you would have to allocate two memory blocks for most strings instead of one. As most strings are not substrings of other strings, that extra memory allocation would be more memory consuming than what you could gain by reusing part of strings.

  • There is an extra NUL character stored after the last character of the string, to make the string also usable by system functions that expect a null terminated string. You can't put that extra NUL character after a substring that is part of another string.

Sign up to request clarification or add additional context in comments.

2 Comments

I suspected there would be some very good reasons for this; and sure enough, there are! Thanks for the insight.
Somewhat of a disappointing technical reason to lose out on potentially pretty nice CoW benefits in the core design. Interestingly, of course, the trailing NUL part is rather moot with the single-parameter substring call in the question examples, I presume.
12

C# 'strings are both null-terminated and length-prefixed' - while this is an implementation detail that shouldn't concern managed consumers, there are some cases (e.g. marshaling) where it's important.

Also if a substring shared a buffer with a much longer string, this means a reference to the short substring would prevent the longer string from being collected. And the possibility of a rats nest of string references that refer to the same buffer.

3 Comments

This was also a great answer; thanks! Makes perfect sense after considering those points.
C# strings are NOT null terminated and it's very easy to prove that. "abc\0def".Length is 7 and not 3 (what it would be if they were null terminated)
@wischi - What I meant by "null terminated" is that I think there is a null ('\0') character following the string's characters in the underlying memory buffer. Not that it is "null terminated" in the classic C sense, i.e. the string is terminated by the first null character it contains in its buffer. Guffa's answer says the same thing, but more clearly, and is rightly the accepted answer.
6

To add to the other answers:

Apparently, the Java standard classes do this: The string returned by String.substring() reuses the internal character array of the original string (source, or look at the JDK sources by Sun).

The problem is that this means that the original String cannot be GCed until all the substrings are eligible for GC as well (as they share the backing character array). This can lead to wasted memory if you start out with a large string, and extract some smaller strings out of it, then discard the big string. That would be common when parsing an input file, for example.

Of course, a clever GC might work around this by copying the character array when it is worth it (the Sun JVM may do this, I don't know), but the added complexity might be a reason not to implement this sharing behaviour at all.

2 Comments

+1 to avoiding the added complexity. This is something that's been on my mind a lot lately: I think in many cases I prefer the "dumb, obvious" solution over clever, less easily provable ideas, more so than I used to.
@Dan Tao: Yes, just my thoughts. "Clever" is often something bad when programming.
1

There are a number of ways something like String could be implemented:

  1. Have a "String" object effectively contain an array, with the implication that all characters in the array are in the string. This is what .net actually does.
  2. Have every "String" be a class which contains an array reference along with a starting offset and length. Problem: Creating most strings would require instantiating two objects rather than one.
  3. Have every "String" be a structure which contains an array reference along with a starting offset and length. Problem: Assignments to string type fields would no longer be atomic.
  4. Have two or more types of "String" objects--those which contain all the characters in an array, and those which contain a reference to another string along with an offset and length. Problem: This would require many methods of string to be virtual.
  5. Have every "String" be a special class which includes a starting offset and length, an object reference to what may or may not be the same object, and a built-in array of characters. This would waste a little space in the common case where a string contains its own characters (because all of them), but would allow the same code to work with strings that contain their own characters or strings that 'borrow' from others.
  6. Have a general-purpose ImmutableArray<T> type (which would inherit ReadableArray<T>), and have an ImmutableArray<Char> be interchangeable with String. There are many uses for immutable arrays; String is probably the most common usage case, but hardly the only one.
  7. Have a general-purpose ImmutableArray type<T> type as above, but also an ImmutableArraySegment<T> class, both inheriting from ImmutableArrayBase<T>. This would require many methods to be virtual, and would probably be my favorite possibility.

Note that most of these approaches have significant limitations in at least some usage scenarios.

Comments

0

I believe these are CLR optimisations that have nothing to do with programmers as you shouldn't be doing the things you are doing. You should assume it is a new string every time (as a programmer).

2 Comments

Well, sure... I never said anything about should. I'm just curious, from a technical standpoint, why this decision was made. I think Guffa and Joe have given some great reasons.
You are right that this is details that you shouldn't normally bother yourself with. However, there is still a value in discussing how the internals of the language is constructed for the sake of gaining a better knowledge on how it's meant to be used, so that you can avoid things that are inherently ineffective.
0

after reviewing Substring method with reflector i figured out that if you pass 0 in substriong method - it will return the same object.

[SecurityCritical]
private unsafe string InternalSubString(int startIndex, int length, bool fAlwaysCopy)
{
    if (((startIndex == 0) && (length == this.Length)) && !fAlwaysCopy)
    {
        return this;
    }
    string str = FastAllocateString(length);
    fixed (char* chRef = &str.m_firstChar)
    {
        fixed (char* chRef2 = &this.m_firstChar)
        {
            wstrcpy(chRef, chRef2 + startIndex, length);
        }
    }
    return str;
}

2 Comments

Yeah... this is basically what I was trying to show with my first example. The question is why when you pass a non-zero value, the string object returned does not share the same char values in memory with the original.
0

This would add complexity (or at least more smarts) to the intern table. Imagine you already have two entries in the intern table "pending" and "bending" and the following code:

var x = "pending";
var y = x.Substring(1);

which entry in the intern table would be considered a hit?

1 Comment

Neither. Strings created at runtime are not automatically interned.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.