0

I am reading the document on index to Delphi string, as below:

http://docwiki.embarcadero.com/RADStudio/Tokyo/en/String_Types_(Delphi)

One statement said:

You can index a string variable just as you would an array. If S is a non-UnicodeString string variable and i, an integer expression, S[i] represents the ith byte in S, which may not be the ith character or an entire character at all for a multibyte character string (MBCS). Similarly, indexing a UnicodeString variable results in an element that may not be an entire character. If the string contains characters in the Basic Multilingual Plane (BMP), all characters are 2 bytes, so indexing the string gets characters. However, if some characters are not in the BMP, an indexed element may be a surrogate pair - not an entire character.

If I understand correctly, S[i] is index to the i-th byte of the string. If S is a UnicodeString, then S[1] is the first byte, S[2] is the 2nd byte of the first character, S[3] is the first byte of the second character, etc. If that is the case, then how do I index the character instead of the byte inside a string? I need to index characters, not bytes.

9
  • No, a Unicode "character" in Delphi is two bytes, and if S is a string (=UnicodeString in Delphi 2009 or later), S[i] is such a two-byte "character". But only Unicode characters in the BMP can be represented as such a two-byte unit, so S[i] might indeed be only one of the two parts in a surrogate pair. Commented Oct 24, 2018 at 9:22
  • (In the vast majority of all applications, you only need the BMP. It contains tens of thousands of characters. I don't know your application, though.) Commented Oct 24, 2018 at 9:25
  • See Detecting and Retrieving codepoints and surrogates from a Delphi String. Commented Oct 24, 2018 at 9:26
  • So in a simple string like "Test ∫⌬dx ᚭᛘᚠ ቚ꡵씒ᱶⵞꮙ៚ㆯ", S[i] is the complete character. Commented Oct 24, 2018 at 9:33
  • Please, when adding tags, add the correct one. Do not tag with delphi-xe2 but with delphi-xe3 since you actually are using Delphi XE3. Commented Oct 24, 2018 at 10:54

1 Answer 1

4

In Delphi, S[i] is a char aka widechar. But this is not an Unicode "character", it is an UTF-16 encoded value in 16 bits (2 bytes). In previous century, i.e. until 1996, Unicode was 16-bit, but it is not the case any more! Please read carrefully the Unicode FAQ.

You may need several widechar to have a whole Unicode codepoint = more or less what we usually call "character". And even this may be wrong, if diacritics are used.

UTF-16 uses a single 16-bit code unit to encode the most common 63K characters, and a pair of 16-bit code units, called surrogates, to encode the 1M less commonly used characters in Unicode.

Originally, Unicode was designed as a pure 16-bit encoding, aimed at representing all modern scripts. (Ancient scripts were to be represented with private-use characters.)

Over time, and especially after the addition of over 14,500 composite characters for compatibility with legacy sets, it became clear that 16-bits were not sufficient for the user community. Out of this arose UTF-16.

see UTF-16 FAQ

For proper decoding of Unicode codepoints in Delphi, see Detecting and Retrieving codepoints and surrogates from a Delphi String (link by @LURD in comments)

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.