Index character instead of byte in the Delphi string

Question

I am reading the document on index to Delphi string, as below:

http://docwiki.embarcadero.com/RADStudio/Tokyo/en/String_Types_(Delphi)

One statement said:

You can index a string variable just as you would an array. If S is a non-UnicodeString string variable and i, an integer expression, S[i] represents the ith byte in S, which may not be the ith character or an entire character at all for a multibyte character string (MBCS). Similarly, indexing a UnicodeString variable results in an element that may not be an entire character. If the string contains characters in the Basic Multilingual Plane (BMP), all characters are 2 bytes, so indexing the string gets characters. However, if some characters are not in the BMP, an indexed element may be a surrogate pair - not an entire character.

If I understand correctly, S[i] is index to the i-th byte of the string. If S is a UnicodeString, then S[1] is the first byte, S[2] is the 2nd byte of the first character, S[3] is the first byte of the second character, etc. If that is the case, then how do I index the character instead of the byte inside a string? I need to index characters, not bytes.

No, a Unicode "character" in Delphi is two bytes, and if S is a string (=UnicodeString in Delphi 2009 or later), S[i] is such a two-byte "character". But only Unicode characters in the BMP can be represented as such a two-byte unit, so S[i] might indeed be only one of the two parts in a surrogate pair. — Andreas Rejbrand
– Andreas Rejbrand, Commented Oct 24, 2018 at 9:22
(In the vast majority of all applications, you only need the BMP. It contains tens of thousands of characters. I don't know your application, though.) — Andreas Rejbrand
– Andreas Rejbrand, Commented Oct 24, 2018 at 9:25
See Detecting and Retrieving codepoints and surrogates from a Delphi String. — LU RD
– LU RD, Commented Oct 24, 2018 at 9:26
So in a simple string like "Test ∫⌬dx ᚭᛘᚠ ቚ꡵씒ᱶⵞꮙ៚ㆯ", S[i] is the complete character. — Andreas Rejbrand
– Andreas Rejbrand, Commented Oct 24, 2018 at 9:33
Please, when adding tags, add the correct one. Do not tag with delphi-xe2 but with delphi-xe3 since you actually are using Delphi XE3. — Tom Brunberg
– Tom Brunberg, Commented Oct 24, 2018 at 10:54

Arnaud Bouchez · Accepted Answer · 2018-10-24 09:42:26Z

In Delphi, S[i] is a char aka widechar. But this is not an Unicode "character", it is an UTF-16 encoded value in 16 bits (2 bytes). In previous century, i.e. until 1996, Unicode was 16-bit, but it is not the case any more! Please read carrefully the Unicode FAQ.

You may need several widechar to have a whole Unicode codepoint = more or less what we usually call "character". And even this may be wrong, if diacritics are used.

UTF-16 uses a single 16-bit code unit to encode the most common 63K characters, and a pair of 16-bit code units, called surrogates, to encode the 1M less commonly used characters in Unicode.

Originally, Unicode was designed as a pure 16-bit encoding, aimed at representing all modern scripts. (Ancient scripts were to be represented with private-use characters.)

Over time, and especially after the addition of over 14,500 composite characters for compatibility with legacy sets, it became clear that 16-bits were not sufficient for the user community. Out of this arose UTF-16.

see UTF-16 FAQ

For proper decoding of Unicode codepoints in Delphi, see Detecting and Retrieving codepoints and surrogates from a Delphi String (link by @LURD in comments)

Collectives™ on Stack Overflow

Index character instead of byte in the Delphi string

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related