Why does JSON encode UTF-16 surrogate pairs instead of Unicode code points directly?

Question

To escape a code point that is not in the Basic Multilingual Plane, the character is represented as a twelve-character sequence, encoding the UTF-16 surrogate pair. So for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E".

ECMA-404: The JSON Data Interchange Format

I believe that there is no need to encode this character at all, so it could be represented directly as "𝄞". However, should one wish to encode it, it must, per spec, be encoded as "\uD834\uDD1E", not (as would seem reasonable) as "\u1d11e". Why is this?

For hysteric raisins. Sorry, historic reasons :-) And yes, you have the choice of directly using a Unicode code point or using an encoded surrogate pair. — gnasher729
– gnasher729, Commented Jul 19, 2016 at 16:19
The \u escape format is limited to 4 hex digits, and thus can only encode 16bit values. 0x1D11E does not fit in 16 bits, thus the use of a surrogate pair. This goes back to the days of UCS-2 when all Unicode codepoints at the time fit nicely in 16 bits. When Unicode outgrew 16 bits, UTF-16 was invented to surpass the encoding limitation, but still needs to be backwards compatible with existing UCS-2 data and formats. — Remy Lebeau
– Remy Lebeau, Commented Jul 19, 2016 at 19:41
Supporting "\u1d11e" to mean "𝄞" would make it difficult to encode the string "ᴑe" using \u1d11 for the ᴑ. A better question would be why support for a sequence like "\U0001d11e" hasn't been added to the language (such as in Python and C++). — 一二三
– 一二三, Commented Jul 20, 2016 at 0:49
@一二三 ECMAScript 6 added a \u{1d11e} syntax (similar to Perl). — nwellnhof
– nwellnhof, Commented Jul 24, 2016 at 13:44
ECMAScript 6 — neat syntax, but JSON is already widely implemented, allowing a new syntax would require all upgrading all parsers, not worth the pain. — Beni Cherniavsky-Paskin
– Beni Cherniavsky-Paskin, Commented Sep 4, 2017 at 14:56

nwellnhof · Accepted Answer · 2016-07-24 13:38:21Z

12

One of the key architectural features of JSON is that JSON-encoded objects are valid Javascript literals that can be evaluated using the eval function, for example. Unfortunately, older Javascript implementations only support 16-bit Unicode escape sequences with four hex characters in string literals, so there's no other way than to use UTF-16 surrogates in escape sequences for code points above 0xFFFF in a portable way. (The \u{...} syntax that allows arbitrary code points was only introduced in ECMAScript 6.)

But as you mentioned, there's no need to use escape sequences if your application supports Unicode JSON text. Simply encode the characters directly in the respective Unicode format.

answered Jul 24, 2016 at 13:38

nwellnhof

34k7 gold badges97 silver badges121 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Why does JSON encode UTF-16 surrogate pairs instead of Unicode code points directly?

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related