9

To escape a code point that is not in the Basic Multilingual Plane, the character is represented as a twelve-character sequence, encoding the UTF-16 surrogate pair. So for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E".

ECMA-404: The JSON Data Interchange Format

I believe that there is no need to encode this character at all, so it could be represented directly as "𝄞". However, should one wish to encode it, it must, per spec, be encoded as "\uD834\uDD1E", not (as would seem reasonable) as "\u1d11e". Why is this?

5
  • 4
    For hysteric raisins. Sorry, historic reasons :-) And yes, you have the choice of directly using a Unicode code point or using an encoded surrogate pair. Commented Jul 19, 2016 at 16:19
  • 3
    The \u escape format is limited to 4 hex digits, and thus can only encode 16bit values. 0x1D11E does not fit in 16 bits, thus the use of a surrogate pair. This goes back to the days of UCS-2 when all Unicode codepoints at the time fit nicely in 16 bits. When Unicode outgrew 16 bits, UTF-16 was invented to surpass the encoding limitation, but still needs to be backwards compatible with existing UCS-2 data and formats. Commented Jul 19, 2016 at 19:41
  • 5
    Supporting "\u1d11e" to mean "𝄞" would make it difficult to encode the string "ᴑe" using \u1d11 for the . A better question would be why support for a sequence like "\U0001d11e" hasn't been added to the language (such as in Python and C++). Commented Jul 20, 2016 at 0:49
  • 1
    @一二三 ECMAScript 6 added a \u{1d11e} syntax (similar to Perl). Commented Jul 24, 2016 at 13:44
  • 3
    ECMAScript 6 — neat syntax, but JSON is already widely implemented, allowing a new syntax would require all upgrading all parsers, not worth the pain. Commented Sep 4, 2017 at 14:56

1 Answer 1

12

One of the key architectural features of JSON is that JSON-encoded objects are valid Javascript literals that can be evaluated using the eval function, for example. Unfortunately, older Javascript implementations only support 16-bit Unicode escape sequences with four hex characters in string literals, so there's no other way than to use UTF-16 surrogates in escape sequences for code points above 0xFFFF in a portable way. (The \u{...} syntax that allows arbitrary code points was only introduced in ECMAScript 6.)

But as you mentioned, there's no need to use escape sequences if your application supports Unicode JSON text. Simply encode the characters directly in the respective Unicode format.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.