JavaScript strings - UTF-16 vs UCS-2?

Question

I've read in some places that JavaScript strings are UTF-16, and in other places they're UCS-2. I did some searching around to try to figure out the difference and found this:

Q: What is the difference between UCS-2 and UTF-16?

A: UCS-2 is obsolete terminology which refers to a Unicode implementation up to Unicode 1.1, before surrogate code points and UTF-16 were added to Version 2.0 of the standard. This term should now be avoided.

UCS-2 does not define a distinct data format, because UTF-16 and UCS-2 are identical for purposes of data exchange. Both are 16-bit, and have exactly the same code unit representation.

Sometimes in the past an implementation has been labeled "UCS-2" to indicate that it does not support supplementary characters and doesn't interpret pairs of surrogate code points as characters. Such an implementation would not handle processing of character properties, code point boundaries, collation, etc. for supplementary characters.

via: http://www.unicode.org/faq/utf_bom.html#utf16-11

So my question is, is it because the JavaScript string object's methods and indexes act on 16-bit data values instead of characters what make some people consider it UCS-2? And if so, would a JavaScript string object oriented around characters instead of 16-bit data chunks be considered UTF-16? Or is there something else I'm missing?

Edit: As requested, here are some sources saying JavaScript strings are UCS-2:

http://blog.mozilla.com/nnethercote/2011/07/01/faster-javascript-parsing/ http://terenceyim.wordpress.com/tag/ucs2/

EDIT: For anyone who may come across this, be sure to check out this link:

http://mathiasbynens.be/notes/javascript-encoding

AmigoJack · Accepted Answer · 2022-10-29 02:58:28Z

20

JavaScript, strictly speaking, ECMAScript, pre-dates Unicode 2.0, so in some cases you may find references to UCS-2 simply because that was correct at the time the reference was written. Can you point us to specific citations of JavaScript being "UCS-2"?

Specifications for ECMAScript versions 3 and 5 at least both explicitly declare a String to be a collection of unsigned 16-bit integers and that if those integer values are meant to represent textual data, then they are UTF-16 code units. See

section 8.4 of the ECMAScript Language Specification in version 5.1
or section 6.1.4 in version 13.0.

EDIT: I'm no longer sure my answer is entirely correct. See the excellent article mentioned above, which in essence says that while a JavaScript engine may use UTF-16 internally, and most do, the language itself effectively exposes those characters as if they were UCS-2.

edited Oct 29, 2022 at 2:58

AmigoJack

6,4222 gold badges20 silver badges36 bronze badges

answered Jan 3, 2012 at 17:39

dgvid

26.7k5 gold badges43 silver badges57 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

patorjk Over a year ago

Thank you for the link, the language of the spec seems pretty clear. I think then that UCS-2 talk is either old or based on the method and indexing support for surrogate pairs.

Jay Freeman -saurik- Over a year ago

So, the specification states "Each integer value in the sequence usually represents a single 16-bit unit of UTF-16 text. However, ECMAScript does not place any restrictions or requirements on the values except that they must be 16-bit unsigned integers.", which is equivalent to saying that in modern C programs each character value in a character array "usually" represents a single 8-bit unit of UTF-8 text, but obviously stating that C strings "are" UTF-8 would be wrong. The semantics JavaScript provides are only UCS-2; if you want UTF-16 support you must do so yourself, as per DMoses's answer.

Philip Over a year ago

UCS is the thing with the numbers, and yes UCS 2 is outdated, the current version is UCS 4. UTF-8/-16/-32 are ways to represent arrays of UCS thingies in bits. ;)

katspaugh · Accepted Answer · 2012-01-03 18:19:54Z

6

It's UTF-16/USC-2. It can handle surrogate pairs, but the charAt/charCodeAt returns a 16-bit char and not the Unicode codepoint. If you want to have it handle surrogate pairs, I suggest a quick read through this.

edited Jan 3, 2012 at 18:19

katspaugh

18k12 gold badges69 silver badges107 bronze badges

answered Jan 3, 2012 at 17:25

Daniel Moses

5,86828 silver badges39 bronze badges

3 Comments

cubuspl42 Over a year ago

What do you mean by "it can handle surrogate pairs"?

Daniel Moses Over a year ago

If you read the article linked it will describe how to have it handle surrogate pairs. My point is that it doesn't error out by default, and there are ways to handle surrogate pairs as shown in the code on the link provided.

doug65536 Over a year ago

@cubuspl42 UTF-16 isn't limited to 0x0-0xFFFF, it can encode pairs of 16-bit characters and represent the entire Unicode range from 0x0-0x101000, over a million codepoints. These pairs are called "surrogate pairs".

Daniel A. White · Accepted Answer · 2012-01-03 17:17:55Z

3

Its just a 16-bit value with no encoding specified in the ECMAScript standard.

See section 7.8.4 String Literals in this document: http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-262.pdf

answered Jan 3, 2012 at 17:17

Daniel A. White

192k49 gold badges389 silver badges473 bronze badges

1 Comment

Константин Ван Over a year ago

This is still true.

Константин Ван · Accepted Answer · 2023-05-26 15:06:08Z

You need to differentiate how it is stored and how it is interpreted.

In Javascript, a string is a sequence of 16-bit unsigned integers that is, usually but not necessarily, interpreted as a UTF-16-encoded character sequence. It is encodingless, and your code, standard Javascript methods, or REPL terminals, may interpret it in whatever encodings they want.

The thirteenth edition of ECMA-262 _{(ECMAScript® 2022 language specification)}

§4.4.20 String value

primitive value that is a finite ordered sequence of zero or more 16-bit unsigned integer values

NOTE A String value is a member of the String type. Each integer value in the sequence usually represents a single 16-bit unit of UTF-16 text. However, ECMAScript does not place any restrictions or requirements on the values except that they must be 16-bit unsigned integers.

Because of this, Javascript strings can contain, with no problems, a value sequence that is invalid in UTF-16, such as lone (“unmatched”) surrogates.

const javascript_string = "\uDF06"; // a lone surrogate
javascript_string.isWellFormed(); // false

alextgordon · Accepted Answer · 2022-04-08 15:51:53Z

1

Things have changed since 2012. JavaScript strings are now UTF-16 for real. Yes, the old string methods still work on 16-bit code units, but the language is now aware of UTF-16 surrogates and knows what to do about them if you use the string iterator. There's also Unicode regex support.

// Before
"😀😂💩".length // 6

// Now
[..."😀😂💩"].length // 3
[..."😀😂💩"]  // [ '😀', '😂', '💩' ]
[... "😀😂💩".matchAll(/./ug) ] // 3 matches as above

// Regexes support unicode character classes
"café".normalize("NFD").match(/\p{L}\p{M}/ug) // [ 'é' ]

// Extract code points
[..."😀😂💩"].map(char => char.codePointAt(0).toString(16)) // [ '1f600', '1f602', '1f4a9' ]

answered Apr 8, 2022 at 15:51

alextgordon

1701 silver badge11 bronze badges

2 Comments

AmigoJack Over a year ago

Without at least naming example versions/engines this is not helping in terms of avoiding implementations that couldn't/still can't do this.

Константин Ван Over a year ago

The way you put is misleading. While it's true that the @@iterator iterates over codepoints, it is not that Javascript string literals are stored in codepoints. The .length is still 6.

Collectives™ on Stack Overflow

JavaScript strings - UTF-16 vs UCS-2?

5 Answers 5

3 Comments

3 Comments

1 Comment

The thirteenth edition of ECMA-262 _{(ECMAScript® 2022 language specification)}

§4.4.20 `String` value

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

3 Comments

3 Comments

1 Comment

The thirteenth edition of ECMA-262 (ECMAScript® 2022 language specification)

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related

The thirteenth edition of ECMA-262 _{(ECMAScript® 2022 language specification)}