How does the UTF-8 encoding algorithm work on 8-bit chunks (in JavaScript)?

Question

I am looking at this:

function encodeCodePoint(codePoint) {
  if ((codePoint & 0xFFFFFF80) == 0) { // 1-byte sequence
    return stringFromCharCode(codePoint);
  }
  var symbol = '';
  if ((codePoint & 0xFFFFF800) == 0) { // 2-byte sequence
    symbol = stringFromCharCode(((codePoint >> 6) & 0x1F) | 0xC0);
  }
  else if ((codePoint & 0xFFFF0000) == 0) { // 3-byte sequence
    checkScalarValue(codePoint);
    symbol = stringFromCharCode(((codePoint >> 12) & 0x0F) | 0xE0);
    symbol += createByte(codePoint, 6);
  }
  else if ((codePoint & 0xFFE00000) == 0) { // 4-byte sequence
    symbol = stringFromCharCode(((codePoint >> 18) & 0x07) | 0xF0);
    symbol += createByte(codePoint, 12);
    symbol += createByte(codePoint, 6);
  }
  symbol += stringFromCharCode((codePoint & 0x3F) | 0x80);
  return symbol;
}

which, in JavaScript, seems to be taking advantage of the fact that the numbers in JavaScript are (I think) something around 32-bits long. So it does some bit manipulation which I am unfamiliar with and gets the encoded value. Same with the decode function:

function decodeSymbol() {
  var byte1;
  var byte2;
  var byte3;
  var byte4;
  var codePoint;

  if (byteIndex > byteCount) {
    throw Error('Invalid byte index');
  }

  if (byteIndex == byteCount) {
    return false;
  }

  // Read first byte
  byte1 = byteArray[byteIndex] & 0xFF;
  byteIndex++;

  // 1-byte sequence (no continuation bytes)
  if ((byte1 & 0x80) == 0) {
    return byte1;
  }

  // 2-byte sequence
  if ((byte1 & 0xE0) == 0xC0) {
    byte2 = readContinuationByte();
    codePoint = ((byte1 & 0x1F) << 6) | byte2;
    if (codePoint >= 0x80) {
      return codePoint;
    } else {
      throw Error('Invalid continuation byte');
    }
  }

  // 3-byte sequence (may include unpaired surrogates)
  if ((byte1 & 0xF0) == 0xE0) {
    byte2 = readContinuationByte();
    byte3 = readContinuationByte();
    codePoint = ((byte1 & 0x0F) << 12) | (byte2 << 6) | byte3;
    if (codePoint >= 0x0800) {
      checkScalarValue(codePoint);
      return codePoint;
    } else {
      throw Error('Invalid continuation byte');
    }
  }

  // 4-byte sequence
  if ((byte1 & 0xF8) == 0xF0) {
    byte2 = readContinuationByte();
    byte3 = readContinuationByte();
    byte4 = readContinuationByte();
    codePoint = ((byte1 & 0x07) << 0x12) | (byte2 << 0x0C) |
      (byte3 << 0x06) | byte4;
    if (codePoint >= 0x010000 && codePoint <= 0x10FFFF) {
      return codePoint;
    }
  }

  throw Error('Invalid UTF-8 detected');
}

Basically, I can't quite read this code and can't really tell what's going on. Wondering if one with better bit-manipulation chops or UTF-8 encoding knowledge could describe at a high level what the input and output are from encoding and decoding, and very roughly how it goes from input to output for each. I am trying to build a utf-8 encoder/decoder and don't see exactly how an 8-bit stream is chunked into 1 to 4 byte chunks, partly because the JavaScript 32-bit integer thingy is getting in the way I think. But to me it seems like this is what happens:

Decoding:

We have an 8-bit (1-byte) stream of data.
We get a byte
We check if that byte is within a certain range of some sort (which I don't know)
If it's in some range, then we know an extra byte follows, or something like that.
We then collect all the bytes for the character...
And in the case of JavaScript, convert it to an integer and then String.fromCharCode(integer) sort of thing.

What I'm missing is how exactly it goes from the 1-byte sequence to up to 4 bytes, how does it do that part?

Encoding:

This is language/architecture dependent, since some architectures will have integers be 16, 32, or 64 bits (...I'm guessing...).
In the case of JavaScript, take the 32-ish-bit integer and do some bit-manipulation magic to extract out the 1 to 4 bytes for this character. How does it know how many bytes to receive???
Repeat until you have an array of bytes.

Wondering if one could fill in the gaps in my understanding. I'm not looking for exactly each bit-manipulation step, as there are a lot. I am just looking for the questions which I highlighted in my analysis just above.

tools.ietf.org/html/rfc3629 (UTF-8 standard) and en.wikipedia.org/wiki/UTF-8 has a good description as well. — Mark Tolonen
– Mark Tolonen, Commented Dec 22, 2019 at 9:03
Yes, bitwise operations on JS numbers behave like they are done on 32bit integers — Jonas Wilms
– Jonas Wilms, Commented Dec 22, 2019 at 9:27

Jonas Wilms · Accepted Answer · 2019-12-22 09:50:35Z

JS integers have 32bit binary operators, thus you can safely work with 4 x 8bit (4bytes) in one single number. That's what your decoder receives as a parameter.

UTF-8 encoding is variable in size. If the codepoint would only take 7bits (= ASCII), then it would fit into one byte, that has a leading zero to indicate that it only has one byte:

  0XXXXXXXX

Now to check whether the codepoint is only one byte, one could check if there is a bit set somewhere in the upper bytes. That can be done by comparing the codepoint to 0xFFFFF80, which has all bits set excluding the last 8. Thus, if a bitwise and results in something unequal 0, there is a bit set somewhere in the upper bytes.

  1111111111111111111110000000 &
                      0XXXXXXX
   = 0

Now if there are more than 7 bits, the first byte contains the number of bytes, all the following bytes contain a 01 sequence at the beginning, for 4 bytes that would be:

  11110XXX 10XXXXXX 10XXXXXX 10XXXXXX

Now to get the upper 8 encoded bits here for example, one could rightshift by 18:

  1110XXX 10XXXXX

Collectives™ on Stack Overflow

How does the UTF-8 encoding algorithm work on 8-bit chunks (in JavaScript)?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related