Convert unicode codepoint to utf-16

Question

In C++ on Windows how do you convert an xml character reference of the form &#xhhhh; to a utf-16 little endian string?

I'm thinking if the hhhh part is 4 characters or less, then it's 2 bytes, which fit into one utf-16 character. But, this wiki page has a table of character references and some near the bottom are 5 digit hex numbers which won't fit into two bytes. How can they be converted to utf-16?

I'm wondering if the MultiByteToWideChar function is capable of doing the job.

My understanding of how a code point that's bigger than 2 bytes gets converted to utf-16 is lacking! (Or for that matter I'm not too sure how a code point that's bigger than 1 byte get's converted to utf-8, but that's another question).

Thanks.

MultiByteToWideChar is totally inappropriate for this task. — Mark Ransom
– Mark Ransom, Commented Mar 17, 2021 at 19:37
Related: MultiByteToWideChar for Unicode code pages 1200, 1201, 12000, 12001. — dxiv
– dxiv, Commented Mar 17, 2021 at 19:48
The algorithm to convert a codepoint into UTF-16 is described on Wikipedia, see UTF-16 — Remy Lebeau
– Remy Lebeau, Commented Mar 17, 2021 at 20:02
@RemyLebeau but the bigger problem in this question is to convert each string &#xhhhh; to a codepoint in the first place. Once you've done that your advice might be helpful. — Mark Ransom
– Mark Ransom, Commented Mar 19, 2021 at 4:15
@MarkRansom it is trivial to parse XML character references into numeric codepoint values. Especially if you use an actual XML parser and let it do the work for you — Remy Lebeau
– Remy Lebeau, Commented Mar 19, 2021 at 4:20

Galik · Accepted Answer · 2021-03-17 19:28:19Z

3

Unicode code-points (UTF-32) are 4 bytes wide and can be converted into a UTF-16character (and possible surrogate) using the following code (that I happen to have lying around).

It is not heavily tested so bug reports gratefully accepted:

/**
 * Converts U-32 code point to UTF-16 (and optional surrogate)
 * @param utf32 - UTF-32 code point
 * @param utf16 - returned UTF-16 character
 * @return - The number code units in the UTF-16 char (1 or 2).
 */
unsigned utf32_to_utf16(char32_t utf32, std::array<char16_t, 2>& utf16)
{
    if(utf32 < 0xD800 || (utf32 > 0xDFFF && utf32 < 0x10000))
    {
        utf16[0] = char16_t(utf32);
        utf16[1] = 0;
        return 1;
    }

    utf32 -= 0x010000;

    utf16[0] = char16_t(((0b1111'1111'1100'0000'0000 & utf32) >> 10) + 0xD800);
    utf16[1] = char16_t(((0b0000'0000'0011'1111'1111 & utf32) >> 00) + 0xDC00);

    return 2;
}

answered Mar 17, 2021 at 19:28

Galik

49k5 gold badges85 silver badges126 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Mark Ransom Over a year ago

You might consider treating the range 0xd800 to 0xdfff specially, since those might be malformed input.

Galik Over a year ago

@MarkRansom Yes, I was wondering about the lack of error checking (I wrote this ages ago). But looking again at the Wikipedia article it says that, even though the range is technically bad code-points a lot of software allows them anyway... so I am going to have to mull on that for a bit.

Mark Ransom Over a year ago

It might not be malformed input either, if the codepoints are paired to make a valid UTF-16 character. JSON is encoded this way for example, see e.g. Why does JSON encode UTF-16 surrogate pairs instead of Unicode code points directly?

Collectives™ on Stack Overflow

Convert unicode codepoint to utf-16

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related