2

In C++ on Windows how do you convert an xml character reference of the form &#xhhhh; to a utf-16 little endian string?

I'm thinking if the hhhh part is 4 characters or less, then it's 2 bytes, which fit into one utf-16 character. But, this wiki page has a table of character references and some near the bottom are 5 digit hex numbers which won't fit into two bytes. How can they be converted to utf-16?

I'm wondering if the MultiByteToWideChar function is capable of doing the job.

My understanding of how a code point that's bigger than 2 bytes gets converted to utf-16 is lacking! (Or for that matter I'm not too sure how a code point that's bigger than 1 byte get's converted to utf-8, but that's another question).

Thanks.

6
  • 1
    MultiByteToWideChar is totally inappropriate for this task. Commented Mar 17, 2021 at 19:37
  • Related: MultiByteToWideChar for Unicode code pages 1200, 1201, 12000, 12001. Commented Mar 17, 2021 at 19:48
  • 1
    The algorithm to convert a codepoint into UTF-16 is described on Wikipedia, see UTF-16 Commented Mar 17, 2021 at 20:02
  • @RemyLebeau but the bigger problem in this question is to convert each string &#xhhhh; to a codepoint in the first place. Once you've done that your advice might be helpful. Commented Mar 19, 2021 at 4:15
  • @MarkRansom it is trivial to parse XML character references into numeric codepoint values. Especially if you use an actual XML parser and let it do the work for you Commented Mar 19, 2021 at 4:20

1 Answer 1

3

Unicode code-points (UTF-32) are 4 bytes wide and can be converted into a UTF-16character (and possible surrogate) using the following code (that I happen to have lying around).

It is not heavily tested so bug reports gratefully accepted:

/**
 * Converts U-32 code point to UTF-16 (and optional surrogate)
 * @param utf32 - UTF-32 code point
 * @param utf16 - returned UTF-16 character
 * @return - The number code units in the UTF-16 char (1 or 2).
 */
unsigned utf32_to_utf16(char32_t utf32, std::array<char16_t, 2>& utf16)
{
    if(utf32 < 0xD800 || (utf32 > 0xDFFF && utf32 < 0x10000))
    {
        utf16[0] = char16_t(utf32);
        utf16[1] = 0;
        return 1;
    }

    utf32 -= 0x010000;

    utf16[0] = char16_t(((0b1111'1111'1100'0000'0000 & utf32) >> 10) + 0xD800);
    utf16[1] = char16_t(((0b0000'0000'0011'1111'1111 & utf32) >> 00) + 0xDC00);

    return 2;
}
Sign up to request clarification or add additional context in comments.

3 Comments

You might consider treating the range 0xd800 to 0xdfff specially, since those might be malformed input.
@MarkRansom Yes, I was wondering about the lack of error checking (I wrote this ages ago). But looking again at the Wikipedia article it says that, even though the range is technically bad code-points a lot of software allows them anyway... so I am going to have to mull on that for a bit.
It might not be malformed input either, if the codepoints are paired to make a valid UTF-16 character. JSON is encoded this way for example, see e.g. Why does JSON encode UTF-16 surrogate pairs instead of Unicode code points directly?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.