UTF 8 encoding Algorithm vs UTF 16 Algorithm

Question

I'm trying to convert the decimal values of unicode characters to their actual characters using C++ and I don't want to use any libraries. I was kindly given the function below by a user on StackOverflow that converts the decimal representation into a UTF 8 character.

This solved all my problems when I was testing my code on OSX, but sadly when I tested it on Windows the characters outputted where completely incorrect. I understand now that Windows uses UTF 16, which would explain why the wrong characters where outputted on Windows.

The problem is, since I didn't write the function myself, I have no idea how it works. I've tried Googling each different part of the function and I understand it's the UTF 8 encoding algorithm and I know its using bitwise operations but I don't have a clue how it works. Here's the function:

void GetUnicodeChar(unsigned int code, char chars[5]) {
if (code <= 0x7F) {
    chars[0] = (code & 0x7F); chars[1] = '\0';
} else if (code <= 0x7FF) {
    // one continuation byte
    chars[1] = 0x80 | (code & 0x3F); code = (code >> 6);
    chars[0] = 0xC0 | (code & 0x1F); chars[2] = '\0';
} else if (code <= 0xFFFF) {
    // two continuation bytes
    chars[2] = 0x80 | (code & 0x3F); code = (code >> 6);
    chars[1] = 0x80 | (code & 0x3F); code = (code >> 6);
    chars[0] = 0xE0 | (code & 0xF); chars[3] = '\0';
} else if (code <= 0x10FFFF) {
    // three continuation bytes
    chars[3] = 0x80 | (code & 0x3F); code = (code >> 6);
    chars[2] = 0x80 | (code & 0x3F); code = (code >> 6);
    chars[1] = 0x80 | (code & 0x3F); code = (code >> 6);
    chars[0] = 0xF0 | (code & 0x7); chars[4] = '\0';
} else {
    // unicode replacement character
    chars[2] = 0xEF; chars[1] = 0xBF; chars[0] = 0xBD;
    chars[3] = '\0';
}
}

So here's my question, does anyone know of a way to convert that UTF 8 encoding function to a UTF 16 one? I have done some research about both algorithms, and the truth is, I don't really understand either.

Alternatively I've seen people use the function MultiByteToWideChar but I couldn't get that to work either. Can anyone please provide me with a method or a function that would allow me to display the correct unicode characters on Windows, without the user having to change their console code page?

@AndrewMedico I'm making a virtual machine that will be open source and I don't want to have to rely on any external libraries. — Francis
– Francis, Commented May 6, 2014 at 18:28
This is at least the third time you've asked the same question. And you still don't understand the basic problem: you're using the Windows console, which doesn't work with Unicode in any way! Not UTF-8, not UTF-16. — Mark Ransom
– Mark Ransom, Commented May 6, 2014 at 18:29
@MarkRansom If the Windows Console won't work with unicode then how does Python, Ruby etc ... output non-ascii characters on Windows? #perseverance — Francis
– Francis, Commented May 6, 2014 at 18:33
I can't speak to Ruby, but in Python there are two ways. First is to use something which isn't the console, such as Idle. Second is to print Unicode characters directly, which Python will automatically convert to the current code page, usually 'cp437' - you can find this with sys.stdout.encoding. — Mark Ransom
– Mark Ransom, Commented May 6, 2014 at 18:38

Remy Lebeau · Accepted Answer · 2014-05-06 18:53:29Z

4

Read the descripts of UTF-8 and UTF-16 on Wikipedia, they describe the encoding algorithms.

Try something like this:

void GetUnicodeCharAsUtf8(unsigned int code, char chars[5])
{
    if (code <= 0x7F) {
        chars[0] = (code & 0x7F);
        chars[1] = '\0';
    } else if (code > 0x10FFFF) {
        // unicode replacement character
        chars[0] = 0xEF;
        chars[1] = 0xBF;
        chars[2] = 0xBD;
        chars[3] = '\0';
    } else {
        int count;
        if (code <= 0x7FF) {
            // one continuation byte
            count = 1;
        } else if (code <= 0xFFFF) {
            // two continuation bytes
            count = 2;
        } else {
            // three continuation bytes
            count = 3;
        }
        for (int i = 0; i < count; ++i) {
            chars[count-i] = 0x80 | (code & 0x3F);
            code >>= 6;
        }
        chars[0] = (0x1E << (6-count)) | (code & (0x3F >> count));
        chars[1+count] = '\0';
    }
}

void GetUnicodeCharAsUtf16(unsigned int code, unsigned short chars[2])
{
    if ( ((code >= 0x0000) && (code <= 0xD7FF)) ||
        ((code >= 0xE000) && (code <= 0xFFFF)) )
    {
        chars[0] = 0x0000;
        chars[1] = (unsigned short) code;
    }
    else if ((code >= 0xD800) && (code <= 0xDFFF))
    {
        // unicode replacement character
        chars[0] = 0x0000;
        chars[1] = 0xFFFD;
    }
    else
    {
        // surrogate pair
        code -= 0x010000;
        chars[0] = 0xD800 + (unsigned short)((code >> 10) & 0x3FF);
        chars[1] = 0xDC00 + (unsigned short)(code & 0x3FF);
    }
}

answered May 6, 2014 at 18:53

Remy Lebeau

609k36 gold badges516 silver badges875 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Mgetz Over a year ago

In theory C++ provides this as part of the standard libary however library support is not universal yet (last I checked g++ did not yet have the header).

Remy Lebeau Over a year ago

Only if you are using a C++11 compiler, though.

Francis Over a year ago

@RemyLebeau Thanks for your help, I just have 1 question. The function I provided had a "char" as the second parameter. Why does your function use an "unsigned short" instead? Is there anyway to make the unsigned short a char because when I use "cout" it just shows me the location of the short in memory?

Remy Lebeau Over a year ago

UTF-8 encodes to 8bit values, which char handles (unsigned char would be better). UTF-16 encodes to 16bit values, which unsigned short handles. Hense the names of the UTF formats - UTF-8 for 8bit, UTF-16 for 16bit. std::cout does not support UTF-16, though, so use std::wcout or even WriteConsoleW() instead.

Collectives™ on Stack Overflow

UTF 8 encoding Algorithm vs UTF 16 Algorithm

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related