2

I'm trying to convert the decimal values of unicode characters to their actual characters using C++ and I don't want to use any libraries. I was kindly given the function below by a user on StackOverflow that converts the decimal representation into a UTF 8 character.

This solved all my problems when I was testing my code on OSX, but sadly when I tested it on Windows the characters outputted where completely incorrect. I understand now that Windows uses UTF 16, which would explain why the wrong characters where outputted on Windows.

The problem is, since I didn't write the function myself, I have no idea how it works. I've tried Googling each different part of the function and I understand it's the UTF 8 encoding algorithm and I know its using bitwise operations but I don't have a clue how it works. Here's the function:

void GetUnicodeChar(unsigned int code, char chars[5]) {
if (code <= 0x7F) {
    chars[0] = (code & 0x7F); chars[1] = '\0';
} else if (code <= 0x7FF) {
    // one continuation byte
    chars[1] = 0x80 | (code & 0x3F); code = (code >> 6);
    chars[0] = 0xC0 | (code & 0x1F); chars[2] = '\0';
} else if (code <= 0xFFFF) {
    // two continuation bytes
    chars[2] = 0x80 | (code & 0x3F); code = (code >> 6);
    chars[1] = 0x80 | (code & 0x3F); code = (code >> 6);
    chars[0] = 0xE0 | (code & 0xF); chars[3] = '\0';
} else if (code <= 0x10FFFF) {
    // three continuation bytes
    chars[3] = 0x80 | (code & 0x3F); code = (code >> 6);
    chars[2] = 0x80 | (code & 0x3F); code = (code >> 6);
    chars[1] = 0x80 | (code & 0x3F); code = (code >> 6);
    chars[0] = 0xF0 | (code & 0x7); chars[4] = '\0';
} else {
    // unicode replacement character
    chars[2] = 0xEF; chars[1] = 0xBF; chars[0] = 0xBD;
    chars[3] = '\0';
}
}

So here's my question, does anyone know of a way to convert that UTF 8 encoding function to a UTF 16 one? I have done some research about both algorithms, and the truth is, I don't really understand either.

Alternatively I've seen people use the function MultiByteToWideChar but I couldn't get that to work either. Can anyone please provide me with a method or a function that would allow me to display the correct unicode characters on Windows, without the user having to change their console code page?

8
  • 2
    Why do you want to do this yourself? Commented May 6, 2014 at 18:26
  • @AndrewMedico I'm making a virtual machine that will be open source and I don't want to have to rely on any external libraries. Commented May 6, 2014 at 18:28
  • This is at least the third time you've asked the same question. And you still don't understand the basic problem: you're using the Windows console, which doesn't work with Unicode in any way! Not UTF-8, not UTF-16. Commented May 6, 2014 at 18:29
  • @MarkRansom If the Windows Console won't work with unicode then how does Python, Ruby etc ... output non-ascii characters on Windows? #perseverance Commented May 6, 2014 at 18:33
  • 2
    I can't speak to Ruby, but in Python there are two ways. First is to use something which isn't the console, such as Idle. Second is to print Unicode characters directly, which Python will automatically convert to the current code page, usually 'cp437' - you can find this with sys.stdout.encoding. Commented May 6, 2014 at 18:38

1 Answer 1

4

Read the descripts of UTF-8 and UTF-16 on Wikipedia, they describe the encoding algorithms.

Try something like this:

void GetUnicodeCharAsUtf8(unsigned int code, char chars[5])
{
    if (code <= 0x7F) {
        chars[0] = (code & 0x7F);
        chars[1] = '\0';
    } else if (code > 0x10FFFF) {
        // unicode replacement character
        chars[0] = 0xEF;
        chars[1] = 0xBF;
        chars[2] = 0xBD;
        chars[3] = '\0';
    } else {
        int count;
        if (code <= 0x7FF) {
            // one continuation byte
            count = 1;
        } else if (code <= 0xFFFF) {
            // two continuation bytes
            count = 2;
        } else {
            // three continuation bytes
            count = 3;
        }
        for (int i = 0; i < count; ++i) {
            chars[count-i] = 0x80 | (code & 0x3F);
            code >>= 6;
        }
        chars[0] = (0x1E << (6-count)) | (code & (0x3F >> count));
        chars[1+count] = '\0';
    }
}

void GetUnicodeCharAsUtf16(unsigned int code, unsigned short chars[2])
{
    if ( ((code >= 0x0000) && (code <= 0xD7FF)) ||
        ((code >= 0xE000) && (code <= 0xFFFF)) )
    {
        chars[0] = 0x0000;
        chars[1] = (unsigned short) code;
    }
    else if ((code >= 0xD800) && (code <= 0xDFFF))
    {
        // unicode replacement character
        chars[0] = 0x0000;
        chars[1] = 0xFFFD;
    }
    else
    {
        // surrogate pair
        code -= 0x010000;
        chars[0] = 0xD800 + (unsigned short)((code >> 10) & 0x3FF);
        chars[1] = 0xDC00 + (unsigned short)(code & 0x3FF);
    }
}
Sign up to request clarification or add additional context in comments.

4 Comments

In theory C++ provides this as part of the standard libary however library support is not universal yet (last I checked g++ did not yet have the header).
Only if you are using a C++11 compiler, though.
@RemyLebeau Thanks for your help, I just have 1 question. The function I provided had a "char" as the second parameter. Why does your function use an "unsigned short" instead? Is there anyway to make the unsigned short a char because when I use "cout" it just shows me the location of the short in memory?
UTF-8 encodes to 8bit values, which char handles (unsigned char would be better). UTF-16 encodes to 16bit values, which unsigned short handles. Hense the names of the UTF formats - UTF-8 for 8bit, UTF-16 for 16bit. std::cout does not support UTF-16, though, so use std::wcout or even WriteConsoleW() instead.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.