1

I have a requirement to convert a UTF8 4 byte string to a UTF16 string in C.
I am not allowed to use any external libraries to support it. I already have a macro defined to support the UTF8 3 byte to UTF16 conversion

#define UTF8-3BYTE-TO-UCS16(char1,char2,char3) ((((char1) & 0x0F) << 12) | (((char2) & 0x3F) << 6) | ((char3) & 0x3F))

I am looking for a similar implementation for the UTF8 4 byte as well.

7
  • I think your macro should be #define UTF8_4BYTE_TO_UCS16(char1, char2, char3, char4) ((((char1) & 0x07) << 18) | (((char2) & 0x3F) << 12) | (((char3) & 0x3F) << 6) | ((char4) & 0x3F))...? Commented Jan 16, 2024 at 10:51
  • 1
    Your suggestion indeed does it, but the problem here is that a unicode code point derived from a 4-byte UTF-8 sequence might be outside BMP (which I mentioned down in my answer), meaning it can't be represented by a single UTF-16 code unit. So, in those cases the code point has to be represented as a surrogate pair in UTF-16, where each surrogate pair consists of two 16-bit code units. UTF-16 can't directly represent chars beyond U+FFFF with a single 16-bit code unit. @YoushaAleayoub Commented Jan 16, 2024 at 10:54
  • 1
    A valid 4-byte UTF8 sequence will always be outside the BMP. OP, are you allowed to write two macros to generate the high and low surrogates? Commented Jan 16, 2024 at 16:24
  • The requirement is to support 4 byte emoji's..Will that fall outside the BMP? @MarkTolonen Commented Jan 17, 2024 at 6:23
  • @JInuThomas Yes, always. 4-byte supports code points > U+FFFF. Commented Jan 17, 2024 at 9:20

1 Answer 1

1

Here are separate macros to generate the HI/LO surrogates. Better to use a function so errors can be returned for invalid byte sequences or use an existing library for conversion like ICU.

#include <stdio.h>
#include <stdint.h>

#define UTF8_4BYTE_TO_UNICODE(char1, char2, char3, char4) ((((char1) & 0x07) << 18) | (((char2) & 0x3F) << 12) | (((char3) & 0x3F) << 6) | ((char4) & 0x3F))
#define UNICODE_TO_UTF16_HI(uni) ((((uni) - 0x10000) >> 10) + 0xD800)
#define UNICODE_TO_UTF16_LO(uni) ((((uni) - 0x10000) & 0x3FF) + 0xDC00)

int main()
{
    // U+1F50C ELECTRIC PLUG 🔌
    uint32_t uni = UTF8_4BYTE_TO_UNICODE(0xf0, 0x9f, 0x94, 0x8c);
    uint16_t hi = UNICODE_TO_UTF16_HI(uni);
    uint16_t lo = UNICODE_TO_UTF16_LO(uni);
    printf("%04X %04X\n", hi, lo);
    return 0;
}

Output:

D83D DD0C

References (Wikipedia):

Sign up to request clarification or add additional context in comments.

6 Comments

@str1ng Note the code was updated and I said it was better to use a library, not that OP had to :) For your answer, declaring variables in the macro means the macro can only be used once in a function or duplicate declarations would occur.
@MarkTolonen Yes, I see that you say it's better because errors and I do totally agree with you, but basically as OP stated he wants it without libraries, you haven't provided code without usage of function, therefore I just followed up on comment, and asked for feedback on my code (certainly not to brag around but rather that I can compare my code up to someone else's), and to get feedback. I can't compare to you in any means as you're Senior dev or Remy where both of you have bunch of experience and I am taking a grasp of surface and I'd like to learn on my mistakes.
@str1ng I'm not using any functions in the macros. Are you referring to printf? That's just for display of the macro results.
Thanks for clarification, yeah obviously I misunderstood this, where I got confused is the part where I missed #define's on conversion of surrogates, and because of the fact that I assumed you'd be using function by a comment before code... Related to the code that I posted, is modifying the macro to accept variables as arguments instead of declaring them internally good approach, I'd assume this way it wouldn't lead to duplicate declaration errors in case it's used multiple times in the same function? Or is there any better way @MarkTolonen
@str1ng The better way would be to write a function instead of a macro. It doesn't have to use external libraries but could then return multiple values in output parameters and return errors for invalid sequences.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.