How to convert unicode code points to utf-8 in c++?

Question

I have an array consisting of unicode code points

unsigned short array[3]={0x20ac,0x20ab,0x20ac};

I just want this to be converted as utf-8 to write into file byte by byte using C++.

Example: 0x20ac should be converted to e2 82 ac.

or is there any other method that can directly write unicode characters in file.

You could use Boost.Locale of Boost libraries: boost.org/doc/libs/1_55_0/libs/locale/doc/html/index.html — Nick Louloudakis
– Nick Louloudakis, Commented Dec 6, 2013 at 8:53
Use a Unicode library like ICU. Even Windows itself has enough to do this. — Mooing Duck
– Mooing Duck, Commented Dec 7, 2013 at 17:27
I assume that's an array of codepoints from the question. Can you affirm that you're going to ignore codepoints that don't fit in a short, and that it's not actually UTF-16 nor UCS-2 encoded? — Mooing Duck
– Mooing Duck, Commented Dec 7, 2013 at 17:28
In order to achieve this goal Boost.Locale uses the-state-of-the-art Unicode and Localization library: ICU - International Components for Unicode. — Chris
– Chris, Commented Feb 6, 2016 at 23:01

sms · Accepted Answer · 2016-09-04 20:20:25Z

11

Finally! With C++11!

#include <string>
#include <locale>
#include <codecvt>
#include <cassert>

int main()
{
    std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> converter;
    std::string u8str = converter.to_bytes(0x20ac);
    assert(u8str == "\xe2\x82\xac");
}

answered Sep 4, 2016 at 20:20

sms

1,0541 gold badge11 silver badges20 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Matthew Over a year ago

This is good, except for with Visual Studio 2015 and 2017 compilers which do not support std::codecvt with char32_t support. But you can use uint32_t: std::wstring_convert< std::codecvt_utf8<uint32_t>, uint32_t > converter;

handicraftsman Over a year ago

To everyone who will read this now: this is deprecated in C++17.

Kidsunbo Over a year ago

To everyone who will read the comment above: there is no replacement in STL in C++ 17.

stenliis · Accepted Answer · 2022-12-31 16:08:23Z

5

The term Unicode refers to a standard for encoding and handling of text. This incorporates encodings like UTF-8, UTF-16, UTF-32, UCS-2, ...

I guess you are programming in a Windows environment, where Unicode typically refers to UTF-16.

When working with Unicode in C++, I would recommend the ICU library.

If you are programming on Windows, don't want to use an external library, and have no constraints regarding platform dependencies, you can use WideCharToMultiByte.

Example for ICU:

#include <iostream>
#include <unicode\ustream.h>

using icu::UnicodeString;

int main(int, char**) {
    //
    // Convert from UTF-16 to UTF-8
    //
    std::wstring utf16 = L"foobar";
    UnicodeString str(utf16.c_str());
    std::string utf8;
    str.toUTF8String(utf8);

    std::cout << utf8 << std::endl;
}

To do exactly what you want:

// Assuming you have ICU\include in your include path
// and ICU\lib(64) in your library path.
#include <iostream>
#include <fstream>
#include <unicode\ustream.h>
#pragma comment(lib, "icuio.lib")
#pragma comment(lib, "icuuc.lib")

void writeUtf16ToUtf8File(char const* fileName, wchar_t const* arr, size_t arrSize) {
    UnicodeString str(arr, arrSize);
    std::string utf8;
    str.toUTF8String(utf8);

    std::ofstream out(fileName, std::ofstream::binary);
    out << utf8;
    out.close();
}

edited Dec 31, 2022 at 16:08

stenliis

3371 gold badge8 silver badges23 bronze badges

answered Dec 6, 2013 at 9:03

Max Truxa

3,53830 silver badges41 bronze badges

2 Comments

Venkatesan Over a year ago

How to download and set up.

Max Truxa Over a year ago

Go here, scroll down to ICU4C Binary Download and download the version you need. Extract the ZIP file and put the extracted directory somewhere you can access it from your project. Add 'path-where-you-put-it/icu/include' to your projects include path and 'path-where-you-put-it/icu/lib' (or lib64) to your projects library path.

True Vision _ Zunna Berry · Accepted Answer · 2013-12-06 08:55:38Z

2

Following code may help you,

#include <atlconv.h>
#include <atlstr.h>

#define ASSERT ATLASSERT

int main()
{
    const CStringW unicode1 = L"\x0391 and \x03A9"; // 'Alpha' and 'Omega'

    const CStringA utf8 = CW2A(unicode1, CP_UTF8);

    ASSERT(utf8.GetLength() > unicode1.GetLength());

    const CStringW unicode2 = CA2W(utf8, CP_UTF8);

    ASSERT(unicode1 == unicode2);
}

answered Dec 6, 2013 at 8:55

True Vision _ Zunna Berry

1,82317 silver badges31 bronze badges

Comments

stenliis · Accepted Answer · 2022-12-31 20:44:02Z

1

This code uses WideCharToMultiByte (I assume that you are using Windows):

unsigned short wide_str[3] = {0x20ac, 0x20ab, 0x20ac};
int utf8_size = WideCharToMultiByte(CP_UTF8, 0, wide_str, 3, NULL, 0, NULL, NULL) + 1;
char* utf8_str = calloc(utf8_size);
WideCharToMultiByte(CP_UTF8, 0, wide_str, 3, utf8_str, utf8_size, NULL, NULL);

You need to call it twice: first time to get number of output bytes, and second time to actually convert it. If you know output buffer size, you may skip first call. Or, you can simply allocate buffer 2x larger than original + 1 byte (for your case it means 12+1 bytes) - it should be always enough.

edited Dec 31, 2022 at 20:44

stenliis

3371 gold badge8 silver badges23 bronze badges

answered Dec 6, 2013 at 9:11

mvp

118k15 gold badges132 silver badges155 bronze badges

1 Comment

Venkatesan Over a year ago

Nice one.but iam using linux machine

user2249683 · Accepted Answer · 2013-12-06 10:08:09Z

0

With std c++

#include <iostream>
#include <locale>
#include <vector>

int main()
{
    typedef std::codecvt<wchar_t, char, mbstate_t> Convert;
    std::wstring w = L"\u20ac\u20ab\u20ac";
    std::locale locale("en_GB.utf8");
    const Convert& convert = std::use_facet<Convert>(locale);

    std::mbstate_t state;
    const wchar_t* from_ptr;
    char* to_ptr;
    std::vector<char> result(3 * w.size() + 1, 0);
    Convert::result convert_result = convert.out(state,
          w.c_str(), w.c_str() + w.size(), from_ptr,
          result.data(), result.data() + result.size(), to_ptr);

    if (convert_result == Convert::ok)
        std::cout << result.data() << std::endl;
    else std::cout << "Failure: " << convert_result << std::endl;
}

answered Dec 6, 2013 at 10:08

user2249683

1 Comment

galactica Over a year ago

having a STL solution is always nice, but the caveat here is that codecvt is to be deprecated in C++17 and surprisingly no alternative solution is available in C++17 and onwards.

Remy Lebeau · Accepted Answer · 2013-12-07 17:25:37Z

0

Iconv is a popular library used on many platforms.

answered Dec 7, 2013 at 17:25

Remy Lebeau

609k36 gold badges516 silver badges875 bronze badges

Comments

user3820843 · Accepted Answer · 2021-05-28 17:13:39Z

I had a similar but slightly different problem. I had strings with the Unicode code point in it as a string representation. Ex: "F\u00f3\u00f3 B\u00e1r". I needed to convert the string code points to their Unicode character.

Here is my C# solution

using System.Globalization;
using System.Text.RegularExpressions;

static void Main(string[] args)
{
    Regex CodePoint = new Regex(@"\\u(?<UTF32>....)");
    Match Letter;
    string s = "F\u00f3\u00f3 B\u00e1r";
    string utf32;
    Letter = CodePoint.Match(s);
    while (Letter.Success)
    {
        utf32 = Letter.Groups[1].Value;
        if (Int32.TryParse(utf32, NumberStyles.HexNumber, CultureInfo.GetCultureInfoByIetfLanguageTag("en-US"), out int HexNum))
            s = s.Replace("\\u" + utf32, Char.ConvertFromUtf32(HexNum));
        Letter = Letter.NextMatch();
    }
    Console.WriteLine(s);
}

Output: Fóó Bár

Collectives™ on Stack Overflow

How to convert unicode code points to utf-8 in c++?

7 Answers 7

3 Comments

2 Comments

Comments

1 Comment

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

3 Comments

2 Comments

Comments

1 Comment

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related