10

I have an array consisting of unicode code points

unsigned short array[3]={0x20ac,0x20ab,0x20ac};

I just want this to be converted as utf-8 to write into file byte by byte using C++.

Example: 0x20ac should be converted to e2 82 ac.

or is there any other method that can directly write unicode characters in file.

5
  • You could use Boost.Locale of Boost libraries: boost.org/doc/libs/1_55_0/libs/locale/doc/html/index.html Commented Dec 6, 2013 at 8:53
  • 1
    Use a Unicode library like ICU. Even Windows itself has enough to do this. Commented Dec 7, 2013 at 17:27
  • I assume that's an array of codepoints from the question. Can you affirm that you're going to ignore codepoints that don't fit in a short, and that it's not actually UTF-16 nor UCS-2 encoded? Commented Dec 7, 2013 at 17:28
  • Could you make it specific? Commented Apr 17, 2014 at 18:00
  • In order to achieve this goal Boost.Locale uses the-state-of-the-art Unicode and Localization library: ICU - International Components for Unicode. Commented Feb 6, 2016 at 23:01

7 Answers 7

11

Finally! With C++11!

#include <string>
#include <locale>
#include <codecvt>
#include <cassert>

int main()
{
    std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> converter;
    std::string u8str = converter.to_bytes(0x20ac);
    assert(u8str == "\xe2\x82\xac");
}
Sign up to request clarification or add additional context in comments.

3 Comments

This is good, except for with Visual Studio 2015 and 2017 compilers which do not support std::codecvt with char32_t support. But you can use uint32_t: std::wstring_convert< std::codecvt_utf8<uint32_t>, uint32_t > converter;
To everyone who will read this now: this is deprecated in C++17.
To everyone who will read the comment above: there is no replacement in STL in C++ 17.
5

The term Unicode refers to a standard for encoding and handling of text. This incorporates encodings like UTF-8, UTF-16, UTF-32, UCS-2, ...

I guess you are programming in a Windows environment, where Unicode typically refers to UTF-16.

When working with Unicode in C++, I would recommend the ICU library.

If you are programming on Windows, don't want to use an external library, and have no constraints regarding platform dependencies, you can use WideCharToMultiByte.

Example for ICU:

#include <iostream>
#include <unicode\ustream.h>

using icu::UnicodeString;

int main(int, char**) {
    //
    // Convert from UTF-16 to UTF-8
    //
    std::wstring utf16 = L"foobar";
    UnicodeString str(utf16.c_str());
    std::string utf8;
    str.toUTF8String(utf8);

    std::cout << utf8 << std::endl;
}

To do exactly what you want:

// Assuming you have ICU\include in your include path
// and ICU\lib(64) in your library path.
#include <iostream>
#include <fstream>
#include <unicode\ustream.h>
#pragma comment(lib, "icuio.lib")
#pragma comment(lib, "icuuc.lib")

void writeUtf16ToUtf8File(char const* fileName, wchar_t const* arr, size_t arrSize) {
    UnicodeString str(arr, arrSize);
    std::string utf8;
    str.toUTF8String(utf8);

    std::ofstream out(fileName, std::ofstream::binary);
    out << utf8;
    out.close();
}

2 Comments

How to download and set up.
Go here, scroll down to ICU4C Binary Download and download the version you need. Extract the ZIP file and put the extracted directory somewhere you can access it from your project. Add 'path-where-you-put-it/icu/include' to your projects include path and 'path-where-you-put-it/icu/lib' (or lib64) to your projects library path.
2

Following code may help you,

#include <atlconv.h>
#include <atlstr.h>

#define ASSERT ATLASSERT

int main()
{
    const CStringW unicode1 = L"\x0391 and \x03A9"; // 'Alpha' and 'Omega'

    const CStringA utf8 = CW2A(unicode1, CP_UTF8);

    ASSERT(utf8.GetLength() > unicode1.GetLength());

    const CStringW unicode2 = CA2W(utf8, CP_UTF8);

    ASSERT(unicode1 == unicode2);
}

Comments

1

This code uses WideCharToMultiByte (I assume that you are using Windows):

unsigned short wide_str[3] = {0x20ac, 0x20ab, 0x20ac};
int utf8_size = WideCharToMultiByte(CP_UTF8, 0, wide_str, 3, NULL, 0, NULL, NULL) + 1;
char* utf8_str = calloc(utf8_size);
WideCharToMultiByte(CP_UTF8, 0, wide_str, 3, utf8_str, utf8_size, NULL, NULL);

You need to call it twice: first time to get number of output bytes, and second time to actually convert it. If you know output buffer size, you may skip first call. Or, you can simply allocate buffer 2x larger than original + 1 byte (for your case it means 12+1 bytes) - it should be always enough.

1 Comment

Nice one.but iam using linux machine
0

With std c++

#include <iostream>
#include <locale>
#include <vector>

int main()
{
    typedef std::codecvt<wchar_t, char, mbstate_t> Convert;
    std::wstring w = L"\u20ac\u20ab\u20ac";
    std::locale locale("en_GB.utf8");
    const Convert& convert = std::use_facet<Convert>(locale);

    std::mbstate_t state;
    const wchar_t* from_ptr;
    char* to_ptr;
    std::vector<char> result(3 * w.size() + 1, 0);
    Convert::result convert_result = convert.out(state,
          w.c_str(), w.c_str() + w.size(), from_ptr,
          result.data(), result.data() + result.size(), to_ptr);

    if (convert_result == Convert::ok)
        std::cout << result.data() << std::endl;
    else std::cout << "Failure: " << convert_result << std::endl;
}

1 Comment

having a STL solution is always nice, but the caveat here is that codecvt is to be deprecated in C++17 and surprisingly no alternative solution is available in C++17 and onwards.
0

Iconv is a popular library used on many platforms.

Comments

0

I had a similar but slightly different problem. I had strings with the Unicode code point in it as a string representation. Ex: "F\u00f3\u00f3 B\u00e1r". I needed to convert the string code points to their Unicode character.

Here is my C# solution

using System.Globalization;
using System.Text.RegularExpressions;

static void Main(string[] args)
{
    Regex CodePoint = new Regex(@"\\u(?<UTF32>....)");
    Match Letter;
    string s = "F\u00f3\u00f3 B\u00e1r";
    string utf32;
    Letter = CodePoint.Match(s);
    while (Letter.Success)
    {
        utf32 = Letter.Groups[1].Value;
        if (Int32.TryParse(utf32, NumberStyles.HexNumber, CultureInfo.GetCultureInfoByIetfLanguageTag("en-US"), out int HexNum))
            s = s.Replace("\\u" + utf32, Char.ConvertFromUtf32(HexNum));
        Letter = Letter.NextMatch();
    }
    Console.WriteLine(s);
}

Output: Fóó Bár

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.