7

Consider:

STDMETHODIMP CFileSystemAPI::setRRConfig( BSTR config_str, VARIANT* ret )
{
mReportReaderFactory.reset( new sbis::report_reader::ReportReaderFactory() );

USES_CONVERSION;
std::string configuration_str = W2A( config_str );

But in config_str I get a string in UTF-16. How can I convert it to UTF-8 in this piece of code?

0

4 Answers 4

7

I implemented two variants of conversion between UTF-8<->UTF-16<->UTF-32, first variant fully implements all conversions from scratch, second uses standard std::codecvt and std::wstring_convert (these two classes are deprecated starting from C++17, but still exist, also guaranteed to exist in C++11/C++14).

If you don't like my code then you may use almost-single-header C++ library utfcpp, which should be very well tested by many customers.

To convert UTF-8 to UTF-16 just call Utf32To16(Utf8To32(str)) and to convert UTF-16 to UTF-8 call Utf32To8(Utf16To32(str)). Or you may just use my handy helper function UtfConv<std::wstring>(std::string("abc")) for UTF-8 to UTF-16 or UtfConv<std::string>(std::wstring(L"abc")) for UTF-16 to UTF-8, UtfConv actually can convert from any to any Utf-encoded string. See examples of these and other usages inside Test(cs) macro.

Both variants are C++11 compliant. Also they compile in CLang/GCC/MSVC compilers (see "Try it online!" links down below) and tested to work in Windows/Linux OSes.

You have to save both of my code snippets in file with UTF-8 encoding and provide options -finput-charset=UTF-8 -fexec-charset=UTF-8 to CLang/GCC, and options /utf-8 to MSVC. This utf-8 saving and options are needed only if you put literal strings with non-ascii characters, like I did in my code for testing only purposes. To use functions themselves you don't need this utf-8 saving and options.

Inclusions of <windows.h> and <clocale> and <iostream>, also call to SetConsoleOutputCP(65001) and std::setlocale(LC_ALL, "en_US.UTF-8") are needed only for testing purposes to setup and output correctly to UTF-8 console. These things are not needed for conversion functions.

Part of code is not very necessary, I mean UtfHelper-related structure and functions, they are just helper functions for conversion, mainly created to handle in cross-platform way std::wstring, because wchar_t is usually 32-bit on Linux and 16-bit on Windows. Only low-level functions Utf8To32, Utf32To8, Utf16To32, Utf32To16 are the only things that are really needed for conversion.

Variant 1 was created out of Wikipedia description of UTF-8 and UTF-16 encodings.

If you find bugs or any improvements (especially in Variant 1) please tell me, I'll fix them.


Variant 1

Try it online!

#include <string>
#include <iostream>
#include <stdexcept>
#include <type_traits>
#include <cstdint>

#ifdef _WIN32
    #include <windows.h>
#else
    #include <clocale>
#endif

#define ASSERT_MSG(cond, msg) { if (!(cond)) throw std::runtime_error("Assertion (" #cond ") failed at line " + std::to_string(__LINE__) + "! Msg: " + std::string(msg)); }
#define ASSERT(cond) ASSERT_MSG(cond, "")

template <typename U8StrT = std::string>
inline static U8StrT Utf32To8(std::u32string const & s) {
    static_assert(sizeof(typename U8StrT::value_type) == 1, "Char byte-size should be 1 for UTF-8 strings!");
    typedef typename U8StrT::value_type VT;
    typedef uint8_t u8;
    U8StrT r;
    for (auto c: s) {
        size_t nby = c <= 0x7FU ? 1 : c <= 0x7FFU ? 2 : c <= 0xFFFFU ? 3 : c <= 0x1FFFFFU ? 4 : c <= 0x3FFFFFFU ? 5 : c <= 0x7FFFFFFFU ? 6 : 7;
        r.push_back(VT(
            nby <= 1 ? u8(c) : (
                (u8(0xFFU) << (8 - nby)) |
                u8(c >> (6 * (nby - 1)))
            )
        ));
        for (size_t i = 1; i < nby; ++i)
            r.push_back(VT(u8(0x80U | (u8(0x3FU) & u8(c >> (6 * (nby - 1 - i)))))));
    }
    return r;
}

template <typename U8StrT>
inline static std::u32string Utf8To32(U8StrT const & s) {
    static_assert(sizeof(typename U8StrT::value_type) == 1, "Char byte-size should be 1 for UTF-8 strings!");
    typedef uint8_t u8;
    std::u32string r;
    auto it = (u8 const *)s.c_str(), end = (u8 const *)(s.c_str() + s.length());
    while (it < end) {
        char32_t c = 0;
        if (*it <= 0x7FU) {
            c = *it;
            ++it;
        } else {
            ASSERT((*it & 0xC0U) == 0xC0U);
            size_t nby = 0;
            for (u8 b = *it; (b & 0x80U) != 0; b <<= 1, ++nby) {(void)0;}
            ASSERT(nby <= 7);
            ASSERT((end - it) >= nby);
            c = *it & (u8(0xFFU) >> (nby + 1));
            for (size_t i = 1; i < nby; ++i) {
                ASSERT((it[i] & 0xC0U) == 0x80U);
                c = (c << 6) | (it[i] & 0x3FU);
            }
            it += nby;
        }
        r.push_back(c);
    }
    return r;
}


template <typename U16StrT = std::u16string>
inline static U16StrT Utf32To16(std::u32string const & s) {
    static_assert(sizeof(typename U16StrT::value_type) == 2, "Char byte-size should be 2 for UTF-16 strings!");
    typedef typename U16StrT::value_type VT;
    typedef uint16_t u16;
    U16StrT r;
    for (auto c: s) {
        if (c <= 0xFFFFU)
            r.push_back(VT(c));
        else {
            ASSERT(c <= 0x10FFFFU);
            c -= 0x10000U;
            r.push_back(VT(u16(0xD800U | ((c >> 10) & 0x3FFU))));
            r.push_back(VT(u16(0xDC00U | (c & 0x3FFU))));
        }
    }
    return r;
}

template <typename U16StrT>
inline static std::u32string Utf16To32(U16StrT const & s) {
    static_assert(sizeof(typename U16StrT::value_type) == 2, "Char byte-size should be 2 for UTF-16 strings!");
    typedef uint16_t u16;
    std::u32string r;
    auto it = (u16 const *)s.c_str(), end = (u16 const *)(s.c_str() + s.length());
    while (it < end) {
        char32_t c = 0;
        if (*it < 0xD800U || *it > 0xDFFFU) {
            c = *it;
            ++it;
        } else if (*it >= 0xDC00U) {
            ASSERT_MSG(false, "Unallowed UTF-16 sequence!");
        } else {
            ASSERT(end - it >= 2);
            c = (*it & 0x3FFU) << 10;
            if ((it[1] < 0xDC00U) || (it[1] > 0xDFFFU)) {
                ASSERT_MSG(false, "Unallowed UTF-16 sequence!");
            } else {
                c |= it[1] & 0x3FFU;
                c += 0x10000U;
            }
            it += 2;
        }
        r.push_back(c);
    }
    return r;
}


template <typename StrT, size_t NumBytes = sizeof(typename StrT::value_type)> struct UtfHelper;
template <typename StrT> struct UtfHelper<StrT, 1> {
    inline static std::u32string UtfTo32(StrT const & s) { return Utf8To32(s); }
    inline static StrT UtfFrom32(std::u32string const & s) { return Utf32To8<StrT>(s); }
};
template <typename StrT> struct UtfHelper<StrT, 2> {
    inline static std::u32string UtfTo32(StrT const & s) { return Utf16To32(s); }
    inline static StrT UtfFrom32(std::u32string const & s) { return Utf32To16<StrT>(s); }
};
template <typename StrT> struct UtfHelper<StrT, 4> {
    inline static std::u32string UtfTo32(StrT const & s) {
        return std::u32string((char32_t const *)(s.c_str()), (char32_t const *)(s.c_str() + s.length()));
    }
    inline static StrT UtfFrom32(std::u32string const & s) {
        return StrT((typename StrT::value_type const *)(s.c_str()),
            (typename StrT::value_type const *)(s.c_str() + s.length()));
    }
};
template <typename StrT> inline static std::u32string UtfTo32(StrT const & s) {
    return UtfHelper<StrT>::UtfTo32(s);
}
template <typename StrT> inline static StrT UtfFrom32(std::u32string const & s) {
    return UtfHelper<StrT>::UtfFrom32(s);
}
template <typename StrToT, typename StrFromT> inline static StrToT UtfConv(StrFromT const & s) {
    return UtfFrom32<StrToT>(UtfTo32(s));
}

#define Test(cs) \
    std::cout << Utf32To8(Utf8To32(std::string(cs))) << ", "; \
    std::cout << Utf32To8(Utf16To32(Utf32To16(Utf8To32(std::string(cs))))) << ", "; \
    std::cout << Utf32To8(Utf16To32(std::u16string(u##cs))) << ", "; \
    std::cout << Utf32To8(std::u32string(U##cs)) << ", "; \
    std::cout << UtfConv<std::string>(UtfConv<std::u16string>(UtfConv<std::u32string>(UtfConv<std::u32string>(UtfConv<std::u16string>(std::string(cs)))))) << ", "; \
    std::cout << UtfConv<std::string>(UtfConv<std::wstring>(UtfConv<std::string>(UtfConv<std::u32string>(UtfConv<std::u32string>(std::string(cs)))))) << ", "; \
    std::cout << UtfFrom32<std::string>(UtfTo32(std::string(cs))) << ", "; \
    std::cout << UtfFrom32<std::string>(UtfTo32(std::u16string(u##cs))) << ", "; \
    std::cout << UtfFrom32<std::string>(UtfTo32(std::wstring(L##cs))) << ", "; \
    std::cout << UtfFrom32<std::string>(UtfTo32(std::u32string(U##cs))) << std::endl; \
    std::cout << "UTF-8 num bytes: " << std::dec << Utf32To8(std::u32string(U##cs)).size() << ", "; \
    std::cout << "UTF-16 num bytes: " << std::dec << (Utf32To16(std::u32string(U##cs)).size() * 2) << std::endl;

int main() {
    #ifdef _WIN32
        SetConsoleOutputCP(65001);
    #else
        std::setlocale(LC_ALL, "en_US.UTF-8");
    #endif
    try {
        Test("World");
        Test("Привет");
        Test("𐐷𤭢");
        Test("𝞹");
        return 0;
    } catch (std::exception const & ex) {
        std::cout << "Exception: " << ex.what() << std::endl;
        return -1;
    }
}

Output:

World, World, World, World, World, World, World, World, World, World
UTF-8 num bytes: 5, UTF-16 num bytes: 10
Привет, Привет, Привет, Привет, Привет, Привет, Привет, Привет, Привет, Привет
UTF-8 num bytes: 12, UTF-16 num bytes: 12
𐐷𤭢, 𐐷𤭢, 𐐷𤭢, 𐐷𤭢, 𐐷𤭢, 𐐷𤭢, 𐐷𤭢, 𐐷𤭢, 𐐷𤭢, 𐐷𤭢
UTF-8 num bytes: 8, UTF-16 num bytes: 8
𝞹, 𝞹, 𝞹, 𝞹, 𝞹, 𝞹, 𝞹, 𝞹, 𝞹, 𝞹
UTF-8 num bytes: 4, UTF-16 num bytes: 4

Variant 2

Try it online!

#include <string>
#include <iostream>
#include <stdexcept>
#include <type_traits>
#include <locale>
#include <codecvt>
#include <cstdint>

#ifdef _WIN32
    #include <windows.h>
#else
    #include <clocale>
#endif

#define ASSERT(cond) { if (!(cond)) throw std::runtime_error("Assertion (" #cond ") failed at line " + std::to_string(__LINE__) + "!"); }

// Workaround for some of MSVC compilers.
#if defined(_MSC_VER) && (!_DLL) && (_MSC_VER >= 1900 /* VS 2015*/) && (_MSC_VER <= 1914 /* VS 2017 */)
std::locale::id std::codecvt<char16_t, char, _Mbstatet>::id;
std::locale::id std::codecvt<char32_t, char, _Mbstatet>::id;
#endif

template <typename U8StrT>
inline static std::u32string Utf8To32(U8StrT const & s) {
    static_assert(sizeof(typename U8StrT::value_type) == 1, "Char byte-size should be 1 for UTF-8 strings!");
    std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> utf_8_32_conv_;
    return utf_8_32_conv_.from_bytes((char const *)s.c_str(), (char const *)(s.c_str() + s.length()));
}

template <typename U8StrT = std::string>
inline static U8StrT Utf32To8(std::u32string const & s) {
    static_assert(sizeof(typename U8StrT::value_type) == 1, "Char byte-size should be 1 for UTF-8 strings!");
    std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> utf_8_32_conv_;
    std::string res = utf_8_32_conv_.to_bytes(s.c_str(), s.c_str() + s.length());
    return U8StrT(
        (typename U8StrT::value_type const *)(res.c_str()),
        (typename U8StrT::value_type const *)(res.c_str() + res.length()));
}

template <typename U16StrT>
inline static std::u32string Utf16To32(U16StrT const & s) {
    static_assert(sizeof(typename U16StrT::value_type) == 2, "Char byte-size should be 2 for UTF-16 strings!");
    std::wstring_convert<std::codecvt_utf16<char32_t, 0x10ffff, std::little_endian>, char32_t> utf_16_32_conv_;
    return utf_16_32_conv_.from_bytes((char const *)s.c_str(), (char const *)(s.c_str() + s.length()));
}

template <typename U16StrT = std::u16string>
inline static U16StrT Utf32To16(std::u32string const & s) {
    static_assert(sizeof(typename U16StrT::value_type) == 2, "Char byte-size should be 2 for UTF-16 strings!");
    std::wstring_convert<std::codecvt_utf16<char32_t, 0x10ffff, std::little_endian>, char32_t> utf_16_32_conv_;
    std::string res = utf_16_32_conv_.to_bytes(s.c_str(), s.c_str() + s.length());
    return U16StrT(
        (typename U16StrT::value_type const *)(res.c_str()),
        (typename U16StrT::value_type const *)(res.c_str() + res.length()));
}


template <typename StrT, size_t NumBytes = sizeof(typename StrT::value_type)> struct UtfHelper;
template <typename StrT> struct UtfHelper<StrT, 1> {
    inline static std::u32string UtfTo32(StrT const & s) { return Utf8To32(s); }
    inline static StrT UtfFrom32(std::u32string const & s) { return Utf32To8<StrT>(s); }
};
template <typename StrT> struct UtfHelper<StrT, 2> {
    inline static std::u32string UtfTo32(StrT const & s) { return Utf16To32(s); }
    inline static StrT UtfFrom32(std::u32string const & s) { return Utf32To16<StrT>(s); }
};
template <typename StrT> struct UtfHelper<StrT, 4> {
    inline static std::u32string UtfTo32(StrT const & s) {
        return std::u32string((char32_t const *)(s.c_str()), (char32_t const *)(s.c_str() + s.length()));
    }
    inline static StrT UtfFrom32(std::u32string const & s) {
        return StrT((typename StrT::value_type const *)(s.c_str()),
            (typename StrT::value_type const *)(s.c_str() + s.length()));
    }
};
template <typename StrT> inline static std::u32string UtfTo32(StrT const & s) {
    return UtfHelper<StrT>::UtfTo32(s);
}
template <typename StrT> inline static StrT UtfFrom32(std::u32string const & s) {
    return UtfHelper<StrT>::UtfFrom32(s);
}
template <typename StrToT, typename StrFromT> inline static StrToT UtfConv(StrFromT const & s) {
    return UtfFrom32<StrToT>(UtfTo32(s));
}

#define Test(cs) \
    std::cout << Utf32To8(Utf8To32(std::string(cs))) << ", "; \
    std::cout << Utf32To8(Utf16To32(Utf32To16(Utf8To32(std::string(cs))))) << ", "; \
    std::cout << Utf32To8(Utf16To32(std::u16string(u##cs))) << ", "; \
    std::cout << Utf32To8(std::u32string(U##cs)) << ", "; \
    std::cout << UtfConv<std::string>(UtfConv<std::u16string>(UtfConv<std::u32string>(UtfConv<std::u32string>(UtfConv<std::u16string>(std::string(cs)))))) << ", "; \
    std::cout << UtfConv<std::string>(UtfConv<std::wstring>(UtfConv<std::string>(UtfConv<std::u32string>(UtfConv<std::u32string>(std::string(cs)))))) << ", "; \
    std::cout << UtfFrom32<std::string>(UtfTo32(std::string(cs))) << ", "; \
    std::cout << UtfFrom32<std::string>(UtfTo32(std::u16string(u##cs))) << ", "; \
    std::cout << UtfFrom32<std::string>(UtfTo32(std::wstring(L##cs))) << ", "; \
    std::cout << UtfFrom32<std::string>(UtfTo32(std::u32string(U##cs))) << std::endl; \
    std::cout << "UTF-8 num bytes: " << std::dec << Utf32To8(std::u32string(U##cs)).size() << ", "; \
    std::cout << "UTF-16 num bytes: " << std::dec << (Utf32To16(std::u32string(U##cs)).size() * 2) << std::endl;

int main() {
    #ifdef _WIN32
        SetConsoleOutputCP(65001);
    #else
        std::setlocale(LC_ALL, "en_US.UTF-8");
    #endif
    try {
        Test("World");
        Test("Привет");
        Test("𐐷𤭢");
        Test("𝞹");
        return 0;
    } catch (std::exception const & ex) {
        std::cout << "Exception: " << ex.what() << std::endl;
        return -1;
    }
}

Output:

World, World, World, World, World, World, World, World, World, World
UTF-8 num bytes: 5, UTF-16 num bytes: 10
Привет, Привет, Привет, Привет, Привет, Привет, Привет, Привет, Привет, Привет
UTF-8 num bytes: 12, UTF-16 num bytes: 12
𐐷𤭢, 𐐷𤭢, 𐐷𤭢, 𐐷𤭢, 𐐷𤭢, 𐐷𤭢, 𐐷𤭢, 𐐷𤭢, 𐐷𤭢, 𐐷𤭢
UTF-8 num bytes: 8, UTF-16 num bytes: 8
𝞹, 𝞹, 𝞹, 𝞹, 𝞹, 𝞹, 𝞹, 𝞹, 𝞹, 𝞹
UTF-8 num bytes: 4, UTF-16 num bytes: 4
Sign up to request clarification or add additional context in comments.

Comments

6

You can do something like this

std::string WstrToUtf8Str(const std::wstring& wstr)
{
  std::string retStr;
  if (!wstr.empty())
  {
    int sizeRequired = WideCharToMultiByte(CP_UTF8, 0, wstr.c_str(), -1, NULL, 0, NULL, NULL);

    if (sizeRequired > 0)
    {
      std::vector<char> utf8String(sizeRequired);
      int bytesConverted = WideCharToMultiByte(CP_UTF8, 0, wstr.c_str(),    
                           -1, &utf8String[0], utf8String.size(), NULL, 
                           NULL);
      if (bytesConverted != 0)
      {
        retStr = &utf8String[0];
      }
      else
      {
        std::stringstream err;
        err << __FUNCTION__ 
            << " std::string WstrToUtf8Str failed to convert wstring '"
            << wstr.c_str() << L"'";
        throw std::runtime_error( err.str() );
      }
    }
  }
  return retStr;
}

You can give your BSTR to the function as a std::wstring

Comments

2
void encode_unicode_character(char* buffer, int* offset, wchar_t ucs_character)
{
    if (ucs_character <= 0x7F)
    {
        // Plain single-byte ASCII.
        buffer[(*offset)++] = (char) ucs_character;
    }
    else if (ucs_character <= 0x7FF)
    {
        // Two bytes.
        buffer[(*offset)++] = 0xC0 | (ucs_character >> 6);
        buffer[(*offset)++] = 0x80 | ((ucs_character >> 0) & 0x3F);
    }
    else if (ucs_character <= 0xFFFF)
    {
        // Three bytes.
        buffer[(*offset)++] = 0xE0 | (ucs_character >> 12);
        buffer[(*offset)++] = 0x80 | ((ucs_character >> 6) & 0x3F);
        buffer[(*offset)++] = 0x80 | ((ucs_character >> 0) & 0x3F);
    }
    else if (ucs_character <= 0x1FFFFF)
    {
        // Four bytes.
        buffer[(*offset)++] = 0xF0 | (ucs_character >> 18);
        buffer[(*offset)++] = 0x80 | ((ucs_character >> 12) & 0x3F);
        buffer[(*offset)++] = 0x80 | ((ucs_character >> 6) & 0x3F);
        buffer[(*offset)++] = 0x80 | ((ucs_character >> 0) & 0x3F);
    }
    else if (ucs_character <= 0x3FFFFFF)
    {
        // Five bytes.
        buffer[(*offset)++] = 0xF8 | (ucs_character >> 24);
        buffer[(*offset)++] = 0x80 | ((ucs_character >> 18) & 0x3F);
        buffer[(*offset)++] = 0x80 | ((ucs_character >> 12) & 0x3F);
        buffer[(*offset)++] = 0x80 | ((ucs_character >> 6) & 0x3F);
        buffer[(*offset)++] = 0x80 | ((ucs_character >> 0) & 0x3F);
    }
    else if (ucs_character <= 0x7FFFFFFF)
    {
        // Six bytes.
        buffer[(*offset)++] = 0xFC | (ucs_character >> 30);
        buffer[(*offset)++] = 0x80 | ((ucs_character >> 24) & 0x3F);
        buffer[(*offset)++] = 0x80 | ((ucs_character >> 18) & 0x3F);
        buffer[(*offset)++] = 0x80 | ((ucs_character >> 12) & 0x3F);
        buffer[(*offset)++] = 0x80 | ((ucs_character >> 6) & 0x3F);
        buffer[(*offset)++] = 0x80 | ((ucs_character >> 0) & 0x3F);
    }
    else
    {
        // Invalid char; don't encode anything.
    }
}

ISO10646-2012 it is all you need to understand UCS.

18 Comments

UCS is not in the question. And UCS is not UTF-16. Is your code valid for UTF-16?
I did not say that utf-16 is ucs but it's rather part of it like utf-8 is.
@kw I think you should answer the question that was asked rather than the question that you happen to have some code for. Take another read of the question title and observe that the question concerns conversion from UTF-16 to UTF-8.
We don't understand why you mention UCS when the question is about UTF-16. We also fail to understand why you present code that converts UTF-32 to UTF-8 when the question is about UTF-16. It is also a mistake to assume that wchar_t can be used to hold a UTF-32 character element. On some systems it can, but not all.
@kw No. The other way round. The question asks, and I feel like a stuck record, to convert from UTF-16. That means that the input is UTF-16. Your code converts from UTF-32. So to use it one would need to convert from UTF-16 to UTF-32, and then on to UTF-8.
|
2

If you are using C++11 you may check this out:

http://www.cplusplus.com/reference/codecvt/codecvt_utf8_utf16/

4 Comments

Could you show me an example, because I do not understand how to work with it. BSTR input parameter in UTF-16le
haven't the time to create one, but found link he covers that very explicitly. i hope that helps
@user3252635 there is an example in the linked documentation. There are better examples at cppreference.com. Also , look at std::wstring_convert.
Deprecated in 17, removed in 26

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.