How to work with non-ascii characters in strings in C++?

Question

When writing a program, I'm having issues working with a combination of special characters and regular ones. When I print either type to the console separately, they work fine, but when I print a special and normal character in the same line, it results in errored characters instead of the expected output. My code:

#include <fstream>
#include <iostream>
#include <string>

using namespace std;

void initCharacterMap(){
    const string normal = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz1234567890!@#$%^&*()-_[]{};':\",.<>/?";
    const string inverse = "∀𐐒Ↄ◖ƎℲ⅁HIſ⋊⅂WᴎOԀΌᴚS⊥∩ᴧMX⅄Zɐqɔpǝɟƃɥıɾʞʃɯuodbɹsʇnʌʍxʎz12Ɛᔭ59Ɫ860¡@#$%^⅋*)(-‾][}{؛,:„'˙></¿";

    cout << normal << endl;

    for(int i=0;i<normal.length();i++){
        cout << normal[i];
    }
    cout << endl;

    cout << inverse << endl;

    for(int i=0;i<inverse.length();i++){
        cout << inverse[i];
    }
    cout << endl;

    for(int i=0;i<inverse.length();i++){
        cout << normal[i] << inverse[i] << endl;
    }
}

int main() {
    initCharacterMap();
    return 0;
}

And the console output: https://paste.ubuntu.com/p/H9bqh67WPZ/

When viewed in console, the \XX characters show up as unknown character symbol, and when I opened that log, I was warned that some characters couldn't be viewed and that editing could corrupt the file.

If anyone has any advice on how I can fix this, it would be greatly appreciated.

EDIT: After following the suggestion in Marek R's answer, the situation greatly improved, but this still isn't quite giving me the results I want. New code:

#include <fstream>
#include <iostream>
#include <string>

using namespace std;

void initCharacterMap(){
    const wchar_t normal[] = L"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz1234567890!@#$%^&*()-_[]{};':\",.<>/?";
    const wchar_t inverse[] = L"∀𐐒Ↄ◖ƎℲ⅁HIſ⋊⅂WᴎOԀΌᴚS⊥∩ᴧMX⅄Zɐqɔpǝɟƃɥıɾʞʃɯuodbɹsʇnʌʍxʎz12Ɛᔭ59Ɫ860¡@#$%^⅋*)(-‾][}{؛,:„'˙></¿";

    wcout << normal << endl;

    for(int i=0;i<sizeof(normal)/sizeof(normal[0]);i++){
        wcout << normal[i];
    }
    wcout << endl;

    wcout << inverse << endl;

    for(int i=0;i<sizeof(inverse)/sizeof(inverse[0]);i++){
        wcout << inverse[i];
    }
    wcout << endl;

    for(int i=0;i<sizeof(inverse)/sizeof(inverse[0]);i++){
        wcout << normal[i] << inverse[i] << endl;
    }
}

int main() {
    initCharacterMap();
    return 0;
}

New console output: https://paste.ubuntu.com/p/hcM7JB99zj/

So, I'm no longer having issues with using output of contents of the strings together, but the issue with it now is that all non-ascii characters are being replaced with question marks in the output. Is there any way to make those characters output properly?

Take a look at std::string vs std::wstring. The later one is made especially to represent characters outside of ASCII range (wchar_t is larger than char) — Fureeish
– Fureeish, Commented Feb 16, 2018 at 23:25
First of all you will need to stop calling them "special characters" and find out what you are actually storing ;) — Lightness Races in Orbit
– Lightness Races in Orbit, Commented Feb 16, 2018 at 23:36
Thanks for the advice, guys. I've followed it and updated the post accordingly. — The_Fireplace
– The_Fireplace, Commented Feb 17, 2018 at 2:37

Marek R · Accepted Answer · 2018-02-17 17:03:33Z

2

Most probably you code is using UTF-8 encoding. This means that single character can occupy from one to 4 bytes. Note that that value of inverse.size() is bigger than you are expecting.

std::string doesn't know anything about encoding, so it treats each byte as a character. The output console is interpreting sequence of byres as done in respective encoding and shows proper characters.

When you print byte by byte each string separately it works since sequence is proper. When you print one byte from one string and one byte from other things get messy.

The easiest way to fix it is use std::wstring wchar_t and L"some literal". It should work in your case, but as point out in comets below on some platforms some characters may not fit into single wide character. If you want to know more read about different character encoding.

The other way to solve your problem is to use a map which will transform sequence of bytes (string) to other sequence (string). C++11:

auto dictionary = std::unordered_map<std::string, std::string> {
    { "A", "∀" },
    { "B", "𐐒" },
    { "C", "Ↄ" },
    { "D", "◖" },
    … … …
}

Edit I've tested your new code and you should add code which configures locale for output stream.

On my mac (with polish locale), when building with clang, application ignores inverted values (wcout goes into invalid state), but when locale is set everything works like you are expecting.

#include <fstream>
#include <iostream>
#include <string>
#include <locale>

using namespace std;

void initCharacterMap(){
    wcout.imbue(locale(""));

    const auto normal = L"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz1234567890!@#$%^&*()-_[]{};':\",.<>/?"s;
    const auto inverse = L"∀𐐒Ↄ◖ƎℲ⅁HIſ⋊⅂WᴎOԀΌᴚS⊥∩ᴧMX⅄Zɐqɔpǝɟƃɥıɾʞʃɯuodbɹsʇnʌʍxʎz12Ɛᔭ59Ɫ860¡@#$%^⅋*)(-‾][}{؛,:„'˙></¿"s;

    wcout << normal << endl;

    for(auto ch : normal){
        wcout << ch;
    }
    wcout << endl;

    wcout << inverse << endl;

    for(auto ch : inverse){
        wcout << ch;
    }
    wcout << endl;

    for(size_t i=0; i < inverse.length(); ++i){
        wcout << normal[i] << inverse[i] << endl;
    }
}

int main() {
    initCharacterMap();
    return 0;
}

https://wandbox.org/permlink/nTYi5RbZgZXclE5r

I'm suspecting that standard library in your compiler also doesn't know how to perform conversion with default locale, so it prints question marks instead actual charters. So add this two lines (include and imbue) and it should work. If not then provide information about your platform and compiler.

edited Feb 17, 2018 at 17:03

answered Feb 16, 2018 at 23:33

Marek R

40.3k6 gold badges70 silver badges165 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

Mooing Duck Over a year ago

The std::wstring suggestion will only work for some characters, depending on your compiler. For most windows compilers, std::wstring will be multibyte for emoji and some Chinese still.

Lightness Races in Orbit Over a year ago

This answer was great until the last line, which is wrong.

Lightness Races in Orbit Over a year ago

I'm actually flipping my downvote to an upvote because the bulk is right and useful. But please fix that last line.

Marek R Over a year ago

fixed and provided an alternative

Lightness Races in Orbit Over a year ago

Yeah like I alluded to before, switching from encoding-unaware one-byte "characters" to encoding-unaware two-byte "characters" doesn't solve the problem. It simply changes the problem. Assuming UTF-8, what you actually need is a library that can actually recognise and deal with UTF-8. Fortunately, this is not hard to find. I don't remember exactly whether the featureset has what you need, but utfcpp.sourceforge.net is really good for quick lightweight stuff. (Unfortunately half of SourceForge has been offline all bloody day -.-)

|

Collectives™ on Stack Overflow

How to work with non-ascii characters in strings in C++?

1 Answer 1

9 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

9 Comments

Your Answer

Sign up or log in

Post as a guest

Related