Unicode and std::string in C++

Question

If I write a random string to file in C++ consisting of some unicode characters, I am told by my text editor that I have not created a valid UTF-8 file.

// Code example
const std::string charset = "abcdefgàèíüŷÀ";
file << random_string(charset); // using std::fstream

What can I do to solve this? Do I have to do lots of additional manual encoding? The way I understand it, std::string does not care about the encoding, only the bytes, so when I pass it a unicode string and write it to file, surely that file should contain the same bytes and be recognized as a UTF-8 encoded file?

Just a wild guess: could it be your random_string function is accidentally inserting nulls due to an off-by-one error with the charset string? — Charles Salvia
– Charles Salvia, Commented Oct 29, 2010 at 12:40
@Charles: That would be just like me :) But I doubt it, as the std::string constructor discards the null from the string literal, and the random_string function just picks a random character from the charset string. — user1481860
– user1481860, Commented Oct 29, 2010 at 12:42
std::string doesn't necessarily discard the null from the string literal. Usually, it internally represents the string as a null-terminated C-string, in order to easily implement the std::string::c_str() function. — Charles Salvia
– Charles Salvia, Commented Oct 29, 2010 at 12:44
The characters: 'àèíüŷÀ' are not UTF-8. Note UTF-8 is a multibyte character set. This means that charset[x] is not guranteed to get you a whole character as it may be split across more than one char. — Loki Astari
– Loki Astari, Commented Oct 29, 2010 at 14:54

Community · Accepted Answer · 2021-10-07 05:49:19Z

14

random_string is likely to be the culprit; I wonder how it's implemented. If your string is indeed UTF-8-encoded and random_string looks like

std::string random_string(std::string const &charset)
{
    const int N = 10;
    std::string result(N);
    for (int i=0; i<N; i++)
        result[i] = charset[rand() % charset.size()];
    return result;
}

then it will take random chars from charset, which in UTF-8 (as other posters have pointed out) are not Unicode code points, but simple bytes. If it selects a random byte from the middle of a UTF-8 multibyte character as the first byte (or puts that after an 7-bit ASCII-compatible character), then your output will not be valid UTF-8. See Wikipedia and RFC 3629.

The solution might be to transform to and from UTF-32 in random_string. I believe wchar_t and std::wstring use UTF-32 on Linux. UTF-16 would also be safe, as long as you stay within the Basic Multilingual Plane.

edited Oct 7, 2021 at 5:49

CommunityBot

11 silver badge

answered Oct 29, 2010 at 12:47

Fred Foo

365k80 gold badges765 silver badges852 bronze badges

Sign up to request clarification or add additional context in comments.

11 Comments

user1481860 Over a year ago

So if a std::string named "str" contains "àỳ", str[0] won't return "à"? And str[1] won't return "ỳ"?

Fred Foo Over a year ago

No, it will return the first byte in the multi-byte encoding for these characters. C++ is a 1980s invention, designed to be compatible with C (1970s) and ASCII (1960s), while Unicode and UTF-8 were introduced in the early 90s. UTF-8 was designed to keep most old programs and algorithms working, looks like you used one of the algorithms that break. If this is more or less what random_string does.

user1481860 Over a year ago

It is. I guess this means that whenever I want to manipulate a unicode string I must use a wstring. I'll read up on portability issues and such. Anyway, answer accepted.

Fred Foo Over a year ago

Correction to my previous comment: str[1] will return the second byte in the encoding for à.

user1481860 Over a year ago

Is there anything wrong with using UTF-8 with wstring to solve the problem? Any particular reason why I'd have to convert to UTF-32 (or UTF-16)?

|

Charles Salvia · Accepted Answer · 2010-10-29 12:53:18Z

11

What can I do to solve this? Do I have to do lots of additional manual encoding? The way I understand it, std::string does not care about the encoding, only the bytes, so when I pass it a unicode string and write it to file, surely that file should contain the same bytes and be recognized as a UTF-8 encoded file?

You are correct that std::string is encoding agnostic. It simply holds an array of char elements. How these char elements are interpreted as text depends on the environment. If your locale is not set to some form of Unicode (i.e. UTF-8 or UTF-16), then when you output a string it will not be displayed/interpreted as Unicode.

Are you sure your string literal "abcdefgàèíüŷÀ" is actually Unicode and not, for example, Latin-1? (ISO-8859-1 or possible Windows-1252)? You need to determine what locale your platform is currently configured to use.

-----------EDIT-----------

I think I know your problem: some of those Unicode characters in your charset string literal, like the accented character "À", are two-byte characters (assuming a UTF-8 encoding). When you address the character-set string using the [] operator in your random_string function, you are returning half of a Unicode character. Thus the random-string function creates an invalid character string.

For example, consider the following code:

std::string s = "À";
std::cout << s.length() << std::endl;

In an environment where the string literal is interpreted as UTF-8, this program will output 2. Therefore, the first character of the string (s[0]) is only half of a Unicode character, and therefore not valid. Since your random_string function is addressing the string by single bytes using the [] operator, you're creating invalid random strings.

So yes, you need to use std::wstring, and create your charset string-literal using the L prefix.

edited Oct 29, 2010 at 12:53

answered Oct 29, 2010 at 12:27

Charles Salvia

53.7k15 gold badges133 silver badges144 bronze badges

6 Comments

user1481860 Over a year ago

This is probably the issue, as I have earlier been able to read a unicode string from a file (encoded in UTF-8) into a std::string and output it to a different file. I'll look into it.

Šimon Tóth Over a year ago

And this is exactly why I said that you can't store multi-byte encodings in a std::string. But for some reason I got downvoted to oblivion.

Charles Salvia Over a year ago

@Let_Me_Be, because you can store multi-byte encodings in a std::string. I just did so in the example above. You simply can't address a single multi-byte character of the string using the [] operator.

Šimon Tóth Over a year ago

@Charles Yeah the same way I can use a linked list for random access.

Charles Salvia Over a year ago

@Let_Me_Be, well I didn't downvote you. But regardless, your suggestion of using std::vector<char> would result in the same problem. You couldn't address a single complete multibyte character.

|

Diego Sevilla · Accepted Answer · 2010-10-29 12:24:20Z

1

In your code sample, the std::string charset stores what you write. That is, if you have used a UTF-8 text editor to write this, what you will receive at output in file would be exactly that UTF-8 text.

UTF-8 is just a coding scheme in which different chars use different byte sizes. However, if you use a UTF-8 editor, it will codify, say 'ñ' with two bytes, and, when you write it to file, it will have that two bytes (being again UTF-8 compliant).

The problem may be the editor you used to create the source C++ file. It may use latin1 or some other encoding.

answered Oct 29, 2010 at 12:24

Diego Sevilla

29.1k4 gold badges62 silver badges91 bronze badges

Comments

Marcelo Cantos · Accepted Answer · 2010-10-29 12:23:16Z

0

To write UTF-8, you need to use a codecvt facet like this one. An example of how to use it can be seen here.

edited Oct 29, 2010 at 12:23

answered Oct 29, 2010 at 12:17

Marcelo Cantos

187k40 gold badges338 silver badges366 bronze badges

2 Comments

Loki Astari Over a year ago

Those are used to convert wchar_t (UTF-16/UTF-32) into UTF-8. Since the string is already UTF-8 no conversion is required.

Marcelo Cantos Over a year ago

@Martin: There is no guarantee that the string is UTF-8. If the source file was saved using codepage 437, the character à will be a single byte with the value 133. (In Unicode, à is represent by the code point U+00E0, which UTF-8 encodes as the byte sequence [0xc3, 0xa0].)

Collectives™ on Stack Overflow

Unicode and std::string in C++

4 Answers 4

11 Comments

6 Comments

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

11 Comments

6 Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related