Reading Unicode characters from a file in C++

Question

I want to read Unicode file (UTF-8) character by character, but I don't know how to read from a file one by one character.

Can anyone to tell me how to do that?

You want to read individual Unicode characters or utf-8 bytes? — Dietmar Kühl
– Dietmar Kühl, Commented Jan 7, 2012 at 2:24
Read the file, then convert UTF-8 to UTF-32. You can either use iconv(), libicu, or C++11. — Kerrek SB
– Kerrek SB, Commented Jan 7, 2012 at 2:27
@Kerrek SB does C++11 include this? What class or function should we look for? — user142019
– user142019, Commented Jan 7, 2012 at 2:33
@WTP: It should be in <cuchar>, and it's actually coming in from the C99 support. There's definitely UTF16 <-> UTF32 support; I'm not 100% sure right now if there's also UTF8 support. — Kerrek SB
– Kerrek SB, Commented Jan 7, 2012 at 2:37
C++11 does have UTF-8 support. codecvt<char32_t,char,mbstate_t> converts between UTF-8 and UTF-32. You can use it with wstring_convert like so: wstring_convert<codecvt<char32_t,char,mbstate_t>,char32_t> convert; u32string s = convert.from_bytes("foo"); — bames53
– bames53, Commented Jan 7, 2012 at 8:35

Hossein · Accepted Answer · 2012-01-07 15:32:05Z

4

First, look at how UTF-8 encodes characters: http://en.wikipedia.org/wiki/UTF-8#Description

Each Unicode character is encoded to one or more UTF-8 byte. After you read first next byte in the file, according to that table:

(Row 1) If the most significant bit is 0 (char & 0x80 == 0) you have your character.

(Row 2) If the three most significant bits are 110 (char & 0xE0 == 0xc0), you have to read another byte, and the bits 4,3,2 of the first UTF-8 byte (110YYYyy) make the first byte of the Unicode character (00000YYY) and the two least significant bits with 6 least significant bits of the next byte (10xxxxxx) make the second byte of the Unicode character (yyxxxxxx); You can do the bit arithmetic using shifts and logical operators of C/C++ easily:

UnicodeByte1 =   (UTF8Byte1 << 3) & 0xE0;
UnicodeByte2 = ( (UTF8Byte1 << 6) & 0xC0 ) | (UTF8Byte2 & 0x3F);

And so on...

Sounds a bit complicated, but it's not difficult if you know how to modify the bits to put them in proper place to decode a UTF-8 string.

answered Jan 7, 2012 at 15:32

Hossein

4,1652 gold badges26 silver badges46 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Remy Lebeau Over a year ago

To take it a step farther, the first byte in a UTF-8 byte sequence tells you how many additional bytes are in the sequence.

informatik01 · Accepted Answer · 2019-06-30 06:06:44Z

3

UTF-8 is ASCII compatible, so you can read a UTF-8 file like you would an ASCII file. The C++ way to read a whole file into a string is:

#include <iostream>
#include <string>
#include <fstream>

std::ifstream fs("my_file.txt");
std::string content((std::istreambuf_iterator<char>(fs)), std::istreambuf_iterator<char>());

The resultant string has characters corresponding to UTF-8 bytes. you could loop through it like so:

for (std::string::iterator i = content.begin(); i != content.end(); ++i) {
    char nextChar = *i;
    // do stuff here.
}

Alternatively, you could open the file in binary mode, and then move through each byte that way:

std::ifstream fs("my_file.txt", std::ifstream::binary);
if (fs.is_open()) {
    char nextChar;
    while (fs.good()) {
        fs >> nextChar;
        // do stuff here.
    }
}

If you want to do more complicated things, I suggest you take a peek at Qt. I've found it rather useful for this sort of stuff. At least, less painful than ICU, for doing largely practical things.

QFile file;
if (file.open("my_file.text") {
    QTextStream in(&file);
    in.setCodec("UTF-8")
    QString contents = in.readAll();

    return;
}

edited Jun 30, 2019 at 6:06

informatik01

16.5k11 gold badges82 silver badges112 bronze badges

answered Jan 7, 2012 at 2:41

Liam M

5,4405 gold badges43 silver badges59 bronze badges

2 Comments

Jindra Helcl Over a year ago

Your solution does not output letters, but bytes. This works only for the ASCII part of the utf-8 character set.

Liam M Over a year ago

@JindraHelcl My solution doesn't output anything: it reads a file and makes the data in that file available for further processing. The asker never specified whether he wanted to read the bytes in the file (which my solution answers) or read the characters in file (which I've shown how to do, using Qt). Keep in mind, this answer is 3 years old.

zessx · Accepted Answer · 2014-10-22 06:50:10Z

1

In theory strlib.h has a function mblen which shell return length of multibyte symbol. But in my case it returns -1 for first byte of multibyte symbol and continue it returns all time. So I write the following:

{
    if(i_ch == nullptr) return -1;
    int l = 0;
    char ch = *i_ch;
    int mask = 0x80;
    while(ch & mask) {
        l++;
        mask = (mask >> 1);
    }
    if (l < 4) return -1;
    return l;
}

It's take less time than research how shell using mblen.

edited Oct 22, 2014 at 6:50

zessx

68.9k29 gold badges139 silver badges166 bronze badges

answered Oct 22, 2014 at 6:31

Andrey

111 bronze badge

Comments

pxp · Accepted Answer · 2012-01-07 02:29:53Z

-2

try this: get the file and then loop through the text based on it's length

Pseudocode:

String s = file.toString();
int len = s.length();
for(int i=0; i < len; i++)
{
    String the_character = s[i].

    // TODO : Do your thing :o)
}

answered Jan 7, 2012 at 2:29

pxp

871 silver badge10 bronze badges

2 Comments

user142019 Over a year ago

That won't work for UTF-8 strings (assuming String is std::string).

jmucchiello Over a year ago

There is no "file.toString()" in C++.

Collectives™ on Stack Overflow

Reading Unicode characters from a file in C++

4 Answers 4

1 Comment

2 Comments

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

2 Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related