2

I want to read Unicode file (UTF-8) character by character, but I don't know how to read from a file one by one character.

Can anyone to tell me how to do that?

7
  • You want to read individual Unicode characters or utf-8 bytes? Commented Jan 7, 2012 at 2:24
  • Read the file, then convert UTF-8 to UTF-32. You can either use iconv(), libicu, or C++11. Commented Jan 7, 2012 at 2:27
  • 1
    @Kerrek SB does C++11 include this? What class or function should we look for? Commented Jan 7, 2012 at 2:33
  • @WTP: It should be in <cuchar>, and it's actually coming in from the C99 support. There's definitely UTF16 <-> UTF32 support; I'm not 100% sure right now if there's also UTF8 support. Commented Jan 7, 2012 at 2:37
  • C++11 does have UTF-8 support. codecvt<char32_t,char,mbstate_t> converts between UTF-8 and UTF-32. You can use it with wstring_convert like so: wstring_convert<codecvt<char32_t,char,mbstate_t>,char32_t> convert; u32string s = convert.from_bytes("foo"); Commented Jan 7, 2012 at 8:35

4 Answers 4

4

First, look at how UTF-8 encodes characters: http://en.wikipedia.org/wiki/UTF-8#Description

Each Unicode character is encoded to one or more UTF-8 byte. After you read first next byte in the file, according to that table:

(Row 1) If the most significant bit is 0 (char & 0x80 == 0) you have your character.

(Row 2) If the three most significant bits are 110 (char & 0xE0 == 0xc0), you have to read another byte, and the bits 4,3,2 of the first UTF-8 byte (110YYYyy) make the first byte of the Unicode character (00000YYY) and the two least significant bits with 6 least significant bits of the next byte (10xxxxxx) make the second byte of the Unicode character (yyxxxxxx); You can do the bit arithmetic using shifts and logical operators of C/C++ easily:

UnicodeByte1 =   (UTF8Byte1 << 3) & 0xE0;
UnicodeByte2 = ( (UTF8Byte1 << 6) & 0xC0 ) | (UTF8Byte2 & 0x3F);

And so on...

Sounds a bit complicated, but it's not difficult if you know how to modify the bits to put them in proper place to decode a UTF-8 string.

Sign up to request clarification or add additional context in comments.

1 Comment

To take it a step farther, the first byte in a UTF-8 byte sequence tells you how many additional bytes are in the sequence.
3

UTF-8 is ASCII compatible, so you can read a UTF-8 file like you would an ASCII file. The C++ way to read a whole file into a string is:

#include <iostream>
#include <string>
#include <fstream>

std::ifstream fs("my_file.txt");
std::string content((std::istreambuf_iterator<char>(fs)), std::istreambuf_iterator<char>());

The resultant string has characters corresponding to UTF-8 bytes. you could loop through it like so:

for (std::string::iterator i = content.begin(); i != content.end(); ++i) {
    char nextChar = *i;
    // do stuff here.
}

Alternatively, you could open the file in binary mode, and then move through each byte that way:

std::ifstream fs("my_file.txt", std::ifstream::binary);
if (fs.is_open()) {
    char nextChar;
    while (fs.good()) {
        fs >> nextChar;
        // do stuff here.
    }
}

If you want to do more complicated things, I suggest you take a peek at Qt. I've found it rather useful for this sort of stuff. At least, less painful than ICU, for doing largely practical things.

QFile file;
if (file.open("my_file.text") {
    QTextStream in(&file);
    in.setCodec("UTF-8")
    QString contents = in.readAll();

    return;
}

2 Comments

Your solution does not output letters, but bytes. This works only for the ASCII part of the utf-8 character set.
@JindraHelcl My solution doesn't output anything: it reads a file and makes the data in that file available for further processing. The asker never specified whether he wanted to read the bytes in the file (which my solution answers) or read the characters in file (which I've shown how to do, using Qt). Keep in mind, this answer is 3 years old.
1

In theory strlib.h has a function mblen which shell return length of multibyte symbol. But in my case it returns -1 for first byte of multibyte symbol and continue it returns all time. So I write the following:

{
    if(i_ch == nullptr) return -1;
    int l = 0;
    char ch = *i_ch;
    int mask = 0x80;
    while(ch & mask) {
        l++;
        mask = (mask >> 1);
    }
    if (l < 4) return -1;
    return l;
}  

It's take less time than research how shell using mblen.

Comments

-2

try this: get the file and then loop through the text based on it's length

Pseudocode:

String s = file.toString();
int len = s.length();
for(int i=0; i < len; i++)
{
    String the_character = s[i].

    // TODO : Do your thing :o)
}

2 Comments

That won't work for UTF-8 strings (assuming String is std::string).
There is no "file.toString()" in C++.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.