UTF8 processing in C

Question

I have basic understanding of UTF8: code points have variable length, so a "character" can be 8 bits, 16 bits, or even longer.

What I'm wondering is if there some sample code, library, etc in C language that does similar things to an UTF8 string like standard library in C. E.g. tell the length of the string, etc.

Thanks,

Keep in mind that e.g. strlen() works perfectly well on utf-8 encoded data, it gives you the length of the uft-8 string. It does not give you the number of unicode characters in that string though. — nos
– nos, Commented Jun 8, 2012 at 11:52
@nos This is wrong, in several ways. Certainly strlen doesn’t work at all if there are U+0000 code points in the string, which are completely legal. It is disingenuous to say that it tells the the “length” of the string. It doesn’t. It tells you the number of bytes only, and not the number of code points, which is what you would want. — tchrist
– tchrist, Commented Jun 10, 2012 at 2:23
@tchrist Remember that we are talking about UTF-8 encoded strings here. In C code, an UTF-8 string ends when you hit a null byte. The lenght of the UTF-8 string might or it might not be what you want. e.g. you do need the number of bytes if you're copying the string into a new buffer, or if you need to determine whether the string fits in a limited length database field. — nos
– nos, Commented Jun 10, 2012 at 8:55
@tchrist strlen doesn't work for ASCII strings that contain the ASCII code NUL either. But we don't go around saying it doesn't work for ASCII strings, do we? — bames53
– bames53, Commented Apr 4, 2015 at 19:15

tchrist · Accepted Answer · 2012-06-10 02:06:27Z

GNU does have a Unicode string library, called libunistring, but it doesn’t handle anything nearly as well as ICU’s does.

For example, the GNU library doesn’t even give you access to collation, which is the basis for all string comparison. In contrast, ICU does. Another thing that ICU has that GNU doesn’t appear is Unicode regexes. For that, you might like to use Phil Hazel’s excellent PCRE library for C, which can be compiled with UTF-8 support.

However, it might be that the GNU library is enough for what you need. I don’t like its API much. Very messy. If you like C programming, you might try the Go programming language, which has excellent Unicode support. It’s a new language, but small and clean and fun to use.

On the other hand, the major interpreted languages — Perl, Python, and Ruby — all have varying support for Unicode that is better than you’ll ever get in C. Of those, Perl’s Unicode support is the most developed and robust.

Remember: it isn’t enough to support more characters. Without the rules that go with them, you don’t have Unicode. At most, you might have ISO 10646: a large character repertoire but no rules. My mantra is “Unicode isn’t just more characters; it’s more characters plus a whole bunch of rules for handling them.”

Mr Lister · Accepted Answer · 2012-06-08 11:58:27Z

1

The foremost library for handling Unicode is IBM's ICU.

But if all you need to do is determine the number of codepoints in an UTF-8 encoded string, count the number of chars with values between \x01 and \x7F or between \xC2 and \xFF.

answered Jun 8, 2012 at 11:58

Mr Lister

46.8k15 gold badges118 silver badges156 bronze badges

6 Comments

ecatmur Over a year ago

\xC2 to \xF4, actually - Unicode stops at U+10FFFF. It's probably easier just to discount continuation bytes, and you can do that with a single operation: c & \xC0 != \x80.

Mr Lister Over a year ago

Sure, or, assuming that chars are signed, C >= '\xC2'

Mr Lister Over a year ago

Also, Unicode is more than a character set. You must also account for things like canonical equivalence (where you should treat a string containing, for example, U+0178 as identical to one containing U+0059 U+0308 even though the first one is 2 bytes long in UTF-8 and the second one 3 bytes). But that might be outside the scope of this question.

MarcusJ Over a year ago

Code Units* a codepoint is basically a character or glyph (with asterisks, but that's the general idea)

Mr Lister Over a year ago

@Marcus Nope. In UTF-8, a code unit is an 8-bit byte. That was the whole problem! We needed to count code points rather than code units! I'm not sure what you mean by asterisks though.

|

Grzegorz Adam Hankiewicz · Accepted Answer · 2021-03-11 11:11:03Z

1

If you are interested in a library which doesn't allocate memory and uses the stack you could try utf8rewind.

edited Mar 11, 2021 at 11:11

answered Apr 28, 2018 at 23:31

Grzegorz Adam Hankiewicz

7,7561 gold badge40 silver badges86 bronze badges

1 Comment

lang2 Over a year ago

this page 404ed.

Collectives™ on Stack Overflow

UTF8 processing in C

3 Answers 3

Comments

6 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

6 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related