8

I have basic understanding of UTF8: code points have variable length, so a "character" can be 8 bits, 16 bits, or even longer.

What I'm wondering is if there some sample code, library, etc in C language that does similar things to an UTF8 string like standard library in C. E.g. tell the length of the string, etc.

Thanks,

13
  • 1
    For length, see e.g. stackoverflow.com/q/5117393/440558 Commented Jun 8, 2012 at 11:49
  • 2
    Keep in mind that e.g. strlen() works perfectly well on utf-8 encoded data, it gives you the length of the uft-8 string. It does not give you the number of unicode characters in that string though. Commented Jun 8, 2012 at 11:52
  • 1
    @nos This is wrong, in several ways. Certainly strlen doesn’t work at all if there are U+0000 code points in the string, which are completely legal. It is disingenuous to say that it tells the the “length” of the string. It doesn’t. It tells you the number of bytes only, and not the number of code points, which is what you would want. Commented Jun 10, 2012 at 2:23
  • 1
    @tchrist Remember that we are talking about UTF-8 encoded strings here. In C code, an UTF-8 string ends when you hit a null byte. The lenght of the UTF-8 string might or it might not be what you want. e.g. you do need the number of bytes if you're copying the string into a new buffer, or if you need to determine whether the string fits in a limited length database field. Commented Jun 10, 2012 at 8:55
  • 1
    @tchrist strlen doesn't work for ASCII strings that contain the ASCII code NUL either. But we don't go around saying it doesn't work for ASCII strings, do we? Commented Apr 4, 2015 at 19:15

3 Answers 3

4

GNU does have a Unicode string library, called libunistring, but it doesn’t handle anything nearly as well as ICU’s does.

For example, the GNU library doesn’t even give you access to collation, which is the basis for all string comparison. In contrast, ICU does. Another thing that ICU has that GNU doesn’t appear is Unicode regexes. For that, you might like to use Phil Hazel’s excellent PCRE library for C, which can be compiled with UTF-8 support.

However, it might be that the GNU library is enough for what you need. I don’t like its API much. Very messy. If you like C programming, you might try the Go programming language, which has excellent Unicode support. It’s a new language, but small and clean and fun to use.

On the other hand, the major interpreted languages — Perl, Python, and Ruby — all have varying support for Unicode that is better than you’ll ever get in C. Of those, Perl’s Unicode support is the most developed and robust.

Remember: it isn’t enough to support more characters. Without the rules that go with them, you don’t have Unicode. At most, you might have ISO 10646: a large character repertoire but no rules. My mantra is “Unicode isn’t just more characters; it’s more characters plus a whole bunch of rules for handling them.”

Sign up to request clarification or add additional context in comments.

Comments

1

The foremost library for handling Unicode is IBM's ICU.

But if all you need to do is determine the number of codepoints in an UTF-8 encoded string, count the number of chars with values between \x01 and \x7F or between \xC2 and \xFF.

6 Comments

\xC2 to \xF4, actually - Unicode stops at U+10FFFF. It's probably easier just to discount continuation bytes, and you can do that with a single operation: c & \xC0 != \x80.
Sure, or, assuming that chars are signed, C >= '\xC2'
Also, Unicode is more than a character set. You must also account for things like canonical equivalence (where you should treat a string containing, for example, U+0178 as identical to one containing U+0059 U+0308 even though the first one is 2 bytes long in UTF-8 and the second one 3 bytes). But that might be outside the scope of this question.
Code Units* a codepoint is basically a character or glyph (with asterisks, but that's the general idea)
@Marcus Nope. In UTF-8, a code unit is an 8-bit byte. That was the whole problem! We needed to count code points rather than code units! I'm not sure what you mean by asterisks though.
|
1

If you are interested in a library which doesn't allocate memory and uses the stack you could try utf8rewind.

1 Comment

this page 404ed.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.