3

Say I have some utf8 encoded string. Inside it words are delimited using ";". But each character (except ";") inside this string has utf8 value >128. Say I store this string inside unsigned char array:

unsigned char buff[]="someutf8string;separated;with;";

Is it safe to pass this buff to strtok function? (If I just want to extracts words using ";" symbol).

My concern is that strtok (or also strcpy) expect char pointers, but inside my string some values will have value > 128. So is this behaviour defined?

6
  • strtok is locale independant, so it shouldn't have any trouble with what you want it for. Commented Jul 15, 2014 at 20:27
  • @AntonH: yes but what about strcpy for example? Commented Jul 15, 2014 at 20:28
  • strcpy looks for the null-terminator, and UTF-8 encoding doesn't have a 0 character, AFAIK. So it shouldn't be an issue either. BUt if you want, wait a bit and someone else will come along and either confirm or disprove what I said. Some info here: java-samples.com/showtutorial.php?tutorialid=806 Commented Jul 15, 2014 at 20:30
  • 1
    UTF-8 does have a 0 character - the same 0 character that ASCII has, since ASCII is a subset of UTF-8 - Unicode U+0000, encoded as byte octet 0x00 just like in ASCII. Commented Jul 15, 2014 at 21:01
  • @AntonH: The null byte is the UTF-8 encoding of U+0000. It does have that, but its purpose is unchanged, so most string functions that look for a zero byte will be OK with UTF-8 (and the null byte never appears in UTF-8 as part of another character). Commented Jul 15, 2014 at 21:03

2 Answers 2

2

No, it is not safe -- but if it compiles it will almost certainly work as expected.

unsigned char buff[]="someutf8string;separated;with;";

This is fine; the standard specifically permits arrays of character type (including unsigned char) to be initialized with a string literal. Successive bytes of the string literal initialize the elements of the array.

strtok(buff, ";")

This is a constraint violation, requiring a compile-time diagnostic. (That's about as close as the C standard gets to saying that something is illegal.)

The first parameter of strok is of type char*, but you're passing an argument of type unsigned char*. These two pointer types are not compatible, and there is no implicit conversion between them. A conforming compiler may reject your program if it contains a call like this (and, for example, gcc -std=c99 -pedantic-errors does reject it.)

Many C compilers are somewhat lax about strict enforcement of the standard's requirements. In many cases, compilers issue warnings for code that contains constraint violations -- which is perfectly valid. But once a compiler has diagnosed a constraint violation and proceeded to generate an executable, the behavior of that executable is not defined by the C standard.

As far as I know, any actual compiler that doesn't reject this call will generate code that behaves just as you expect it to. The pointer types char* and unsigned char* almost certainly have the same representation and are passed the same way as arguments, and the types char and unsigned char are explicitly required to have the same representation for non-negative values. Even for values exceeding CHAR_MAX, like the ones you're using, a compiler would have to go out of its way to generate misbehaving code. You could have problems on a system that doesn't use 2's-complement for signed integers, but yo're not likely to encounter such a system.

If you add an explicit cast:

strtok((char*)buff, ";")

removes the constraint violation and will probably silence any warning -- but the behavior is still strictly undefined.

In practice, though, most compilers try to treat char, signed char, and unsigned char almost interchangeably, partly to cater to code like yours, and partly because they'd have to go out of their way to do anything else.

Sign up to request clarification or add additional context in comments.

3 Comments

@userq: No, there's no contradiction. A constraint violation is an error that must be diagnosed by the compiler. Undefined behavior is an error that the compiler is not required to detect.
@userq: If a C program violates a constraint, a compiler must diagnose it, and may reject it. If it doesn't reject it, its behavior is undefined. A cast removes the constraint violation, but not the UB. "and if user already has stored her string in an unsigned char array what solution do you recommend if she wants to apply string functions to this buffer/string?" -- That's a good question. C doesn't seem to guarantee that char and unsigned char objects are interchangeable, but it still seems to assume that they are.
Plain char is commonly signed; UTF-8 requires unsigned 8-bit quantities. IMHO C has not done a good and consistent job of reconciling them. Using the cast: strtok((char*)buff, ";") is probably the best approach; it will almost certainly work, even though the language standard doesn't guarantee it.
1

According to the C11 Standard (ISO/IEC 9899:2011 §7.24.1 String Handling Conventions, ¶3, emphasis added):

For all functions in this subclause, each character shall be interpreted as if it had the type unsigned char (and therefore every possible object representation is valid and has a different value).

Note: this paragraph was not present in the C99 standard.

So I do not see a problem.

18 Comments

When the UTF8 encoding scheme was invented, one of the goals was to allow C string library routines handle them safely, even though the routines do not know about Unicode. In particular, you can safely copy them, compare them, sort them, and work with ascii-7-subset characters as is. (From a paper by Dennis Ritchie?)
It was Ken Thompson, not Dennis Ritchie, who devised UTF8. Here's a memo about the design objectives: cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt
Is that clause saying that the characters are read as (unsigned char)*p, or as *(unsigned char *)p ? (for non-2's complement negative values, these are different)
@MattMcNabb I think the cast after the dereference makes the most sense. Is there anything of particular concern to you in the subtlety of the difference between the two?
@ChronoKitsune I agree that makes more sense, however if it is true then unsigned char buf[] = "e"; strchr( (char *)buf, 'e' ) could fail if 'e' is negative. The value of 'e' is converted to unsigned char when stored in buf , but aliasing it back to signed char could result in a different character than 'e'.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.