How to handle non-ASCII strings properly in C?

Question

My idea was to code a Hangman-like game in C. I want it to be able to use German words with umlauts (eg: ä, ü, ö) and also Greek words (completely non-ASCII characters).

My compiler and my terminal can handle Unicode well. Displaying the strings works well.

But how should I do operations on these strings? For the German language I could maybe handle the 6 upper- and lowercase accented characters by taking care of these cases in the functions. But considering Greek it seems like impossible.

I wrote this test code. It outputs the string, the length of the string (of course wrong, because the UTF-8 sequences take the place of two characters), and the value of the individual characters of the string in plain text and hex.

#include <stdio.h>
#include <string.h>

int main() {
    printf("123456789\n");
    char aTestString[] = "cheese";
    printf("%s ist %d Zeichen lang\n", aTestString, strlen(aTestString));
        
    for (int i = 0; i < strlen(aTestString); i++) {
        printf("( %c )", aTestString[i]);   // char als char
        printf("[ %02X ]", aTestString[i]); // char in hexadezimal
    }

    printf("\n123456789\n");
    char aTestString2[] = "Käse";
    printf("%s has %d characters\n", aTestString2, strlen(aTestString2));
        
    for (int i = 0; i < strlen(aTestString2); i++) {
        printf("( %c )", aTestString2[i]);  // char als char
        printf("[ %02X ]", aTestString2[i]); // char in hexadezimal
    }
    
    printf("\n123456789\n");    
    char aTestString3[] = "λόγος";
    printf("%s has %d characters\n", aTestString3, strlen(aTestString3));

    for (int i = 0; i < strlen(aTestString3); i++) {
        printf("( %c )", aTestString3[i]);  // char als char
        printf("[ %02X ]", aTestString3[i]); // char in hexadezimal
    }
}

For example, what is the recommended way to count the Unicode characters, or to see whether a specific Unicode character (that is, code point) is in the string? I am quite sure there must some simple solution because such characters are often used in passwords for example.

Here the output of the test program:

123456789
cheese has 6 character
( c )[ 63 ]( h )[ 68 ]( e )[ 65 ]( e )[ 65 ]( s )[ 73 ]( e )[ 65 ]
123456789
Käse has 5 characters
( K )[ 4B ](  )[ FFFFFFC3 ](  )[ FFFFFFA4 ]( s )[ 73 ]( e )[ 65 ]
123456789
λόγος has 10 characters
(  )[ FFFFFFCE ](  )[ FFFFFFBB ](  )[ FFFFFFCF ](  )[ FFFFFF8C ](  )[ FFFFFFCE ](  )[ FFFFFFB3 ](  )[ FFFFFFCE ](  )[ FFFFFFBF ](  )[ FFFFFFCF ](  )[ FFFFFF82 ]

To get the number of code-points in a Unicode string you need a third-party library. Like the ICU library. — Some programmer dude
– Some programmer dude, Commented Jul 2, 2023 at 17:49
Your code would be easier to understand if you translated the output to English. — Andreas Wenzel
– Andreas Wenzel, Commented Jul 2, 2023 at 18:07
iam sorry for that "cheese ist 6 Zeichen lang" just means "cheese has 6 characters". ill fix this in my code above. — ᛉᛉᛉ ᛉᛉᛉ
– ᛉᛉᛉ ᛉᛉᛉ, Commented Jul 2, 2023 at 18:10
@Someprogrammerdude No you don't. It is a couple of lines of plain C code. — n. m. could be an AI
– n. m. could be an AI, Commented Jul 2, 2023 at 18:38
@ᛉᛉᛉᛉᛉᛉ No. Use wchar_t and functions that work with wide strings, and just count wchar_ts. You don't need anything more complicated than that until you start handling exotic scripts and rare special characters. For German and Greek it's pleny enough. — n. m. could be an AI
– n. m. could be an AI, Commented Jul 2, 2023 at 18:52

mediocrevegetable1 · Accepted Answer · 2023-07-02 19:05:50Z

C's multi-byte string utilities are useful in this case. Using mbrlen, for example, one way to find the number of characters in a string (albeit probably a very naive one that I just bodged together right now) is this:

#include <stdio.h>
#include <wchar.h>
#include <locale.h>

size_t string_size(const char *s)
{
    mbstate_t state = {0};
    size_t len = 0;
    for (; *s != '\0'; ++len)
    {
        unsigned c_len;
        for (c_len = 1; mbrlen(s+c_len-1, 1, &state) == -2; ++c_len) {}
        s += c_len;
    }
    return len;
}

int main(void)
{
    setlocale(LC_ALL, "en_US.utf8");
    const char *s = "zß水🍌";
    printf("%zu\n", string_size(s));
}

// Output: 4

Using the same function mbrlen, you could also extract individual characters through finding their lengths. There are also functions to convert between multibyte characters and wide characters if you want to work with that.

Collectives™ on Stack Overflow

How to handle non-ASCII strings properly in C?

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related