5

My idea was to code a Hangman-like game in C. I want it to be able to use German words with umlauts (eg: ä, ü, ö) and also Greek words (completely non-ASCII characters).

My compiler and my terminal can handle Unicode well. Displaying the strings works well.

But how should I do operations on these strings? For the German language I could maybe handle the 6 upper- and lowercase accented characters by taking care of these cases in the functions. But considering Greek it seems like impossible.

I wrote this test code. It outputs the string, the length of the string (of course wrong, because the UTF-8 sequences take the place of two characters), and the value of the individual characters of the string in plain text and hex.

#include <stdio.h>
#include <string.h>

int main() {
    printf("123456789\n");
    char aTestString[] = "cheese";
    printf("%s ist %d Zeichen lang\n", aTestString, strlen(aTestString));
        
    for (int i = 0; i < strlen(aTestString); i++) {
        printf("( %c )", aTestString[i]);   // char als char
        printf("[ %02X ]", aTestString[i]); // char in hexadezimal
    }

    printf("\n123456789\n");
    char aTestString2[] = "Käse";
    printf("%s has %d characters\n", aTestString2, strlen(aTestString2));
        
    for (int i = 0; i < strlen(aTestString2); i++) {
        printf("( %c )", aTestString2[i]);  // char als char
        printf("[ %02X ]", aTestString2[i]); // char in hexadezimal
    }
    
    printf("\n123456789\n");    
    char aTestString3[] = "λόγος";
    printf("%s has %d characters\n", aTestString3, strlen(aTestString3));

    for (int i = 0; i < strlen(aTestString3); i++) {
        printf("( %c )", aTestString3[i]);  // char als char
        printf("[ %02X ]", aTestString3[i]); // char in hexadezimal
    }
}

For example, what is the recommended way to count the Unicode characters, or to see whether a specific Unicode character (that is, code point) is in the string? I am quite sure there must some simple solution because such characters are often used in passwords for example.

Here the output of the test program:

123456789
cheese has 6 character
( c )[ 63 ]( h )[ 68 ]( e )[ 65 ]( e )[ 65 ]( s )[ 73 ]( e )[ 65 ]
123456789
Käse has 5 characters
( K )[ 4B ](  )[ FFFFFFC3 ](  )[ FFFFFFA4 ]( s )[ 73 ]( e )[ 65 ]
123456789
λόγος has 10 characters
(  )[ FFFFFFCE ](  )[ FFFFFFBB ](  )[ FFFFFFCF ](  )[ FFFFFF8C ](  )[ FFFFFFCE ](  )[ FFFFFFB3 ](  )[ FFFFFFCE ](  )[ FFFFFFBF ](  )[ FFFFFFCF ](  )[ FFFFFF82 ]
9
  • To get the number of code-points in a Unicode string you need a third-party library. Like the ICU library. Commented Jul 2, 2023 at 17:49
  • Your code would be easier to understand if you translated the output to English. Commented Jul 2, 2023 at 18:07
  • iam sorry for that "cheese ist 6 Zeichen lang" just means "cheese has 6 characters". ill fix this in my code above. Commented Jul 2, 2023 at 18:10
  • @Someprogrammerdude No you don't. It is a couple of lines of plain C code. Commented Jul 2, 2023 at 18:38
  • 1
    @ᛉᛉᛉᛉᛉᛉ No. Use wchar_t and functions that work with wide strings, and just count wchar_ts. You don't need anything more complicated than that until you start handling exotic scripts and rare special characters. For German and Greek it's pleny enough. Commented Jul 2, 2023 at 18:52

1 Answer 1

5

C's multi-byte string utilities are useful in this case. Using mbrlen, for example, one way to find the number of characters in a string (albeit probably a very naive one that I just bodged together right now) is this:

#include <stdio.h>
#include <wchar.h>
#include <locale.h>

size_t string_size(const char *s)
{
    mbstate_t state = {0};
    size_t len = 0;
    for (; *s != '\0'; ++len)
    {
        unsigned c_len;
        for (c_len = 1; mbrlen(s+c_len-1, 1, &state) == -2; ++c_len) {}
        s += c_len;
    }
    return len;
}

int main(void)
{
    setlocale(LC_ALL, "en_US.utf8");
    const char *s = "zß水🍌";
    printf("%zu\n", string_size(s));
}

// Output: 4

Using the same function mbrlen, you could also extract individual characters through finding their lengths. There are also functions to convert between multibyte characters and wide characters if you want to work with that.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.