-2

I'm working on a project which works on utf-8 strings character by character, however I was unable to find a way to work on UTF-8 strings on that manner in C++.

What I need is:

  • The strings need to be UTF-8, since the strings won't be limited to English alphabet.
  • Storing and retrieving them as-is is insufficient, since I'll work on them character by character and process them.
  • Accessing them character by character, and being able to compare them with other UTF-8 characters is a requirement.

Suggestion of any C++ (regardless of 98/11/14) feature or library is very welcome.

Additional points for not using Boost. I have a tendency to develop tools without external dependencies.

13
  • 1
    Have you heard of ICU? Commented Oct 21, 2018 at 19:04
  • 1
    This answer (and the one it references) should provide what you need: stackoverflow.com/questions/37989081/… Commented Oct 21, 2018 at 19:06
  • 1
    Possible duplicate stackoverflow.com/questions/43302279/… Commented Oct 21, 2018 at 19:08
  • 1
    Standard C++ already has utf-8 to ucs-16/utf-32 converters, No need for an external library. Commented Oct 21, 2018 at 19:20
  • 1
    @KubaOber I want the characters, not the bytes of that particular character. Commented Oct 21, 2018 at 19:35

3 Answers 3

1

C++ is notorious for having very very poor support for unicode out of the box. So the best option is to use a library like ICU or boost.

Friendly advice:

I have a tendency to develop tools without external dependencies

You need to justify this statement, otherwise, if it's an arbitrary rule of yours you limit yourself. Libraries, like languages are tools. Choosing what tools to use needs to be analyzed and the benefits weighted against the downsides.

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks for the advice! I like to use libraries which I can embed into my source tree completely because of various reasons. First of all it removes the burden of installing development packages of big libraries just for compiling a small utility (of mine), then it removes the burden of code maintenance to keep library compatibility as the library evolves. Lastly it makes the tool more portable since I don't always have the luxury to install lots of dev packages to compile my tool. However, it the best way is to use boost or other so-called big library, I'll happily use it at the end of the day
@bayindirh You can use vcpkg (github.com/Microsoft/vcpkg) for building and integrating almost any major C++ library nowadays. It's been getting a lot of traction and it is equally good for both rapid prototyping with usage of third-party libs, and for enterprise scenarios (see their export command)
@bayindirh solid argument. If you know what you are doing, which it looks like you do, you are the only judge who can tell if implementing utf8 support yourself is worth it or not.
@bolov, thanks. I'll take a look in <codecvt>, libICU, and others, and if I can find a embeddable library (like eigen, easylogging++, etc.) I'll use it without hesitation. This is a personal project, so no time pressure is present. I'll try to strike a healthy balance between challenge, not-invented-here and pragmatism.
1

You mean, working with code points (as opposed to the actual chars – i.e. bytes)? A small addition to the answer above. I would recommend you to first read the specs on how UTF-8 works, then probably read the "UTF-8 Everywhere" manifesto, and also look here – it is a nice example of how to build a UTF-8 code point iterator. It is always good to know how stuff actually works, especially if it is an important part of your software. Though you will most certainly end up using ICU :-)

2 Comments

Actually I need to access the characters itself. The text I'm going to process is guaranteed to have two-byte unicode characters, and I need to access them without seeing their different bytes. Since C++ can store and read unicode strings in std::string, by dividing the bytes internally and behaving indifferently to these binary values, I used code points to explicitly point that I need to access two-byte characters as characters itself, not individual bytes of these two-byte characters.
Yeah, sorry, I understood what you meant, just expressed my idea not too well. By actual chars I meant chars – byte values. Edited the answer.
0

You can use Wide Chars ( or also Multibytes ) for handling Unicode

In https://www.geeksforgeeks.org/wide-char-and-library-functions-in-c/ is a summary of C++ library functions for Wide Chars

Also see the Internationalization standards like I18N and cf https://www.cprogramming.com/tutorial/unicode.html

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.