Working with UTF-8 strings and characters in C++ [closed]

Question

Closed. This question is seeking recommendations for software libraries, tutorials, tools, books, or other off-site resources. It does not meet Stack Overflow guidelines. It is not currently accepting answers.

We don’t allow questions seeking recommendations for software libraries, tutorials, tools, books, or other off-site resources. You can edit the question so it can be answered with facts and citations.

Closed 7 years ago.

Improve this question

I'm working on a project which works on utf-8 strings character by character, however I was unable to find a way to work on UTF-8 strings on that manner in C++.

What I need is:

The strings need to be UTF-8, since the strings won't be limited to English alphabet.
Storing and retrieving them as-is is insufficient, since I'll work on them character by character and process them.
Accessing them character by character, and being able to compare them with other UTF-8 characters is a requirement.

Suggestion of any C++ (regardless of 98/11/14) feature or library is very welcome.

Additional points for not using Boost. I have a tendency to develop tools without external dependencies.

This answer (and the one it references) should provide what you need: stackoverflow.com/questions/37989081/… — Galik
– Galik, Commented Oct 21, 2018 at 19:06
Standard C++ already has utf-8 to ucs-16/utf-32 converters, No need for an external library. — Galik
– Galik, Commented Oct 21, 2018 at 19:20
@KubaOber I want the characters, not the bytes of that particular character. — bayindirh
– bayindirh, Commented Oct 21, 2018 at 19:35

bolov · Accepted Answer · 2018-10-21 19:05:09Z

1

C++ is notorious for having very very poor support for unicode out of the box. So the best option is to use a library like ICU or boost.

Friendly advice:

I have a tendency to develop tools without external dependencies

You need to justify this statement, otherwise, if it's an arbitrary rule of yours you limit yourself. Libraries, like languages are tools. Choosing what tools to use needs to be analyzed and the benefits weighted against the downsides.

answered Oct 21, 2018 at 19:05

bolov

76.8k17 gold badges156 silver badges248 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

bayindirh Over a year ago

Thanks for the advice! I like to use libraries which I can embed into my source tree completely because of various reasons. First of all it removes the burden of installing development packages of big libraries just for compiling a small utility (of mine), then it removes the burden of code maintenance to keep library compatibility as the library evolves. Lastly it makes the tool more portable since I don't always have the luxury to install lots of dev packages to compile my tool. However, it the best way is to use boost or other so-called big library, I'll happily use it at the end of the day

ivanmoskalev Over a year ago

@bayindirh You can use vcpkg (github.com/Microsoft/vcpkg) for building and integrating almost any major C++ library nowadays. It's been getting a lot of traction and it is equally good for both rapid prototyping with usage of third-party libs, and for enterprise scenarios (see their export command)

bolov Over a year ago

@bayindirh solid argument. If you know what you are doing, which it looks like you do, you are the only judge who can tell if implementing utf8 support yourself is worth it or not.

bayindirh Over a year ago

@bolov, thanks. I'll take a look in <codecvt>, libICU, and others, and if I can find a embeddable library (like eigen, easylogging++, etc.) I'll use it without hesitation. This is a personal project, so no time pressure is present. I'll try to strike a healthy balance between challenge, not-invented-here and pragmatism.

Community · Accepted Answer · 2021-10-07 11:39:35Z

1

You mean, working with code points (as opposed to the actual chars – i.e. bytes)? A small addition to the answer above. I would recommend you to first read the specs on how UTF-8 works, then probably read the "UTF-8 Everywhere" manifesto, and also look here – it is a nice example of how to build a UTF-8 code point iterator. It is always good to know how stuff actually works, especially if it is an important part of your software. Though you will most certainly end up using ICU :-)

edited Oct 7, 2021 at 11:39

CommunityBot

11 silver badge

answered Oct 21, 2018 at 19:08

ivanmoskalev

2,0841 gold badge17 silver badges25 bronze badges

2 Comments

bayindirh Over a year ago

Actually I need to access the characters itself. The text I'm going to process is guaranteed to have two-byte unicode characters, and I need to access them without seeing their different bytes. Since C++ can store and read unicode strings in std::string, by dividing the bytes internally and behaving indifferently to these binary values, I used code points to explicitly point that I need to access two-byte characters as characters itself, not individual bytes of these two-byte characters.

ivanmoskalev Over a year ago

Yeah, sorry, I understood what you meant, just expressed my idea not too well. By actual chars I meant chars – byte values. Edited the answer.

ralf htp · Accepted Answer · 2018-10-21 19:24:21Z

0

You can use Wide Chars ( or also Multibytes ) for handling Unicode

In https://www.geeksforgeeks.org/wide-char-and-library-functions-in-c/ is a summary of C++ library functions for Wide Chars

Also see the Internationalization standards like I18N and cf https://www.cprogramming.com/tutorial/unicode.html

answered Oct 21, 2018 at 19:24

ralf htp

9,4805 gold badges25 silver badges35 bronze badges

Collectives™ on Stack Overflow

Working with UTF-8 strings and characters in C++ [closed]

3 Answers 3

4 Comments

2 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

2 Comments

Comments

Linked

Related