4

I have a string output that ins not necessarily valid utf8. I have to pass it to a method only accepting valid utf8 strings.
Therefore I need to convert output to the closest valid utf8 string removing invalid bytes or parts. How can I do that in c++? I would like not to use a 3rd party library.

10
  • 2
    I think this is not safe. If your string is not UTF-8, the only safe thing is to abort entirely. Otherwise you're opening yourself up to attacks. Commented Oct 23, 2012 at 12:54
  • 1
    What does invalid bytes mean for you ? Do you want a valid utf-8 stream (with maybe invalid codepoints or non-sensical combinations) or a valid unicode utf-8 encoded stream ? Commented Oct 23, 2012 at 12:55
  • i need a valid unicode utf-8 encoded stream... remove everything that is not valid. Commented Oct 23, 2012 at 13:09
  • 1
    The Wikipedia page for UTF-8 (en.wikipedia.org/wiki/Utf-8) contains a lot of information that should easily help you accomplish this. Commented Oct 23, 2012 at 13:18
  • 1
    @SteveJessop: No no, the attack isn't in the valid string, but in the way you attempt to recover from invalid data. This has happened before, and as a result, the Unicode standard now says that an application should give up immediately upon encountering an invalid byte. Commented Oct 23, 2012 at 13:50

2 Answers 2

2

You should use the icu::UnicodeString methods fromUTF8(const StringPiece &utf8) or toUTF8String(StringClass &result).

Sign up to request clarification or add additional context in comments.

Comments

0

If you're sure your string is valid UTF-8 with only a few corrupt bytes, http://utfcpp.sourceforge.net/ can fix that. From the page:

#include "utf8.h"
void fix_utf8_string(std::string& str) {
    std::string temp;
    utf8::replace_invalid(str.begin(), str.end(), back_inserter(temp));
    str = temp;
}

Your requirement for not using a 3rd party library is pretty much impossible when dealing with Unicode data, but the UTF8-CPP library is header-only which is as light as you can get.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.