I have a string output that ins not necessarily valid utf8. I have to pass it to a method only accepting valid utf8 strings.
Therefore I need to convert output to the closest valid utf8 string removing invalid bytes or parts. How can I do that in c++? I would like not to use a 3rd party library.
-
2I think this is not safe. If your string is not UTF-8, the only safe thing is to abort entirely. Otherwise you're opening yourself up to attacks.Kerrek SB– Kerrek SB2012-10-23 12:54:25 +00:00Commented Oct 23, 2012 at 12:54
-
1What does invalid bytes mean for you ? Do you want a valid utf-8 stream (with maybe invalid codepoints or non-sensical combinations) or a valid unicode utf-8 encoded stream ?Matthieu M.– Matthieu M.2012-10-23 12:55:27 +00:00Commented Oct 23, 2012 at 12:55
-
i need a valid unicode utf-8 encoded stream... remove everything that is not valid.Alex Schneider– Alex Schneider2012-10-23 13:09:18 +00:00Commented Oct 23, 2012 at 13:09
-
1The Wikipedia page for UTF-8 (en.wikipedia.org/wiki/Utf-8) contains a lot of information that should easily help you accomplish this.Component 10– Component 102012-10-23 13:18:38 +00:00Commented Oct 23, 2012 at 13:18
-
1@SteveJessop: No no, the attack isn't in the valid string, but in the way you attempt to recover from invalid data. This has happened before, and as a result, the Unicode standard now says that an application should give up immediately upon encountering an invalid byte.Kerrek SB– Kerrek SB2012-10-23 13:50:18 +00:00Commented Oct 23, 2012 at 13:50
|
Show 5 more comments
2 Answers
If you're sure your string is valid UTF-8 with only a few corrupt bytes, http://utfcpp.sourceforge.net/ can fix that. From the page:
#include "utf8.h"
void fix_utf8_string(std::string& str) {
std::string temp;
utf8::replace_invalid(str.begin(), str.end(), back_inserter(temp));
str = temp;
}
Your requirement for not using a 3rd party library is pretty much impossible when dealing with Unicode data, but the UTF8-CPP library is header-only which is as light as you can get.