8

I have to use unicode range in a regex in C++. Basically what I need is to have a regex to accept all valid unicode characters..I just tried with the test expression and facing some issues with it.


std::regex reg("^[\\u0080-\\uDB7Fa-z0-9!#$%&'*+/=?^_`{|}~-]+$");

Is the issue is with \\u?

6
  • Remove \\u0080-\\uDB7F and try to match 124. If it matches, yes, the problem is with \\u0080-\\uDB7F. Commented Jun 23, 2016 at 10:32
  • The problem is C++ having no usable Unicode support. Use something like ICU. Commented Jun 23, 2016 at 10:34
  • Or Boost is also a good alternative. BTW, check this: UnicodeEscapeSequence is the letter u followed by exactly four HexDigits. This character escape matches the character whose code unit equals the numeric value of this four-digit hexadecimal number. If the value does not fit in this std::basic_regex's CharT, std::regex_error is thrown(C++ only). Commented Jun 23, 2016 at 10:35
  • @WiktorStribiżew uDB7F and most stuff before that definitely does not fit into a char. Commented Jun 23, 2016 at 10:43
  • 1
    @BaummitAugen: That is why perhaps wregex could help. I have no time to check that now Commented Jun 23, 2016 at 10:44

1 Answer 1

9

This should work fine but you need to use std::wregex and std::wsmatch. You will need to convert the source string and regular expression to wide character unicode (UTF-32 on Linux, UTF-16(ish) on Windows) to make it work.

This works for me where source text is UTF-8:

inline std::wstring from_utf8(const std::string& utf8)
{
    // code to convert from utf8 to utf32/utf16
}

inline std::string to_utf8(const std::wstring& ws)
{
    // code to convert from utf32/utf16 to utf8
}

int main()
{
    std::string test = "john.doe@神谕.com"; // utf8
    std::string expr = "[\\u0080-\\uDB7F]+"; // utf8

    std::wstring wtest = from_utf8(test);
    std::wstring wexpr = from_utf8(expr);

    std::wregex we(wexpr);
    std::wsmatch wm;
    if(std::regex_search(wtest, wm, we))
    {
        std::cout << to_utf8(wm.str(0)) << '\n';
    }
}

Output:

神谕

Note: If you need a UTF conversion library I used THIS ONE in the example above.

Edit: Or, you could use the functions given in this answer:

Any good solutions for C++ string code point and code unit?

Sign up to request clarification or add additional context in comments.

2 Comments

Great answer, thanks! What does the [\\u0080-\\uDB7F]+ range cover? A-Z? In that vein, what would be a regex for [a-zA-Z0-9]?
@SexyBeast I just copied that range out of the OPs question. But you can see what it covers here: idevelopment.info/data/Programming/character_encodings/… Also what you have written should work fine in a regex.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.