0

I'm getting a strange behaviour running this code:

regex.search(ur'([^\p{IsAlnum}\s\.\'\`\,\-])', u'\U0001f618')

This should match \U0001f618, which is the unicode representation of a kissing emoji. The result, however, is the following:

<regex.Match object; span=(0, 1), match=u'\ud83d'>

This doesn't make sense at all, because u'\ud83d' is not even a valid unicode character.

I expected this instead:

<regex.Match object; span=(0, 1), match=u'\U0001f618'>

What is happening here?

I'm running Python 2.7.13 on macOS Sierra 10.12.6, regex.__version__ is 2.4.130.

5
  • Cannot reproduce. Same python version, same regex version, output is <regex.Match object; span=(0, 1), match=u'\U0001f618'>. I'm on Manjaro instead of Mac, but not sure how that would make a difference. Maybe try reinstalling the regex module? Commented Oct 4, 2017 at 10:32
  • Same Python and regex versions, however, on Linux platform. Works as you expect. Commented Oct 4, 2017 at 10:35
  • Same Python and regex versions, can reproduce on macOS Sierra 10.12.6: <regex.Match object; span=(0, 1), match=u'\ud83d'> Commented Oct 4, 2017 at 10:41
  • I'm also running macOS Sierra 10.12.6 Commented Oct 4, 2017 at 11:01
  • 1
    This may help: Python returns length of 2 for single Unicode character string Commented Oct 4, 2017 at 11:13

1 Answer 1

1

As mentioned by @PM 2Ring, it is happening because Python is compiled with UCS-2 support (narrow range) instead of UCS-4 support (wide range). Because of this, Python internally (and incorrectly) represents u'\U0001f618' as two characters, which explains the regex result.

More information here: https://stackoverflow.com/a/29109996/4111012

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.