1

I have a large file where any unicode character that wasn't in UTF-8 got replaced by its code point in angle brackets (e.g. the "👍" was converted to "<U+0001F44D>"). Now I want to revert this with a regex substitution.

I've tried to acomplish this with

re.sub(r'<U\+([A-F0-9]+)>',r'\U\1', str)

but obviously this won't work because we cannot insert the group into this unicode escape. What's the best/easiest way to do this? I found many questions trying to do the exact opposite but nothing useful to 're-encode' these code points as actual characters...

1 Answer 1

3

When you have a number of the character, you can do ord(number) to get the character of that number.

Because we have a string, we need to read it as int with base 16.

Both of those together:

>>> chr(int("0001F44D", 16))
'👍'

However, now we have a small function, not a string to simply replace! Quick search returned that you can pass a function to re.sub

Now we get:

re.sub(r'<U\+([A-F0-9]+)>', lambda x: chr(int(x.group(1), 16)), my_str)

PS Don't name your string just str - you'll shadow the builtin str meaning type.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks, I was not aware that there is any possibility to use a function with re.sub! I was thinking of using the chr() function here, but didn't know how to use it with the regex substitution. The str was only for debugging purposes, I will use this with a dataset mapping function! Thanks a lot!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.