Replace Unicode code point with actual character using regex

Question

I have a large file where any unicode character that wasn't in UTF-8 got replaced by its code point in angle brackets (e.g. the "👍" was converted to "<U+0001F44D>"). Now I want to revert this with a regex substitution.

I've tried to acomplish this with

re.sub(r'<U\+([A-F0-9]+)>',r'\U\1', str)

but obviously this won't work because we cannot insert the group into this unicode escape. What's the best/easiest way to do this? I found many questions trying to do the exact opposite but nothing useful to 're-encode' these code points as actual characters...

h4z3 · Accepted Answer · 2021-05-12 16:05:39Z

3

When you have a number of the character, you can do ord(number) to get the character of that number.

Because we have a string, we need to read it as int with base 16.

Both of those together:

>>> chr(int("0001F44D", 16))
'👍'

However, now we have a small function, not a string to simply replace! Quick search returned that you can pass a function to re.sub

Now we get:

re.sub(r'<U\+([A-F0-9]+)>', lambda x: chr(int(x.group(1), 16)), my_str)

PS Don't name your string just str - you'll shadow the builtin str meaning type.

answered May 12, 2021 at 16:05

h4z3

5,4951 gold badge18 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Jeremias Bohn Over a year ago

Thanks, I was not aware that there is any possibility to use a function with re.sub! I was thinking of using the chr() function here, but didn't know how to use it with the regex substitution. The str was only for debugging purposes, I will use this with a dataset mapping function! Thanks a lot!

Collectives™ on Stack Overflow

Replace Unicode code point with actual character using regex

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related