How do I regex search for weird non-ASCII characters in Python?

Question

I'm using the following regular expression basically to search for and delete these characters.

invalid_unicode = re.compile(ur'(Û|²|°|±|É|¹|Í)')

My source code in ASCII encoded, and whenever I try to run the script it spits out:

SyntaxError: Non-ASCII character '\xdb' in file ./release.py on line 273, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details

If I follow the instructions at the given website, and place utf-8 on the second line encoding, my script doesn't run. Instead it gives me this error:

SyntaxError: (unicode error) 'utf8' codec can't decode byte 0xdb in position 0: unexpected end of data

How do I get this one regular expression running in an ASCII written script that'd be great.

I just figured out that these characters, aren't unicode but extended ascii code. — Incognito
– Incognito, Commented Jan 11, 2010 at 3:00
I highly recommend reading Joel's article on Unicode and character sets: joelonsoftware.com/articles/Unicode.html — Greg Hewgill
– Greg Hewgill, Commented Jan 11, 2010 at 3:33

Alex Martelli · Accepted Answer · 2010-01-11 02:52:49Z

3

You need to find out what encoding your editor is using, and set that per PEP263; or, make things more stable and portable (though alas perhaps a bit less readable) and use escape sequences in your string literal, i.e., use u'(\xdb|\xb2|\xb0|\xb1|\xc9|\xb9|\xcd)' as the parameter to the re.compile call.

answered Jan 11, 2010 at 2:52

Alex Martelli

888k175 gold badges1.3k silver badges1.4k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Paulo Santos Over a year ago

I wouldn't use the byte representation of an Unicode char as it's really locale dependent. I'd escape as Unicode, and thus avoiding any complication.

Alex Martelli Over a year ago

@Paulo, Unicode encodings are not necessarily locale-dependent -- you can use 'utf-8' or other locale-independent encodings. But in any case, the form I give above is what a print repr(thestring) emits, does not rely on any encoding, and cannot possibly cause any complications (it's just the same as using \u00db and so on, guaranteed to produce absolutely identical Unicode objects, just more concise as it saves you typing the unchanging 00 parts!-).

Greg Hewgill · Accepted Answer · 2010-01-11 02:51:10Z

1

After telling Python that your source file uses UTF-8 encoding, did you actually make sure that your editor is saving the file using UTF-8 encoding? The error you get indicates that your editor is probably not using UTF-8.

What text editor are you using?

answered Jan 11, 2010 at 2:51

Greg Hewgill

1.0m192 gold badges1.2k silver badges1.3k bronze badges

1 Comment

Greg Hewgill Over a year ago

Here's how to configure UTF-8 in Notepad++: superuser.com/questions/21135/…

Anon. · Accepted Answer · 2010-01-11 02:51:32Z

0

\x{c0de}

In a regex will match the Unicode character at code point c0de.

Python uses PCRE, right? (If it doesn't, it's probably \uC0DE instead...)

answered Jan 11, 2010 at 2:51

Anon.

60.3k9 gold badges85 silver badges86 bronze badges

Collectives™ on Stack Overflow

How do I regex search for weird non-ASCII characters in Python?

3 Answers 3

2 Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related