2

I'm using the following regular expression basically to search for and delete these characters.

invalid_unicode = re.compile(ur'(Û|²|°|±|É|¹|Í)')

My source code in ASCII encoded, and whenever I try to run the script it spits out:

SyntaxError: Non-ASCII character '\xdb' in file ./release.py on line 273, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details

If I follow the instructions at the given website, and place utf-8 on the second line encoding, my script doesn't run. Instead it gives me this error:

SyntaxError: (unicode error) 'utf8' codec can't decode byte 0xdb in position 0: unexpected end of data

How do I get this one regular expression running in an ASCII written script that'd be great.

2
  • 1
    I just figured out that these characters, aren't unicode but extended ascii code. Commented Jan 11, 2010 at 3:00
  • 2
    I highly recommend reading Joel's article on Unicode and character sets: joelonsoftware.com/articles/Unicode.html Commented Jan 11, 2010 at 3:33

3 Answers 3

3

You need to find out what encoding your editor is using, and set that per PEP263; or, make things more stable and portable (though alas perhaps a bit less readable) and use escape sequences in your string literal, i.e., use u'(\xdb|\xb2|\xb0|\xb1|\xc9|\xb9|\xcd)' as the parameter to the re.compile call.

Sign up to request clarification or add additional context in comments.

2 Comments

I wouldn't use the byte representation of an Unicode char as it's really locale dependent. I'd escape as Unicode, and thus avoiding any complication.
@Paulo, Unicode encodings are not necessarily locale-dependent -- you can use 'utf-8' or other locale-independent encodings. But in any case, the form I give above is what a print repr(thestring) emits, does not rely on any encoding, and cannot possibly cause any complications (it's just the same as using \u00db and so on, guaranteed to produce absolutely identical Unicode objects, just more concise as it saves you typing the unchanging 00 parts!-).
1

After telling Python that your source file uses UTF-8 encoding, did you actually make sure that your editor is saving the file using UTF-8 encoding? The error you get indicates that your editor is probably not using UTF-8.

What text editor are you using?

1 Comment

Here's how to configure UTF-8 in Notepad++: superuser.com/questions/21135/…
0
\x{c0de}

In a regex will match the Unicode character at code point c0de.

Python uses PCRE, right? (If it doesn't, it's probably \uC0DE instead...)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.