Why is Python insisting on using ascii?

Question

When parsing an HTML file with Requests and Beautiful Soup, the following line is throwing an exception on some web pages:

if 'var' in str(tag.string):

Here is the context:

response = requests.get(url)  
soup = bs4.BeautifulSoup(response.text.encode('utf-8'))

for tag in soup.findAll('script'):
    if 'var' in str(tag.string):    # This is the line throwing the exception
        print(tag.string)

Here is the exception:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 15: ordinal not in range(128)

I have tried both with and without using the encode('utf-8') function in the BeautifulSoup line, it makes no difference. I do note that for the pages throwing the exception there is a character Ã in a comment in the javascript, even though the encoding reported by response.encoding is ISO-8859-1. I do realise that I can remove the offending characters with unicodedata.normalize however I would prefer to convert the tag variable to utf-8 and keep the characters. None of the following methods help to change the variable to utf-8:

tag.encode('utf-8')
tag.decode('ISO-8859-1').encode('utf-8')
tag.decode(response.encoding).encode('utf-8')

What must I do to this string in order to transform it into usable utf-8?

You try those methods but kept doing: if 'var' in str(tag.string):?? — Paulo Bu
– Paulo Bu, Commented Jun 10, 2013 at 15:15

Paulo Bu · Accepted Answer · 2013-06-10 16:16:31Z

3

Ok so basically you're getting an HTTP response encoded in Latin-1. The character giving you problem es indeed Ã because looking here you may see that 0xC3 is exactly that character in Latin-1.

I think you blinded test every combination you imagined about decoding/encoding the request. First of all, if you do this: if 'var' in str(tag.string): whenever string var contains non-ASCII bytes, python will complaint.

Looking at the code you've shared with us, the right approach IMHO would be:

response = requests.get(url)
# decode the latin-1 bytes to unicode  
#soup = bs4.BeautifulSoup(response.text.decode('latin-1'))
#try this line instead
soup = bs4.BeautifulSoup(response.text, from_encoding=response.encoding)

for tag in soup.findAll('script'):
    # since now soup was made with unicode strings I supposed you can treat
    # its elements as so
    if u'var' in tag.string:    # This is the line throwing the exception
        # now if you want output in utf-8
        print(tag.string.encode('utf-8'))

EDIT: It will be useful for you to take a look at the encoding section from the BeautifiulSoup 4 doc

Basically, the logic is:

You get some bytes encoded in encoding X
You decode X by doing bytes.decode('X') and this returns a unicode byte sequence
You work with unicode
You encode the unicode to some encoding Y for the output ubytes.encode('Y')

Hope this bring some light to the problem.

edited Jun 10, 2013 at 16:16

answered Jun 10, 2013 at 15:33

Paulo Bu

29.9k6 gold badges77 silver badges74 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

dotancohen Over a year ago

Thanks. Instead of response.text.decode('latin-1') I am trying response.text.decode(response.encoding) because this application needs to work with other sites as well. That very line is now throwing the error message (albeit with a different position, of course). Is there no generic way to work with any encoding?

Paulo Bu Over a year ago

What's the error now? This is the way to work with whatever encoding. You get the response encoding, decode it, work with unicode and encode int utf-8. What error is throwing now and how does response.encoding looks like?

dotancohen Over a year ago

Same error: UnicodeEncodeError: 'ascii' codec can't encode characters in position 5837-5838: ordinal not in range(128), now on this line: soup = bs4.BeautifulSoup(response.text.decode(response.encoding)) (all copied from the CLI error message). The page that I'm parsing in this example is poemhunter.com/poems/hate (not my site, just an example that I stumbled across).

Paulo Bu Over a year ago

I edited the code in my answer when instantiation of BeautifulSoup object. Also gave you a link to the docs which will be useful. I'll take a look at that page. Notify me if it worked.

dotancohen Over a year ago

Thank you, sending the encoding as you mention with from_encoding= does seem to help! I'm testing now. Thank you for the link to the relevant part of the documentation.

|

Dmytriy Voloshyn · Accepted Answer · 2014-10-14 14:59:23Z

2

You can also try to use Unicode Dammit lib(it is part of BS4) to parse pages. Detailed description here: http://scriptcult.com/subcategory_176/article_852-use-beautifulsoup-unicodedammit-with-lxml-html.html

answered Oct 14, 2014 at 14:59

Dmytriy Voloshyn

1,06213 silver badges28 bronze badges

Collectives™ on Stack Overflow

Why is Python insisting on using ascii?

2 Answers 2

6 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related