0

When parsing an HTML file with Requests and Beautiful Soup, the following line is throwing an exception on some web pages:

if 'var' in str(tag.string):

Here is the context:

response = requests.get(url)  
soup = bs4.BeautifulSoup(response.text.encode('utf-8'))

for tag in soup.findAll('script'):
    if 'var' in str(tag.string):    # This is the line throwing the exception
        print(tag.string)

Here is the exception:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 15: ordinal not in range(128)

I have tried both with and without using the encode('utf-8') function in the BeautifulSoup line, it makes no difference. I do note that for the pages throwing the exception there is a character à in a comment in the javascript, even though the encoding reported by response.encoding is ISO-8859-1. I do realise that I can remove the offending characters with unicodedata.normalize however I would prefer to convert the tag variable to utf-8 and keep the characters. None of the following methods help to change the variable to utf-8:

tag.encode('utf-8')
tag.decode('ISO-8859-1').encode('utf-8')
tag.decode(response.encoding).encode('utf-8')

What must I do to this string in order to transform it into usable utf-8?

2
  • 1
    You try those methods but kept doing: if 'var' in str(tag.string):?? Commented Jun 10, 2013 at 15:15
  • @PauloBu: No, I use of course the output of the conversion! Commented Jun 10, 2013 at 15:34

2 Answers 2

3

Ok so basically you're getting an HTTP response encoded in Latin-1. The character giving you problem es indeed à because looking here you may see that 0xC3 is exactly that character in Latin-1.

I think you blinded test every combination you imagined about decoding/encoding the request. First of all, if you do this: if 'var' in str(tag.string): whenever string var contains non-ASCII bytes, python will complaint.

Looking at the code you've shared with us, the right approach IMHO would be:

response = requests.get(url)
# decode the latin-1 bytes to unicode  
#soup = bs4.BeautifulSoup(response.text.decode('latin-1'))
#try this line instead
soup = bs4.BeautifulSoup(response.text, from_encoding=response.encoding)

for tag in soup.findAll('script'):
    # since now soup was made with unicode strings I supposed you can treat
    # its elements as so
    if u'var' in tag.string:    # This is the line throwing the exception
        # now if you want output in utf-8
        print(tag.string.encode('utf-8'))

EDIT: It will be useful for you to take a look at the encoding section from the BeautifiulSoup 4 doc

Basically, the logic is:

  1. You get some bytes encoded in encoding X
  2. You decode X by doing bytes.decode('X') and this returns a unicode byte sequence
  3. You work with unicode
  4. You encode the unicode to some encoding Y for the output ubytes.encode('Y')

Hope this bring some light to the problem.

Sign up to request clarification or add additional context in comments.

6 Comments

Thanks. Instead of response.text.decode('latin-1') I am trying response.text.decode(response.encoding) because this application needs to work with other sites as well. That very line is now throwing the error message (albeit with a different position, of course). Is there no generic way to work with any encoding?
What's the error now? This is the way to work with whatever encoding. You get the response encoding, decode it, work with unicode and encode int utf-8. What error is throwing now and how does response.encoding looks like?
Same error: UnicodeEncodeError: 'ascii' codec can't encode characters in position 5837-5838: ordinal not in range(128), now on this line: soup = bs4.BeautifulSoup(response.text.decode(response.encoding)) (all copied from the CLI error message). The page that I'm parsing in this example is poemhunter.com/poems/hate (not my site, just an example that I stumbled across).
I edited the code in my answer when instantiation of BeautifulSoup object. Also gave you a link to the docs which will be useful. I'll take a look at that page. Notify me if it worked.
Thank you, sending the encoding as you mention with from_encoding= does seem to help! I'm testing now. Thank you for the link to the relevant part of the documentation.
|
2

You can also try to use Unicode Dammit lib(it is part of BS4) to parse pages. Detailed description here: http://scriptcult.com/subcategory_176/article_852-use-beautifulsoup-unicodedammit-with-lxml-html.html

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.