1

Beautiful Soup doesn't seem to work properly(for me) in case HTML contains unicodes whose ascii exceeds 128. What suitable decoding-encoding should be used for this ?

raw = open('index.html').read()
BeautifulSoup.BeautifulSoup(raw)

Error

...stacktrace...
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 8094: ordinal not in range(128)

1 Answer 1

1

The problem is not with parsing the file. Using the link you gave in your comment to Marco, doing soup = BeautifulSoup(urllib.urlopen(your_link)) works absolutely fine.

It's just when you try and print that parsed data to the console that you get a problem, because it's now been converted to Unicode, and Python will try and output that as ASCII unless you tell it otherwise. So doing print soup rather than just soup in your console will work.

Sign up to request clarification or add additional context in comments.

2 Comments

how would you resolve this if you can't use the print statement? (see more here: stackoverflow.com/questions/7769745/…)
You don't need to, that's the whole point. It's only a problem when you're outputting in the console.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.