Trouble with parsing HTML with unicodes through Beautiful Soup

Question

Beautiful Soup doesn't seem to work properly(for me) in case HTML contains unicodes whose ascii exceeds 128. What suitable decoding-encoding should be used for this ?

raw = open('index.html').read() BeautifulSoup.BeautifulSoup(raw)

Error

...stacktrace... UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 8094: ordinal not in range(128)

Daniel Roseman · Accepted Answer · 2011-10-14 15:24:31Z

1

The problem is not with parsing the file. Using the link you gave in your comment to Marco, doing soup = BeautifulSoup(urllib.urlopen(your_link)) works absolutely fine.

It's just when you try and print that parsed data to the console that you get a problem, because it's now been converted to Unicode, and Python will try and output that as ASCII unless you tell it otherwise. So doing print soup rather than just soup in your console will work.

answered Oct 14, 2011 at 15:24

Daniel Roseman

602k68 gold badges911 silver badges924 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Marco L. Over a year ago

how would you resolve this if you can't use the print statement? (see more here: stackoverflow.com/questions/7769745/…)

Daniel Roseman Over a year ago

You don't need to, that's the whole point. It's only a problem when you're outputting in the console.

Collectives™ on Stack Overflow

Trouble with parsing HTML with unicodes through Beautiful Soup

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related