Python beautiful soup encoding

Question

I have a problem with beautiful soup. I try to ged rid of html tags in a string, therefore I have the following function

def cleanHtml(self, html):
    try:
        soup = BeautifulSoup(html);
        content = soup.findAll(text=True)
        return ''.join(content);
    except:
        print html

when I now do:

print {'title' : string_with_german_umlauts}
print {'title' : self.cleanHtml(string_with_german_umlauts)}

I get the following output for the string 'Leder Gürtel' (meaning leather belt)

{'title': 'Leder G\xc3\xbcrtel'}
{'title': u'Leder G\xfcrtel'}

The right encoding is of course \xc3\xbc for the umlaut 'ü'. After trying for the whole day to get this working, I'll give up and ask ;-)

I appreciate any help Thx

In case this helps you: 'G\xc3\xbcrtel' is a byte-string and u'G\xfcrtel' is a codepoint-string ("Unicode string") and is equivalent to u'G\u00fcrtel'. 'G\xc3\xbcrtel'.decode('UTF-8') returns u'G\u00fcrtel'. While debugging consider at each step whether the data is in the form of bytes or codepoints, and when converting between one and the other, consider what encoding is being used. — wberry
– wberry, Commented Jan 31, 2012 at 19:15

inspectorG4dget · Accepted Answer · 2012-01-30 17:54:30Z

1

The fact that you have umlauts in your result is expected behavior. Beautiful Soup handles unicode, so this is expected. What is the problem here? Is is that you are not seeing the umlaut in the dictionary? If so, that is not an issue at all, as the umlaut will be properly visible when you print:

>>> d = {'title': u'Leder G\xfcrtel'}
>>> for k in d:     
...     print k, d[k]
...
title Leder Gürtel

Hope this helps

answered Jan 30, 2012 at 17:54

inspectorG4dget

115k30 gold badges159 silver badges253 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

inspectorG4dget Over a year ago

I'm no expert on mongoDB, but you might want to mention that in your question, so that someone who /is/ well versed in mongo will pick up on that and help you out. Also, you don't mention what the actual problem is

thesonix Over a year ago

I'm inserting the dict in my mongoDB. This is why I see the strange behaviour e.g. the 端 character (\u7aef) instead of an 'ü'.

thesonix Over a year ago

thanks. But what kind of encoding is \u7aef? UTF-8 is \xc3\xbc.

inspectorG4dget Over a year ago

Not quite sure what \u7aef is. Perhaps you're manipulating the string somewhere else - that could lead to a misinterpretation of what's going into mongo

thesonix Over a year ago

yes I guess its definitely a python=>pymongo problem. Python itself handles the encoding correctly.

Collectives™ on Stack Overflow

Python beautiful soup encoding

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related