I have a problem with beautiful soup. I try to ged rid of html tags in a string, therefore I have the following function
def cleanHtml(self, html):
try:
soup = BeautifulSoup(html);
content = soup.findAll(text=True)
return ''.join(content);
except:
print html
when I now do:
print {'title' : string_with_german_umlauts}
print {'title' : self.cleanHtml(string_with_german_umlauts)}
I get the following output for the string 'Leder Gürtel' (meaning leather belt)
{'title': 'Leder G\xc3\xbcrtel'}
{'title': u'Leder G\xfcrtel'}
The right encoding is of course \xc3\xbc for the umlaut 'ü'. After trying for the whole day to get this working, I'll give up and ask ;-)
I appreciate any help Thx
'G\xc3\xbcrtel'is a byte-string andu'G\xfcrtel'is a codepoint-string ("Unicode string") and is equivalent tou'G\u00fcrtel'.'G\xc3\xbcrtel'.decode('UTF-8')returnsu'G\u00fcrtel'. While debugging consider at each step whether the data is in the form of bytes or codepoints, and when converting between one and the other, consider what encoding is being used.