I am using urllib2.urlopen to fetch a URL and get header information like 'charset', 'content-length'.
But some page set their charset with something like
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
And urllib2 doesn't parse this for me.
Is there any built-in tool I can use to get http-equiv information?
EDIT:
This is what I do to parse charset from a page
elem = lxml.html.fromstring(page_source)
content_type = elem.xpath(
".//meta[translate(@http-equiv, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz')='content-type']/@content")
if content_type:
content_type = content_type[0]
for frag in content_type.split(';'):
frag = frag.strip().lower()
i = frag.find('charset=')
if i > -1:
return frag[i+8:] # 8 == len('charset=')
return None
How can I improve this? Can I precompile the xpath query?