3

I am using urllib2.urlopen to fetch a URL and get header information like 'charset', 'content-length'.

But some page set their charset with something like

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

And urllib2 doesn't parse this for me.

Is there any built-in tool I can use to get http-equiv information?

EDIT:

This is what I do to parse charset from a page

elem = lxml.html.fromstring(page_source)
content_type = elem.xpath(
        ".//meta[translate(@http-equiv, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz')='content-type']/@content")
if content_type:
    content_type = content_type[0]
    for frag in content_type.split(';'):
        frag = frag.strip().lower()
        i = frag.find('charset=')
        if i > -1:
            return frag[i+8:] # 8 == len('charset=')

return None

How can I improve this? Can I precompile the xpath query?

1
  • 2
    BeautifulSoup could handle this.. but there should be a better way. Commented Dec 4, 2010 at 7:24

4 Answers 4

2

Find 'http-equiv' using BeautifulSoup

import urllib2
from BeautifulSoup import BeautifulSoup

f  = urllib2.urlopen("http://example.com")
soup = BeautifulSoup(f) # trust BeautifulSoup to parse the encoding
for meta in soup.findAll('meta', attrs={
    'http-equiv': lambda x: x and x.lower() == 'content-type'}):
    print("content-type: %r" % meta['content'])
    break
else:
    print('no content-type found')

#NOTE: strings in the soup are Unicode, but we can ask about charset
#      declared in the html 
print("encoding: %s" % (soup.declaredHTMLEncoding,))
Sign up to request clarification or add additional context in comments.

Comments

1

yeah! any html parsing library would help.

BeautifulSoup is pure python library based on sgmllib, lxml is more efficient alternative python library written in c

Try any one of them. They will solve your problem.

Comments

1

I need to parse this as well (among other things) for my online http fetcher. I use lxml to parse pages and get the meta equiv headers, roughly as follows:

    from lxml.html import parse

    doc = parse(url)
    nodes = doc.findall("//meta")
    for node in nodes:
        name = node.attrib.get('name')
        id = node.attrib.get('id')
        equiv = node.attrib.get('http-equiv')
        if equiv.lower() == 'content-type':
            ... do your thing ... 

You can do a much fancier query to directly fetch the appropriate tag (by specifying the name= in the query), but in my case I'm parsing all meta tags. I'll leave this as an exercise for you, here is the relevant lxml documentation.

Beautifulsoup is considered somewhat deprecated and no longer actively developed.

1 Comment

I'm using lxml as you do. But I use fromstring instead of parse, with which I don't have to decode the page source. When I use parse with a page with encoding gb2312 it raises a UnicodeDecodeError.
1

Building your own HTML parser is much harder than you think, and as the previous answers I suggestion using a library to do it. But instead of BeautifulSoup and lxml I would suggest html5lib. It the parser that best mimics how a browser parses the page, for instance in respect to encoding:

Parsed trees are always Unicode. However a large variety of input encodings are supported. The encoding of the document is determined in the following way:

The encoding may be explicitly specified by passing the name of the encoding as the encoding parameter to HTMLParser.parse

If no encoding is specified, the parser will attempt to detect the encoding from a element in the first 512 bytes of the document (this is only a partial implementation of the current HTML 5 specification)

If no encoding can be found and the chardet library is available, an attempt will be made to sniff the encoding from the byte pattern

If all else fails, the default encoding (usually Windows-1252) will be used

From: http://code.google.com/p/html5lib/wiki/UserDocumentation

1 Comment

The examples are the same as the lxml example above, you just parse with html5lib first, and then use lxml. The reason for this is that the html5lib folks didn't want to "hardcode" a specific API, so they supply several parallel ones (called treebuilders). lxml is one of them.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.