How to get `http-equiv`s in python?

Question

I am using urllib2.urlopen to fetch a URL and get header information like 'charset', 'content-length'.

But some page set their charset with something like

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

And urllib2 doesn't parse this for me.

Is there any built-in tool I can use to get http-equiv information?

EDIT:

This is what I do to parse charset from a page

elem = lxml.html.fromstring(page_source)
content_type = elem.xpath(
        ".//meta[translate(@http-equiv, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz')='content-type']/@content")
if content_type:
    content_type = content_type[0]
    for frag in content_type.split(';'):
        frag = frag.strip().lower()
        i = frag.find('charset=')
        if i > -1:
            return frag[i+8:] # 8 == len('charset=')

return None

How can I improve this? Can I precompile the xpath query?

BeautifulSoup could handle this.. but there should be a better way. — Gabi Purcaru
– Gabi Purcaru, Commented Dec 4, 2010 at 7:24

jfs · Accepted Answer · 2010-12-04 14:33:28Z

2

Find 'http-equiv' using BeautifulSoup

import urllib2
from BeautifulSoup import BeautifulSoup

f  = urllib2.urlopen("http://example.com")
soup = BeautifulSoup(f) # trust BeautifulSoup to parse the encoding
for meta in soup.findAll('meta', attrs={
    'http-equiv': lambda x: x and x.lower() == 'content-type'}):
    print("content-type: %r" % meta['content'])
    break
else:
    print('no content-type found')

#NOTE: strings in the soup are Unicode, but we can ask about charset
#      declared in the html 
print("encoding: %s" % (soup.declaredHTMLEncoding,))

answered Dec 4, 2010 at 14:33

jfs

417k210 gold badges1k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Shiv Deepak · Accepted Answer · 2010-12-04 07:39:37Z

1

yeah! any html parsing library would help.

BeautifulSoup is pure python library based on sgmllib, lxml is more efficient alternative python library written in c

Try any one of them. They will solve your problem.

answered Dec 4, 2010 at 7:39

Shiv Deepak

3,1365 gold badges38 silver badges50 bronze badges

Comments

Ivo van der Wijk · Accepted Answer · 2010-12-04 09:01:29Z

1

I need to parse this as well (among other things) for my online http fetcher. I use lxml to parse pages and get the meta equiv headers, roughly as follows:

    from lxml.html import parse

    doc = parse(url)
    nodes = doc.findall("//meta")
    for node in nodes:
        name = node.attrib.get('name')
        id = node.attrib.get('id')
        equiv = node.attrib.get('http-equiv')
        if equiv.lower() == 'content-type':
            ... do your thing ...

You can do a much fancier query to directly fetch the appropriate tag (by specifying the name= in the query), but in my case I'm parsing all meta tags. I'll leave this as an exercise for you, here is the relevant lxml documentation.

Beautifulsoup is considered somewhat deprecated and no longer actively developed.

edited Dec 4, 2010 at 9:01

answered Dec 4, 2010 at 8:49

Ivo van der Wijk

16.8k4 gold badges45 silver badges57 bronze badges

1 Comment

satoru Over a year ago

I'm using lxml as you do. But I use fromstring instead of parse, with which I don't have to decode the page source. When I use parse with a page with encoding gb2312 it raises a UnicodeDecodeError.

Emil Stenström · Accepted Answer · 2010-12-04 15:07:09Z

1

Building your own HTML parser is much harder than you think, and as the previous answers I suggestion using a library to do it. But instead of BeautifulSoup and lxml I would suggest html5lib. It the parser that best mimics how a browser parses the page, for instance in respect to encoding:

Parsed trees are always Unicode. However a large variety of input encodings are supported. The encoding of the document is determined in the following way:

The encoding may be explicitly specified by passing the name of the encoding as the encoding parameter to HTMLParser.parse

If no encoding is specified, the parser will attempt to detect the encoding from a element in the first 512 bytes of the document (this is only a partial implementation of the current HTML 5 specification)

If no encoding can be found and the chardet library is available, an attempt will be made to sniff the encoding from the byte pattern

If all else fails, the default encoding (usually Windows-1252) will be used

From: http://code.google.com/p/html5lib/wiki/UserDocumentation

answered Dec 4, 2010 at 15:07

Emil Stenström

14.2k8 gold badges57 silver badges77 bronze badges

1 Comment

Emil Stenström Over a year ago

The examples are the same as the lxml example above, you just parse with html5lib first, and then use lxml. The reason for this is that the html5lib folks didn't want to "hardcode" a specific API, so they supply several parallel ones (called treebuilders). lxml is one of them.

Collectives™ on Stack Overflow

How to get `http-equiv`s in python?

4 Answers 4

Find 'http-equiv' using BeautifulSoup

Comments

Comments

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Find 'http-equiv' using BeautifulSoup

Comments

Comments

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related