check if the page is HTML page in python?

Question

I am trying to write a code in python for Web crawler. I want to check if the page I am about to crawl is a HTML page and not page like .pdf/.doc/.docx etc.. I do not want to check it with extension .html as asp,aspx, or pages like http://bing.com/travel/ do not .html extensions explicitly but they are html pages. Is there any good way in python?

Without loading any page data? Sounds hard. Otherwise, why not just check the content-type, or read the first few bytes and see if it looks like html? (starts with a <DOCTYPE> or <html>, for instance) — Henry Keiter
– Henry Keiter, Commented Sep 18, 2013 at 22:04
It is ok to load any page data. I just tried regular expression .*<.*html.*>.*" to check first few bytes as page can be <!DOCTYPE html>... or <html> but ra.match just goes in infinite loop for some pages. — user2793286
– user2793286, Commented Sep 18, 2013 at 22:06
How accurate do you want to be? You could choose to trust the content-type header or not. You could try to parse the HTML, or not. There is no "right" way to do this, it depends on how accurate/fast you want the HTML check to actually be. — Mark Hildreth
– Mark Hildreth, Commented Sep 18, 2013 at 22:07
I would prefer something that would work accurately. Without content read if I can do it it will be better. — user2793286
– user2793286, Commented Sep 18, 2013 at 22:11

unutbu · Accepted Answer · 2013-09-19 00:00:48Z

This gets the header only from the server:

import urllib2
url = 'http://www.kernel.org/pub/linux/kernel/v3.0/testing/linux-3.7-rc6.tar.bz2'
req = urllib2.Request(url)
req.get_method = lambda: 'HEAD'
response = urllib2.urlopen(req)
content_type = response.headers.getheader('Content-Type')
print(content_type)

prints

application/x-bzip2

From which you could conclude this is not HTML. You could use

'html' in content_type

to programmatically test if the content is HTML (or possibly XHTML). If you wanted to be even more sure the content is HTML you could download the contents and try to parse it with an HTML parser like lxml or BeautifulSoup.

Beware of using requests.get like this:

import requests
r = requests.get(url)
print(r.headers['content-type'])

This takes a long time and my network monitor shows a sustained load leading me to believe this is downloading the entire file, not just the header.

On the other hand,

import requests
r = requests.head(url)
print(r.headers['content-type'])

gets the header only.

Jonas Geiregat · Accepted Answer · 2013-09-18 22:05:04Z

3

Don't bother with what the standard library throws at you but, rather try requests.

>>> import requests
>>> r = requests.get("http://www.google.com")
>>> r.headers['content-type']
    'text/html; charset=ISO-8859-1'

answered Sep 18, 2013 at 22:05

Jonas Geiregat

5,4826 gold badges45 silver badges63 bronze badges

Collectives™ on Stack Overflow

check if the page is HTML page in python?

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related