How do I search for text in a page using regular expressions in Python?

Question

I'm trying to create a simple module for phenny, a simple IRC bot framework in Python. The module is supposed to go to http://www.isup.me/websitetheuserrequested to check is a website was up or down. I assumed I could use regex for the module seeing as other built-in modules use it too, so I tried creating this simple script although I don't think I did it right.

import re, urllib
import web

isupuri = 'http://www.isup.me/%s'
check = re.compile(r'(?ims)<span class="body">.*?</span>')

def isup(phenny, input):
    global isupuri
    global cleanup

    bytes = web.get(isupuri)
    quote = check.findall(bytes)
    result = re.sub(r'<[^>]*?>', '', str(quote[0]))
    phenny.say(result)

isup.commands = ['isup']
isup.priority = 'low'
isup.example = '.isup google.com'

It imports the required web packages (I think), and defines the string and the text to look for within the page. I really don't know what I did in those four lines, I kinda just ripped the code off another phenny module.

Here is an example of a quotes module that grabs a random quote from some webpage, I kinda tried to use that as a base: http://pastebin.com/vs5ypHZy

Does anyone know what I am doing wrong? If something needs clarified I can tell you, I don't think I explained this enough.

Here is the error I get:

Traceback (most recent call last):
  File "C:\phenny\bot.py", line 189, in call
    try: func(phenny, input)
  File "C:\phenny\modules\isup.py", line 18, in isup
    result = re.sub(r'<[^>]*?>', '', str(quote[0]))
IndexError: list index out of range

what exactly isn't working for you? the program does not run? the result is wrong? — João Portela
– João Portela, Commented Jan 3, 2012 at 15:12
also, why do you need isup.me? why don't you do a HTTP HEAD request to check if the site is up? — João Portela
– João Portela, Commented Jan 3, 2012 at 15:12
I added the error that I get when the command is executed. And I never knew I could use HTTP HEAD, even though I'm not sure what it is. — Alex
– Alex, Commented Jan 3, 2012 at 15:15
You don't need the global statements, so long as you're not defining them within the function. I'd also recommend that you capitalize your static variables (e.g., ISUPURI instead of isupuri), so people (and you) know not to mess with them. — Edwin
– Edwin, Commented Jan 3, 2012 at 15:30

João Portela · Accepted Answer · 2012-01-03 15:49:18Z

1

try this (from http://docs.python.org/release/2.6.7/library/httplib.html#examples):

import httplib
conn = httplib.HTTPConnection("www.python.org")
conn.request("HEAD","/index.html")
res = conn.getresponse()
if res.status >= 200 and res.status < 300:
    print "up"
else:
    print "down"

You will also need to add code to follow redirects before checking the response status.

edit

Alternative that does not need to handle redirects but uses exceptions for logic:

import urllib2
request = urllib2.Request('http://google.com')
request.get_method = lambda : 'HEAD'

try:
    response = urllib2.urlopen(request)
    print "up"
    print response.code
except urllib2.URLError, e:
    # failure
    print "down"
    print e

You should do your own tests and choose the best one.

edited Jan 3, 2012 at 15:49

answered Jan 3, 2012 at 15:34

João Portela

6,5067 gold badges41 silver badges56 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Alex Over a year ago

This kinda works, I edited the "www.python.org" to whatever the user said, although now it says everything is down, I think because of the /index.html as some sites may not have this. How would I go about just checking the final page it redirects to?

Aaron Digulla Over a year ago

@Alex: Use the exact same URL which your browser uses (just copy it from the location bar).

Alex Over a year ago

Whenever I include http I get this error: InvalidURL: nonnumeric port:'//stackoverflow.com/questions/8714093/how-do-i-search-for-text-in-a-page-using-regular-expressions-in-python' (source unknown)

Alex Over a year ago

I tried the new edit, and it works... however when I don't include http:// it throws an error, I was going to add http:// to the user's query however if they already had http:// in their query it would cause another error... and also if the website is down it doesn't say anything at all.

João Portela Over a year ago

I assumed you need to check for http availability hence the need for the http prefix. You should do more tests to see if you should also check response.code value to be >= 200 and < 300

|

Aaron Digulla · Accepted Answer · 2012-01-03 15:51:09Z

0

The error means your regexp wasn't found anywhere on the page (the list quote has no element 0).

answered Jan 3, 2012 at 15:51

Aaron Digulla

330k111 gold badges626 silver badges840 bronze badges

2 Comments

Alex Over a year ago

I thought r'(?ims)<span class="body">.*?</span>' would be valid regex, seeing as the result is found inside that HTML tag...

Aaron Digulla Over a year ago

It's valid (or you would have gotten an error compiling it). It just doesn't match anywhere on the page. That can mean the page is an empty string (nothing was downloaded or you got an error page) or that the regexp doesn't do what you think it should.

Collectives™ on Stack Overflow

How do I search for text in a page using regular expressions in Python?

2 Answers 2

6 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related