0

Was wondering how I would extrapolate the value of an html element using a regular expression (in python preferably).

For example, <a href="http://google.com"> Hello World! </a>

What regex would I use to extract Hello World! from the above html?

6
  • 10
    Use python's html parsing abilities. (Right tool for the job.) Commented Oct 7, 2010 at 17:57
  • 6
    Please do not use regex to parse *ML. Anyone that suggests that you should is wrong. Commented Oct 7, 2010 at 17:58
  • You stand in grave danger of getting your IDLE privileges revoked if you so much as touch XML/HTML with a regular expression. Commented Oct 7, 2010 at 18:03
  • @Nick: Yes, ML-derivates (en.wikipedia.org/wiki/Category:ML_programming_language_family) are, like about all programming languages, way too complex to be parsed by regexes ;) Commented Oct 7, 2010 at 18:27
  • thanks @Mark :) you're the only person who gave a straight answer! Commented Oct 7, 2010 at 20:13

3 Answers 3

8

Using regex to parse HTML has been covered extensively on SO. The consensus is that it shouldn't be done.

Here are some related links worth reading:

One trick I have used in the past to parse HTML files is convert it to XHTML and then treat it as an xml file and use xPath. If this is an option look at:

Sign up to request clarification or add additional context in comments.

Comments

7

Regex + HTML...

But BeautifulSoup is a handy library.

>>> from BeautifulSoup import BeautifulSoup
>>> html = '<a href="http://google.com"> Hello World! </a>'
>>> soup = BeautifulSoup(html)
>>> soup.a.string
u' Hello World! '

This, for instance, would print out links on this page:

import urllib2
from BeautifulSoup import BeautifulSoup

q = urllib2.urlopen('https://stackoverflow.com/questions/3884419/')
soup = BeautifulSoup(q.read())

for link in soup.findAll('a'):
    if link.has_key('href'):
        print str(link.string) + " -> " + link['href']
    elif link.has_key('id'):
        print "ID: " + link['id']
    else:
        print "???"

Output:

Stack Exchange -> http://stackexchange.com
log in -> /users/login?returnurl=%2fquestions%2f3884419%2f
careers -> http://careers.stackoverflow.com
meta -> http://meta.stackoverflow.com
...
ID: flag-post-3884419
None -> /posts/3884419/revisions
...

3 Comments

if I had multipal links (<a href=""> blah blah </a>), that only seems to output the first link it comes across?
There are other methods. soup.findAll('a') for instance. See the documentation: crummy.com/software/BeautifulSoup/documentation.html
I keep hearing about BeautifulSoup but I didn't realize it actually had such a nice API... there are so many tools out there, but a lot of them are just atrocious to use. This is nice :) I've been doing my parsing in C# though.
0

Ideally you wouldn't use a Regular expression - they are unsuitable for most parsing tasks, including HTML. Use a parsing library - I'm not an expert python user, but I'm sure there's one to be had.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.