How do I use regular expressions to parse HTML tags?

Question

Was wondering how I would extrapolate the value of an html element using a regular expression (in python preferably).

For example, <a href="http://google.com"> Hello World! </a>

What regex would I use to extract Hello World! from the above html?

Use python's html parsing abilities. (Right tool for the job.) — Chris
– Chris, Commented Oct 7, 2010 at 17:57
Please do not use regex to parse *ML. Anyone that suggests that you should is wrong. — Nick T
– Nick T, Commented Oct 7, 2010 at 17:58
You stand in grave danger of getting your IDLE privileges revoked if you so much as touch XML/HTML with a regular expression. — Manoj Govindan
– Manoj Govindan, Commented Oct 7, 2010 at 18:03
@Nick: Yes, ML-derivates (en.wikipedia.org/wiki/Category:ML_programming_language_family) are, like about all programming languages, way too complex to be parsed by regexes ;) — user395760
– user395760, Commented Oct 7, 2010 at 18:27
thanks @Mark :) you're the only person who gave a straight answer! — user179169
– user179169, Commented Oct 7, 2010 at 20:13

Community · Accepted Answer · 2017-05-23 12:26:48Z

8

Using regex to parse HTML has been covered extensively on SO. The consensus is that it shouldn't be done.

Here are some related links worth reading:

One trick I have used in the past to parse HTML files is convert it to XHTML and then treat it as an xml file and use xPath. If this is an option look at:

edited May 23, 2017 at 12:26

CommunityBot

11 silver badge

answered Oct 7, 2010 at 17:59

Abe Miessler

85.7k104 gold badges323 silver badges496 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Community · Accepted Answer · 2017-05-23 12:04:10Z

7

Regex + HTML...

But BeautifulSoup is a handy library.

>>> from BeautifulSoup import BeautifulSoup
>>> html = '<a href="http://google.com"> Hello World! </a>'
>>> soup = BeautifulSoup(html)
>>> soup.a.string
u' Hello World! '

This, for instance, would print out links on this page:

import urllib2
from BeautifulSoup import BeautifulSoup

q = urllib2.urlopen('https://stackoverflow.com/questions/3884419/')
soup = BeautifulSoup(q.read())

for link in soup.findAll('a'):
    if link.has_key('href'):
        print str(link.string) + " -> " + link['href']
    elif link.has_key('id'):
        print "ID: " + link['id']
    else:
        print "???"

Output:

Stack Exchange -> http://stackexchange.com
log in -> /users/login?returnurl=%2fquestions%2f3884419%2f
careers -> http://careers.stackoverflow.com
meta -> http://meta.stackoverflow.com
...
ID: flag-post-3884419
None -> /posts/3884419/revisions
...

edited May 23, 2017 at 12:04

CommunityBot

11 silver badge

answered Oct 7, 2010 at 18:01

Nick T

27k14 gold badges88 silver badges128 bronze badges

3 Comments

user179169 Over a year ago

if I had multipal links (<a href=""> blah blah </a>), that only seems to output the first link it comes across?

Manoj Govindan Over a year ago

There are other methods. soup.findAll('a') for instance. See the documentation: crummy.com/software/BeautifulSoup/documentation.html

mpen Over a year ago

I keep hearing about BeautifulSoup but I didn't realize it actually had such a nice API... there are so many tools out there, but a lot of them are just atrocious to use. This is nice :) I've been doing my parsing in C# though.

Eamon Nerbonne · Accepted Answer · 2010-10-07 17:58:02Z

0

Ideally you wouldn't use a Regular expression - they are unsuitable for most parsing tasks, including HTML. Use a parsing library - I'm not an expert python user, but I'm sure there's one to be had.

answered Oct 7, 2010 at 17:58

Eamon Nerbonne

48.4k21 gold badges105 silver badges172 bronze badges

Collectives™ on Stack Overflow

How do I use regular expressions to parse HTML tags?

3 Answers 3

Comments

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related