Was wondering how I would extrapolate the value of an html element using a regular expression (in python preferably).
For example, <a href="http://google.com"> Hello World! </a>
What regex would I use to extract Hello World! from the above html?
Was wondering how I would extrapolate the value of an html element using a regular expression (in python preferably).
For example, <a href="http://google.com"> Hello World! </a>
What regex would I use to extract Hello World! from the above html?
Using regex to parse HTML has been covered extensively on SO. The consensus is that it shouldn't be done.
Here are some related links worth reading:
One trick I have used in the past to parse HTML files is convert it to XHTML and then treat it as an xml file and use xPath. If this is an option look at:
But BeautifulSoup is a handy library.
>>> from BeautifulSoup import BeautifulSoup
>>> html = '<a href="http://google.com"> Hello World! </a>'
>>> soup = BeautifulSoup(html)
>>> soup.a.string
u' Hello World! '
This, for instance, would print out links on this page:
import urllib2
from BeautifulSoup import BeautifulSoup
q = urllib2.urlopen('https://stackoverflow.com/questions/3884419/')
soup = BeautifulSoup(q.read())
for link in soup.findAll('a'):
if link.has_key('href'):
print str(link.string) + " -> " + link['href']
elif link.has_key('id'):
print "ID: " + link['id']
else:
print "???"
Output:
Stack Exchange -> http://stackexchange.com
log in -> /users/login?returnurl=%2fquestions%2f3884419%2f
careers -> http://careers.stackoverflow.com
meta -> http://meta.stackoverflow.com
...
ID: flag-post-3884419
None -> /posts/3884419/revisions
...
soup.findAll('a') for instance. See the documentation: crummy.com/software/BeautifulSoup/documentation.html