2

Possible Duplicate:
Parsing HTML in Python

I have a long string of HTML similar to the following:

<ul>
<li><a href="/a/long/link">Class1</a></li>
<li><a href="/another/link">Class2</a></li>
<li><img src="/image/location" border="0">Class3</a></li>
</ul>

It has several list entries (Class1 to Class8). I'd like to turn this into a list in Python with only the class names, as in

["Class1", "Class2", "Class3"]

and so on.

How would I go about doing this? I've tried using REs, but I haven't been able to find a method that works. Of course, with only 8 classes I could easily do it manually, but I have several more HTML documents to extract data from.

Thanks! :)

4

3 Answers 3

2

Check out lxml (pip install lxml). You'll want to do a little more research, but effectively it comes down to something like this:

from lxml import etree

tree = etree.HTML(page_source)
def parse_list(xpath):
    ul = tree.xpath(xpath)
    return [child.text for child in ul.getchildren()]
Sign up to request clarification or add additional context in comments.

Comments

0

This should work but take it just as a quick and ugly hack, do not parse HTML with regular expressions

>>> hdata = """<ul>
... <li><a href="/a/long/link">Class1</a></li>
... <li><a href="/another/link">Class2</a></li>
... <li><img src="/image/location" border="0">Class3</a></li>
... </ul>"""
>>> import re
>>> lire = re.compile(r'<li>.*?>(.*?)<.*')
>>> [lire.search(x).groups()[0] for x in hdata.splitlines() if lire.search(x)]
    ['Class1', 'Class2', 'Class3']

You could try to use Element Tree if your source is valid XML, otherwise look for Beautiful Soup

1 Comment

Thanks! I actually did use Beautiful Soup to isolate the list from the rest of the HTML document, but wasn't sure how to go further than that. I'll have a look at it :)
0

If all the line endings are the same, you could try a regular expression like

re.compile(r'^<li><.*>(.*)</a></li>$')

If you're expecting much more variability in the file than in your example, then something like an HTML parser would probably be better.

Comments