How do I extract HTML list entries into a Python list? [duplicate]

Question

Possible Duplicate:
Parsing HTML in Python

I have a long string of HTML similar to the following:

<ul>
<li><a href="/a/long/link">Class1</a></li>
<li><a href="/another/link">Class2</a></li>
<li><img src="/image/location" border="0">Class3</a></li>
</ul>

It has several list entries (Class1 to Class8). I'd like to turn this into a list in Python with only the class names, as in

["Class1", "Class2", "Class3"]

and so on.

How would I go about doing this? I've tried using REs, but I haven't been able to find a method that works. Of course, with only 8 classes I could easily do it manually, but I have several more HTML documents to extract data from.

Thanks! :)

Check out the documentation for docs.python.org/library/htmlparser.html — Alex Churchill
– Alex Churchill, Commented Aug 9, 2011 at 21:20
stackoverflow.com/questions/3276040/… if you want an example of HTMLParser — Alex Churchill
– Alex Churchill, Commented Aug 9, 2011 at 21:23
Try BeautifilSoup by: soup = BeautifilSoup(html); soup2.findAll("li", text=True);, it'll return all the class names. — kenorb
– kenorb, Commented Jun 22, 2014 at 14:19
See also Only extracting text from this element, not its children. — kenorb
– kenorb, Commented Jun 22, 2014 at 14:21

Ceasar · Accepted Answer · 2011-08-09 21:43:50Z

2

Check out lxml (pip install lxml). You'll want to do a little more research, but effectively it comes down to something like this:

from lxml import etree

tree = etree.HTML(page_source)
def parse_list(xpath):
    ul = tree.xpath(xpath)
    return [child.text for child in ul.getchildren()]

answered Aug 9, 2011 at 21:43

Ceasar

23.2k15 gold badges66 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Facundo Casco · Accepted Answer · 2011-08-09 21:42:53Z

0

This should work but take it just as a quick and ugly hack, do not parse HTML with regular expressions

>>> hdata = """<ul>
... <li><a href="/a/long/link">Class1</a></li>
... <li><a href="/another/link">Class2</a></li>
... <li><img src="/image/location" border="0">Class3</a></li>
... </ul>"""
>>> import re
>>> lire = re.compile(r'<li>.*?>(.*?)<.*')
>>> [lire.search(x).groups()[0] for x in hdata.splitlines() if lire.search(x)]
    ['Class1', 'Class2', 'Class3']

You could try to use Element Tree if your source is valid XML, otherwise look for Beautiful Soup

answered Aug 9, 2011 at 21:42

Facundo Casco

10.7k8 gold badges45 silver badges66 bronze badges

1 Comment

user886767 Over a year ago

Thanks! I actually did use Beautiful Soup to isolate the list from the rest of the HTML document, but wasn't sure how to go further than that. I'll have a look at it :)

dpitch40 · Accepted Answer · 2011-08-09 21:22:36Z

0

If all the line endings are the same, you could try a regular expression like

re.compile(r'^<li><.*>(.*)</a></li>$')

If you're expecting much more variability in the file than in your example, then something like an HTML parser would probably be better.

answered Aug 9, 2011 at 21:22

dpitch40

2,6917 gold badges35 silver badges44 bronze badges

Collectives™ on Stack Overflow

How do I extract HTML list entries into a Python list? [duplicate]

3 Answers 3

Comments

1 Comment

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

Comments

Linked

Related