Python parsing HTML Using Regular Expressions

Question

I am trying to go through the HTML of a website and parse it looking for the max enrollment of a class. I tried checking for a substring in each line of the HTML file, but that would try to parse the wrong lines. So I am now using Regular Expressions. I have \t\t\t\t\t\t\t<td class="odd">([0-9])|([0-9][0-9])|([0-9][0-9][0-9])<\/td>\r\n as my regular expression right now, but this regular expression matches the max enrollment as well as the section number. Is there another way to go about what I am trying to extract from the webpage? The HTML code snippet is below:

<tr>
    <td class="tableHeader">Section</td>
    <td class="odd">001</td>
</tr>

<tr>
    <td class="tableHeader">Credits</td>
    <td class="even" align="left">  4.00</td>
</tr>

<tr>
<td class="tableHeader">Title</td>
<td class="odd">Linear Algebra</td>
</tr>

<tr>
    <td class="tableHeader">Campus</td>
    <td class="even" align="left">University City</td>
</tr>

<tr>
    <td class="tableHeader">Instructor(s)</td>
    <td class="odd">Guang  Yang</td>
</tr>
<tr>
    <td class="tableHeader">Instruction Type</td>
    <td class="even">Lecture</td>
</tr>

<tr>
    <td class="tableHeader">Max Enroll</td>
    <td class="odd">30</td>
</tr>

do not agree about the dupe, it's not asking whether it can be done with a regex, it's wrongly trying to do that. — zmo
– zmo, Commented May 8, 2014 at 17:36
This is not a duplicate. That OP is trying to actually match the tag name, class name, etc. I am just trying to extract the contents in such a way where I don't get the section number AND max enroll number. I just need help with getting only the Max Enroll number. — heinst
– heinst, Commented May 8, 2014 at 17:41
Well then instead of sitting there insulting the way I approached this problem, maybe it would be more productive to point me in the right direction, wouldn't it? — heinst
– heinst, Commented May 8, 2014 at 17:49
which why I'm giving a link in my all-caps disclaimer. I could also write it using <blink></blink> using toilet? — zmo
– zmo, Commented May 8, 2014 at 18:07

Community · Accepted Answer · 2017-05-23 12:13:11Z

5

DO NOT PARSE HTML USING REGEXP.

Use the right tool for the right job.

Let's make an analogy to explain why it's wrong: it's like trying to have a 5 year old understand Hamlet, whereas he does not have the vocabulary and grammar to understand Shakespeare's, that he will get when he'll be able to process more abstract concepts.

Use either lxml or BeautifulSoup to do that.

As an example: to get a list of all the evens and all the odds:

>>> from lxml import etree
>>> tree = etree.HTML(your_html_text)
>>> odds = tree.xpath('//td[@class="odd"]/text()')
>>> evens = tree.xpath('//td[@class="even"]/text()')
>>> odds
['001', 'Linear Algebra', 'Guang  Yang', '30']
>>> evens
['  4.00', 'University City', 'Lecture']

edit:

I am just trying to extract the contents in such a way where I don't get the section number AND max enroll number. I just need help with getting only the Max Enroll number.

ok, now I'm getting what you want, so here's the solution using lxml:

>>> for elt in tree.xpath('//tr'):
...     if elt.xpath('td[@class="tableHeader"]')[0].text == "Max Enroll":
...         elt.xpath('td[@class="odd"]|td[@class="even"]')[0].text
... 
'30'

There you have only the max enroll number.

Using BeautifulSoup it's a bit easier:

>>> bs = BeautifulSoup(your_html_text)
>>> for t in bs.findAll('td', attrs={'class': 'tableHeader'}):
...   if t.text == "Max Enroll":
...     print t.findNext('td').text
'30'

edited May 23, 2017 at 12:13

CommunityBot

11 silver badge

answered May 8, 2014 at 17:25

zmo

24.9k4 gold badges58 silver badges91 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

alecxe Over a year ago

soup.find('td', text="Max Enroll").find_next_sibling('td').text would be easier.

zmo Over a year ago

indeed, though I'm giving the more general approach here, so the OP can adapt to his dataset.

alecxe · Accepted Answer · 2014-05-08 18:27:36Z

3

Use the tool that is specialized on parsing html, like BeautifulSoup:

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

For example, here's how you can get what you want:

from bs4 import BeautifulSoup

data = """your html here"""

soup = BeautifulSoup(data)
print soup.find('td', text="Max Enroll").find_next_sibling('td').text

Prints:

edited May 8, 2014 at 18:27

answered May 8, 2014 at 17:37

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

2 Comments

heinst Over a year ago

If I choose this method, I will not be able to give this script to friends very easily for them to use because it will use a library that they (most likely) will not have installed on their computer initially, correct?

alecxe Over a year ago

@heinst well, BeautifulSoup is a third-party library that can be easily installed. Just include requirements.txt file with script dependencies and give it to your friends.

Community · Accepted Answer · 2017-05-23 12:05:37Z

1

An alternate to zmo's answer, using BeautifulSoup:

from bs4 import BeautifulSoup

data = """
<snipped html>
"""

soup = BeautifulSoup(data)

for tableHeaders in soup.find_all('td', class_="tableHeader"):
    if tableHeaders.get_text() == "Max Enroll":
        print tableHeaders.find_next_siblings('td', class_="odd")[0].get_text()

Output:

edited May 23, 2017 at 12:05

CommunityBot

11 silver badge

answered May 8, 2014 at 17:56

admdrew

3,9444 gold badges30 silver badges41 bronze badges

Collectives™ on Stack Overflow

Python parsing HTML Using Regular Expressions

3 Answers 3

2 Comments

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related