3

I am trying to go through the HTML of a website and parse it looking for the max enrollment of a class. I tried checking for a substring in each line of the HTML file, but that would try to parse the wrong lines. So I am now using Regular Expressions. I have \t\t\t\t\t\t\t<td class="odd">([0-9])|([0-9][0-9])|([0-9][0-9][0-9])<\/td>\r\n as my regular expression right now, but this regular expression matches the max enrollment as well as the section number. Is there another way to go about what I am trying to extract from the webpage? The HTML code snippet is below:

<tr>
    <td class="tableHeader">Section</td>
    <td class="odd">001</td>
</tr>

<tr>
    <td class="tableHeader">Credits</td>
    <td class="even" align="left">  4.00</td>
</tr>

<tr>
<td class="tableHeader">Title</td>
<td class="odd">Linear Algebra</td>
</tr>

<tr>
    <td class="tableHeader">Campus</td>
    <td class="even" align="left">University City</td>
</tr>

<tr>
    <td class="tableHeader">Instructor(s)</td>
    <td class="odd">Guang  Yang</td>
</tr>
<tr>
    <td class="tableHeader">Instruction Type</td>
    <td class="even">Lecture</td>
</tr>

<tr>
    <td class="tableHeader">Max Enroll</td>
    <td class="odd">30</td>
</tr>
17
  • 3
    Read this: stackoverflow.com/a/1732454/3001761 Commented May 8, 2014 at 17:25
  • 2
    do not agree about the dupe, it's not asking whether it can be done with a regex, it's wrongly trying to do that. Commented May 8, 2014 at 17:36
  • 1
    This is not a duplicate. That OP is trying to actually match the tag name, class name, etc. I am just trying to extract the contents in such a way where I don't get the section number AND max enroll number. I just need help with getting only the Max Enroll number. Commented May 8, 2014 at 17:41
  • 1
    Well then instead of sitting there insulting the way I approached this problem, maybe it would be more productive to point me in the right direction, wouldn't it? Commented May 8, 2014 at 17:49
  • 2
    which why I'm giving a link in my all-caps disclaimer. I could also write it using <blink></blink> using toilet? Commented May 8, 2014 at 18:07

3 Answers 3

5

DO NOT PARSE HTML USING REGEXP.

Use the right tool for the right job.

Let's make an analogy to explain why it's wrong: it's like trying to have a 5 year old understand Hamlet, whereas he does not have the vocabulary and grammar to understand Shakespeare's, that he will get when he'll be able to process more abstract concepts.

Use either lxml or BeautifulSoup to do that.

As an example: to get a list of all the evens and all the odds:

>>> from lxml import etree
>>> tree = etree.HTML(your_html_text)
>>> odds = tree.xpath('//td[@class="odd"]/text()')
>>> evens = tree.xpath('//td[@class="even"]/text()')
>>> odds
['001', 'Linear Algebra', 'Guang  Yang', '30']
>>> evens
['  4.00', 'University City', 'Lecture']

edit:

I am just trying to extract the contents in such a way where I don't get the section number AND max enroll number. I just need help with getting only the Max Enroll number.

ok, now I'm getting what you want, so here's the solution using lxml:

>>> for elt in tree.xpath('//tr'):
...     if elt.xpath('td[@class="tableHeader"]')[0].text == "Max Enroll":
...         elt.xpath('td[@class="odd"]|td[@class="even"]')[0].text
... 
'30'

There you have only the max enroll number.

Using BeautifulSoup it's a bit easier:

>>> bs = BeautifulSoup(your_html_text)
>>> for t in bs.findAll('td', attrs={'class': 'tableHeader'}):
...   if t.text == "Max Enroll":
...     print t.findNext('td').text
'30'
Sign up to request clarification or add additional context in comments.

2 Comments

soup.find('td', text="Max Enroll").find_next_sibling('td').text would be easier.
indeed, though I'm giving the more general approach here, so the OP can adapt to his dataset.
3

Use the tool that is specialized on parsing html, like BeautifulSoup:

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

For example, here's how you can get what you want:

from bs4 import BeautifulSoup

data = """your html here"""

soup = BeautifulSoup(data)
print soup.find('td', text="Max Enroll").find_next_sibling('td').text

Prints:

30

2 Comments

If I choose this method, I will not be able to give this script to friends very easily for them to use because it will use a library that they (most likely) will not have installed on their computer initially, correct?
@heinst well, BeautifulSoup is a third-party library that can be easily installed. Just include requirements.txt file with script dependencies and give it to your friends.
1

An alternate to zmo's answer, using BeautifulSoup:

from bs4 import BeautifulSoup

data = """
<snipped html>
"""

soup = BeautifulSoup(data)

for tableHeaders in soup.find_all('td', class_="tableHeader"):
    if tableHeaders.get_text() == "Max Enroll":
        print tableHeaders.find_next_siblings('td', class_="odd")[0].get_text()

Output:

30

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.