Problem extracting text out of html file using python regex

Question

I'm working on a project that requires me to write some code to pull out some text from a html file in python.

<tr>
<td>Target binary file name:</td>
<td class="right">Doc1.docx</td>
</tr>

^Small portion of the html file that I'm interested in.

#! /usr/bin/python
import os
import re    

if __name__ == '__main__':
    f = open('./results/sample_result.html')
    soup = f.read()
    p = re.compile("binary")
    for line in soup:
        m = p.search(line)
        if m:
            print "finally"
            break

^Sample code I wrote to test if I could extract data out. I've written several programs similar to this to extract text from txt files almost exactly the same and they have worked just fine. Is there something I'm missing out with regards to regex and html?

codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html — deinst
– deinst, Commented Jul 31, 2010 at 13:05
@deinst, great link, I laughed reading it. @OP, thats very correct point. You just should NOT parse html with regex. Try magic of lxml or BeautifulSoup, and you will never want to go back to regex again. — Daniel Kluev
– Daniel Kluev, Commented Jul 31, 2010 at 13:30
A recent question covers how to do something very similar with BeautifulSoup: stackoverflow.com/questions/3376803/… — bobince
– bobince, Commented Jul 31, 2010 at 13:35
Thanks for suggesting lxml Daniel, I'll take a look at it. @bobince: Thanks for the link! — M Rubern C
– M Rubern C, Commented Jul 31, 2010 at 14:29

S.Lott · Accepted Answer · 2010-07-31 13:22:54Z

4

Is there something I'm missing out with regards to regex and html?

Yes. You're missing the fact that some HTML cannot be parsed with a simple regex.

answered Jul 31, 2010 at 13:22

S.Lott

393k83 gold badges520 silver badges791 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

M Rubern C Over a year ago

Ouch. I was thinking that the above would simply match since the only thing I was searching for was the word "binary". While I understand that it isn't a good idea to use regex to process html, but in this scenario I don't understand why the regex does not match because I'm not dealing with the tags at all.

S.Lott Over a year ago

@M Rubern C: You can't ignore the tags. What if your "binary" is <b>b</b>inary to make the "b" bold?

Katriel · Accepted Answer · 2010-07-31 14:29:43Z

0

Is this actually what you're trying to do, or just a simple example for a more complicated regex later? If the latter, listen to everyone else. If the former:

for line in file:
      if "binary" in line:
            # do stuff

If that doesn't work, are you sure "binary" is in the file? Not, I don't know, "<i>b</i>inary"?

answered Jul 31, 2010 at 14:29

Katriel

124k19 gold badges141 silver badges172 bronze badges

1 Comment

M Rubern C Over a year ago

I was planning to use regex to parse and tried to write simple example to test but I've been convinced otherwise. I'm sure it appears as <td>Target binary file name:</td> Just puzzled why it doesnt pick up.

PaulMcG · Accepted Answer · 2010-07-31 19:58:57Z

0

HTML as understood by browsers is waaaay too flexible for reg expressions. Attributes can pop up in any tag, and in any order, and in upper or lower case, and with or without quotation marks about the value. Special emphasis tags can show up anywhere. Whitespace is significant in regex, but not so much in HTML, so your regex has to be littered with \s*'s everywhere. There is no requirement that opening tags be matched with closing tags. Some opening tags include a trailing '/', meaning that they are empty tags (no body, no closing tag). Lastly, HTML is often nested, which is pretty much off the chart as far as regex is concerned.

edited Jul 31, 2010 at 19:58

answered Jul 31, 2010 at 14:26

PaulMcG

64.1k16 gold badges98 silver badges135 bronze badges

Collectives™ on Stack Overflow

Problem extracting text out of html file using python regex

3 Answers 3

2 Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related