1

I'm trying to find a string inside a HTML page with known patterns. for example, in the following HTML code:

<TABLE WIDTH="100%">
<TR><TD ALIGN="LEFT" width="50%">&nbsp;</TD>
<TD ALIGN=RIGHT VALIGN=BOTTOM WIDTH=50%><FONT SIZE=-1>( <STRONG>1</STRONG></FONT> <FONT SIZE=-2>of</FONT> <STRONG><FONT SIZE=-1>1</STRONG> )</FONT></TD></TR></TABLE>
<HR>
<TABLE WIDTH="100%">
<TR>    <TD ALIGN="LEFT" WIDTH="50%"><B>String 1</B></TD>
    <TD ALIGN="RIGHT" WIDTH="50%"><B><A Name=h1 HREF=#h0></A><A  HREF=#h2></A><B><I></I></B>String</B></TD>
</TR>
<TR><TD ALIGN="LEFT" WIDTH="50%"><b>String 2.</B>
</TD>
<TD ALIGN="RIGHT" WIDTH="50%"> <B>
String 3
</B></TD>
</TR>
</TABLE>
<HR>
<font size="+1">String 4</font><BR>
...

I want to find String 4 , and I know that it will always be between

<HR><font size="+1">
and </font><BR>

how can I search for the string using RE?

edit:

I've tried the following, but no success:

p = re.match('<HR><font size="+1">(.*?)</font><BR>',html)

thanks.

4
  • Have you considered using xpath instead of regex? Commented Jul 2, 2012 at 13:14
  • I've tried using BeautifulSoup. It didn't work because I'm running the parser on multiple pages and there are slight changes between them. Commented Jul 2, 2012 at 13:19
  • @Rgo: XPath queries (with lxml) can take care of pages with slight differences. Commented Jul 2, 2012 at 13:23
  • Your try with re.match did not work because re.match tries to match from the beginning. Also the + has special meaning so should be escaped. But you were on the right way. Commented Jul 2, 2012 at 14:24

3 Answers 3

4
re.findall(r'<HR>\s*<font size="\+1">(.*?)</font><BR>', html, re.DOTALL)

findall is returning a list with everything that is captured between the brackets in the regular expression. I used re.DOTALL so the dot also captures end of lines.

I used \s* because I was not sure whether there would be any whitespace.

Sign up to request clarification or add additional context in comments.

Comments

2

This works, but may not be very robust:

import re
r = re.compile('<HR>\s?<font size="\+1">(.+?)</font>\s?<BR>', re.IGNORECASE)
r.findall(html)

You will be better off using a proper HTML parser. BeautifulSoup is excellent and easy to use. Look it up.

Comments

0
re.findall(r'<HR>\n<font size="\+1">([^<]*)<\/font><BR>', html, re.MULTILINE)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.