Search in HTML page using Regex patterns with python

Question

I'm trying to find a string inside a HTML page with known patterns. for example, in the following HTML code:

<TABLE WIDTH="100%">
<TR><TD ALIGN="LEFT" width="50%">&nbsp;</TD>
<TD ALIGN=RIGHT VALIGN=BOTTOM WIDTH=50%><FONT SIZE=-1>( <STRONG>1</STRONG></FONT> <FONT SIZE=-2>of</FONT> <STRONG><FONT SIZE=-1>1</STRONG> )</FONT></TD></TR></TABLE>
<HR>
<TABLE WIDTH="100%">
<TR>    <TD ALIGN="LEFT" WIDTH="50%"><B>String 1</B></TD>
    <TD ALIGN="RIGHT" WIDTH="50%"><B><A Name=h1 HREF=#h0></A><A  HREF=#h2></A><B><I></I></B>String</B></TD>
</TR>
<TR><TD ALIGN="LEFT" WIDTH="50%"><b>String 2.</B>
</TD>
<TD ALIGN="RIGHT" WIDTH="50%"> <B>
String 3
</B></TD>
</TR>
</TABLE>
<HR>
<font size="+1">String 4</font><BR>
...

I want to find String 4 , and I know that it will always be between

<HR><font size="+1">
and </font><BR>

how can I search for the string using RE?

edit:

I've tried the following, but no success:

p = re.match('<HR><font size="+1">(.*?)</font><BR>',html)

thanks.

I've tried using BeautifulSoup. It didn't work because I'm running the parser on multiple pages and there are slight changes between them. — Rgo
– Rgo, Commented Jul 2, 2012 at 13:19
@Rgo: XPath queries (with lxml) can take care of pages with slight differences. — Martijn Pieters
– Martijn Pieters, Commented Jul 2, 2012 at 13:23
Your try with re.match did not work because re.match tries to match from the beginning. Also the + has special meaning so should be escaped. But you were on the right way. — Marco de Wit
– Marco de Wit, Commented Jul 2, 2012 at 14:24

Marco de Wit · Accepted Answer · 2012-07-02 14:27:45Z

4

re.findall(r'<HR>\s*<font size="\+1">(.*?)</font><BR>', html, re.DOTALL)

findall is returning a list with everything that is captured between the brackets in the regular expression. I used re.DOTALL so the dot also captures end of lines.

I used \s* because I was not sure whether there would be any whitespace.

edited Jul 2, 2012 at 14:27

answered Jul 2, 2012 at 14:07

Marco de Wit

2,8362 gold badges20 silver badges23 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Jordan Dimov · Accepted Answer · 2012-07-02 12:57:20Z

2

This works, but may not be very robust:

import re
r = re.compile('<HR>\s?<font size="\+1">(.+?)</font>\s?<BR>', re.IGNORECASE)
r.findall(html)

You will be better off using a proper HTML parser. BeautifulSoup is excellent and easy to use. Look it up.

answered Jul 2, 2012 at 12:57

Jordan Dimov

1,32812 silver badges28 bronze badges

Comments

solarc · Accepted Answer · 2012-07-02 13:02:06Z

0

re.findall(r'<HR>\n<font size="\+1">([^<]*)<\/font><BR>', html, re.MULTILINE)

answered Jul 2, 2012 at 13:02

solarc

5,7582 gold badges42 silver badges52 bronze badges

Collectives™ on Stack Overflow

Search in HTML page using Regex patterns with python

3 Answers 3

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related