Regular Expression HTML Tag Exclusion

Question

Yes, yes, I've weighed using an xml parser instead of regular expressions, but this seems to be a simplistic enough case that it's suitable:

from BeautifulSoup import BeautifulSoup
from urllib import urlopen

tempSite = 'http://www.sumkindawebsiterighthur.com'
theTempSite = urlopen(tempSite).read()
currentTempSite = BeautifulSoup(theTempSite)
Email = currentTempSite.findAll('tr', valign="top") 
print Email[0]

Currently results with:

<tr valign="top">
<td><p>Phone Number:</p></td>
<td>&nbsp;</td>
<td><p>706-878-8888</p></td>
</tr>

I'm trying to remove all markup (tr, td, p, would be nice too) and result:

Phone Number: 706-878-8888

My problem is over-exclusion AND multiple lines being regex'd, looking for an answer that outputs on a single line.

You don't need an XML parser if you already have a DOM with BeautifulSoup. Surely you can recursively iterate over the subnodes and concatenate the inner text of each? (I've never used BeautifulSoup) — Cameron
– Cameron, Commented Jan 26, 2012 at 19:17
I'm getting an empty list (Email = []), is that the correct URL? — juliomalegria
– juliomalegria, Commented Jan 26, 2012 at 19:18
Haha no, not the correct site. Keeping someones information private. THere's got to be a simple solution for this though. — Hikalea
– Hikalea, Commented Jan 26, 2012 at 19:22
+1 for @Cameron. Don't use regex for this, try a bit further with BeautifulSoup, you get a better result, and learn "the right way" to do this sort of stuff. — heltonbiker
– heltonbiker, Commented Jan 26, 2012 at 19:25

Andrew Clark · Accepted Answer · 2012-01-26 19:22:49Z

2

If your results are really always that simple, the following regex will put 'Phone Number:' in capture group 1 and the number in capture group 2 as long as the re.DOTALL flag is set:

.*(Phone Number:).*?([-\d]+).*

You can then call re.sub() on your string with the replacement \1 \2.

Here is a complete example that returns what you want:

>>> s = """<tr valign="top">
... <td><p>Phone Number:</p></td>
... <td>&nbsp;</td>
... <td><p>706-878-8888</p></td>
... </tr>"""
>>> regex = re.compile(r'.*(Phone Number:).*?([-\d]+).*', re.DOTALL)
>>> regex.sub(r'\1 \2', s)
'Phone Number: 706-878-8888'

answered Jan 26, 2012 at 19:22

Andrew Clark

210k36 gold badges285 silver badges310 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Regular Expression HTML Tag Exclusion

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related