Delete rest of HTML file after some text

Question

I am scraping HTML file using BeautifulSoup in python. I want to delete text after find a word.

Ex:

<div class="content">

<p> Page 1 </p>
<p> Page 2 </p>
<p> Page 3 </p>
<p> Page 4 </p>
<p> Page 5 </p>

</div>

I want to delete from Page 3.

<div class="content">

<p> Page 1 </p>
<p> Page 2 </p>
<p> Page 3 </p>

</div>

I have tried the following

p = soup.findAll('p')
if len(p) > 3 :
   d = p[3]
   while d:
       e = d.next
       d.extract()
       d = e

replacing d.extract() with del(d) is also not working. Please help.

Exactly how do you want to delete this? just that section? or everything down the rest of the page, including closing tags? — Spencer Rathbun
– Spencer Rathbun, Commented Apr 27, 2011 at 19:48
Rest of the html page, but I want to maintain the closing tags. — vikesh
– vikesh, Commented Apr 27, 2011 at 19:51

Brian O'Dell · Accepted Answer · 2011-04-27 20:06:54Z

1

Try this:

p = soup.findAll('p')  
while len(p) > 3:
    last_p = p.pop()
    last_p.extract()

answered Apr 27, 2011 at 20:06

Brian O'Dell

3,0691 gold badge22 silver badges23 bronze badges

Sign up to request clarification or add additional context in comments.

1 Answer 1