How can I remove everything after a specific text present in html ? Using python and beautifulsoup4

Question

I'm trying to scrape wikipedia. I wish to get only the desired data and discard everthing which is unncessary such as See also, References, etc.

<h2>
     <span class="mw-headline" id="See_also">See also</span>
</h2>
<ul>
     <li><a href="/wiki/List_of_adaptations_of_works_by_Stephen_King" title="List of adaptations of works by Stephen King">List of adaptations of works by Stephen King</a></li>
     <li><a href="/wiki/Castle_Rock_(Stephen_King)" title="Castle Rock (Stephen King)">Castle Rock (Stephen King)</a></li>
     <li><a href="/wiki/Charles_Scribner%27s_Sons" title="Charles Scribner&#39;s Sons">Charles Scribner's Sons</a> (aka Scribner)</li>
     <li><a href="/wiki/Derry_(Stephen_King)" title="Derry (Stephen King)">Derry (Stephen King)</a></li>
     <li><a href="/wiki/Dollar_Baby" title="Dollar Baby">Dollar Baby</a></li>
     <li><a href="/wiki/Jerusalem%27s_Lot_(Stephen_King)" title="Jerusalem&#39;s Lot (Stephen King)">Jerusalem's Lot (Stephen King)</a></li>
     <li><i><a href="/wiki/Haven_(TV_series)" title="Haven (TV series)">Haven</a></i></li>
</ul>

As shown in the above HTML. If I find See also in h2 tag, I want to delete everything which is followed by it. unordered list in this case.

it would be simpler get it as text, find position and slice it with html = html[:position] — furas
– furas, Commented May 13, 2021 at 10:05
with beautifulsoup or lxml you can use extract() to remove element so it would need to remove them one by one — furas
– furas, Commented May 13, 2021 at 10:07
maybe you should use better method to get only desired data instead of removing other data. — furas
– furas, Commented May 13, 2021 at 10:08
The suggestion for using position logic is good, but it is not effficient. — Noob
– Noob, Commented May 13, 2021 at 12:11

Andrej Kesely · Accepted Answer · 2021-05-13 11:08:53Z

2

You can use CSS selector with ~ to select right elements to extract:

from bs4 import BeautifulSoup

txt = '''
<div>This I want to keep</div>
<h2>
     <span class="mw-headline" id="See_also">See also</span>
</h2>
<ul>
     <li><a href="/wiki/List_of_adaptations_of_works_by_Stephen_King" title="List of adaptations of works by Stephen King">List of adaptations of works by Stephen King</a></li>
     <li><a href="/wiki/Castle_Rock_(Stephen_King)" title="Castle Rock (Stephen King)">Castle Rock (Stephen King)</a></li>
     <li><a href="/wiki/Charles_Scribner%27s_Sons" title="Charles Scribner&#39;s Sons">Charles Scribner's Sons</a> (aka Scribner)</li>
     <li><a href="/wiki/Derry_(Stephen_King)" title="Derry (Stephen King)">Derry (Stephen King)</a></li>
     <li><a href="/wiki/Dollar_Baby" title="Dollar Baby">Dollar Baby</a></li>
     <li><a href="/wiki/Jerusalem%27s_Lot_(Stephen_King)" title="Jerusalem&#39;s Lot (Stephen King)">Jerusalem's Lot (Stephen King)</a></li>
     <li><i><a href="/wiki/Haven_(TV_series)" title="Haven (TV series)">Haven</a></i></li>
</ul>
'''

soup = BeautifulSoup(txt, 'html.parser')

for tag in soup.select('h2:contains("See also") ~ *, h2:contains("See also")'):
    tag.extract()

print(soup)

Prints:

<div>This I want to keep</div>

NOTE: Newer versions of bs4 use :-soup-contains instead of :contains

answered May 13, 2021 at 11:08

Andrej Kesely

196k15 gold badges60 silver badges105 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Noob Over a year ago

This works. Can you explain what exactly for tag in soup.select('h2:contains("See also") ~ *, h2:contains("See also")'): does?

Andrej Kesely Over a year ago

@HemantSirsat h2:contains("See also") ~ * is CSS selector that selects all tags which are that preceded by a <h2> element containing "See also".

Collectives™ on Stack Overflow

How can I remove everything after a specific text present in html ? Using python and beautifulsoup4

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related