1

I'm trying to scrape wikipedia. I wish to get only the desired data and discard everthing which is unncessary such as See also, References, etc.

<h2>
     <span class="mw-headline" id="See_also">See also</span>
</h2>
<ul>
     <li><a href="/wiki/List_of_adaptations_of_works_by_Stephen_King" title="List of adaptations of works by Stephen King">List of adaptations of works by Stephen King</a></li>
     <li><a href="/wiki/Castle_Rock_(Stephen_King)" title="Castle Rock (Stephen King)">Castle Rock (Stephen King)</a></li>
     <li><a href="/wiki/Charles_Scribner%27s_Sons" title="Charles Scribner&#39;s Sons">Charles Scribner's Sons</a> (aka Scribner)</li>
     <li><a href="/wiki/Derry_(Stephen_King)" title="Derry (Stephen King)">Derry (Stephen King)</a></li>
     <li><a href="/wiki/Dollar_Baby" title="Dollar Baby">Dollar Baby</a></li>
     <li><a href="/wiki/Jerusalem%27s_Lot_(Stephen_King)" title="Jerusalem&#39;s Lot (Stephen King)">Jerusalem's Lot (Stephen King)</a></li>
     <li><i><a href="/wiki/Haven_(TV_series)" title="Haven (TV series)">Haven</a></i></li>
</ul>

As shown in the above HTML. If I find See also in h2 tag, I want to delete everything which is followed by it. unordered list in this case.

4
  • it would be simpler get it as text, find position and slice it with html = html[:position] Commented May 13, 2021 at 10:05
  • with beautifulsoup or lxml you can use extract() to remove element so it would need to remove them one by one Commented May 13, 2021 at 10:07
  • maybe you should use better method to get only desired data instead of removing other data. Commented May 13, 2021 at 10:08
  • The suggestion for using position logic is good, but it is not effficient. Commented May 13, 2021 at 12:11

1 Answer 1

2

You can use CSS selector with ~ to select right elements to extract:

from bs4 import BeautifulSoup

txt = '''
<div>This I want to keep</div>
<h2>
     <span class="mw-headline" id="See_also">See also</span>
</h2>
<ul>
     <li><a href="/wiki/List_of_adaptations_of_works_by_Stephen_King" title="List of adaptations of works by Stephen King">List of adaptations of works by Stephen King</a></li>
     <li><a href="/wiki/Castle_Rock_(Stephen_King)" title="Castle Rock (Stephen King)">Castle Rock (Stephen King)</a></li>
     <li><a href="/wiki/Charles_Scribner%27s_Sons" title="Charles Scribner&#39;s Sons">Charles Scribner's Sons</a> (aka Scribner)</li>
     <li><a href="/wiki/Derry_(Stephen_King)" title="Derry (Stephen King)">Derry (Stephen King)</a></li>
     <li><a href="/wiki/Dollar_Baby" title="Dollar Baby">Dollar Baby</a></li>
     <li><a href="/wiki/Jerusalem%27s_Lot_(Stephen_King)" title="Jerusalem&#39;s Lot (Stephen King)">Jerusalem's Lot (Stephen King)</a></li>
     <li><i><a href="/wiki/Haven_(TV_series)" title="Haven (TV series)">Haven</a></i></li>
</ul>
'''

soup = BeautifulSoup(txt, 'html.parser')

for tag in soup.select('h2:contains("See also") ~ *, h2:contains("See also")'):
    tag.extract()

print(soup)

Prints:

<div>This I want to keep</div>

NOTE: Newer versions of bs4 use :-soup-contains instead of :contains

Sign up to request clarification or add additional context in comments.

2 Comments

This works. Can you explain what exactly for tag in soup.select('h2:contains("See also") ~ *, h2:contains("See also")'): does?
@HemantSirsat h2:contains("See also") ~ * is CSS selector that selects all tags which are that preceded by a <h2> element containing "See also".

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.