How can I remove everything except a selected tag from a html file with python?

Question

If I have this division:

<div class="wikicontent" id="wikicontentid">

How can I use Python to print just that tag and its' contents?

m.wasowski · Accepted Answer · 2014-03-25 23:01:57Z

1

import bs4

soup =  bs4BeautifulSoup(html_content);
result = soup.find("div", { "class" : "wikicontent", "id" : "wikicontentid" })

answered Mar 25, 2014 at 23:01

m.wasowski

6,3861 gold badge25 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

ramcdougal · Accepted Answer · 2014-03-25 23:14:41Z

Use the Beautiful Soup module.

>>> import bs4

Suppose we have a document that contains a number of divs, some which match the class and some which match the id, and one that does both:

>>> html = '<div class="wikicontent">blah1</div><div class="wikicontent" id="wikicontentid">blah2</div><div id="wikicontentid">blah3</div>'

We can parse with Beautiful Soup:

>>> soup = bs4.BeautifulSoup(html)

To find all the divs:

>>> soup.find_all('div')
[<div class="wikicontent">blah1</div>, <div class="wikicontent" id="wikicontentid">blah2</div>, <div id="wikicontentid">blah3</div>]

This is a bs4.element.ResultSet that contains three bs4.element.Tag which you can extract via the [] operator.

To find everything matching a given id, use the id keyword argument:

>>> soup.find_all(id='wikicontentid')
[<div class="wikicontent" id="wikicontentid">blah2</div>, <div id="wikicontentid">blah3</div>]

To match a class, use the class_ keyword argument (note the underscore):

>>> soup.find_all(class_='wikicontent')
[<div class="wikicontent">blah1</div>, <div class="wikicontent" id="wikicontentid">blah2</div>]

You can combine these selectors in a single call:

>>> soup.find_all('div', class_='wikicontent', id='wikicontentid')
[<div class="wikicontent" id="wikicontentid">blah2</div>]

If you know there is only one match or if you are only interested in the first match, use soup.find:

>>> soup.find(class_='wikicontent', id='wikicontentid')
<div class="wikicontent" id="wikicontentid">blah2</div>

As before, this is not a string,

>>> type(soup.find('div', class_='wikicontent', id='wikicontentid'))
<class 'bs4.element.Tag'>

but you can turn it into one:

>>> str(soup.find('div', class_='wikicontent', id='wikicontentid'))
'<div class="wikicontent" id="wikicontentid">blah2</div>'

cheekybastard · Accepted Answer · 2014-03-26 01:52:01Z

0

To download the page source use http://docs.python-requests.org/en/latest/, for parsing html/css tags use http://lxml.de/.

import requests
import lxml.html

dom = lxml.html.fromstring(requests.get('http://theurlyourscraping.com').content)
wikicontent = [x for x in dom.xpath('//div[@class="wikicontent"]/text()')]
print wikicontent

answered Mar 26, 2014 at 1:52

cheekybastard

5,7653 gold badges25 silver badges26 bronze badges

Collectives™ on Stack Overflow

How can I remove everything except a selected tag from a html file with python?

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related