0

If I have this division:

<div class="wikicontent" id="wikicontentid">

How can I use Python to print just that tag and its' contents?

3 Answers 3

1

You can use BeautifulSoup:

import bs4

soup =  bs4BeautifulSoup(html_content);
result = soup.find("div", { "class" : "wikicontent", "id" : "wikicontentid" })
Sign up to request clarification or add additional context in comments.

Comments

1

Use the Beautiful Soup module.

>>> import bs4

Suppose we have a document that contains a number of divs, some which match the class and some which match the id, and one that does both:

>>> html = '<div class="wikicontent">blah1</div><div class="wikicontent" id="wikicontentid">blah2</div><div id="wikicontentid">blah3</div>'

We can parse with Beautiful Soup:

>>> soup = bs4.BeautifulSoup(html)

To find all the divs:

>>> soup.find_all('div')
[<div class="wikicontent">blah1</div>, <div class="wikicontent" id="wikicontentid">blah2</div>, <div id="wikicontentid">blah3</div>]

This is a bs4.element.ResultSet that contains three bs4.element.Tag which you can extract via the [] operator.

To find everything matching a given id, use the id keyword argument:

>>> soup.find_all(id='wikicontentid')
[<div class="wikicontent" id="wikicontentid">blah2</div>, <div id="wikicontentid">blah3</div>]

To match a class, use the class_ keyword argument (note the underscore):

>>> soup.find_all(class_='wikicontent')
[<div class="wikicontent">blah1</div>, <div class="wikicontent" id="wikicontentid">blah2</div>]

You can combine these selectors in a single call:

>>> soup.find_all('div', class_='wikicontent', id='wikicontentid')
[<div class="wikicontent" id="wikicontentid">blah2</div>]

If you know there is only one match or if you are only interested in the first match, use soup.find:

>>> soup.find(class_='wikicontent', id='wikicontentid')
<div class="wikicontent" id="wikicontentid">blah2</div>

As before, this is not a string,

>>> type(soup.find('div', class_='wikicontent', id='wikicontentid'))
<class 'bs4.element.Tag'>

but you can turn it into one:

>>> str(soup.find('div', class_='wikicontent', id='wikicontentid'))
'<div class="wikicontent" id="wikicontentid">blah2</div>'

Comments

0

To download the page source use http://docs.python-requests.org/en/latest/, for parsing html/css tags use http://lxml.de/.

import requests
import lxml.html

dom = lxml.html.fromstring(requests.get('http://theurlyourscraping.com').content)
wikicontent = [x for x in dom.xpath('//div[@class="wikicontent"]/text()')]
print wikicontent

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.