extract data from html content

Question

I want to download some html pages and extract informations, each HTML page has this table tag:

<table class="sobi2Details" style='background-image: url(http://www.imd.ir/components/com_sobi2/images/backgrounds/grey.gif);border-style: solid; border-color: #808080' >
    <tr>
        <td><h1>Dr Jhon Doe</h1></td>
    </tr>
    <tr>
        <td></td>
    </tr>
    <tr>
        <td></td>
    </tr>
    <tr>
        <td>
          <div id="sobi2outer">
             <br/>
             <span id="sobi2Details_field_name" ><span id="sobi2Listing_field_name_label">name:</span>Jhon</span><br/>
             <span id="sobi2Details_field_family" ><span id="sobi2Listing_field_family_label">family:</span> Doe</span><br/>
             <span id="sobi2Details_field_tel1" ><span id="sobi2Listing_field_tel1_label">tel:</span> 33727464</span><br/>
          </div>
        </td>
    </tr>
</table>

I want to access name (Jhone) ,family (Doe) and tel(33727464),I've used beausiful soup to access these span tags by id:

name=soup.find(id="sobi2Details_field_name").__str__()
family=soup.find(id="sobi2Details_field_family").__str__()
tel=soup.find(id="sobi2Details_field_tel1").__str__()

but I don't know how to extract data into these tags.I tryed to use children and content attributes,but when I use theme as a tag It returns None:

name=soup.find(id="sobi2Details_field_name")
for child in name.children:
    #process content inside

but I get this error:

'NoneType' object has no attribute 'children'

while when I use str() on it,it is not None!! any Idea?

Edit:My final solution

soup = BeautifulSoup(page,from_encoding="utf-8")
name_span=soup.find(id="sobi2Details_field_name").__str__()
name=name_span.split(':')[-1]
result = re.sub('</span>', '',name)

What version of Beautiful Soup are you using? What does type(name) return -- for me it returned <class 'bs4.element.Tag'>. I just installed BS4 with easy_install on Python 2.7.2 on OS X 10.8. — Martin Kenny
– Martin Kenny, Commented Jul 28, 2012 at 13:56
I've installed BS4 on python 2.6 , I don't know what's type(name),I didn't used it! — Asma Gheisari
– Asma Gheisari, Commented Jul 28, 2012 at 14:13
type(value) will return the type of value, so you could use it to help troubleshoot your problem. If you put print type(name) after the name=soup.find(...) line, you will be able to tell what type BS has returned for the result of the find method. — Martin Kenny
– Martin Kenny, Commented Jul 28, 2012 at 14:21
I'm off to bed, but note that when I tested your code, I saved your HTML fragment to a file, and loaded it into BeautifulSoup from there. You mentioned that the HTML page includes this <table> element; maybe there's something else in the rest of the HTML that's throwing BS off. Perhaps try starting with just this fragment; check if it works; then work your way out, adding more of the complete HTML until it fails. — Martin Kenny
– Martin Kenny, Commented Jul 28, 2012 at 14:34

Balthazar Rouberol · Accepted Answer · 2012-07-28 15:13:42Z

3

I found a couple of ways to do it.

from bs4 import BeautifulSoup
soup = BeautifulSoup(open(path_to_html_file))

name_span = soup.find(id="sobi2Details_field_name")

# First way: split text over ':'
# This only works because there's always a ':' before the target field
name = name_span.text.split(':')[1]

# Second way: iterate over the span strings
# The element you look for is always the last one
name = list(name_span.strings)[-1]

# Third way: iterate over 'next' elements
name = name_span.next.next.next # you can create a function to do that, it looks ugly :)

Tell me if it helps.

answered Jul 28, 2012 at 15:13

Balthazar Rouberol

7,2302 gold badges37 silver badges41 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Asma Gheisari Over a year ago

thank U.your first way sounds really good,and worked.but my html contains unicode and it has error when I test code on it.do U have any suggestion.

Joey · Accepted Answer · 2012-07-28 21:22:01Z

1

If you are familiar with xpath use lxml with etree instead:

import urllib2
from lxml import etree

opener = urllib2.build_opener()
root = etree.HTML(opener.open("myUrl").read())

print root.xpath("//span[@id='sobi2Details_field_name']/text()")[0]

answered Jul 28, 2012 at 21:22

Joey

3392 gold badges6 silver badges14 bronze badges

Collectives™ on Stack Overflow

extract data from html content

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related