1

I want to download some html pages and extract informations, each HTML page has this table tag:

<table class="sobi2Details" style='background-image: url(http://www.imd.ir/components/com_sobi2/images/backgrounds/grey.gif);border-style: solid; border-color: #808080' >
    <tr>
        <td><h1>Dr Jhon Doe</h1></td>
    </tr>
    <tr>
        <td></td>
    </tr>
    <tr>
        <td></td>
    </tr>
    <tr>
        <td>
          <div id="sobi2outer">
             <br/>
             <span id="sobi2Details_field_name" ><span id="sobi2Listing_field_name_label">name:</span>Jhon</span><br/>
             <span id="sobi2Details_field_family" ><span id="sobi2Listing_field_family_label">family:</span> Doe</span><br/>
             <span id="sobi2Details_field_tel1" ><span id="sobi2Listing_field_tel1_label">tel:</span> 33727464</span><br/>
          </div>
        </td>
    </tr>
</table>

I want to access name (Jhone) ,family (Doe) and tel(33727464),I've used beausiful soup to access these span tags by id:

name=soup.find(id="sobi2Details_field_name").__str__()
family=soup.find(id="sobi2Details_field_family").__str__()
tel=soup.find(id="sobi2Details_field_tel1").__str__()

but I don't know how to extract data into these tags.I tryed to use children and content attributes,but when I use theme as a tag It returns None:

name=soup.find(id="sobi2Details_field_name")
for child in name.children:
    #process content inside

but I get this error:

'NoneType' object has no attribute 'children'

while when I use str() on it,it is not None!! any Idea?

Edit:My final solution

soup = BeautifulSoup(page,from_encoding="utf-8")
name_span=soup.find(id="sobi2Details_field_name").__str__()
name=name_span.split(':')[-1]
result = re.sub('</span>', '',name)
4
  • What version of Beautiful Soup are you using? What does type(name) return -- for me it returned <class 'bs4.element.Tag'>. I just installed BS4 with easy_install on Python 2.7.2 on OS X 10.8. Commented Jul 28, 2012 at 13:56
  • I've installed BS4 on python 2.6 , I don't know what's type(name),I didn't used it! Commented Jul 28, 2012 at 14:13
  • type(value) will return the type of value, so you could use it to help troubleshoot your problem. If you put print type(name) after the name=soup.find(...) line, you will be able to tell what type BS has returned for the result of the find method. Commented Jul 28, 2012 at 14:21
  • I'm off to bed, but note that when I tested your code, I saved your HTML fragment to a file, and loaded it into BeautifulSoup from there. You mentioned that the HTML page includes this <table> element; maybe there's something else in the rest of the HTML that's throwing BS off. Perhaps try starting with just this fragment; check if it works; then work your way out, adding more of the complete HTML until it fails. Commented Jul 28, 2012 at 14:34

2 Answers 2

3

I found a couple of ways to do it.

from bs4 import BeautifulSoup
soup = BeautifulSoup(open(path_to_html_file))

name_span = soup.find(id="sobi2Details_field_name")

# First way: split text over ':'
# This only works because there's always a ':' before the target field
name = name_span.text.split(':')[1]

# Second way: iterate over the span strings
# The element you look for is always the last one
name = list(name_span.strings)[-1]

# Third way: iterate over 'next' elements
name = name_span.next.next.next # you can create a function to do that, it looks ugly :)

Tell me if it helps.

Sign up to request clarification or add additional context in comments.

1 Comment

thank U.your first way sounds really good,and worked.but my html contains unicode and it has error when I test code on it.do U have any suggestion.
1

If you are familiar with xpath use lxml with etree instead:

import urllib2
from lxml import etree

opener = urllib2.build_opener()
root = etree.HTML(opener.open("myUrl").read())

print root.xpath("//span[@id='sobi2Details_field_name']/text()")[0]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.