3
import urllib2

from BeautifulSoup import *

resp = urllib2.urlopen("file:///D:/sample.html")

rawhtml = resp.read()

resp.close()
print rawhtml

I am using this code to get text from a html document, but it also gives me html code. What should i do to fetch only text from the html document?

1

5 Answers 5

4

Note that your example makes no use of Beautifulsoup. See the doc, and follow examples.

The following example, taken from the link above, searches the soup for <td> elements.

import urllib2
from BeautifulSoup import BeautifulSoup

page = urllib2.urlopen("http://www.icc-ccs.org/prc/piracyreport.php")
soup = BeautifulSoup(page)
for incident in soup('td', width="90%"):
    where, linebreak, what = incident.contents[:3]
    print where.strip()
    print what.strip()
    print
Sign up to request clarification or add additional context in comments.

Comments

3

The very module documentation has a way to extract all strings from a document. @ http://www.crummy.com/software/BeautifulSoup/

from BeautifulSoup import BeautifulSoup
import urllib2

resp = urllib2.urlopen("http://www.google.com")
rawhtml = resp.read()
soup = BeautifulSoup(rawhtml)

all_strings = [e for e in soup.recursiveChildGenerator() 
         if isinstance(e,unicode)])
print all_strings

Comments

1

Adapted from Tony Segaran's Programming Collective Intelligence (page 60):

def gettextonly(soup):
    v=soup.string
    if v == None:
        c=soup.contents
        resulttext=''
        for t in c:
            subtext=gettextonly(t)
            resulttext+=subtext+'\n'
        return resulttext
    else:
        return v.strip()

Example usage:

>>>from BeautifulSoup import BeautifulSoup

>>>doc = ['<html><head><title>Page title</title></head>',
       '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
       '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
       '</html>']
>>>''.join(doc)
'<html><head><title>Page title</title></head><body><p id="firstpara" align="center">
This is paragraph <b>one</b>.<p id="secondpara" align="blah">This is
paragraph<b>two</b>.</html>'

>>>soup = BeautifulSoup(''.join(doc))
>>>gettextonly(soup)
u'Page title\n\nThis is paragraph\none\n.\n\nThis is paragraph\ntwo\n.\n\n\n\n'

Note that the result is a single string, with text from inside different tags separated by newline (\n) characters.

If you would like to extract all of the words of the text as a list of words, you can use the following function, also adapted from Tony Segaran's Programming Collective Intelligence (pg. 61):

import re
def separatewords(text):
    splitter=re.compile('\\W*')
    return [s.lower() for s in splitter.split(text) if s!='']

Example usage:

>>>separatewords(gettextonly(soup))
[u'page', u'title', u'this', u'is', u'paragraph', u'one', u'this', u'is', 
u'paragraph', u'two']

Comments

1

There's also html2text.

Another option is to pipe it to "lynx -dump"

Comments

0

I've been using html2text package with beautiful soup to fix some problems of the package. e.g. html2text did not understand auml or ouml literals, only Auml and Ouml with uppercase first letter.

unicode_coded_entities_html = unicode(BeautifulStoneSoup(html,convertEntities=BeautifulStoneSoup.HTML_ENTITIES))
text = html2text.html2text(unicode_coded_entities_html)

html2text does conversion to markdown text syntax, so converted text can be rendered back to html format as well (of course some information will be lost in transformation).

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.