html to text conversion using python language

Question

import urllib2

from BeautifulSoup import *

resp = urllib2.urlopen("file:///D:/sample.html")

rawhtml = resp.read()

resp.close()
print rawhtml

I am using this code to get text from a html document, but it also gives me html code. What should i do to fetch only text from the html document?

you can convert html2text using pypi.python.org/pypi/html2text/2.35 — shahjapan
– shahjapan, Commented Aug 29, 2010 at 12:06

gimel · Accepted Answer · 2010-08-29 06:55:29Z

4

Note that your example makes no use of Beautifulsoup. See the doc, and follow examples.

The following example, taken from the link above, searches the soup for <td> elements.

import urllib2
from BeautifulSoup import BeautifulSoup

page = urllib2.urlopen("http://www.icc-ccs.org/prc/piracyreport.php")
soup = BeautifulSoup(page)
for incident in soup('td', width="90%"):
    where, linebreak, what = incident.contents[:3]
    print where.strip()
    print what.strip()
    print

answered Aug 29, 2010 at 6:55

gimel

86.9k10 gold badges80 silver badges104 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

pyfunc · Accepted Answer · 2010-08-29 07:07:09Z

3

The very module documentation has a way to extract all strings from a document. @ http://www.crummy.com/software/BeautifulSoup/

from BeautifulSoup import BeautifulSoup
import urllib2

resp = urllib2.urlopen("http://www.google.com")
rawhtml = resp.read()
soup = BeautifulSoup(rawhtml)

all_strings = [e for e in soup.recursiveChildGenerator() 
         if isinstance(e,unicode)])
print all_strings

answered Aug 29, 2010 at 7:07

pyfunc

67k15 gold badges155 silver badges139 bronze badges

Comments

Bryce Thomas · Accepted Answer · 2010-08-29 07:28:46Z

Adapted from Tony Segaran's Programming Collective Intelligence (page 60):

def gettextonly(soup):
    v=soup.string
    if v == None:
        c=soup.contents
        resulttext=''
        for t in c:
            subtext=gettextonly(t)
            resulttext+=subtext+'\n'
        return resulttext
    else:
        return v.strip()

Example usage:

>>>from BeautifulSoup import BeautifulSoup

>>>doc = ['<html><head><title>Page title</title></head>',
       '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
       '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
       '</html>']
>>>''.join(doc)
'<html><head><title>Page title</title></head><body><p id="firstpara" align="center">
This is paragraph <b>one</b>.<p id="secondpara" align="blah">This is
paragraph<b>two</b>.</html>'

>>>soup = BeautifulSoup(''.join(doc))
>>>gettextonly(soup)
u'Page title\n\nThis is paragraph\none\n.\n\nThis is paragraph\ntwo\n.\n\n\n\n'

Note that the result is a single string, with text from inside different tags separated by newline (\n) characters.

If you would like to extract all of the words of the text as a list of words, you can use the following function, also adapted from Tony Segaran's Programming Collective Intelligence (pg. 61):

import re
def separatewords(text):
    splitter=re.compile('\\W*')
    return [s.lower() for s in splitter.split(text) if s!='']

Example usage:

>>>separatewords(gettextonly(soup))
[u'page', u'title', u'this', u'is', u'paragraph', u'one', u'this', u'is', 
u'paragraph', u'two']

Miki Tebeka · Accepted Answer · 2010-08-29 16:48:01Z

1

There's also html2text.

Another option is to pipe it to "lynx -dump"

answered Aug 29, 2010 at 16:48

Miki Tebeka

14.1k5 gold badges40 silver badges53 bronze badges

Comments

Mikael Lepistö · Accepted Answer · 2011-07-03 07:02:51Z

0

I've been using html2text package with beautiful soup to fix some problems of the package. e.g. html2text did not understand auml or ouml literals, only Auml and Ouml with uppercase first letter.

unicode_coded_entities_html = unicode(BeautifulStoneSoup(html,convertEntities=BeautifulStoneSoup.HTML_ENTITIES))
text = html2text.html2text(unicode_coded_entities_html)

html2text does conversion to markdown text syntax, so converted text can be rendered back to html format as well (of course some information will be lost in transformation).

answered Jul 3, 2011 at 7:02

Mikael Lepistö

19.9k4 gold badges73 silver badges74 bronze badges

Collectives™ on Stack Overflow

html to text conversion using python language

5 Answers 5

Comments

Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related