4

I know about utils like html2text, BeautifulSoup etc. but the issue is that they also extract javascript and add it to the text making it tough to separate them.

htmlDom = BeautifulSoup(webPage)

htmlDom.findAll(text=True)

Alternately,

from stripogram import html2text
extract = html2text(webPage)

Both of these extract all the javascript on the page as well, this is undesired.

I just wanted the readable text which you could copy from your browser to be extracted.

4 Answers 4

6

If you want to avoid extracting any of the contents of script tags with BeautifulSoup,

nonscripttags = htmlDom.findAll(lambda t: t.name != 'script', recursive=False)

will do that for you, getting the root's immediate children which are non-script tags (and a separate htmlDom.findAll(recursive=False, text=True) will get strings that are immediate children of the root). You need to do this recursively; e.g., as a generator:

def nonScript(tag):
    return tag.name != 'script'

def getStrings(root):
   for s in root.childGenerator():
     if hasattr(s, 'name'):    # then it's a tag
       if s.name == 'script':  # skip it!
         continue
       for x in getStrings(s): yield x
     else:                     # it's a string!
       yield s

I'm using childGenerator (in lieu of findAll) so that I can just get all the children in order and do my own filtering.

Sign up to request clarification or add additional context in comments.

2 Comments

@demos, you're welcome, glad to hear this! BTW, why the accept (and btw tx for it!) without an upvote? Seems strange!-)
@Alex Martelli The first upvote is from me. What a pity that there hasn't been any upvote on this answer for 19 months !
0

Using BeautifulSoup, something along these lines:

def _extract_text(t):
    if not t:
        return ""
    if isinstance(t, (unicode, str)):
        return " ".join(filter(None, t.replace("\n", " ").split(" ")))
    if t.name.lower() == "br": return "\n"
    if t.name.lower() == "script": return "\n"
    return "".join(extract_text(c) for c in t)
def extract_text(t):
    return '\n'.join(x.strip() for x in _extract_text(t).split('\n'))
print extract_text(htmlDom)

Comments

0

you can remove script tags in beautiful soup, something like:

for script in soup("script"):
    script.extract()

Removing Elements

1 Comment

Looks like a quick solution, but what is the penalty for tag extraction?
0

Try it out:

http://code.google.com/p/boilerpipe/

http://ai-depot.com/articles/the-easy-way-to-extract-useful-text-from-arbitrary-html/

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.