1

I am trying to use python to extract certain information from html code. for example:

<a href="#tips">Visit the Useful Tips Section</a> 
and I would like to get result : Visit the Useful Tips Section

<div id="menu" style="background-color:#FFD700;height:200px;width:100px;float:left;">
<b>Menu</b><br />
HTML<br />
CSS<br />
and I would like to get Menu HTML CSS

In other word, I wish to get everything between <>and<> I am trying to write a python function that takes the html code as a string, and then extract information from there. I am stuck at string.split('<').

1
  • Have you tried using any HTML parsing library? Or you can actually process the file by removing all the tags (a bit tricky to do with <script> tag, though). Commented Jun 1, 2012 at 13:24

5 Answers 5

3

You should use a proper HTML parsing library, such as the HTMLParser module.

Sign up to request clarification or add additional context in comments.

Comments

1
string = '<a href="#tips">Visit the Useful Tips Section</a>'
re.findall('<[^>]*>(.*)<[^>]*>', string) //return 'Visit the Useful Tips Section'

2 Comments

@lazyr: depends on the context... If you know enough about the markup structure and there's no ambiguity, a mere regexp can JustWork with way less overhead than a full blown HTML parser. But you indeed have to know when the regexp is ok and when it's time to go for your HTML parser...
1

You can use lxml html parser.

>>> import lxml.html as lh
>>> st = ''' load your above html content into a string '''
>>> d = lh.fromstring(st)
>>> d.text_content()

'Visit the Useful Tips Section \nand I would like to get result : Visit the Useful Tips Section\n\n\nMenu\nHTML\nCSS\nand I would
like to get Menu HTML CSS\n'

or you can do

>>> for content in d.text_content().split("\n"):
...     if content:
...             print content
...
Visit the Useful Tips Section
and I would like to get result : Visit the Useful Tips Section
Menu
HTML
CSS
and I would like to get Menu HTML CSS
>>>

Comments

0

I understand you are trying to strip out the HTML tags and keep only the text.

You can define a regular expression that represents the tags. Then substitute all matches with the empty string.

Example:

def remove_html_tags(data):
    p = re.compile(r'<.*?>')
    return p.sub('', data)

References:

Example

Docs about python regular expressions

Comments

0

I'd use BeautifulSoup - it gets much less cranky with mal-formed html.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.