How can I extract certain information from a string in python?

Question

I am trying to use python to extract certain information from html code. for example:

<a href="#tips">Visit the Useful Tips Section</a> 
and I would like to get result : Visit the Useful Tips Section

<div id="menu" style="background-color:#FFD700;height:200px;width:100px;float:left;">
<b>Menu</b><br />
HTML<br />
CSS<br />
and I would like to get Menu HTML CSS

In other word, I wish to get everything between <>and<> I am trying to write a python function that takes the html code as a string, and then extract information from there. I am stuck at string.split('<').

Have you tried using any HTML parsing library? Or you can actually process the file by removing all the tags (a bit tricky to do with <script> tag, though). — nhahtdh
– nhahtdh, Commented Jun 1, 2012 at 13:24

unwind · Accepted Answer · 2012-06-01 13:24:03Z

3

You should use a proper HTML parsing library, such as the HTMLParser module.

answered Jun 1, 2012 at 13:24

unwind

402k64 gold badges492 silver badges620 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

user278064 · Accepted Answer · 2012-06-01 13:26:25Z

1

string = '<a href="#tips">Visit the Useful Tips Section</a>'
re.findall('<[^>]*>(.*)<[^>]*>', string) //return 'Visit the Useful Tips Section'

answered Jun 1, 2012 at 13:26

user278064

10.2k1 gold badge36 silver badges48 bronze badges

2 Comments

Lauritz V. Thaulow Over a year ago

I wouldn't recommend a regexp-based solution

bruno desthuilliers Over a year ago

@lazyr: depends on the context... If you know enough about the markup structure and there's no ambiguity, a mere regexp can JustWork with way less overhead than a full blown HTML parser. But you indeed have to know when the regexp is ok and when it's time to go for your HTML parser...

RanRag · Accepted Answer · 2012-06-01 13:32:55Z

1

You can use lxml html parser.

>>> import lxml.html as lh
>>> st = ''' load your above html content into a string '''
>>> d = lh.fromstring(st)
>>> d.text_content()

'Visit the Useful Tips Section \nand I would like to get result : Visit the Useful Tips Section\n\n\nMenu\nHTML\nCSS\nand I would
like to get Menu HTML CSS\n'

or you can do

>>> for content in d.text_content().split("\n"):
...     if content:
...             print content
...
Visit the Useful Tips Section
and I would like to get result : Visit the Useful Tips Section
Menu
HTML
CSS
and I would like to get Menu HTML CSS
>>>

answered Jun 1, 2012 at 13:32

RanRag

49.8k39 gold badges120 silver badges172 bronze badges

Comments

RumburaK · Accepted Answer · 2012-06-01 13:29:12Z

0

I understand you are trying to strip out the HTML tags and keep only the text.

You can define a regular expression that represents the tags. Then substitute all matches with the empty string.

Example:

def remove_html_tags(data):
    p = re.compile(r'<.*?>')
    return p.sub('', data)

References:

Example

Docs about python regular expressions

answered Jun 1, 2012 at 13:29

RumburaK

2,2152 gold badges24 silver badges31 bronze badges

Comments

Pamela McA'Nulty · Accepted Answer · 2012-06-01 13:44:20Z

0

I'd use BeautifulSoup - it gets much less cranky with mal-formed html.

answered Jun 1, 2012 at 13:44

Pamela McA'Nulty

3063 silver badges4 bronze badges

Collectives™ on Stack Overflow

How can I extract certain information from a string in python?

5 Answers 5

Comments

2 Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

2 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related