Python regex to strip html a tags without href attribute

Question

I'm having a string that has been cleaned with lxml's Cleaner, so all links are now in the form Content. Now I'd like to strip out all links that have no href attribute, e.g.

<a rel="nofollow">Link to be removed</a>

should become

Link to be removed

The same for:

<a>Other link to be removed</a>

Shoudl become:

Other link to be removed

Simply all links with a missing href attribute. It doesn't have to be regex, but since lxml returns a clean markup structure, it should be possible. What I need, is a source string stripped of such non-functional a tags.

Don't use regex to read/manipulate HTML. Use an HTML/XML library instead — gefei
– gefei, Commented Jun 21, 2013 at 6:11
Which one does that and how? Can't find this feature in lxml, FilterHTML or bleach. Additionally, the string has already been parsed by lxml. — Simon Steinberger
– Simon Steinberger, Commented Jun 21, 2013 at 6:12

TerryA · Accepted Answer · 2013-06-21 06:21:11Z

2

You can use BeautifulSoup, which will make it easier to find <a> tags without a href:

>>> from bs4 import BeautifulSoup as BS
>>> html = """
... <a rel="nofollow">Link to be removed</a>
... <a href="alink">This should not be included</a>
... <a>Other link to be removed</a>
... """
>>> soup = BS(html)
>>> for i in soup.find_all('a', href=False):
...     i.replace_with(i.text)
... 
>>> print soup
<html><body>Link to be removed
<a href="alink">This should not be included</a>
Other link to be removed</body></html>

edited Jun 21, 2013 at 6:21

answered Jun 21, 2013 at 6:12

TerryA

60.2k11 gold badges122 silver badges148 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Simon Steinberger Over a year ago

That outputs the text, but I'd like to strip the html tags only inside the source string. I'll edit my question to clear that up.

TerryA Over a year ago

@Nasmon Oh, so something like Hello. <a>Test</a>. Yay. should be Hello. Test. Yay.?

Simon Steinberger Over a year ago

Exactly, that would be great!

Simon Steinberger Over a year ago

Thank you Haidro! Since I'm already using lxml and don't have BeautifulSoup installed, I've accepted falsetru's answer. But it's great to have an alternative with BeautifulSoup!

falsetru Over a year ago

@Nasmon, If a tag contain another tags, that will be lost.

|

falsetru · Accepted Answer · 2013-06-21 06:25:54Z

1

Use drop_tag method.

import lxml.html

root = lxml.html.fromstring('<div>Test <a rel="nofollow">Link to be <b>removed</b></a>. <a href="#">link</a>')
for a in root.xpath('a[not(@href)]'):
    a.drop_tag()

assert lxml.html.tostring(root) == '<div>Test Link to be <b>removed</b>. <a href="#">link</a></div>'

http://lxml.de/lxmlhtml.html

.drop_tag(): Drops the tag, but keeps its children and text.

answered Jun 21, 2013 at 6:25

falsetru

371k69 gold badges769 silver badges659 bronze badges

1 Comment

Simon Steinberger Over a year ago

Thanks!! That works nicely. It works for me if I use this xpath: '//a[not(@href)]'. Without "//" it doesn't find all nested links.

Collectives™ on Stack Overflow

Python regex to strip html a tags without href attribute

2 Answers 2

6 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related