0

I'm having a string that has been cleaned with lxml's Cleaner, so all links are now in the form Content. Now I'd like to strip out all links that have no href attribute, e.g.

<a rel="nofollow">Link to be removed</a>

should become

Link to be removed

The same for:

<a>Other link to be removed</a>

Shoudl become:

Other link to be removed

Simply all links with a missing href attribute. It doesn't have to be regex, but since lxml returns a clean markup structure, it should be possible. What I need, is a source string stripped of such non-functional a tags.

2
  • Don't use regex to read/manipulate HTML. Use an HTML/XML library instead Commented Jun 21, 2013 at 6:11
  • Which one does that and how? Can't find this feature in lxml, FilterHTML or bleach. Additionally, the string has already been parsed by lxml. Commented Jun 21, 2013 at 6:12

2 Answers 2

2

You can use BeautifulSoup, which will make it easier to find <a> tags without a href:

>>> from bs4 import BeautifulSoup as BS
>>> html = """
... <a rel="nofollow">Link to be removed</a>
... <a href="alink">This should not be included</a>
... <a>Other link to be removed</a>
... """
>>> soup = BS(html)
>>> for i in soup.find_all('a', href=False):
...     i.replace_with(i.text)
... 
>>> print soup
<html><body>Link to be removed
<a href="alink">This should not be included</a>
Other link to be removed</body></html>
Sign up to request clarification or add additional context in comments.

6 Comments

That outputs the text, but I'd like to strip the html tags only inside the source string. I'll edit my question to clear that up.
@Nasmon Oh, so something like Hello. <a>Test</a>. Yay. should be Hello. Test. Yay.?
Exactly, that would be great!
Thank you Haidro! Since I'm already using lxml and don't have BeautifulSoup installed, I've accepted falsetru's answer. But it's great to have an alternative with BeautifulSoup!
@Nasmon, If a tag contain another tags, that will be lost.
|
1

Use drop_tag method.

import lxml.html

root = lxml.html.fromstring('<div>Test <a rel="nofollow">Link to be <b>removed</b></a>. <a href="#">link</a>')
for a in root.xpath('a[not(@href)]'):
    a.drop_tag()

assert lxml.html.tostring(root) == '<div>Test Link to be <b>removed</b>. <a href="#">link</a></div>'

http://lxml.de/lxmlhtml.html

.drop_tag(): Drops the tag, but keeps its children and text.

1 Comment

Thanks!! That works nicely. It works for me if I use this xpath: '//a[not(@href)]'. Without "//" it doesn't find all nested links.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.