Python filter list to remove certain links from html source code

Question

I have html source code which I want to filter out one or more links and keep the others.

I have set up my filter with "*" as the wildcard:

<a*>Link1</a>‚ <a*>Link2</a>‚ or <a*>Link3</a>
<a*>A bad link*</a>
some text* <a*>update*</a>
other text right before link <a*>click here</a>

I would like to filter out every instance of the link from the html source code using python. I'm ok with loading the list into an array. I need some help with the filter. Each line break would signify a separate filter and I only want to remove the link(s) and not the text

I am still very new to python and regex/beautifulsoup. Even if you could point me in the right direction, it would be greatly appreciated.

so a bad link is if you just have <a>wqeqweq</a> in single line and nothing else? — damir
– damir, Commented Dec 20, 2010 at 23:49
You should be using an HTML parser, like HTMLParser or BeautifulSoup. HTML shouldn't be parsed with regex — Rafe Kettler
– Rafe Kettler, Commented Dec 20, 2010 at 23:55
I believe this link from a previous StackOverflow question is appropriate: stackoverflow.com/questions/1732348/… I agree with Ryan - use an HTML parser like BeautifulSoup. — kejadlen
– kejadlen, Commented Dec 20, 2010 at 23:59
@damir Yes, each line would be a separate filter and I only want to remove the link (<a>) and not the text — Ryan
– Ryan, Commented Dec 21, 2010 at 0:00
BeautifulSoup can be an option for me if regex isn't the right application — Ryan
– Ryan, Commented Dec 21, 2010 at 0:00

mechanical_meat · Accepted Answer · 2010-12-21 01:05:20Z

3

To remove <a> tags and keep only the text not contained within those tags:

>>> from BeautifulSoup import BeautifulSoup as bs
>>> markup = """<a*>Link1</a> <a*>Link2</a> or <a*>Link3</a>
... <a*>A bad link*</a>
... some text* <a*>update*</a>
... other text right before link <a*>click here</a>"""
>>> soup = bs(markup)
>>> TAGS_TO_EXTRACT = ('a',)
>>> for tag in soup.findAll():
...   if tag.name in TAGS_TO_EXTRACT:
...     tag.extract()
...
>>> soup
  or

some text*
other text right before link

It's not clear to me if you want the text within the tags or not. If you want the text contained within the tags do something like this instead:

>>> for tag in soup.findAll():
...   if tag.name in TAGS_TO_EXTRACT:
...     tag.replaceWith(tag.text)
...
>>> soup
Link1 Link2 or Link3
A bad link*
some text* update*
other text right before link click here

edited Dec 21, 2010 at 1:05

answered Dec 21, 2010 at 0:39

mechanical_meat

170k25 gold badges238 silver badges231 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

jsbueno · Accepted Answer · 2010-12-21 01:32:39Z

Parsing it with the only purose of reassembling the whole document discarding just a part of the information would yield a lot of uneeded code.

So, I think this is better as a job for regular expressions. Python's regular expressions can have a callback function that allows one to customize the substitution string. In this case, it is a simple matter of creating a regexp that matches the "bad link", the text in between, and the end link mark-up, and preserves only the text in between.

import re

markup = """<a*>Link1</a>‚ <a*>Link2</a>‚ or <a*>Link3</a>
<a*>A bad link*</a>
some text* <a*>update*</a>
other text right before link <a*>click here</a>"""

filtered = re.sub (r"(\<a.*?>)(.*?)(\</a\s*\>)",lambda match: match.groups()[1] , markup)

Collectives™ on Stack Overflow

Python filter list to remove certain links from html source code

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related