1

I have html source code which I want to filter out one or more links and keep the others.

I have set up my filter with "*" as the wildcard:

<a*>Link1</a>‚ <a*>Link2</a>‚ or <a*>Link3</a>
<a*>A bad link*</a>
some text* <a*>update*</a>
other text right before link <a*>click here</a>

I would like to filter out every instance of the link from the html source code using python. I'm ok with loading the list into an array. I need some help with the filter. Each line break would signify a separate filter and I only want to remove the link(s) and not the text

I am still very new to python and regex/beautifulsoup. Even if you could point me in the right direction, it would be greatly appreciated.

6
  • so a bad link is if you just have <a>wqeqweq</a> in single line and nothing else? Commented Dec 20, 2010 at 23:49
  • 1
    You should be using an HTML parser, like HTMLParser or BeautifulSoup. HTML shouldn't be parsed with regex Commented Dec 20, 2010 at 23:55
  • I believe this link from a previous StackOverflow question is appropriate: stackoverflow.com/questions/1732348/… I agree with Ryan - use an HTML parser like BeautifulSoup. Commented Dec 20, 2010 at 23:59
  • @damir Yes, each line would be a separate filter and I only want to remove the link (<a>) and not the text Commented Dec 21, 2010 at 0:00
  • BeautifulSoup can be an option for me if regex isn't the right application Commented Dec 21, 2010 at 0:00

2 Answers 2

3

To remove <a> tags and keep only the text not contained within those tags:

>>> from BeautifulSoup import BeautifulSoup as bs
>>> markup = """<a*>Link1</a> <a*>Link2</a> or <a*>Link3</a>
... <a*>A bad link*</a>
... some text* <a*>update*</a>
... other text right before link <a*>click here</a>"""
>>> soup = bs(markup)
>>> TAGS_TO_EXTRACT = ('a',)
>>> for tag in soup.findAll():
...   if tag.name in TAGS_TO_EXTRACT:
...     tag.extract()
...
>>> soup
  or

some text*
other text right before link

It's not clear to me if you want the text within the tags or not. If you want the text contained within the tags do something like this instead:

>>> for tag in soup.findAll():
...   if tag.name in TAGS_TO_EXTRACT:
...     tag.replaceWith(tag.text)
...
>>> soup
Link1 Link2 or Link3
A bad link*
some text* update*
other text right before link click here
Sign up to request clarification or add additional context in comments.

Comments

0

Parsing it with the only purose of reassembling the whole document discarding just a part of the information would yield a lot of uneeded code.

So, I think this is better as a job for regular expressions. Python's regular expressions can have a callback function that allows one to customize the substitution string. In this case, it is a simple matter of creating a regexp that matches the "bad link", the text in between, and the end link mark-up, and preserves only the text in between.

import re

markup = """<a*>Link1</a>‚ <a*>Link2</a>‚ or <a*>Link3</a>
<a*>A bad link*</a>
some text* <a*>update*</a>
other text right before link <a*>click here</a>"""

filtered = re.sub (r"(\<a.*?>)(.*?)(\</a\s*\>)",lambda match: match.groups()[1] , markup)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.