Use Python re to get rid of links

Question

Say I have a string looks like <a href="/wiki/Greater_Boston" title="Greater Boston">Boston–Cambridge–Quincy, MA–NH MSA</a>

How can I use re to get rid of links and get only the Boston–Cambridge–Quincy, MA–NH MSA part?

I tried something like match = re.search(r'<.+>(\w+)<.+>', name_tmp) but not working.

Community · Accepted Answer · 2017-05-23 11:49:18Z

3

re.sub('<a[^>]+>(.*?)</a>', '\\1', text)

Note that parsing HTML in general is rather dangerous. However it seems that you are parsing MediaWiki generated links where it is safe to assume that the links are always similar formatted, so you should be fine with that regular expression.

edited May 23, 2017 at 11:49

CommunityBot

11 silver badge

answered Feb 23, 2013 at 23:43

poke

392k80 gold badges596 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Jonathan Vanasco · Accepted Answer · 2013-02-24 00:21:20Z

3

You can also use the bleach module https://pypi.python.org/pypi/bleach , which wraps html sanitizing tools and lets you quickly strip text of html

answered Feb 24, 2013 at 0:21

Jonathan Vanasco

15.8k11 gold badges53 silver badges74 bronze badges

Collectives™ on Stack Overflow

Use Python re to get rid of links

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related