0

Say I have a string looks like <a href="/wiki/Greater_Boston" title="Greater Boston">Boston–Cambridge–Quincy, MA–NH MSA</a>

How can I use re to get rid of links and get only the Boston–Cambridge–Quincy, MA–NH MSA part?

I tried something like match = re.search(r'<.+>(\w+)<.+>', name_tmp) but not working.

2 Answers 2

3
re.sub('<a[^>]+>(.*?)</a>', '\\1', text)

Note that parsing HTML in general is rather dangerous. However it seems that you are parsing MediaWiki generated links where it is safe to assume that the links are always similar formatted, so you should be fine with that regular expression.

Sign up to request clarification or add additional context in comments.

Comments

3

You can also use the bleach module https://pypi.python.org/pypi/bleach , which wraps html sanitizing tools and lets you quickly strip text of html

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.