0

I have a string '<span>TEST1</span> <span>TEST2</span> <a href="#">TEST3</a>'

I need to remove html tags and leave the text

import re
p = re.compile( '\s*<[^>]+>\s*')
test = p.sub('', '<span>TEST1</span> <span>TEST2</span> <a href="#">TEST3</a>')
print(test)

OUTPUT: TEST1TEST2TEST3

But this removes every html element, how should I change regex so that the output would be like this:

OUTPUT: TEST1 TEST2 <a href="#">TEST3</a>
2
  • 3
    Better solution is to use HTML parser. Are you open to using beautifulsoup ? Commented Apr 27, 2022 at 10:37
  • 1
    The pattern matches all elements because it matches from <...> You could change it to <(?!\/a>|a )[^>]+> regex101.com/r/YxTzLr/1 Commented Apr 27, 2022 at 10:37

1 Answer 1

2

You can work with the so-called "Negative Lookaheads".

In your case, you can leave out <a and </a>:

(?!<a )(?!<\/a>)<[^>]+>

Note the space in <a and the closing parenthesis in </a> so that only the opening and closing tags of an <a> element match and nothing else begins with an a.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.