How to remove html elements from a string but exclude a specific element with regex

Question

I have a string '<span>TEST1</span> <span>TEST2</span> <a href="#">TEST3</a>'

I need to remove html tags and leave the text

import re
p = re.compile( '\s*<[^>]+>\s*')
test = p.sub('', '<span>TEST1</span> <span>TEST2</span> <a href="#">TEST3</a>')
print(test)

OUTPUT: TEST1TEST2TEST3

But this removes every html element, how should I change regex so that the output would be like this:

OUTPUT: TEST1 TEST2 <a href="#">TEST3</a>

Better solution is to use HTML parser. Are you open to using beautifulsoup ? — Andrej Kesely
– Andrej Kesely, Commented Apr 27, 2022 at 10:37
The pattern matches all elements because it matches from <...> You could change it to <(?!\/a>|a )[^>]+> regex101.com/r/YxTzLr/1 — The fourth bird
– The fourth bird, Commented Apr 27, 2022 at 10:37

Sebastian · Accepted Answer · 2022-04-27 12:08:12Z

2

You can work with the so-called "Negative Lookaheads".

In your case, you can leave out <a and </a>:

(?!<a )(?!<\/a>)<[^>]+>

Note the space in <a and the closing parenthesis in </a> so that only the opening and closing tags of an <a> element match and nothing else begins with an a.

edited Apr 27, 2022 at 12:08

answered Apr 27, 2022 at 10:45

Sebastian

487 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to remove html elements from a string but exclude a specific element with regex

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related