Java regex to retain specific closing tags

Question

I'm trying to write a regex to remove all but a handful of closing xml tags.

The code seems simple enough:

String stringToParse = "<body><xml>some stuff</xml></body>";
Pattern pattern = Pattern.compile("</[^(a|em|li)]*?>");
Matcher matcher = pattern.matcher(stringToParse);
stringToParse = matcher.replaceAll("");

However, when this runs, it skips the "xml" closing tag. It seems to skip any tag where there is a matching character in the compiled group (a|em|li), i.e. if I remove the "l" from "li", it works.

I would expect this to return the following string: "<body><xml>some stuff" (I am doing additional parsing to remove the opening tags but keeping it simple for the example).

Could you please explicitly state what you wish the final value of stringToParse to be, and what you get instead? — Christopher Bruns
– Christopher Bruns, Commented Feb 2, 2010 at 22:41
This seems part of some security-sensitive task. I would strongly recommend to forget the regex idea and go for a real parser instead. Even though you named the variable "stringToParse", using regex is not parsing. — BalusC
– BalusC, Commented Feb 2, 2010 at 22:44

Mark Byers · Accepted Answer · 2010-02-02 22:52:06Z

4

You probably shouldn't use regex for this task, but let's see what happens...

Your problem is that you are using a negative character class, and inside character classes you can't write complex expressions - only characters. You could try a negative lookahead instead:

"</(?!a|em|li).*?>"

But this won't handle a number of cases correctly:

Comments containing things that look like tags.
Tags as strings in attributes.
Tags that start with a, em, or li but are actually other tags.
Capital letters.
etc...

You can probably fix these problems, but you need to consider whether or not it is worth it, or if it would be better to look for a solution based on a proper HTML parser.

answered Feb 2, 2010 at 22:52

Mark Byers

844k202 gold badges1.6k silver badges1.5k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Chris B Over a year ago

Awesome, Mark, thanks for the explanation. I did not understand that aspect of character classes.

Community · Accepted Answer · 2017-05-23 12:13:27Z

1

I would really use a proper parser for this (e.g. JTidy). You can't parse XML/HTML using regular expressions as it's not regular, and no end of edge cases abound. I would rather use the XML parsing available in the standard JDK (JAXP) or a suitable 3rd party library (see above) and configure your output accordingly.

See this answer for more passionate info re. parsing XML/HTML via regexps.

edited May 23, 2017 at 12:13

CommunityBot

11 silver badge

answered Feb 2, 2010 at 23:10

Brian Agnew

273k38 gold badges342 silver badges443 bronze badges

Comments

Anon. · Accepted Answer · 2010-02-02 22:52:58Z

0

You cannot use an alternation inside a character class. A character class always matches a single character.

You likely want to use a negative lookahead or lookbehind instead:

"</(?!a|em|li).*?>"

answered Feb 2, 2010 at 22:52

Anon.

60.3k9 gold badges85 silver badges86 bronze badges

Collectives™ on Stack Overflow

Java regex to retain specific closing tags

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related