0

I'm trying to write a regex to remove all but a handful of closing xml tags.

The code seems simple enough:

String stringToParse = "<body><xml>some stuff</xml></body>";
Pattern pattern = Pattern.compile("</[^(a|em|li)]*?>");
Matcher matcher = pattern.matcher(stringToParse);
stringToParse = matcher.replaceAll("");

However, when this runs, it skips the "xml" closing tag. It seems to skip any tag where there is a matching character in the compiled group (a|em|li), i.e. if I remove the "l" from "li", it works.

I would expect this to return the following string: "<body><xml>some stuff" (I am doing additional parsing to remove the opening tags but keeping it simple for the example).

2
  • Could you please explicitly state what you wish the final value of stringToParse to be, and what you get instead? Commented Feb 2, 2010 at 22:41
  • 1
    This seems part of some security-sensitive task. I would strongly recommend to forget the regex idea and go for a real parser instead. Even though you named the variable "stringToParse", using regex is not parsing. Commented Feb 2, 2010 at 22:44

3 Answers 3

4

You probably shouldn't use regex for this task, but let's see what happens...

Your problem is that you are using a negative character class, and inside character classes you can't write complex expressions - only characters. You could try a negative lookahead instead:

"</(?!a|em|li).*?>"

But this won't handle a number of cases correctly:

  • Comments containing things that look like tags.
  • Tags as strings in attributes.
  • Tags that start with a, em, or li but are actually other tags.
  • Capital letters.
  • etc...

You can probably fix these problems, but you need to consider whether or not it is worth it, or if it would be better to look for a solution based on a proper HTML parser.

Sign up to request clarification or add additional context in comments.

1 Comment

Awesome, Mark, thanks for the explanation. I did not understand that aspect of character classes.
1

I would really use a proper parser for this (e.g. JTidy). You can't parse XML/HTML using regular expressions as it's not regular, and no end of edge cases abound. I would rather use the XML parsing available in the standard JDK (JAXP) or a suitable 3rd party library (see above) and configure your output accordingly.

See this answer for more passionate info re. parsing XML/HTML via regexps.

Comments

0

You cannot use an alternation inside a character class. A character class always matches a single character.

You likely want to use a negative lookahead or lookbehind instead:

"</(?!a|em|li).*?>"

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.