2

I want to write a method in Java that splits a String by XML tags like follows:

"Lorem ipsum <b>dolor</b> sit amet consetetur <b>diam</b> nonumy."

Should return the Array:

["Lorem ipsum ", "<b>dolor</b>", " sit amet consetetur ", "<b>diam</b>", " nonumy."]

This should work for every XML tag, also self closing tags like <element />.

Is there a library that does something similar in a simple way?

Thanks!

1 Answer 1

1

Using lookaround in your split should do the trick:

String splits[] = input.split("\\s+(?=<b>)|(?<=</b>)\\s+");

Example:

String input = "Lorem ipsum <b>dolor</b> sit amet consetetur <b>diam</b> nonumy.";
for(String s : input.split("\\s+(?=<b>)|(?<=</b>)\\s+")){
    System.out.println(s);
}

If you want to keep the space intact in your spitted array, then remove the \\s+ from the regex.

Sign up to request clarification or add additional context in comments.

4 Comments

Great! As a follow-up question: do you know a generic regexp for every XML tag I can use? I can not know which tags will come. The regexp should also match self closing tags like <element/> and with attributes too.
Variable tags will not work as lookbehind(<=...) only supports a fixed length.
I have solved that with an interval (?=<mynamespace:.*?>)|(?<=</mynamespace:.{1,20}>).
Ok, and with self closing tags the regex look as follows: ((?=<mynamespace:.*?>)|(?<=</mynamespace:.{1,20}>))|((?=<mynamespace:.*?)|(?<=/>)).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.