I'm new to regular expression, I'm trying to use it to parse tokens separated by "(", ")" and blank space. This is my attempt:
String str = "(test (_bit1 _bit2 |bit3::&92;test#4|))";
String[] tokens = str.split("[\\s*[()]]");
for(int i = 0; i < tokens.length; i++)
System.out.println(i + " : " + tokens[i]);
I expect the following output:
0 : test
1 : _bit1
2 : _bit2
3 : |bit3::&92;test#4|
However, there are two empty tokens appear in the actual output:
0 :
1 : test
2 :
3 : _bit1
4 : _bit2
5 : |bit3::&92;test#4|
I don't understand why I have two empty tokens in position 0 and 2. Could anyone give me a hint? Thank you.
===== Update ====
There was an answer of Alan Moore who deleted it. But I like the answer, so I copy it here for my own reference.
Your regex, [\s*[()]], matches one whitespace character (\s) or one of the characters *, (, or ). The delimiter at the beginning of the string (() is why you get the empty first token. There's no way around that; you just have to check for an empty first token and ignore it.
The second empty token is between the first space and the ( that follows it. That one's on you, because you used * (zero or more) instead of + (one or more). But fixing it isn't that simple. You want to split on spaces, parens, or both, but you have to make sure there's at least one character, whichever it is. This might do it:
\s*[()]+\s*|\s+
But you probably should allow for spaces between parens, too:
\s*(?:[()]+\s*)+|\s+
As a Java string literal, that would be:
\s*(?:[()]+\s*)+|\s+