3

The code:

String s = "a12ij";

System.out.println(Arrays.toString(s.split("\\d?")));

The output is [a, , , i, j], which confuses me. If the expression is greedy, shouldn't it try and match as much as possible, thereby splitting on each digit? I would assume that the output should be [a, , i, j] instead. Where is that extra empty character coming from?

3
  • 1
    Yes, ? is greedy, but only up to 1 character. So this happens twice and you'll need 2. This is something that is pretty standardized. Commented Apr 17, 2016 at 4:33
  • @Obicere, Using two ? would equate to matching between zero and one time, as few times as possible. Commented Apr 17, 2016 at 5:03
  • What do you want the output array to be? Commented Apr 17, 2016 at 7:50

2 Answers 2

3

The pattern you're using only matches one digit a time:

\d    match a digit [0-9]
 ?    matches between zero and one time (greedy)

Since you have more than one digit it's going to split on both of them individually. You can easily match more than one digit at a time more than a few different ways, here are a couple:

\d    match a digit [0-9]
+?    matches between one and unlimited times (lazy)

Or you could just do:

\d    match a digit [0-9]
 +    matches between one and unlimited times (greedy)

Which would likely be the closest to what I would think you would want, although it's unclear.

Explanation:

Since the token \d is using the ? quantifier the regex engine is telling your split function to match a digit between zero and one time. So that must include all of your characters (zero), as well as each digit matched (once).

You can picture it something like this:

    a,1,2,i,j    // each character represents (zero) and is split
      | |
    a, , ,i,j    // digit 1 and 2 are each matched (once)

Digit 1 and 2 were matched but not captured — so they are tossed out, however, the comma still remains from the split, and is not removed basically producing two empty strings.


If you're specifically looking to have your result as a, ,i,j then I'll give you a hint. You'll want to (capture the \digits as a group between one and unlimited times+) followed up by the greedy qualifier ?. I recommend visiting one of the popular regex sites that allows you to experiment with patterns and quantifiers; it's also a great way to learn and can teach you a lot!

The solution can be found here

Sign up to request clarification or add additional context in comments.

12 Comments

I'm not convinced this really answers the question. Judging from what the OP expected, I think he already knew that it was splitting on each digit individually, but didn't understand the extra empty string in the output.
@ajb: The second example of my answer should produce the 'expected' output. I'm not sure why anyone would an empty string in their array, but perhaps there's a good reason.
@I'L'I answer for extra empty strings is probably here stackoverflow.com/questions/18870699/…
@11thdimension: From that question it looks like they don't want the empty strings, which seems logical. So still unclear why anyone would prefer keeping them (as the OP shows).
@11thdimension That answer is from 2013, and I strongly suspect it's no longer valid.
|
2

The javadoc for split() is not clear on what happens when a pattern can match the empty string. My best guess here is the delimiters found by split() are what would be found by successive find() calls of a Matcher. The javadoc for find() says:

This method starts at the beginning of this matcher's region, or, if a previous invocation of the method was successful and the matcher has not since been reset, at the first character not matched by the previous match.

So if the string is "a12ij" and the pattern matches either a single digit or an empty string, then find() should find the following:

  • Empty string starting at position 0 (before a)
  • The string "1"
  • The string "2"
  • Empty string starting at position 3 (before i). This is because "the first character not matched by the previous match" is the i.
  • Empty string starting at position 4 (before j).
  • Empty string starting at position 5 (at the end of the string).

So if the matches found are the substrings denoted by the x, where an x under a blank means the match is an empty string:

  a   1   2   i   j
x     x   x x   x   x

Now if we look at the substrings between the x's, they are "a", "", "", "i", "j" as you are seeing. (The substring before the first empty string is not returned, because the split() javadoc says "A zero-width match at the beginning however never produces such empty leading substring." [Note that this may be new behavior with Java 8.] Also, split() doesn't return empty trailing substrings.)

I'd have to look at the code for split() to confirm this behavior. But it makes sense looking at the Matcher javadoc and it is consistent with the behavior you're seeing.

MORE: I've confirmed from the source that split() does rely on Matcher and find(), except for an optimization for the common case of splitting on a one-known-character delimiter. So that explains the behavior.

4 Comments

can you please explain the same with the 1234 input with same regex \\d?. Because I'm getting output empty array [], your answer suggests it should be [,,,,].
@11thdimension I wonder if they rewrote this for Java 8? I've found some source online, but it appears to be Java 6. There's a phrase in the Java 8 javadoc that isn't in the Java 7 javadoc, and the code I found won't obey this new requirement. So some tweaking had to be done, apparently. I should have the source somewhere so I'll try to take a look at it.
OK, it looks like they added a little logic to suppress the initial empty string, but other than that, and other than an optimization in case we're splitting on one simple character, it should be the same. I'm getting the same output as you, but I don't understand it.
@11thdimension Sigh... I should have spotted this right away. The reason your output array is empty is because trailing empty strings are always discarded, and all the empty strings are trailing here. If I try "1234".split("\\d?", 1000), now the resulting array is 6 empty strings, since the "trailing empty string" rule doesn't apply if there's a limit.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.