2

How would you explain that empty regex and empty capturing group regex return string length plus one results?

Code

public static void main(String... args) {
    {
        System.out.format("Pattern - empty string\n");
        String input = "abc";
        Pattern pattern = Pattern.compile("");
        Matcher matcher = pattern.matcher(input);
        while (matcher.find()) {
            String s = matcher.group();
            System.out.format("[%s]: %d / %d\n", s, matcher.start(),
                    matcher.end());
        }
    }
    {
        System.out.format("Pattern - empty capturing group\n");
        String input = "abc";
        Pattern pattern = Pattern.compile("()");
        Matcher matcher = pattern.matcher(input);
        while (matcher.find()) {
            String s = matcher.group();
            System.out.format("[%s]: %d / %d\n", s, matcher.start(),
                    matcher.end());
        }
    }
}

Output

Pattern - empty string
[]: 0 / 0
[]: 1 / 1
[]: 2 / 2
[]: 3 / 3
Pattern - empty capturing group
[]: 0 / 0
[]: 1 / 1
[]: 2 / 2
[]: 3 / 3
2
  • Could you try to print the start and end offset of each captured group? Commented Apr 19, 2013 at 11:18
  • Thanks, the answers are corroborated by the provided start and ends. Commented Apr 19, 2013 at 11:34

2 Answers 2

5

The regex engine is hardcoded to advance one position upon a zero-length match (otherwise infinite loop). Your regex matches a zero-length substring. There are zero-length substrings between every character (think the "gaps between each character"); in addition, the regex engine considers the start and end of the string valid match positions as well. Because a string of length N contains N+1 gaps between letters (counting the start and end, which the regex engine does), you'll get N+1 matches.

Sign up to request clarification or add additional context in comments.

Comments

4

Regex engines consider positions before and after characters, too. You can see this from the fact that they have things like ^ (start of string), $ (end of string) and \b word boundary, which match at certain positions without matching any characters (and therefore between/before/after characters). Therefore we have the N-1 positions between characters that have to be considered, as well as the first and last position (because ^ and $ would match there respectively), which gives you N+1 candidate positions. All of which match for a completely unrestrictive empty pattern.

So here are your matches:

" a b c "
 ^ ^ ^ ^

Which is obviously N+1 for N characters.

You will get the same behavior with other patterns that allow zero-length matches and don't actually find longer ones in your pattern. For instance, try \d*. It cannot find any digits in your input string, but * will gladly return zero-length matches.

2 Comments

There are only two positions "between characters".
@Vitaly sorry, that was not accurately formulated then. but the positions before the first and after the last character are obviously also considered, since you have the anchors ^ and $ which match in these postions.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.