Why empty regex and empty capturing group regex return string length plus one results

Question

How would you explain that empty regex and empty capturing group regex return string length plus one results?

Code

public static void main(String... args) {
    {
        System.out.format("Pattern - empty string\n");
        String input = "abc";
        Pattern pattern = Pattern.compile("");
        Matcher matcher = pattern.matcher(input);
        while (matcher.find()) {
            String s = matcher.group();
            System.out.format("[%s]: %d / %d\n", s, matcher.start(),
                    matcher.end());
        }
    }
    {
        System.out.format("Pattern - empty capturing group\n");
        String input = "abc";
        Pattern pattern = Pattern.compile("()");
        Matcher matcher = pattern.matcher(input);
        while (matcher.find()) {
            String s = matcher.group();
            System.out.format("[%s]: %d / %d\n", s, matcher.start(),
                    matcher.end());
        }
    }
}

Output

Pattern - empty string
[]: 0 / 0
[]: 1 / 1
[]: 2 / 2
[]: 3 / 3
Pattern - empty capturing group
[]: 0 / 0
[]: 1 / 1
[]: 2 / 2
[]: 3 / 3

Could you try to print the start and end offset of each captured group? — YMomb
– YMomb, Commented Apr 19, 2013 at 11:18
Thanks, the answers are corroborated by the provided start and ends. — YMomb
– YMomb, Commented Apr 19, 2013 at 11:34

michaelb958--GoFundMonica · Accepted Answer · 2013-04-19 11:35:48Z

5

The regex engine is hardcoded to advance one position upon a zero-length match (otherwise infinite loop). Your regex matches a zero-length substring. There are zero-length substrings between every character (think the "gaps between each character"); in addition, the regex engine considers the start and end of the string valid match positions as well. Because a string of length N contains N+1 gaps between letters (counting the start and end, which the regex engine does), you'll get N+1 matches.

edited Apr 19, 2013 at 11:35

answered Apr 19, 2013 at 11:19

michaelb958--GoFundMonica

4,7367 gold badges34 silver badges35 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Martin Ender · Accepted Answer · 2013-04-19 11:34:47Z

4

Regex engines consider positions before and after characters, too. You can see this from the fact that they have things like ^ (start of string), $ (end of string) and \b word boundary, which match at certain positions without matching any characters (and therefore between/before/after characters). Therefore we have the N-1 positions between characters that have to be considered, as well as the first and last position (because ^ and $ would match there respectively), which gives you N+1 candidate positions. All of which match for a completely unrestrictive empty pattern.

So here are your matches:

" a b c "
 ^ ^ ^ ^

Which is obviously N+1 for N characters.

You will get the same behavior with other patterns that allow zero-length matches and don't actually find longer ones in your pattern. For instance, try \d*. It cannot find any digits in your input string, but * will gladly return zero-length matches.

edited Apr 19, 2013 at 11:34

answered Apr 19, 2013 at 11:21

Martin Ender

44.4k11 gold badges93 silver badges132 bronze badges

2 Comments

Vitaly Over a year ago

There are only two positions "between characters".

Martin Ender Over a year ago

@Vitaly sorry, that was not accurately formulated then. but the positions before the first and after the last character are obviously also considered, since you have the anchors ^ and $ which match in these postions.

Collectives™ on Stack Overflow

Why empty regex and empty capturing group regex return string length plus one results

2 Answers 2

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related