Where does the split() method in Java begin matching regex to a string?

Question

I was messing around with the split() method in Java when I came across a problem which I couldn't seem to understand. I was curious as to where exactly the split method starts to search for regex matches: at the first character, before, or after?

Given String "test":

If the split method starts before the first character then there should be an empty string before the string "test", and splitting at an empty string should return an array of length 6, but it is of length 5.

System.out.println("test".split("",-1).length);

So clearly the split method does not start before the given string.

If the split method starts at the first character given string then shouldn't splitting with a regex of "Z*" return an array of length 6 with a leading empty string as the first character is indeed not Z (hence 0 or more times)? However it returns an array of length 5.

System.out.println("test".split("Z*",-1).length);

So by induction the split method starts after the first character... but clearly it does not since the following code works as expected:

System.out.println("test".split("t",-1).length);
Output: 3

So where exactly does the split method start searching for regex matches? Or what exactly is the gap in my reasoning?

Number of matches + 1 'test' has 2 t's, t give's 3. Test has 4 characters, matching nothing gives 5. Is that what you got ? — user557597
– user557597, Commented Mar 18, 2018 at 17:05
Also be aware that in more recent versions of JDK, the split method was optimised so that a single-character pattern which is not a regex special character will not actually engage the regex engine. So splitting on just the character "t" will not cause regex to be engaged. — Bobulous
– Bobulous, Commented Mar 18, 2018 at 17:09
You could always set the limit to 0, which will remove any trailing empty strings from the array. — user557597
– user557597, Commented Mar 18, 2018 at 17:15
@sln Yes I do believe that might be what happened. So essentially matching at "Z*" is synonymous matching at an empty string, and since the first empty string comes after the first character of "test", there were only 4 matches, giving the second example a length of 5? — NoobsPwnU
– NoobsPwnU, Commented Mar 18, 2018 at 17:16
Z* will match nothing as well, equivalent to "" if no Z's in the sample. However, if you use Z+ on a string without Z's you should get an array of 1 element, the original string. — user557597
– user557597, Commented Mar 18, 2018 at 17:19

everett1992 · Accepted Answer · 2018-03-18 17:22:18Z

1

You can read the jdk source code online. Here is split from OpenJdk 8.

String.split has a happy-path optimization for single character strings, but most work is delegated to Pattern.split. Pattern split has a special case for a zero width match at the beginning of the string.

answered Mar 18, 2018 at 17:22

everett1992

2,7104 gold badges32 silver badges48 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Where does the split() method in Java begin matching regex to a string?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related