1

I was messing around with the split() method in Java when I came across a problem which I couldn't seem to understand. I was curious as to where exactly the split method starts to search for regex matches: at the first character, before, or after?

Given String "test":

If the split method starts before the first character then there should be an empty string before the string "test", and splitting at an empty string should return an array of length 6, but it is of length 5.

System.out.println("test".split("",-1).length);

So clearly the split method does not start before the given string.

If the split method starts at the first character given string then shouldn't splitting with a regex of "Z*" return an array of length 6 with a leading empty string as the first character is indeed not Z (hence 0 or more times)? However it returns an array of length 5.

System.out.println("test".split("Z*",-1).length);

So by induction the split method starts after the first character... but clearly it does not since the following code works as expected:

System.out.println("test".split("t",-1).length);
Output: 3

So where exactly does the split method start searching for regex matches? Or what exactly is the gap in my reasoning?

6
  • 1
    Number of matches + 1 'test' has 2 t's, t give's 3. Test has 4 characters, matching nothing gives 5. Is that what you got ? Commented Mar 18, 2018 at 17:05
  • Also be aware that in more recent versions of JDK, the split method was optimised so that a single-character pattern which is not a regex special character will not actually engage the regex engine. So splitting on just the character "t" will not cause regex to be engaged. Commented Mar 18, 2018 at 17:09
  • You could always set the limit to 0, which will remove any trailing empty strings from the array. Commented Mar 18, 2018 at 17:15
  • @sln Yes I do believe that might be what happened. So essentially matching at "Z*" is synonymous matching at an empty string, and since the first empty string comes after the first character of "test", there were only 4 matches, giving the second example a length of 5? Commented Mar 18, 2018 at 17:16
  • 1
    Z* will match nothing as well, equivalent to "" if no Z's in the sample. However, if you use Z+ on a string without Z's you should get an array of 1 element, the original string. Commented Mar 18, 2018 at 17:19

1 Answer 1

1

You can read the jdk source code online. Here is split from OpenJdk 8.

String.split has a happy-path optimization for single character strings, but most work is delegated to Pattern.split. Pattern split has a special case for a zero width match at the beginning of the string.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.