0

Something worked for me today, but I'm not sure that I understand it enough to be certain that it will work in random future versions of Javascript.

I wanted something like string.split() on whitespace, but that would also return the delimiter strings. In other words:

f("abc   def ghi")
 => ["abc", "   ", "def", " ", "ghi"] 

My first attempt was a dozen lines of ugly regexp searches and loops.

Then I had a crazy idea that I figured had low odds of working, but was worth a quick test: do a .split that would match on either delimiter and non-delimiter ranges. To my joy and surprise, this basically worked:

"abc   def ghi".split(/([^\s]+|[\s]+)/)
  => ["", "abc", "", "   ", "", "def", "", " ", "", "ghi", ""]

With one more small tweak, I have exactly what I need:

"abc   def ghi".split(/([^\s]+|[\s]+)/).filter(s=>s.length)
 => ["abc", "   ", "def", " ", "ghi"]

The problem, of course, is that I can imagine Javascript implementations that would behave differently on this somewhat pathological regexp.

Can I depend on this behavior always working? Why? Where is the spec documented?

For "extra credit" can you give an intuitive argument why this behavior is the most reasonable?

7
  • 3
    The behavior seems reasonable--I wouldn't call this regex or split's behavior pathological, but I also can't explain why it's "the most reasonable" any more than any other JS feature. You can use "abc def ghi".match(/\s+|\S+/g) if you don't care for the empty strings. While this seems too broad, I don't think the dupe is accurate, since OP realizes that JS engines exist and they change over time. Commented May 22, 2019 at 18:56
  • 1
    Simply split it by word boundary: str.split(/\b/). Commented May 22, 2019 at 18:59
  • 2
    I don't understand what is the relation between the duplicate and this question. I'm sure there is a duplicate of this, but is not the chosen one. Commented May 22, 2019 at 19:01
  • 1
    Yes. That's why (\s+). The + will match multiple spaces. The () wrapper will make sure the delimiters are also added in the output. Commented May 22, 2019 at 19:17
  • 1
    A better, well defined method that gets exactly the result you want (without needing to filter the result) is to do: string.match(/([^\s]+|[\s]+)/g) Commented May 22, 2019 at 19:40

1 Answer 1

2

If the argument to split is a regex with capturing groups, the matched groups are returned as individual items in the return array. Moreover, if the regex contains multiple capturing groups, they'll all be included in the return array as individual elements.

let input = 'a 8_b 0_c';
console.log(input.split(/ \d_/));
console.log(input.split(/ (\d)_/)); // includes numbers
console.log(input.split(/( )(\d)_/)); // includes spaces and numbers
console.log(input.split(/( )(\d)(_)/)); // includes spaces, numbers, and underscores

So for your use case, you can simplify your solution to

let x = "abc   def ghi".split(/(\s+)/);
console.log(x);

MDN reference

If separator is a regular expression that contains capturing parentheses, then each time separator is matched, the results (including any undefined results) of the capturing parentheses are spliced into the output array.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.