2

I have the following expression:

where('publishedDate').gte(new Date('2015-08-21T20:37:45.176Z')).sort({ field: 'asc', test: -1 });

that I would like to parse into the following tokens using javascript's string::split(regex):

where
('publishedDate')
gte
(new Date('2015-08-21T20:37:45.176Z'))
sort
({ field: 'asc', test: -1 })

but I'm having a hard time coming up with a suitable expression that does it. Any help is greatly appreciated!

4
  • 2
    You won't be able to use JS's regexes to parse text containing nested braces, because they have no suitable mechanism for this. This would be possible with, for instance PCRE's recursive regexes or .NET's balancing groups, but JS has no such feature. Write a tokenizer by hand, it's not complicated at all. Commented Aug 19, 2015 at 22:34
  • A regex impossibility? Hard to believe:-) I'll take a second or third opinion and if that also suggests it can't be done then I'll find a different approach. Thanks! Commented Aug 19, 2015 at 22:39
  • 2
    @webteckie: JavaScript's regular expressions are relatively weak compared with PCRE and such. (And this from a JavaScript fanboi.) Lucas is quite right about not having a means of handling abitrarily nested things (like the parens in your example). Regular expressions can be part of the solution, but not the entire solution. Commented Aug 19, 2015 at 22:47
  • @webteckie don't get me wrong - because you can do it with non-regular PCRE "regexes" for sure (I just showed a dirty example here). Just don't expect to be able to do the same thing with JS's braindead regexes ;) Commented Aug 19, 2015 at 22:49

1 Answer 1

2

You can use this one:

/(\w+)(\(.+?\)(?=\.|;|$))/g

It works well on your input.

Demo: https://regex101.com/r/kV9lS4/1

How to get the matching tokens only:

var re = /(\w+)(\(.+?\)(?=\.|;))/g,
    str = "where('publishedDate').gte(new Date('2015-08-21T20:37:45.176Z'))" +
          ".sort({ field: 'asc', test: -1 });",
    result = [],
    arr;

while(arr = re.exec(str)) {
  result.push(arr[1]);
  result.push(arr[2]);
}

var re = /(\w+)(\(.+?\)(?=\.|;))/g,
    str = "where('publishedDate').gte(new Date('2015-08-21T20:37:45.176Z')).sort({ field: 'asc', test: -1 });",
    result = [],
    arr;

while(arr = re.exec(str)) {
  result.push(arr[1]);
  result.push(arr[2]);
}

document.getElementById('result').innerHTML = JSON.stringify(result, null, 2);
<pre id="result"></pre>

Sign up to request clarification or add additional context in comments.

5 Comments

Thanks. However, for some reason string.split in node is returning the following so the dots are being captured unlike in your demo: [ '', 'where', '(\'publishedDate\')', '.', 'gte', '(new Date(\'2015-08-21T20:37:45.176Z\'))', '.sort({ field: \'asc\', test: -1 })' ]
Weird. If I do the same thing in the node console I get the same thing as you. But if I do it programatically via string split I get ..., ".sort({ field: 'asc', test: -1 })" where the .sort is not right.
The input will be split by the matching tokens, so what you got is correct. If you want to extract those tokens only, see the code snippet I added.
What would I do to the regex to make the semicolon in the input expression optional? If I change it to /(\w+)((.+?)(?=\.|;?))/g then I lose the parentheses around the 'new Date' expression.
Change it to (?=\.|;|$), which means: dot or ; or $ (end of line). Regex updated in the answer to account for this case.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.