Regex separate binary operators

Question

I need to divide tokens: = == <= >= < > and ~= neatly in separate regexes. Currently I have:

(=)    for = 
[=]{2} for == 
(<=)   for <=
(<)    for <
(>=)   for >=
(>)    for >
\~=    for ~=

but i am afraid these will interfere with each other (= cant match the equal sign in e.g. <=)

Any recommendations? I am new to regex so if you have an answer please explain a bit :-)

Alternation listing most specific to least specific: (abc|ab|a) matches abc or ab or a ... — Alex K.
– Alex K., Commented Feb 6, 2017 at 19:02
They will not interfere. If you specific a capturing group such as (<=), it must match all values inside that aren't specified as optional and therefore will not match = or >= or ~= etc... — m_callens
– m_callens, Commented Feb 6, 2017 at 19:04
but (=) & (<=) would both match "a <= b" which is his concern I think — Alex K.
– Alex K., Commented Feb 6, 2017 at 19:06
but i mean that = will match <= for example, or will it automatically pick <= since its the longest match? — Louise
– Louise, Commented Feb 6, 2017 at 19:07
The rule for any regex engine is that the leftmost match wins. In an alternation a|b|c..., for backtracking regex engines, the first branch that succeeds wins, for posix regex engines (sed, grep...) the longest match in the alternation wins. — Casimir et Hippolyte
– Casimir et Hippolyte, Commented Feb 6, 2017 at 19:08

jakeehoffmann · Accepted Answer · 2017-02-06 19:29:25Z

1

It kinda depends on your environment and regex engine! If it is a DFA or POSIX NFA engine, then you are always going to match the longest, left-most possible pattern. You can determine if your engine works this way by trying to match

nfa|nfa not

against the string "nfa not". If the entire string matches, then you know you're working with a longest, left-most engine, ie. DFA or POSIX NFA.

However, the most common engine type is Traditional NFA, where you are granted a lot of expressive power and control with your regexes but, as usual, that power comes with responsibility. In a traditional NFA, longest left-most is not guaranteed. I will explain Alex K's abc|ab|a solution. The '|' (called OR or alternation) is a way of saying match abc OR ab. You may wonder, "what if the text is 'abc'? Either one works in that case!". That is true! And in a traditional NFA, the options are tried from left to right. So in a traditional NFA, searching for ab|abc in the text "abc" will match "ab" and searching abc|ab will match the whole "abc". You can take advantage of this by searching for <=|= in your text to ensure you always get '<=' rather than just the '='.

It turns out that Alex K's solution will work regardless of engine because '<=' is also the longest, left-most match. I thought I'd give a deeper explanation to provide some understanding and maybe arouse your interest. Check out 'Mastering Regular Expressions' by J. Friedl if you want to learn more!

answered Feb 6, 2017 at 19:29

jakeehoffmann

1,4191 gold badge16 silver badges23 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Louise Over a year ago

Thanks for the explanation! Before i got "rule can not be matched" but when i read this i realized i had to put them in a different order. since i want them separated (GT, GTEQ, LT etc.) i just put them in order: [=]{2} (<=) (>=) \~= (>) (<) (=)

Collectives™ on Stack Overflow

Regex separate binary operators

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related