1

I need to divide tokens: = == <= >= < > and ~= neatly in separate regexes. Currently I have:

(=)    for = 
[=]{2} for == 
(<=)   for <=
(<)    for <
(>=)   for >=
(>)    for >
\~=    for ~=

but i am afraid these will interfere with each other (= cant match the equal sign in e.g. <=)

Any recommendations? I am new to regex so if you have an answer please explain a bit :-)

7
  • Alternation listing most specific to least specific: (abc|ab|a) matches abc or ab or a ... Commented Feb 6, 2017 at 19:02
  • 1
    They will not interfere. If you specific a capturing group such as (<=), it must match all values inside that aren't specified as optional and therefore will not match = or >= or ~= etc... Commented Feb 6, 2017 at 19:04
  • but (=) & (<=) would both match "a <= b" which is his concern I think Commented Feb 6, 2017 at 19:06
  • but i mean that = will match <= for example, or will it automatically pick <= since its the longest match? Commented Feb 6, 2017 at 19:07
  • The rule for any regex engine is that the leftmost match wins. In an alternation a|b|c..., for backtracking regex engines, the first branch that succeeds wins, for posix regex engines (sed, grep...) the longest match in the alternation wins. Commented Feb 6, 2017 at 19:08

1 Answer 1

1

It kinda depends on your environment and regex engine! If it is a DFA or POSIX NFA engine, then you are always going to match the longest, left-most possible pattern. You can determine if your engine works this way by trying to match

nfa|nfa not 

against the string "nfa not". If the entire string matches, then you know you're working with a longest, left-most engine, ie. DFA or POSIX NFA.

However, the most common engine type is Traditional NFA, where you are granted a lot of expressive power and control with your regexes but, as usual, that power comes with responsibility. In a traditional NFA, longest left-most is not guaranteed. I will explain Alex K's abc|ab|a solution. The '|' (called OR or alternation) is a way of saying match abc OR ab. You may wonder, "what if the text is 'abc'? Either one works in that case!". That is true! And in a traditional NFA, the options are tried from left to right. So in a traditional NFA, searching for ab|abc in the text "abc" will match "ab" and searching abc|ab will match the whole "abc". You can take advantage of this by searching for <=|= in your text to ensure you always get '<=' rather than just the '='.

It turns out that Alex K's solution will work regardless of engine because '<=' is also the longest, left-most match. I thought I'd give a deeper explanation to provide some understanding and maybe arouse your interest. Check out 'Mastering Regular Expressions' by J. Friedl if you want to learn more!

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for the explanation! Before i got "rule can not be matched" but when i read this i realized i had to put them in a different order. since i want them separated (GT, GTEQ, LT etc.) i just put them in order: [=]{2} (<=) (>=) \~= (>) (<) (=)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.