How a RegEx engine works [closed]

Question

In learning Regular Expressions it had me wondering how the underlying engine works. Probably more specifically, I'd like to know more about how it evalutates, prioritizies and parses the expression. I feel the RegEx engine is a blackbox to me, and I would really enjoy deciphering it.

So I'd like to ask if there are some great resources that I could read up on that discuss RegEx engine theory.

*Note: I am not interested in building an engine, just learning the inner workings of it.

Regular Expression engines are based on finite-state machines. A nice article about how fast regular expression matching works is http://swtch.com/~rsc/regexp/regexp1.html. — Giuseppe Cardone
– Giuseppe Cardone, Commented Sep 1, 2010 at 21:56
Take some book about Authomata Theory. Also good articles can be found there: swtch.com/~rsc/regexp — Vadim Shender
– Vadim Shender, Commented Sep 1, 2010 at 21:57
Mastering Regular Expressions is a great book though it's not focused on that subject, it does have several chapters dealing with how each regex engine behaves. (though it's more of a practical manner rather than analyzing the details of the engine itself) — NorthGuard
– NorthGuard, Commented Sep 1, 2010 at 22:56
I've actually been poking around that book but didn't know about those chapters. Thanks! — Robb
– Robb, Commented Sep 2, 2010 at 0:26

Markus Jarderot · Accepted Answer · 2010-09-03 14:28:21Z

49

There are two main classes of regex engines.

Those based on Finite State Automaton. These are generally the fastest. They work by building a state machine, and feeding it characters from the input string. It is difficult, if not impossible, to implement some more advanced features in engines like this.

Examples of FSA based engines:
- Posix/GNU ERE/BRE — Used in most unix utilities, such as grep, sed and awk.
- Re2 — A relatively new project for trying to give more power to the Automata based method.
Those based on back-tracking. These often compile the pattern into byte-code, resembling machine instructions. The engine then executes the code, jumping from instruction to instruction. When an instruction fails, it then back-tracks to find another way to match the input.

Examples of back-tracking based engines:
- Perl — The original. Most other engines of this type try to replicate the functionality of regexes in the Perl language.
- PCRE — The most successful implementation. This library is the most widely used implementation. It has a rich set of features, some of which can't be considered as "Regular" any more.
- Python, Ruby, Java, .NET — Other implementations I don't intend to describe further.

For more information:

regular-expressions.info - Tutorial
regular-expressions.info - Flavor comparison
swtch.com - Implementing Regular Expressions — A good set of articles about effective, Automata based, regular expressions.

If you want me to expand on something, post a comment.

edited Sep 3, 2010 at 14:28

answered Sep 2, 2010 at 11:17

Markus Jarderot

89.7k23 gold badges141 silver badges142 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Robb Over a year ago

It looks like I have some work cut out for me with the posted links but I believe this is more what I was looking for. Even further if you know of an actual book that could be purchased, that would be fantastic.

Markus Jarderot Over a year ago

I haven't read many books on the subject, but one I liked is "Introduction to the Theory of Computation" by Michael Sipser. It is not just about Regular Expressions, but goes all the way to Turing Machines and NP-completeness, etc.

Collectives™ on Stack Overflow

How a RegEx engine works [closed]

1 Answer 1

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Linked

Related