Regular expressions, how do they really work? Automata theory for programmers part 1

I know, automata theory doesn’t sound like the most exiting thing on earth, but there’s more than meets the eye here. I’m going to assume you have at least a working knowledge of regular expressions. You don’t need to worry about knowing the intricate little details to appreciate this article. So, let’s get going.

If you think regular expressions are some sort of glorified wildcard, you’re wrong. They are more akin to mathematical expressions than anything else, which is also why they’re called expressions. For the sake of completeness they are called regular because they can implement any regular (type 3) grammar on the Chomsky language hierarchy. To understand how they actually work, you have to understand what a state machine is.

State machines

In the most common sense, a state machine is something that holds a given state, and have transitions to other states based on some conditional, usually input. It’s a very useful concept. It’s commonly implemented as a loop with a switch statement at the base, switching on the current state. For each state, the machine performs given actions which might result in the machine changing state. This in turn will reflect in it’s next iteration. Regular expressions are typically parsed by state machines. Each iteration takes an input, and evaluates it based on the current state of the machine – advancing it to either a new state, an error state or an accept state (i.e. the string has matched the expression). State machines for regular expressions are typically driven by tables.

This is pretty much all you need to know to implement the actual engine. It looks a bit like this:

bool Matches(Reader r)
{
   int state = 0;
   while (r.Peek() != -1)
   {
      char input = (char)r.Read();
      if (table[state, input] == Accept)
          return true; // Table indicates that for this state, we accept the input given
      if (table[state, input] == Error)
          return false; // Table indicates that this is not a valid input
      // Advance to next state.
      state = table[state, input];
   }
}

Really, that is it. The engine that drives regular expressions is pretty much this. As you can see, all the complexity are in the construction of the tables. Because the tables can be constructed beforehand the actual regular expression evaluation can be extremely fast. So, what we're going to be mainly discussing are how to construct these tables. To do this we'll have to define our state machines in a different way, and introduce a few cool new terms.

Nondeterministic Finite automata

It sounds fancy, but it's really our friend the state machine all over again but this time with a twist. A nondeterministic finite automata is a state machine that may advance it's state without an input and be in more than one state at the same time. Like a quantum state machine. A transition between two states in this fashion is called an Epsilon transition. I like to think that E stands for empty. The thing about these state machines is that it's a lot easier to construct them than it is to construct their deterministic cousins. It's especially easy to combine them, as we shall see.

Deterministic finite automata

State machines which require some sort of input on every state, and where the same input in one given state can never lead to two different places. This has two effects, it's always clear what to do given an input and you're always in only one state at a time. Two highly desirable characteristics indeed.

Constructing nondeteministic finite automata

It's easy enough to show this by some drawing. A circle is a state, identified by a number. A double circle is an accept state - a state where we consider the expression to be matched if we are at the end of the input. An arrow is a transition between states. The text on the arrow is the required input to transfer the machine from one state to another.


The simplest one of them all, this machine accepts a single "x" character.


This machine first takes an "x" then an "y" then accepts the input.

Both of these machines are by nature deterministic, they cannot be in more than one state at a time, since there are no epsilon transitions. For an example of that, take a look at this machine

Note that there is an epsilon transition in this one, represented by the funny-looking e character. This means that this transition is automatic. This machine will match the strings "y" or "xy". When starting the machine will be state 1 from which it detects an epsilon transition and follows it automatically while still being in state 1. So, now we've got a machine that is in both state 1 and 2. Depending on the input different things will now happen. If we get an "y" the machine that is in state 1 will have nowhere to go, and since machines must always advance on input this machine will die - leaving only the machine in state 2 which can continue to state 3 and accept the input. The reverse thing happens if an "x" were inputted. In that case the machine in state 1 dies since it can't go anywhere, leaving the other machine to advance to state 3.

It's not that hard to figure out how to translate different elements in regular expressions to these small machines. Here's a few more examples:

Accepts "x" zero or more times. Equivalent to the expression x*

Accepts "x" one or more times. Equivalent to the expression x+

Accepts either "x" or "y". Equivalent to the expression x|y

The amazing thing about these machines is that you can string them together into a huge machine, they are infitely combineable. As long as you stick to the rule about a single starting state and a single accept state. You've already seen two examples of this, the machine that accepted "xy" was two machines each accepting a single character linked together. The other example is the last example which accepted either "x" or "y", liked together by alternation. To construct these compound machines the easiest way is to convert the regular expression into postfix notation and apply a stack based algorithm. Both of these are really not that difficult algorithms, but this post is long enough as it is. It's briefly explained in this document. It's also quite well documented in the source code of my parser generator Piglet, look at the classes NFA and PostFixConverter. If there's interest in further explanation of this, I'll see if I can write something up.

Turns out there aren't that many ways to link machines together for basic use, most have already been shown here. So, as a final example here is a machine that accepts the regexp (a|b)+bcd

Try following the edges and keep track of machines that are alive for the various forms of inputs. It will do exactly what you want given the regexp. And it's all built on combinations of the same building blocks shown above. So, given these tools you could actually construct a machine that accepts any regular expression. However, these machines are not very efficient since they need to keep track of multiple states which given a large enough expression could be quite cumbersome. We still want to have that desireable deterministic effect, where we could just keep track of one state and know where to go for each given input.

That is probably going to be the next part of this algorithmic exploration!

Edit: Next part can be viewed here

About these ads
Tagged , , ,

7 thoughts on “Regular expressions, how do they really work? Automata theory for programmers part 1

  1. kausik says:

    Thanks for the post. It helped me a lot.

    It would be great to know how do you create NFA for negation e.g. for a regex like [^bk] which means any character other than b and k

    • Per Dervall says:

      Thank you. As for negation you simply create a Nfa that accepts every chat except the one you specify. There’s no concept of negation for a Nfa, so you make one that accepts the inverse instead.

      • kausik says:

        Well, creating an NFA with rest of the alphabets except the ones mentioned within the square bracket could be a solution but it could get real messy. For example considering only lower case English alphabets as allowable, an expression like [^a] would require an NFA for REGEX (b|c|d…y|z). Just wondering if there is any better solution for this!!

      • Per Dervall says:

        That’s typically how it’s done. What you want to do is to not make separate transitions on all the characters but instead to keep the transitions into ranges of characters instead if that makes sense. So, a transition is no longer on a single character but instead on a series. This will be needed later on as well if you intend to make a unicode compatible regex engine since there’s no way to have transitions on all possible varieties of characters in the unicode charset.

  2. kausik says:

    Thanks Par for your response however am not sure how keeping a range will help because finding a transition will no more be in constant time; especially because based on regex a character class can produce variable number of ranges causing variable time for finding such a transition. Nonetheless I’m working on creating an alternate solution which might be generic, uses minimum characters for negation char class and finds a transition in constant time.

    In the meantime, I’m wondering how a regex with occurrence support [i.e. "(abc){1,3}" to indicate "abc" can occur min once and max thrice] is represented in NFA? It would be great if you could throw some light on it.

  3. Per Dervall says:

    Yes, you are correct. You’re sacrificing the constant access time if you’re using ranges. I’m sure there’s more efficient ways of achieving the same access times. Of course, if you don’t intend to support the full unicode charset then you can get constant access times without any issues. If you come up with a better solution, feel free to share your findings.

    As for occurrence support, what I’ve done is to repeat the expression that can be repeated, linked without a en epsilon to the end X number of times for the mandatory number of repeats, and linked Y times for the optional number of repeats. Basically, it repeats parts of the NFA before the conversion to the DFA, if that makes any sense.

    • kausik says:

      About character class, I think I’ve a better idea and will share it once I’ve more clarity on the algorithm.
      On min & max repetition, your solution definitely works but I’m afraid it will contribute to the state explosion and again I’ll share my my idea if I can think of any.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: