Tag Archives: nfa

Converting a NFA to a DFA by subset construction, Regular expressions part 2

Welcome again to the understanding regular expressions series, the previous part is located here if you missed it. Much like the dragon book you probably really can’t read these things out of order. Last time we ended with a really nice non deterministic finite automata, but in order to make it work in a deterministic fashion we will need to transform it. Again, a deterministic machine is the mirror image of a nondeterministic machine, it will achieve exactly the same thing but in a different fashion.

For reference, a nondeterministic finite automata:

  • One entry state
  • May contain epsilon transitions
  • One input may lead to several new states
  • One accepting state

It’s cousin the deterministic finite automata (DFA) has these properties:

  • One entry state
  • May not contain epsilon transition
  • One input may only lead to one state
  • May have several accepting states

The key thing about the DFA is that we can easily convert it to a table and run it through our very simple engine, giving us very nice run-times of O(n). So how to convert this machine? The algorithm used here is called closure and it is very cool indeed. It is also used in a different variation for parser generation, which is why it’s almost a must to understand regular expressions before starting to truly understand parser generators. Here it is

Given a list of states the closure set is all the sets that can be reached without any input, including the input states themselves

The function will have an input of one or more states, and return a set of states which will at least contain the input states. In pseudo C# code it looks like this. I myself much prefer pseudocode that is much closer to real code – it leaves much less to the imagination of the reader.

Set<State> Closure(List<State> inputStates)
    var output = new Set<State>();

    // Keeps states we are going to add later
    while (true) 
        var statesToAdd = new Set<State>();
        foreach (var state in output)
            foreach (var edge in output.EdgesGoingOut)
                if (edge.IsEpsilonEdge)
                    if (!output.Contains(statesToAdd)) 
        if (!statesToAdd.Any())
            break; // Exit loop if there are no states to add
        output.UnionWith(statesToAdd); // Add all states to output
    return output;

It’s helpful to visualize this using graphs. If we return to the regular expression (a|b)+bcd , the corresponding NFA looks like this

If we are going to compute closure for state 0 we will traverse the edges to 1 and 3, but not to 5, because that edge goes the wrong way. Then there are nowhere else to go, since the remaining transitions require inputs “a” or “b” respectively. So the closure of state 0 is {0, 1, 3} . Note that the input state is always included in the closure list. I like to think of this as a flood fill, because it looks like that if you have many epsilon transitions. This new set is actually the start of our deterministic machine! Yay. Lets begin to draw the machine to watch it grow.

Our first state

We now need to introduce a new operation called goto . I know, goto is a bad word these days. The algorithm apologizes profoundly for the association. The goto algorithm is real easy.

Given a set of states and an input symbol find all the states reachable by traversing the edges that require the given input symbol. Then apply the closure algorithm on the output

Think of this algorithm as move one step then flood fill again. In C# pseudocode it looks like this

List<State> Goto(List<State> inputState, char inputCharacter)
     var output = new Set<State>();
    foreach (var state in inputState)
        foreach (var edge in state.EdgesGoingOut)
            if (edge.Label == inputCharacter)
    return Closure(output);

So, given our first closure set {0, 1, 3} lets apply Goto on it for every input character. Nothing will happen except for characters “a” and “b”. Lets explore character “a”. The only transition we get to cross is the one going from state 3 to 4. So 4 is the only result of the Goto. We now apply closure to it, which gives us the list {4, 5, 6, 0, 1, 3} when we follow every arrow which is an epsilon. The input for “b” is very similar and yields state set {2, 5, 6, 0, 1, 3}. The next step is adding our new states to the new diagram with the input we gave to the Goto as the edge label. Now the graph looks like this:

You then repeat this again. Applying Goto on the state set {4, 5, 6, 0, 1, 3} and input “a” gives us exacly the same result as previously. This means that this state has an edge going to itself. Applying Goto on the same states with input “b” gives a new state set that we haven’t seen before {2, 5, 6, 7, 0, 1, 3}. Lets add this to our graph:

If we apply the same algorithm to the state list {2, 5, 6, 0, 1, 3} we find that it will only yield sets that we have already added to the graph, but it will give us new transitions. The graph now looks like this:

There is one unexplored set to apply goto on {2, 5, 6, 7, 0, 1, 3}. Doing this with “a” gets a transition back to a previously known state. “b” gives a transition back to the same state list. “c” gives a transition to a brand new set containing only state 8. Our graph now looks like this:

Still one new set to explore! Only one input gives a new set “d” gives a set containing only state 9. State 9 is an accepting state, so we add this to the graph.

Applying goto on this new set gives us nothing at all. We are done. Almost by magic we have transformed our non-deteministic machine into a deterministic machine. The reader is invited to try the same input as on the nondeterministic variant. It will respond the same. There is really only one bit left to do, we need to make this into a table that will run with our regular expression engine.

Next part can be found here

Tagged , , ,

Regular expressions, how do they really work? Automata theory for programmers part 1

I know, automata theory doesn’t sound like the most exiting thing on earth, but there’s more than meets the eye here. I’m going to assume you have at least a working knowledge of regular expressions. You don’t need to worry about knowing the intricate little details to appreciate this article. So, let’s get going.

If you think regular expressions are some sort of glorified wildcard, you’re wrong. They are more akin to mathematical expressions than anything else, which is also why they’re called expressions. For the sake of completeness they are called regular because they can implement any regular (type 3) grammar on the Chomsky language hierarchy . To understand how they actually work, you have to understand what a state machine is.

State machines

In the most common sense, a state machine is something that holds a given state, and have transitions to other states based on some conditional, usually input. It’s a very useful concept. It’s commonly implemented as a loop with a switch statement at the base, switching on the current state. For each state, the machine performs given actions which might result in the machine changing state. This in turn will reflect in it’s next iteration. Regular expressions are typically parsed by state machines. Each iteration takes an input, and evaluates it based on the current state of the machine – advancing it to either a new state, an error state or an accept state (i.e. the string has matched the expression). State machines for regular expressions are typically driven by tables.

This is pretty much all you need to know to implement the actual engine. It looks a bit like this:

bool Matches(Reader r)
   int state = 0;
   while (r.Peek() != -1)
      char input = (char)r.Read();
      if (table[state, input] == Accept)
          return true; // Table indicates that for this state, we accept the input given
      if (table[state, input] == Error)
          return false; // Table indicates that this is not a valid input
      // Advance to next state.
      state = table[state, input];

Really, that is it. The engine that drives regular expressions is pretty much this. As you can see, all the complexity are in the construction of the tables. Because the tables can be constructed beforehand the actual regular expression evaluation can be extremely fast. So, what we’re going to be mainly discussing are how to construct these tables. To do this we’ll have to define our state machines in a different way, and introduce a few cool new terms.

Nondeterministic Finite automata

It sounds fancy, but it’s really our friend the state machine all over again but this time with a twist. A nondeterministic finite automata is a state machine that may advance it’s state without an input and be in more than one state at the same time. Like a quantum state machine. A transition between two states in this fashion is called an Epsilon transition. I like to think that E stands for empty. The thing about these state machines is that it’s a lot easier to construct them than it is to construct their deterministic cousins. It’s especially easy to combine them, as we shall see.

Deterministic finite automata

State machines which require some sort of input on every state, and where the same input in one given state can never lead to two different places. This has two effects, it’s always clear what to do given an input and you’re always in only one state at a time. Two highly desirable characteristics indeed.

Constructing nondeteministic finite automata

It’s easy enough to show this by some drawing. A circle is a state, identified by a number. A double circle is an accept state – a state where we consider the expression to be matched if we are at the end of the input. An arrow is a transition between states. The text on the arrow is the required input to transfer the machine from one state to another.

The simplest one of them all, this machine accepts a single “x” character.

This machine first takes an “x” then an “y” then accepts the input.

Both of these machines are by nature deterministic, they cannot be in more than one state at a time, since there are no epsilon transitions. For an example of that, take a look at this machine

Note that there is an epsilon transition in this one, represented by the funny-looking e character. This means that this transition is automatic. This machine will match the strings “y” or “xy”. When starting the machine will be state 1 from which it detects an epsilon transition and follows it automatically while still being in state 1. So, now we’ve got a machine that is in both state 1 and 2. Depending on the input different things will now happen. If we get an “y” the machine that is in state 1 will have nowhere to go, and since machines must always advance on input this machine will die – leaving only the machine in state 2 which can continue to state 3 and accept the input. The reverse thing happens if an “x” were inputted. In that case the machine in state 1 dies since it can’t go anywhere, leaving the other machine to advance to state 3.

It’s not that hard to figure out how to translate different elements in regular expressions to these small machines. Here’s a few more examples:

Accepts “x” zero or more times. Equivalent to the expression x*

Accepts “x” one or more times. Equivalent to the expression x+

Accepts either “x” or “y”. Equivalent to the expression x|y

The amazing thing about these machines is that you can string them together into a huge machine, they are infitely combineable. As long as you stick to the rule about a single starting state and a single accept state. You’ve already seen two examples of this, the machine that accepted “xy” was two machines each accepting a single character linked together. The other example is the last example which accepted either “x” or “y”, liked together by alternation. To construct these compound machines the easiest way is to convert the regular expression into postfix notation and apply a stack based algorithm. Both of these are really not that difficult algorithms, but this post is long enough as it is. It’s briefly explained in this document . It’s also quite well documented in the source code of my parser generator Piglet , look at the classes NFA and PostFixConverter. If there’s interest in further explanation of this, I’ll see if I can write something up.

Turns out there aren’t that many ways to link machines together for basic use, most have already been shown here. So, as a final example here is a machine that accepts the regexp (a|b)+bcd

Try following the edges and keep track of machines that are alive for the various forms of inputs. It will do exactly what you want given the regexp. And it’s all built on combinations of the same building blocks shown above. So, given these tools you could actually construct a machine that accepts any regular expression. However, these machines are not very efficient since they need to keep track of multiple states which given a large enough expression could be quite cumbersome. We still want to have that desireable deterministic effect, where we could just keep track of one state and know where to go for each given input.

That is probably going to be the next part of this algorithmic exploration!

Edit: Next part can be viewed here

Tagged , , ,