Tag Archives: regular expressions

New versions of Piglet and Regexvisualizer are out

It’s been a busy few weeks now, but I finally managed to get some OSS work done again. Here’s an overview of the new stuff

Piglet 1.1.0

Piglet has been updated with a few new features.

  • DFAs generated by piglet can now be minimized, and are minimized by default. This is in essence an optimization that will reduce Piglets memory footprint and make it run faster. As alwyas, this is a tradeoff and you can if you really want to turn off the minimization. The result will be a larger memory footprint but a faster parser generation time. The correctness of the thing remains the same.
  • Ability to get the graph string for the lexer has been added to the public API. Really a feature coming from the regex visualizer, it has been improved with a better API for access from outside the piglet internals. This has also been enhanced with new features, outlined below in the updates to the regex visualizer.
  • Changed test framework to NUnit. This decision was made since there had been some issues with MsTest, mainly that it screwed up building on a machine which did not have Visual Studio installed. This was not an acceptable situation so we decided to change it.

All in all, this release should be fully backwards compatible with earlier versions of Piglet. If you are using it and finding any problems, please do let me know. Piglet is, as always, available on NuGet and on GitHub

Regex visualizer

The regular expression visualizer has been given a new feature, making it really useful for debugging. It will now highlight given an input string which are the active states in both the NFA and the DFA. And if your expression isn’t matched, it will show you what part of your expression is matched and the last state where the machine was in a legal state. All in all this is turning out to be a pretty useful tool. I hope to do some work on the rendering soon, to avoid the pretty annoying flickering – going to look into canviz to turn the rendering into client-side instead of torturing google with constant requests.

Tagged ,

DFA state minimization

If you follow the generic instruction for converting a NFA to a DFA presented in my earlier post , the DFA you generate will work, but it will not be optimal. It will have too many states opposed to the bare minimum needed to do it’s job properly. This tends to happen more frequently if you give it an input of an expression that in some respect have redundancy. A prime example of this is the expression a+b+|ab . The |ab part is of course, totally useless. But the subset construction algorithm doesn’t know that.

To compare, here is the NFA generated by thompsons construction:

Here is the non-minimized DFA that the subset construction algorithm will generate:

And here is the minimum DFA required to do the job properly:

Quite a difference! So, how do we do this? There are several ways of accomplishing this, some more complex than others. I’m going to give you a working algorithm which admittedly isn’t the fastest. The faster versions are more complex, and would ruin the idea of explaining the algorithm.

For programmers

I’m willing to bet you’re like me, a programmer and not a mathematician. You can describe this algorithm in math terms to no end, and I’m willing to bet that most people are going to miss out on what is really not the most complex algorithm in the world. If you google this you’re going to get the algorithm described as some function which partitions states so that this and that is true. Perhaps this is jealousy on my part since I feel left out of the cool crowd with the greek symbols and all, but I think there are better ways of explaining this stuff.

The algorithm

It’s a bit backwards, but what we’re going to do is to assume that all states can be merged together and then rule out those that can’t. This is going to be represented by a table, where empty squares are those that we cannot tell apart. So, for now, this table looks like this:

+-----+
|s1|  |
+--+--+--+
|s2|  |  |
+--+--+--+--+
|s3|  |  |  |
+--+--+--+--+--+
|s4|  |  |  |  |
+--+--+--+--+--+
   |s0|s1|s2|s3|
   +--+--+--+--+

This table is going to be filled later on. There’s no reason for a full table, we only want the distinct pairs of states.

What we are interested in is to sort the states into groups that may be merged into a single state without altering the functionality of the DFA. So, what we really need to do is to find out how to find these state groups normally referred to as nondistinguishable states. The first rule of this is:

A state that is an accepting state may never be in the same group as a state which is not an accepting state.

So, in our graphs we may never merge a double-circle state with a single circle state. Let’s apply this to our table. We end up with this:

+-----+
|s1|  |
+--+--+--+
|s2|X |X |
+--+--+--+--+
|s3|  |  |X |
+--+--+--+--+--+
|s4|X |X |  |X |
+--+--+--+--+--+
   |s0|s1|s2|s3|
   +--+--+--+--+

We’re not out of the woods yet, still more work to do. There is another rule at work here. It’s easiest to express this in pseudocode. You will apply the algorithm repeatedly until the table stops changing. If you complete a full iteration without any entry getting filled in, you are done.

do {
for (int i = 0; i < numStates; ++i) 
{
  for (int j = i + 1; j < numStates; ++j)
  {
    if (table[i,j].IsEmpty) 
    {
      foreach (var inputsymbol in allValidInputs)
      {
        if (one of the states has a transition but the other one doesn't)
        {
          table[i,j] = X;
        }
        else if (both states have a transition) 
        {
          int iTarget = state at other end of transition from state i;
          int jTarget = state at other end of transition from state j;
          if (!table[iTarget, jTarget].IsEmpty)
          {
            table[i,j] = X;
          }
         }
        else
        {
           // For clarity, if none of the states have transitions
           // on this symbol, don't change the table
        }
      }
    }
  } 
}
} while (table.WasChanged)

What this does is to check if there is any significant difference by checking where the transitions go, and filling in the table as it goes along. Applying this changes the table to look like this:

+-----+
|s1|X |
+--+--+--+
|s2|X |X |
+--+--+--+--+
|s3|X |  |X |
+--+--+--+--+--+
|s4|X |X |  |X |
+--+--+--+--+--+
   |s0|s1|s2|s3|
   +--+--+--+--+

In this example, iterating the algorithm won’t change the table which means we are done. The states s1 and s3 can be merged into one state and s2 and s4 can be merged into another. To illustrate, here is the two nondistinct state groups marked out:

A final caveat is that you might get long chains of states. Say that s0 and s1 will be merged along with s1 and s2, which really means you need to merge s0, s1 and s2 into another state. This is pretty trivial, and I will not elaborate further on this, an example of this can be found in the class DFA of my parser library Piglet , which is available on GitHub.

As a final note, it can be proven that this algorithm generates the smallest possible DFA that accepts the given grammar. All in a few lines of code. Try the algorithm out yourself at the regex visualizer – make a redundant expression and uncheck the Minimize checkbox.

Tagged , , ,

A regular expression visualizer

Sometimes writing software gives you byproducts, and this is one. When implementing Piglet i ran into the issue that I couldn’t use the built in System.Text.Regex classes that I would have liked to use to avoid implementing another regular expression engine. No matter, so I built one of my own. Now when implementing one of those bad boys I figured that the best way of debugging them was to draw the same graphs that you find in the books – for the simple reason that getting an overview of a directed graph by investigating a bunch of references is tricky to say the least!

So, wise from experience I figured that you can’t spend too much time writing a good debugging facility, and it turned out quite nicely. So nice in fact that I decided to share this with the rest of the world.

It’s a simple tool that allows you to input your regular expression and see the deterministic and nondeterministic machines that implements your expression. Useful for understanding regular expressions and finding problems in your expressions.

Deployed on Appharbor, which was a truly pleasant experience indeed, it can be found here . Open sourced, if you feel it can be improved, please do so! The code is on github , as usual. The app itself is a teeny-tiny MVC application, with some javascript written by someone who is notoriously bad at it (me). If you find any issues, please do tell me using the communications channel of your choice.

Hope you find it useful!

Tagged , ,

Converting a NFA to a DFA by subset construction, Regular expressions part 2

Welcome again to the understanding regular expressions series, the previous part is located here if you missed it. Much like the dragon book you probably really can’t read these things out of order. Last time we ended with a really nice non deterministic finite automata, but in order to make it work in a deterministic fashion we will need to transform it. Again, a deterministic machine is the mirror image of a nondeterministic machine, it will achieve exactly the same thing but in a different fashion.

For reference, a nondeterministic finite automata:

  • One entry state
  • May contain epsilon transitions
  • One input may lead to several new states
  • One accepting state

It’s cousin the deterministic finite automata (DFA) has these properties:

  • One entry state
  • May not contain epsilon transition
  • One input may only lead to one state
  • May have several accepting states

The key thing about the DFA is that we can easily convert it to a table and run it through our very simple engine, giving us very nice run-times of O(n). So how to convert this machine? The algorithm used here is called closure and it is very cool indeed. It is also used in a different variation for parser generation, which is why it’s almost a must to understand regular expressions before starting to truly understand parser generators. Here it is

Given a list of states the closure set is all the sets that can be reached without any input, including the input states themselves

The function will have an input of one or more states, and return a set of states which will at least contain the input states. In pseudo C# code it looks like this. I myself much prefer pseudocode that is much closer to real code – it leaves much less to the imagination of the reader.

Set<State> Closure(List<State> inputStates)
{
    var output = new Set<State>();
    output.AddRange(inputStates);

    // Keeps states we are going to add later
    while (true) 
    {
        var statesToAdd = new Set<State>();
        foreach (var state in output)
        {
            foreach (var edge in output.EdgesGoingOut)
            {
                if (edge.IsEpsilonEdge)
                {
                    if (!output.Contains(statesToAdd)) 
                    {
                        statesToAdd.Add(edge.DestinationState);
                    }
                }
            }
        }
        if (!statesToAdd.Any())
            break; // Exit loop if there are no states to add
        output.UnionWith(statesToAdd); // Add all states to output
    }
    return output;
}

It’s helpful to visualize this using graphs. If we return to the regular expression (a|b)+bcd , the corresponding NFA looks like this

If we are going to compute closure for state 0 we will traverse the edges to 1 and 3, but not to 5, because that edge goes the wrong way. Then there are nowhere else to go, since the remaining transitions require inputs “a” or “b” respectively. So the closure of state 0 is {0, 1, 3} . Note that the input state is always included in the closure list. I like to think of this as a flood fill, because it looks like that if you have many epsilon transitions. This new set is actually the start of our deterministic machine! Yay. Lets begin to draw the machine to watch it grow.


Our first state

We now need to introduce a new operation called goto . I know, goto is a bad word these days. The algorithm apologizes profoundly for the association. The goto algorithm is real easy.

Given a set of states and an input symbol find all the states reachable by traversing the edges that require the given input symbol. Then apply the closure algorithm on the output

Think of this algorithm as move one step then flood fill again. In C# pseudocode it looks like this

List<State> Goto(List<State> inputState, char inputCharacter)
{
     var output = new Set<State>();
    foreach (var state in inputState)
    {
        foreach (var edge in state.EdgesGoingOut)
        {
            if (edge.Label == inputCharacter)
            {
                output.Add(edge.DestinationState);
            }
        }
    }
    return Closure(output);
}

So, given our first closure set {0, 1, 3} lets apply Goto on it for every input character. Nothing will happen except for characters “a” and “b”. Lets explore character “a”. The only transition we get to cross is the one going from state 3 to 4. So 4 is the only result of the Goto. We now apply closure to it, which gives us the list {4, 5, 6, 0, 1, 3} when we follow every arrow which is an epsilon. The input for “b” is very similar and yields state set {2, 5, 6, 0, 1, 3}. The next step is adding our new states to the new diagram with the input we gave to the Goto as the edge label. Now the graph looks like this:

You then repeat this again. Applying Goto on the state set {4, 5, 6, 0, 1, 3} and input “a” gives us exacly the same result as previously. This means that this state has an edge going to itself. Applying Goto on the same states with input “b” gives a new state set that we haven’t seen before {2, 5, 6, 7, 0, 1, 3}. Lets add this to our graph:

If we apply the same algorithm to the state list {2, 5, 6, 0, 1, 3} we find that it will only yield sets that we have already added to the graph, but it will give us new transitions. The graph now looks like this:

There is one unexplored set to apply goto on {2, 5, 6, 7, 0, 1, 3}. Doing this with “a” gets a transition back to a previously known state. “b” gives a transition back to the same state list. “c” gives a transition to a brand new set containing only state 8. Our graph now looks like this:

Still one new set to explore! Only one input gives a new set “d” gives a set containing only state 9. State 9 is an accepting state, so we add this to the graph.

Applying goto on this new set gives us nothing at all. We are done. Almost by magic we have transformed our non-deteministic machine into a deterministic machine. The reader is invited to try the same input as on the nondeterministic variant. It will respond the same. There is really only one bit left to do, we need to make this into a table that will run with our regular expression engine.

Next part can be found here

Tagged , , ,
Follow