If you follow the generic instruction for converting a NFA to a DFA presented in my earlier post, the DFA you generate will work, but it will not be optimal. It will have too many states opposed to the bare minimum needed to do it’s job properly. This tends to happen more frequently if you give it an input of an expression that in some respect have redundancy. A prime example of this is the expression `a+b+|ab`

. The `|ab`

part is of course, totally useless. But the subset construction algorithm doesn’t know that.

To compare, here is the NFA generated by thompsons construction:

Here is the non-minimized DFA that the subset construction algorithm will generate:

And here is the minimum DFA required to do the job properly:

Quite a difference! So, how do we do this? There are several ways of accomplishing this, some more complex than others. I’m going to give you a working algorithm which admittedly isn’t the fastest. The faster versions are more complex, and would ruin the idea of explaining the algorithm.

## For programmers

I’m willing to bet you’re like me, a programmer and not a mathematician. You can describe this algorithm in math terms to no end, and I’m willing to bet that most people are going to miss out on what is really not the most complex algorithm in the world. If you google this you’re going to get the algorithm described as some function which partitions states so that this and that is true. Perhaps this is jealousy on my part since I feel left out of the cool crowd with the greek symbols and all, but I think there are better ways of explaining this stuff.

## The algorithm

It’s a bit backwards, but what we’re going to do is to assume that all states can be merged together and then rule out those that can’t. This is going to be represented by a table, where empty squares are those that we cannot tell apart. So, for now, this table looks like this:

+-----+ |s1| | +--+--+--+ |s2| | | +--+--+--+--+ |s3| | | | +--+--+--+--+--+ |s4| | | | | +--+--+--+--+--+ |s0|s1|s2|s3| +--+--+--+--+

This table is going to be filled later on. There’s no reason for a full table, we only want the distinct pairs of states.

What we are interested in is to sort the states into groups that may be merged into a single state without altering the functionality of the DFA. So, what we really need to do is to find out how to find these state groups normally referred to as nondistinguishable states. The first rule of this is:

A state that is an accepting state may never be in the same group as a state which is not an accepting state.

So, in our graphs we may never merge a double-circle state with a single circle state. Let’s apply this to our table. We end up with this:

+-----+ |s1| | +--+--+--+ |s2|X |X | +--+--+--+--+ |s3| | |X | +--+--+--+--+--+ |s4|X |X | |X | +--+--+--+--+--+ |s0|s1|s2|s3| +--+--+--+--+

We’re not out of the woods yet, still more work to do. There is another rule at work here. It’s easiest to express this in pseudocode. You will apply the algorithm * repeatedly *until the table stops changing. If you complete a full iteration without any entry getting filled in, you are done.

do { for (int i = 0; i < numStates; ++i) { for (int j = i + 1; j < numStates; ++j) { if (table[i,j].IsEmpty) { foreach (var inputsymbol in allValidInputs) { if (one of the states has a transition but the other one doesn't) { table[i,j] = X; } else if (both states have a transition) { int iTarget = state at other end of transition from state i; int jTarget = state at other end of transition from state j; if (!table[iTarget, jTarget].IsEmpty) { table[i,j] = X; } } else { // For clarity, if none of the states have transitions // on this symbol, don't change the table } } } } } } while (table.WasChanged)

What this does is to check if there is any significant difference by checking where the transitions go, and filling in the table as it goes along. Applying this changes the table to look like this:

+-----+ |s1|X | +--+--+--+ |s2|X |X | +--+--+--+--+ |s3|X | |X | +--+--+--+--+--+ |s4|X |X | |X | +--+--+--+--+--+ |s0|s1|s2|s3| +--+--+--+--+

In this example, iterating the algorithm won’t change the table which means we are done. The states s1 and s3 can be merged into one state and s2 and s4 can be merged into another. To illustrate, here is the two nondistinct state groups marked out:

A final caveat is that you might get long chains of states. Say that s0 and s1 will be merged along with s1 and s2, which really means you need to merge s0, s1 and s2 into another state. This is pretty trivial, and I will not elaborate further on this, an example of this can be found in the class DFA of my parser library Piglet, which is available on GitHub.

As a final note, it can be proven that this algorithm generates the smallest possible DFA that accepts the given grammar. All in a few lines of code. Try the algorithm out yourself at the regex visualizer – make a redundant expression and uncheck the Minimize checkbox.

Hi, thank you for a wonderful explanation! What is the original name for this minimization algorithm?

Thank you! I believe this is Hopcroft’s algorithm. It’s not the most efficient way of doing this, there are refinements to this method that can improve the execution time.

Cool, I know that this is not most efficient one, I’m trying to implement it in JavaScript… I represent my DFA as JSON object, so it looks like this:

{

0: [{

from: 97,

to: 97,

shift: 1

}],

1: [{

from: 97,

to: 97,

shift: 3

}, {

from: 98,

to: 98,

shift: 2

}],

2: [{

from: 98,

to: 98,

shift: 4

}],

3: [{

from: 97,

to: 97,

shift: 3

}, {

from: 98,

to: 98,

shift: 4

}],

4: [{

from: 98,

to: 98,

shift: 4

}]

}

What I do not understand is what is the difference between an algorithm described in your article (with all this complex table construction and so on) and the following:

1. For each state (numbered as 0, 1, 2, 3, 4 in my example) obtain unique hash that identifies such state (for example for state0 this will be: from=97,to=97,shift=1, for state1 this will be: from=97,to=97,shift=3&from=98,to=98,shift=2 and so on…)

2. Compare obtained hashes and if we find two identical ones, then merge it. In my example, state2 hash will be: from=98&shift=4&to=98, and state4 hash will be: from=98&shift=4&to=98, they are the same, so we can merge them into the new state5, after that DFA will look like this:

{

0: [{

from: 97,

to: 97,

shift: 1

}],

1: [{

from: 97,

to: 97,

shift: 3

}, {

from: 98,

to: 98,

shift: 5

}],

3: [{

from: 97,

to: 97,

shift: 3

}, {

from: 98,

to: 98,

shift: 5

}],

5: [{

from: 98,

to: 98,

shift: 5

}]

}

3. Continue this ’till we get no changes, so the next step (merging states 1 and 3) will transform DFA into the following form:

{

0: [{

from: 97,

to: 97,

shift: 6

}],

6: [{

from: 97,

to: 97,

shift: 6

}, {

from: 98,

to: 98,

shift: 5

}],

5: [{

from: 98,

to: 98,

shift: 5

}]

}

4. There are no more identical states, that means we’re done.

What do you think about it? Is this identical to the algorithm described by you?

Hard to tell, without seeing it in action. I didn’t try to invent a new algorithm, basically what I did was to read the dragon book VERY closely (this stuff is described in about a paragraph of incomprehensibility) and implement what I found there.

If you want my opinion, I think your algorithm might not produce the optimal DFA, though it looks like it won’t induce errors from a cursory glance. I suggest running some more tests. If you take a look at http://regexvisualizer.apphb.com/ and compare the minimized DFA to what your algorithm generate with some more complex samples you should be able to see if there are any differences. The sample shown here is a bit simple, since it finishes generating in just one pass.

What is guaranteed with the algorithm i present is that it will generate the mathematical optimal DFA with the least amount of states. The inefficiencies it contains is that the generation is not as optimal as it could be, though it is quite good. The result is guaranteed to be optimal though.

With these algorithms, they are often from the 60s or 70s, which means that the speed and memory impact was of huge concern to the people at that time. Today it might be possible to cheat a bit with the obscene amount of computing power available and do something simpler code-wise but more expensive in regards to computing power.

I tried to use your regexvisualizer to understand whether my solution produces something closer to the minimized DFA produced by your tool and it seems that my algorithm makes it even smaller. I’m disappointed… Here is the regexp that I tested: (a | b)+ (a | b)*, minimized DFA produced by your tool contains 8 states, in the mean time my algorithm produces DFA that contains only 2 states:

{

“START”: [{

"type": "range",

"from": 97,

"to": 98,

"shift": "1"

}],

“1″: [{

“type”: “range”,

“from”: 97,

“to”: 98,

“shift”: “1″

}, {

“shift”: ["ACCEPTED_TOKEN"]

}]

}

I understand that there is differences in the DFA representation which shows redundant circles on the diagram produced by your tool, but still I see that minimized DFA in your case is more complex than mine…

Remove the spaces, the spaces are significant in a regular expression. You’ll see that the it ends up with two states just as you do. (a|b)+(a|b)* is not the same as (a | b)+ (a | b)*

Btw it produces the same result on the last drawing from this post (results in merging s1 + s3 and s2 + s4)

You may like to check for unreachable and dead states also during minimization.

Normally yes, but the method that is used to generate the DFA that is going to be minimized never gives me any unreachable or dead states, so I decided to skip this part of the problem.

Since the DFA always comes from a NFA constructed by thompsons construction from a regular expression which is then converted to a DFA using subset construction you’re not going to see these sort of issues as far as I can tell. I might be wrong though, if there are any examples of a regular expression which generates unreachable states or trap states I would certainly be interested in knowing about it!

I agree, I have the same situation, this algorithm (goto / closure) never produces DFA that has unreachable states.

Answer to your comment about removing the spaces (not sure why Reply button doesn’t appear right there):

You’re absolutely right! The thing is that I’m using my own grammar format which ignores the spaces and forces literals to be enclosed in “…”, that was the thing. Now I see no difference, i. e. DFA minimized using my algorithm is absolutely the same as yours, and it’s a good news of course. The bad thing is that I cannot prove that those are equivalent :(

Grammars are NOT input to DFA recognizers! Grammars are input to DFA generators.

Certainly! Was I unclear anywhere in the post?