Piglet 1.3.0 is out, now with unicode support!

Piglet has been updated today, with version 1.3.0. Here’s a list of the most exiting changes:

Unicode support

You can now use the full character range available in your regular expressions and parser tokens. This means that the parser will correctly be able to lex things such as Arabic, Chinese and Japanese. When using regular expressions for these, all the normal rules apply but the characters will not be included in any of the shorthand notation. For instance, the traditional Japanese numeral Kanjis are not part of the \d construct.

Nothing in existing code needs to be altered to enable this support. The runtime of the lexer has been slightly altered and is very slightly slower, but it should not even be noticeable.

Choice of lexer runtime

The most costly thing by far in the parser and lexer construction algorithms is the lexer table compression. Though this has been alleviated somewhat by the unicode functionality which actually served to reduce the size of the lexing tables, it can still be quite expensive.

If a faster construction time but a slower lexer is desired, you now have other options. When constructing, set the LexerRuntime in the LexerSettings variable of your ParserConfigurator. Or if constructing just a lexer with no accompanying parser, set the LexerRuntime property.

The available values are:

  • Tabular. Tabular is the slowest to construct but the fastest to run. Lexers built this way will use an internal table to perform lookups. Use this method if you only construct your lexer once and reuse it continually or parsing very large texts. Time complexity is O(1) – regardless of input size or grammar size. Memory usage is constant, but might incur a larger memory use than other methods for small grammars.
  • Nfa. Nfa means that the lexer will run as a non-finite automata. This method of constructing is VERY fast but slower to run. Also, the lexing performance is not linear and will vary based on the configuration of your grammar. Initially uses less memory than a tabular approach, but might increase memory usage as the lexing proceeds.
  • Dfa. Runs the lexing algorithm as a deterministic finite automata. This method of construction is slower than NFA, but faster than Tabular. It runs in finite memory and has a complexity of O(1) but a slower run time and more memory usage than a tabular lexer. It provides a middle road between the two other options.

The tabular option is still the default option.

Hope you find it useful, and please report and bugs or problems either directly to me or file an issue on github.

Leave a Reply

%d bloggers like this: