perlancar's blog: Getopt modules: Epilogue

perlancar's blog

About this mini-article series. For each of the past 24 23 days, I have reviewed a module that parses command-line options (such module is usually under the Getopt::* namespace). First article is here.

This series was born out of my experimentations with option parsing and tab completion, and more broadly of my interest in doing CLI with Perl. Aside from writing this series, I've also released numerous modules related to option parsing, some of them are purely experimental in nature and some already used in production.

It has been interesting evaluating the various modules: the sometimes unconventional or seemingly odd approach that they take, or the specific features that they offer. Not all of them are worth using, but at least they provide perspectives and some lessons for us to learn.

Of course, not all modules got reviewed. There are simply far more than 24 modules (lcpan tells me that there are 180 packages in the Getopt:: namespace alone, with 94 distributions having the name Getopt-*). I tried to cover at least the must-know ones, core ones, and the popular ones. Other than that, frankly the selection is pretty much random. I picked what's interesting to me or what I can make some points about, whether they are negative or positive points.

I have skipped many modules that are just yet another Getopt::Long wrapper which adds per-option usage or some other features found in Getopt::Long::Descriptive (GLD). Not that they are worse than GLD, for some reason or another they just didn't get adopted widely or at all. A couple examples of these: Getopt::Helpful, Getopt::Fancy.

Modules which use Moose, except MooseX::Getopt, automatically get skipped by me because their applicability is severely limited by the high number of dependencies and high startup overhead (200-500ms or even more on slower computers). These include: Getopt::Flex, Getopt::Alt, Getopt::Chain.

Some others are simply too weird or high in "WTF number", but I won't name names here.

Except for App::Cmd and App::Spec, I haven't really touched CLI frameworks in general. There are no shortages of CLI frameworks on CPAN too, perhaps for another series?

I've avoided reviewing my own modules, which include Getopt::Long::Complete (Getopt::Long wrapper which adds tab completion), Getopt::Long::Subcommand (Getopt::Long wrapper, with support for subcommands), Getopt::Long::More (my most recent Getopt::Long wrapper which adds tab completion and other features), Getopt::Long::Less & Getopt::Long::EvenLess (two leaner versions of Getopt::Long for the specific goal of reducing startup overhead), Getopt::Panjang (a break from Getopt::Long interface compatibility to explore new possibilities), and a CLI framework Perinci::CmdLine (which currently uses Getopt::Long but plans to switch backend in the long run; I've written a whole series of tutorial posts for this module).

In general, I'd say that you should probably try to stick with Getopt::Long first. As far as option parsing is concerned, it's packed with features already, and it has the advantage of being a core module. But as soon as you want: automatic autohelp/automessage generation, subcommand, tab completion then you should begin looking elsewhere.

Unfortunately except for evaluating Perl ports of some option parsing libraries (like Smart::Options, Getopt::ArgParse, Getopt::Kingpin), I haven't got the chance to deeply look into how option parsing is done in other languages. Among the other languages is Perl's own sister Perl 6, which offers built-in command-line option parsing. This endeavor of researching option parsing in other languages could potentially offer more lessons and perspectives.

I hope this series is of use to some people. Merry christmas and happy holidays to everybody.


perlancar's blog: Getopt modules 23: Getopt::Complete

perlancar's blog

About this mini-article series. Each day for 24 days, I will be reviewing a module that parses command-line options (such module is usually under the Getopt::* namespace). First article is here.

Getopt::Complete (GC) is a module written by Scott Smith (SAKOHT) in 2009 and also co-maintained by Nathan Nutter (NNUTTER). Last release is in 2011. So far it registers one CPAN distribution depending on it, although it's written by Scott himself.

Shell tab completion is a topic which I have been interested in since around 2012. I've released numerous modules related to completion, including two option parsing modules Getopt::Long::Complete (GLC) and Getopt::Long::More (GLM) which sports completion as (one of) its selling point, so it's natural that I want to compare them to Getopt::Complete. Throughout the article I'll be repeatedly doing those comparisons, and I hope it's not becoming too annoying.

Interface

GC, like GLC and GLM, is a Getopt::Long (GL) wrapper that adds tab completion feature. To let the module detect tab completion mode and return completion answer as soon as possible, GC offers this interface:

use Getopt::Complete (
    'frog'        => ['ribbit','urp','ugh'],
    'fraggle'     => sub { return ['rock','roll'] },
    'quiet!'      => undef,
    'name'        => undef,
    'age=n'       => undef,
    'outfile=s@'  => 'files',
    'outdir'      => 'directories',
    'runthis'     => 'commands',
    'username'    => 'users',
    ''          => 'directories',
);

That is, it accepts the options specification as import arguments. This looks simple but presents its own inconveniences.

The second thing you'll notice that the options specification are different than GL. While GLC and GLM choose to use an interface that is backward-compatible with GL, GC focuses on tab completion. The values of the pairs in the options specification is not a variable reference/coderef as you would expect in GL, but solely completion specification: it's either undef (meaning the option does not require argument), a string (meaning a completion type/routine to use, e.g. files to complete from filenames, commands to complete from program names in PATH, and so on. The options values themselves are collected in %ARGS.

Thus, compared to GLC and GLM, specifying completion routines is simpler in GC (but I also wrote Shell::Completer to provide the same level of convenience with more flexibility).

Activating Completion

To activate completion in bash, you need to declare this shell function first:

function _getopt_complete () {
COMPREPLY=($( COMP_CWORD=$COMP_CWORD perl `which ${COMP_WORDS[0]}` ${COMP_WORDS[@]:0} ));
}

then for each CLI application you also need to do:

% complete -F _getopt_complete myapp

This is different than the way you activate completion for GLC- or GLM-based scripts:

% complete -C myapp myapp

External programs receive raw COMP_LINE and COMP_POINT environment variables from bash when doing tab completion, while shell functions are provided with the already-parsed command-line COMP_WORDS array variable and COMP_CWORD. GC wants to avoid parsing the command-line on its own, so the _getopt_complete function is used to give the Perl program parsed command-line arguments in @ARGV, and COMP_CWORD in another environment variable.

Using command-line that is already parsed by bash in COMP_WORDS has its pros as well as cons, due to the way that bash parses command-line for COMP_WORDS. So I cannot say which way is better, but what I can say is parsing COMP_LINE ourselves is more flexible.

Completion behavior and bugs

When you press tab after the command:

% myapp <tab>

GC offers only completion from the <> specification. In the above example, it only offers list of directories as answer. On the other hand, GLC and GLM also shows the list of available option names. With GC, to list the available options, you have to do:

% myapp -<tab>

I also cannot say that GLC's and GLM's way is better, but it certainly makes the CLI program more discoverable. By just pressing Tab, a user (especially a new user) can know more about what's possible.

GC has still a few problems. First of all, it cannot complete "–opt=" when COMP_WORDBREAKS contains "=". I have put workarounds for this issue in GLC and GLM. Second, it cannot handle filenames/directory names with spaces, or quotes, and probably other special characters too.

Third, GLC and GLM through Complete::Util offers some matching algorithms aside from simple prefix matching, for extra convenience. This is not offered by GC.


perlancar's blog: Getopt modules 22: Getopt::Kingpin

perlancar's blog

About this mini-article series. Each day for 24 days, I will be reviewing a module that parses command-line options (such module is usually under the Getopt::* namespace). First article is here.

Getopt::Kingpin is a port of Go's kingpin library, written by Masaaki Takasago (TAKASAGO) in 2016. It offers the usual "nowadays standard" features like: short and long options with short option bundling, automatic help/usage message generation, specifying that an option is required, default value, and subcommands. Two extra features are: specifying that an option can be set via environment variable of a certain name, and built-in completion (which is a feature from the original library but doesn't seem to be implemented yet in the Perl port). The Go library also allows templating of help message, and this is not yet supported by Getopt::Kingpin.

Like Smart::Options (reviewed a couple of days ago), kingpin is using the so-called "fluent style" interface, a.k.a. chained methods, which I find annoying to type in Perl due to the method call operator in Perl being -> instead of a single dot. Although fortunately the chained methods interface is slightly less annoying than in Smart::Options.

After looking at the 3 ports of option parsing libraries (the abovementioned two plus Getopt::ArgParse reviewed yesterday) it indeed seems that subcommand support is becoming a standard thing. Which makes me think about whether Getopt::Long should also add such feature, or whether we should promote some other option parsing library as the "best practice" when one wants to do subcommands. So far, I'm not seeing any single best candidate for "Getopt::Long + subcommand support".


Joel Berger: Cross-post: On the Danger of Software Magicians

I wrote a language-agnostic article and posted it on Medium. Cross posting here for anyone following my Perl posts as well. I hope you enjoy it.

https://medium.com/@joel.a.berger/on-the-danger-of-software-magicians-fd8186b8945c#.lmpgrfzie

perlancar's blog: Getopt modules 21: Getopt::ArgParse

perlancar's blog
<p><B>About this mini-article series.</B> Each day for 24 days, I will be reviewing a module that parses command-line options (such module is usually under the <TT>Getopt::*</TT> namespace). <A HREF="https://perlancar.wordpress.com/2016/12/01/getopt-modules-01-getoptlong/">First article is here</A>. </p> <p>In contrast to in Perl, where the core modules <A HREF="https://metacpan.org/pod/Getopt::Std">Getopt::Std</A> and <A HREF="https://metacpan.org/pod/Getopt::Long">Getopt::Long</A> stand the test of time and remain the most popular ways people parse command-line options with in their Perl CLI scripts, in Python we encounter several churns of recommended standard modules. </p> <p>First there is <A HREF="https://docs.python.org/3/library/getopt.html">getopt</A>, &quot;C-style parser for command line options&quot;. To use getopt, you pass a string containing list of short options a la Getopt::Std, e.g. <TT>&quot;ho:v&quot;</TT> (meaning <TT>-o</TT> takes argument while <TT>h</TT> and <TT>v</TT> are flag switches), and also an array containing long options, e.g. <TT>[&quot;help&quot;, &quot;output=&quot;]</TT> (meaning <TT>&#8211;output</TT> takes argument while <TT>&#8211;help</TT> does not). But, instead of supplying references to variables to set, or coderefs (remember, specifying anonymous function is inconvenient in Python) like in Getopt::Long, in getopt programmers are asked to do a manual if-then-else and a loop (see the linked documentation for example). This is also quite similar in interface to the <A HREF="https://ruby-doc.org/stdlib-2.1.0/libdoc/getoptlong/rdoc/GetoptLong.html">GetoptLong</A> class in Ruby. </p> <p>No doubt, this style of programming feels manual and tedious. Thus came <A HREF="https://docs.python.org/3/library/optparse.html">optparse</A> which is more OO and supposedly more Pythonic. Instead of passing a whole list of options at once, you now add one option (object) at a time using <TT>add_option</TT> method, along with more information for each option: usage/help message, type, whether the option is required, number of arguments expected, default value, and perhaps some callback. optparse&#039;s capability is equivalent to Getopt::Long or <A HREF="https://metacpan.org/pod/Getopt::Long::Descriptive">Getopt::Long::Descriptive</A>, except that optparse makes some design choices, for example it is decidedly Unix-oriented, allowing only <TT>&#8211;</TT> or <TT>&#8212;</TT> as the option prefix (while Getopt::Long allows you to configure this). The documentation is quite probably the nicest aspect of this module: it does not assume much knowledge (like familiarity with Unix or CLI) from the readers and explains at length what an option is and how should one design a CLI program with regards to accepting options. I realized that &quot;required option&quot; is indeed an oxymoron from reading it! </p> <p>But, as with Getopt::Long, optparse does not have the concept of subcommands. Thus arrived <A HREF="https://docs.python.org/3/library/argparse.html">argparse</A>. It is basically like optparse in appearance, except it has some extra features like the ability to specify positional arguments (in Getopt::Long, this is handled by the <TT>&lt;&gt;</TT> option specification) and support nested subcommands with the use of subparsers. Interestingly, argparse supports reading arguments from a file just like <A HREF="https://metacpan.org/pod/Getopt::ArgvFile">Getopt::ArgvFile</A>, and this is the only form of &quot;config file&quot; it supports. </p> <p>As things are right now, argparse becomes part of the standard library (a.k.a. core modules, in Perl parlance) while optparse is now deprecated and might be removed. However, getopt remains. </p> <p>There is a Perl port of argparse on CPAN called <A HREF="https://metacpan.org/pod/Getopt::ArgParse">Getopt::ArgParse</A>, created by M ytraM (<A HREF="https://metacpan.org/author/MYTRAM">MYTRAM</A>) in 2013 and last updated in 2015. It is not feature-by-feature equivalent to its Python original, because of language differences and because argparse still accumulates features over time. You get some basic features like autohelp/autousage message, default value, setting an option as required, setting number of expected arguments, as well as subparsers for subcommand support (although not yet nested in Getopt::ArgParse). The type/validation feature is weak or almost nonexistent; perhaps a custom validation routine should be allowed to be specified or more can be explored here. </p> <p>What&#039;s rather disappointing from this port is its use of <A HREF="https://metacpan.org/pod/Getopt::Long">Getopt::Long</A> (I was expecting a full port so option parsing should be done by itself) and <A HREF="https://metacpan.org/pod/Moo">Moo</A>, significantly adding dependencies. </p> <p>There is mention of configuration file in the documentation, but actually there is no explicit support of configuration file. Not even using &quot;option file prefix&quot; ala argparse or <A HREF="https://metacpan.org/pod/Getopt::ArgvFile">Getopt::ArgvFile</A>. </p> <p>All in all, I&#039;m not seeing something to make me prefer this module. If you do not use subcommands, I recommend sticking with <A HREF="https://metacpan.org/pod/Getopt::Long">Getopt::Long</A> or <A HREF="https://metacpan.org/pod/Getopt::Long::Descriptive">Getopt::Long::Descriptive</A>. If you do use subcommands, perhaps also consider a CLI framework like <A HREF="https://metacpan.org/pod/App::Cmd">App::Cmd</A>, or <A HREF="https://metacpan.org/pod/Getopt::Long::Subcommand">Getopt::Long::Subcommand</A>.</p><br /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/perlancar.wordpress.com/1581/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/perlancar.wordpress.com/1581/" /></a> <img alt="" border="0" src="https://pixel.wp.com/b.gif?host=perlancar.wordpress.com&#038;blog=83450492&#038;post=1581&#038;subd=perlancar&#038;ref=&#038;feed=1" width="1" height="1" />

Perl Foundation News: White Camel Awards for 2016

brian d foy has announced the White Camel Awards for 2016 and we'd like to congratulate the winners. I'd like to add a special congratulations to Karen Pauley for all her work with Perl, both officially as TPF president and unofficially as a community member. Thanks to all of the winners for your constant efforts toward keeping the Perl community a vibrant and fun place to be.

perlancar's blog: Getopt modules 20: Smart::Options

perlancar's blog

About this mini-article series. Each day for 24 days, I will be reviewing a module that parses command-line options (such module is usually under the Getopt::* namespace). First article is here.

In the next few days, I'll be reviewing Perl ports of some popular option parsing modules from other languages. Today: Smart::Options.

Summary

Smart::Options is written by KAN Fushihara (MIKIHOSHI) and is a Perl port*) of node's optimist package, which in turns uses minimist as the option parsing engine and adds some stuffs, mainly the ability to generate usage/help message. Ironically, optimist is now deprecated in favor of yargs which is roughly the same as optimist but does its own parsing and adds features like bash completion.

So minimist is roughly the equivalent of Getopt::Long, optimist is the equivalent of Getopt::Long::Descriptive, yargs is roughly the equivalent of Getopt::Long::Descriptive + tab completion (like Getopt::Long::Complete or Getopt::Long::More).

You can get an idea of the sheer number of packages in npm, the CPAN equivalent in the node ecosystem, by looking at these numbers: compared to Getopt::Long's 1127 dependents, minimist has 5768 dependents. And it isn't even the most popular option parsing package. The most popular one on npm is currently commander (from the legendary TJ Holowaychuk) which has 12252 dependents! yargs has 4073, optimist has 3546 (remember that optimist has been declared as deprecated), and nomnom (another deprecated option parsing package) still has 510.

Currently there is no CPAN distribution depending on Smart::Options.

commander itself resembles Getopt::Long::Descriptive a bit more in its interface. I didn't find any Perl port of commander on CPAN though.

But I digress. Let's go back Smart::Options and optimist. As I said earlier, optimist is roughly equivalent to Getopt::Long::Descriptive. Except for one main difference: you are not required to specify any specification. Without any specification, the library will simply accept any option and put it in a hash. But remember that without specification, you cannot check for an unknown option or get auto-abbreviation.

Bundling of short one-letter options is supported, but if you don't provide specification the library cannot differentiate which short options require value and which ones don't: the library will simply assume that all short options are just flags which don't take value.

Another difference is the usage of OO and method chaining.

Usage

Here's how one would use Smart::Options in the simplest way (without any specification):

use 5.010;
use Smart::Options;
my $opts = argv(); # you can also say: $opts = Smart::Options->new->parse
say "foo = ", $opts->{foo};
say "b = ", $opts->{b};
say "args = [", join(", ", @{ $opts->{_} }), "]";
say "ARGV = [", join(", ", @ARGV), "]";

Let's try to run it:

% ./script.pl –foo 10 -b — a b c
foo = 10
b = 1
args = [a, b, c]
ARGV = [–foo, 10, -b, –, a, b, c]

As you can see, the command-line arguments will be put in the _ key. And unlike Getopt::Long, it does not modify @ARGV.

One nitpick: the argv() or the parse() function (or method) can accept a list to parse options from array other than @ARGV, but since it accepts a list instead of arrayref, when you pass a zero-length array it will assume that you don't pass any array and so still defaults to @ARGV. This can be remedied, e.g., by accepting an arrayref instead.

Without options specification, it's not possible to declare an option to be required, repeatable, or as a flag. So let's add some specification:

use 5.010;
use Smart::Options;
my $opts = Smart::Options->new
    ->demand('foo')                     ->describe(foo => 'The foo option')
    ->default(bar => 3)->alias(b => bar)->describe(bar => 'The bar option')
    ->default(baz => 5)                 ->describe(baz => "The baz option")
    ->parse;
say "foo = ", $opts->{foo};
say "b = ", $opts->{b};
say "args = [", join(", ", @{ $opts->{_} }), "]";

After this, you can generate help message:

$ ./script.pl –help
Usage: ./script.pl

Options:
-b, –bar The bar option [default: 3]
–baz The baz option [default: 5]
–foo The foo option [required]
-h, –help Show help

Missing required arguments: foo

BTW, some option parsing modules, including Smart::Options, still complain about missing –foo when we instruct it to show help message (–help), like shown above. I think this behavior is a bug and should be fixed.

Other features

*) I said earlier that Smart::Options is a port of optimist. It is actually more accurately a blend between optimist and Kan's older module opts. So beyond optimist, Smart::Options adds some more (quite substantial) features, which do not exist even in yargs or commander.

Validation. Like in Getopt::Long, you can add some validation. You can declare an option to accept Bool, Int, Num, Str, ArrayRef (this is similar to Getopt::Long's @ destination type to make option repeatable), HashRef (if say foo is declared as a hashref, you can specify –foo.key1 or –foo.key2 in the command-line and so on), or Config.

Configuration file. The last type, Config, is actually supposed to let you specify a filename to make the module reads an INI-like configuration file. But perhaps this configuration is misplaced and conflated, as this is not a type/validation configuration, and it is not per-option but global.

Coercion. This can be used to convert an option value which is scalar/string to, say, Path::Tiny instance.

Subcomands. This lets you support (nested) subcommands by adding a nested Smart::Options object inside another, like in Getopt::Long::Subcommand. For example:

my $opts = Smart::Options->new
    ->subcmd(subcmd1 => Smart::Options->new->...)
    ->subcmd(subcmd1 => Smart::Options->new->...)
    ->parse;

DSL. If you don't like the chained methods syntax, there's Smart::Options::Declare which offers an alternative interface to declare an option one by one much like Moose's has. Although it doesn't seem to support declaring subcommands yet.

Performance

The startup overhead of Smart::Options is roughly the same as Getopt::Long::Descriptive, while the memory usage is higher.

% bencher-module-startup-overhead Smart::Options Getopt::Long::Descriptive
+—————————+——————————+——————–+—————-+———–+————————+————+———–+———+
| participant | proc_private_dirty_size (MB) | proc_rss_size (MB) | proc_size (MB) | time (ms) | mod_overhead_time (ms) | vs_slowest | errors | samples |
+—————————+——————————+——————–+—————-+———–+————————+————+———–+———+
| Smart::Options | 4.2 | 8 | 33 | 36 | 33.9 | 1 | 0.00018 | 20 |
| Getopt::Long::Descriptive | 0.82 | 4.5 | 23 | 35 | 32.9 | 1 | 9.9e-05 | 20 |
| perl -e1 (baseline) | 4.9 | 9 | 38 | 2.1 | 0 | 17 | 1.5e-05 | 20 |
+—————————+——————————+——————–+—————-+———–+————————+————+———–+———+

Also to be noted is that Smart::Options does not use Getopt::Long but does its own parsing.

Verdict

I find optimist and yargs themselves don't offer any new feature not already existing in Getopt::Long or Getopt::Long::Descriptive (the completion feature can be done with shcompgen). But Smart::Options does offer some extra features like subcommand support and reading of configuration file. On the other hand, you lose some of Getopt::Long's features like: auto-abbreviation and custom handler (in Getopt::Long, you can assign a coderef to an option which can do anything, like printing a message early and exiting, or setting other variable or multiple variables, or whatever).

My problem with this module is the interface: method chaining has its uses (for example I find it convenient in some JSON module or in jQuery) but here it just distracts and make options specification visually convoluted. On the other hand, the DSL alternative interface is not complete (yet).

I personally would still reach for my Perinci::CmdLine most of the time. But I will prefer Smart::Options over App::Options (which is also covered in this mini-article series).


Sawyer X: Perl 5 Porters Mailing List Summary: December 12th-18th

Hey everyone,

Following is the p5p (Perl 5 Porters) mailing list summary for the past week.

Enjoy!

December 12th-18th

Grant reports

Issues

New issues

  • Perl #130333: Perl_pp_rv2sv(): Assertion failed.
  • Perl #130334: Perl_pp_rv2av(): Assertion failed.
  • Perl #130335: sort{$a<=>$b} fails to sort occasionally.
  • Perl #130337: Perl_sv_pvn_force_flags(SV *const, STRLEN *const, const I32): Assertion failed.
  • Perl #130360: Bug #121105 for perl5: During a system(), unquoted Perl vars are evaluated after the fork() call.
  • Perl #130361: debugger does not stop at postponed breakpoints.
  • Perl #130367: perl thinks a hash is a scalar in push/keys error messages.
  • Perl #130375: Porting/release_managers_guide.pod: need advice re new directories created by CPAN synch.

Resolved issues

  • Perl #128893: printf %a botches 0 flag for negative values.
  • Perl #130108: Perl 5.24.1 fails to compile with DTrace enabled on FreeBSD.

Suggested patches

Ricardo Signes provided a patch to add missing parts to Module::Load that provide the core desired behavior of Module::Runtime.

Aaron Crane provided a patch for fixing the Unicode Bug issue with the range operator, to be used under unicode_strings.

Discussion

David Mertens asked (Changes to hints hash via keyword are clobbered by pragmatic module) about behavior he saw in the hints hash when changing the internal hash structure from C. Zefram summarized that changes to hints (whether from Pure-Perl or from C-level) should be done using %^H only. David proposed rectifying the documentation to clarify this point.

David seeks (Guidance on keyword and scope at end of for loop) additional guidance on the issue of the hints hash and keywords in a new thread.

In a move to clean up the delimiters, Karl Williamson proposes deprecating a delimiter which is part of a larger grapheme cluster and not separate.

perlancar's blog: Getopt modules 19: App::Spec

perlancar's blog

About this mini-article series. Each day for 24 days, I will be reviewing a module that parses command-line options (such module is usually under the Getopt::* namespace). First article is here.

In the previous article I discussed App::Cmd, which is a nice, simple CLI framework that supports subcommands by requiring you to write a subcommand class for each subcommand you want to add. And it also lets you specify Getopt::Long::Descriptive command-line options directly so you can be as custom as Getopt::Long::Descriptive lets you to be. However, many high-level features are missing.

There exists many more CLI frameworks on CPAN, like there are option parsing libraries, some closer to App::Cmd (except, say, being Moo- or Moose-specific) while others try to provide more said features.

App::Spec is one module. It is closer in features to my Perinci::CmdLine with the main difference being that App::Spec is OO (although it uses a single class and different methods to support subcommands instead of a separate class for each subcommand) while Perinci::CmdLine is decidedly not. Here are the features that it supports (or want to support, as it’s not quite polished or finished yet): a specification for CLI app (summary/description, list of subcommands (possibly nested), and parameters/options for each subcommand), extra validation, automatic help/usage message generation, and shell tab completion. App::Spec is relatively new (2016) and written by Tina Muller (TINITA). No applications on CPAN are using it right now. There is actually an App::Spec article on this year‘s Perl Advent Calendar so I’ll just direct you to reading the article instead of describing it myself.

What’s good about App::Spec is that it does not use Moo or Moose, so you can use it for applications you want to be light. It’s also not too heavy on the OO side. It provides shell tab completion out of the box; we need more frameworks like this because tab completion is one of pillars of usability on the CLI. I hope the completion feature improves in the future.

What I find in App::Spec not really to my liking includes: low-level (manual) mapping to Getopt::Long specification format (I prefer an automatic mapper like in MooseX::Getopt or my Perinci::CmdLine), splitting options into “options” and “parameters” (unnecessary, they’re all options to me, “options” just happen to be boolean switches while “parameters” have values like string or whatever).


perlancar's blog: Getopt modules 18: App::Cmd

perlancar's blog

About this mini-article series. Each day for 24 days, I will be reviewing a module that parses command-line options (such module is usually under the Getopt::* namespace). First article is here.

Traditionally, option parsing modules in Perl like Getopt::Std and Getopt::Long
do not have the concept of subcommands. The rising popularity of CLI programs with subcommands, specifically git and other post-CVS version control tools, has prompted the option parsing libraries to include the concept too, like in node’s commander or Python’s argparse. In Perl, there are not many option parsing libraries that offer this feature (although ports of other languages’ libraries including argparse exist on CPAN, which I’ll cover in the following days). That said, you can also use a higher-level library like CLI frameworks that support subcommands.

App::Cmd is one such module: it is an OO CLI application framework which happens to use Getopt::Long::Descriptive as the command-line options parser. Both modules are written by Ricardo “Rik” Signes (RJBS). App::Cmd was first released in 2006 and is still being updated. Around 60 CPAN distributions use App::Cmd (90 if we also count users of MooseX::App::Cmd), making it possibly the most popular CLI application framework on CPAN. Toby Inkster (TOBYINK) even once called it “the PSGI of the command-line world”, although I don’t think that analogy is appropriate. A very popular CLI application, dzil (Dist::Zilla), also by Rik, uses App::Cmd.

As mentioned, App::Cmd is meant to be used to write CLI application which has subcommands (or commands, as it call them) like ‘git’ or ‘dzil’ (with ‘git clone’ or ‘dzil build’ as examples of command with subcommand). To use App::Cmd in your application, you need to create a single application class and then one class for each subcommand you want to support. App::Cmd does not use Moo or Moose, making it more universally usable. Of course, your application or command classes can be Moo-based or Moose-based, as demonstrated by MooseX::App::Cmd and dzil. The CLI script itself is reduced to something like:

use YourApp; # your application class
YourApp->run;

Accepting and processing command-line options is pretty direct, if not low-level:

package YourApp::Command::cmd1; # a command class
sub opt_spec {
    return (
        [ "skip-check|C",  "skip checking stuffs", ],
        [ "sleep-between|s=i",  "delay between processing file", { default =>5 } ],
    );
}

# optional
sub usage_desc { "blah blah" }

# optional
sub validate_args {
    my ($self, $opt, $args) = @_;
    $self->usage_error("Please supply at least one file") unless @$args;
    $self->usage_error("Please specify a positive number") unless $opt->sleep_between >= 0;
}

sub execute {
    my ($self, $opt, $args) = @_;
    ...
}

You provide a method opt_spec in your command class to specify which command-line options your subcommand will accept. The value returned by this method will be passed directly to Getopt::Long::Descriptive. The parse result will be $opt object (which is something you normally get from the Getopt::Long::Descriptive’s describe_options function) as well as $args (the remaining command-line arguments from @ARGV after the options has been stripped from).

You can also provide the usage description to be passed to Getopt::Long::Descriptive’s describe_options via the usage_desc method. And an additional method validate_args to further validate $opt and $args if needed. The main method in a command class is execute, which is fed $opt and $args.

So you can see this does not differ much from a “traditional” CLI using Getopt::Long. You still provide the command-line options specification manually. This differs from other CLI frameworks like MooseX::Getopt or my Perinci::CmdLine which try to be more DRY by directly setting object attributes or function arguments from command-line options.

Another thing to note is that no configuration file support is baked in: you need to read and parse configuration files yourself. So basically what App::Cmd provides is the structure. For getting many higher-level CLI features, you need to do on your own. App::Cmd is mature and widely used, but you might also want to take a look at some other CLI frameworks that do more stuffs for you.


perlancar's blog: Getopt modules 17: Getopt::Modular

perlancar's blog

About this mini-article series. Each day for 24 days, I will be reviewing a module that parses command-line options (such module is usually under the Getopt::* namespace). First article is here.

Getopt::Modular is a Getopt::Long wrapper that lets you place some command-line options to one or more modules and then lets you combine them as you use the modules. For example:

# Module1.pm
use Getopt::Modular;
Getopt::Modular->acceptParam(
    opt1 => {
        spec => '=s',
        aliases => ['O'],
        help => 'This is option one',
        default => 'foo',
        validate => sub { ... },
    },
    opt2 => {
        ...
    },
);
# access the parameters somewhere in your code using:
if (Getopt::Modular->getOpt('foo')) { ... }
1;

# in Module2.pm
use Getopt::Modular;
Getopt::Modular->acceptParam(
    opt3 => { ... },
);
1;

# in myapp
use Getopt::Modular;
use Module1;
use Module2;
Getopt::Modular->parse_args; # program accepts options opt1, opt2, opt3

As you can see, aside from splitting command-line options over several modules, Getopt::Modular also lets you specify default value, usage/help message strings, and extra validation routine.

Getopt::Modular is written by Darin McBride (DMCBRIDE), first release is in 2008 and last updated in 2014. Currently no other CPAN distributions are using it. But Getopt::Modular inspired another module Getopt::Awesome (written by Pablo Fischer (PFISCHER) in 2009) which continues Getopt::Modular’s basic premise but with an alternative syntax.

The intention is good, to achieve modularity, but it’s modularity at the inappropriate level. If you want your code in a module to be more reusable and flexible (and everybody wants that), you accept parameters. The first attempt for accepting parameters should be function parameters (or if you are building an OO class, class attributes). If that is not suitable, for example if you want to parameterize a more global behavior, you use package variables or perhaps environment variable. Using Getopt::Modular (needlessly) ties the parameters to command-line, when your module might not be command-line-specific. The mapping is perhaps best done at the script level instead of at the modules.

That said, there are surely cases when this module is appropriate, for example if you are building a rather complex CLI application that you split into several modules, where the modules are CLI/application-specific. But even then, you should probably try to make the module not CLI-specific if you can.


brian d foy: White Camels 2016

2016.png The White Camel Awards recognize outstanding, non-technical achievement in Perl. Started in 1999 by Perl mongers and later merged with The Perl Foundation, the awards committee selects three names from a long list of worthy Perl volunteers to recognize hard work in Perl Community, Perl Advocacy, and Perl User Groups. These awards have been managed by The Perl Review in conjunction with the The Perl Foundation.

At the end of each year we ask the community to nominate Perl heroes. Each year we have a long list of people we could recognize. You don't have to wait to nominate someone though. We maintain a list from one year to the next. We'll take nominations at any time, but wait to announce them on Perl's birthday.

For 2016, the White Camels recognize the efforts of these people whose hard work has made Perl and the Perl community a better place:

Perl User Groups - David Golden

You've run across David's work behind the scenes of the Perl community, and probably recognize him as "xdg" . He could have received a White Camel in any category. He's a constant force in establishing consensus for the Perl toolchain, the QA workshops, and many other social constructs that make everything work together. Besides that technical work, he's also one of the organizers of the New York Perl mongers.

Perl Advocacy - Karen Pauley

For years, Karen has been on the shortlist of people nominated for a White Camel Award for her work as the president of The Perl Foundation. For several years she has kept the gears turning, oversaw various major improvements in management, and putting everything in good order. This year she has stepped out of her role in TP. Since she's no longer part of TPF, we can recognize her with this award. Mark Keating has a much longer discussion of her work at TPF.

Perl Community - Thomas Klausner

Thomas Klausner ("domm" to many people) of the Vienna Perl mongers literally works harder than everyone else to get to Perl events. Starting in Vienna, he cycles to many events. He's worked as an organizer for several workshops in his area (most recently the first Alpine Perl Workshop). He's a core part of the conference community even if he doesn't boast about his work.

NEILB: The CPAN Pull Request Challenge for 2017

I'm going to run the CPAN Pull Request Challenge (PRC) again in 2017, as enough of this year's participants have said they'd like to continue. If you'd like to take part, email me your github username. If you're a CPAN author, please let me know if you're happy for your distributions to be assigned.

perlancar's blog: Getopt modules 16: Getopt::Attribute

perlancar's blog

About this mini-article series. Each day for 24 days, I will be reviewing a module that parses command-line options (such module is usually under the Getopt::* namespace). First article is here.

When you are doing OO, mapping command-line options to your class attributes is convenient. But what if you are not using OO? There's Getopt::Attribute for that to map options to your package variables (there's also my Perinci::CmdLine that maps command-line options to function arguments, but I'm not reviewing my own modules in this series).

Getopt::Attribute is written by Marcel Grünauer (MARCEL), first in 2001 and last updated in 2010. Here are the users of this module on CPAN (mostly Marcel himself):

% lcpan rdeps Getopt::Attribute
+———+———-+——————–+——–+————–+————-+
| phase | rel | dist | author | dist_version | req_version |
+———+———-+——————–+——–+————–+————-+
| runtime | requires | Hopkins-Plugin-RPC | DIZ | 0.900 | 1.44 |
| runtime | requires | Module-Changes | MARCEL | 0.05 | 0 |
| runtime | requires | Module-Cloud | MARCEL | 1.100861 | 0 |
| runtime | requires | Task-MasteringPerl | BDFOY | 1.002 | 0 |
| runtime | requires | Vim-Complete | MARCEL | 1.100880 | 0 |
+———+———-+——————–+——–+————–+————-+

Here's how you would you Getopt::Attribute:

use Getopt::Attribute;

our $verbose : Getopt(verbose!);
our $all     : Getopt(all);
our $size    : Getopt(size=s);
our $more    : Getopt(more+);
our @library : Getopt(library=s);
our %defines : Getopt(define=s);
sub quiet : Getopt(quiet) { our $quiet_msg = 'seen quiet' }
usage() if our $man : Getopt(man);

As you can see, it uses a rather Perl-specific feature called subroutine attributes. You can then call your CLI app like this:

% myapp –all –size=10 –more –more –library L1 –library L2

then your variable $all will be set to 1, $size to 10, $more to 2, and @library to ["L1", "L2"].

The module code itself is surprisingly compact, less than 30 lines of code. If you wonder where the actual parsing is done, it's done in the INIT phase. So at least, unlike with App::Options, you can still utilize "perl -c" to syntax-check your scripts.

My main complaints are only: 1) my Emacs' cperl-mode still doesn't syntax-highlights these subroutine attributes correctly; 2) if you want to put all options to a single hash, you can't, so this module forces you to pick a particular style.

This module is a pure Getopt::Long wrapper that does not add additional features like putting summary string for each option (although that's doable putting it in the subroutine attribute as parameter), specifying required option, or specifying default value. It would make the module more interesting if it had those features.


Dave's Free Press: Journal: Module pre-requisites analyser

Dave's Free Press: Journal: CPANdeps

Dave's Free Press: Journal: Perl isn't dieing

Dave's Free Press: Journal: YAPC::Europe 2007 report: day 3

Dave's Free Press: Journal: Devel::CheckLib can now check libraries' contents

Ocean of Awareness: Top-down parsing is guessing

Top-down parsing is guessing. Literally. Bottom-up parsing is looking.

The way you'll often hear that phrased is that top-down parsing is looking, starting at the top, and bottom-up parsing is looking, starting at the bottom. But that is misleading, because the input is at the bottom -- at the top there is nothing to look at. A usable top-down parser must have a bottom-up component, even if that component is just lookahead.

A more generous, but still accurate, way to describe the top-down component of parsers is "prediction". And prediction is, indeed, a very useful component of a parser, when used in combination with other techniques.

Of course, if a parser does nothing but predict, it can predict only one input. Top-down parsing must always be combined with a bottom-up component. This bottom-up component may be as modest as lookahead, but it must be there or else top-down parsing is really not parsing at all.

So why is top-down parsing used so much?

Top-down parsing may be unusable in its pure form, but from one point of view that is irrelevant. Top-down parsing's biggest advantage is that it is highly flexible -- there's no reason to stick to its "pure" form.

A top-down parser can be written as a series of subroutine calls -- a technique called recursive descent. Recursive descent allows you to hook in custom-written bottom-up logic at every top-down choice point, and it is a technique which is completely understandable to programmers with little or no training in parsing theory. When dealing with recursive descent parsers, it is more useful to be a seasoned, far-thinking programmer than it is to be a mathematician. This makes recursive descent very appealing to seasoned, far-thinking programmers, and they are the audience that counts.

Switching techniques

You can even use the flexibility of top-down to switch away from top-down parsing. For example, you could claim that a top-down parser could do anything my own parser (Marpa) could do, because a recursive descent parser can call a Marpa parser.

A less dramatic switchoff, and one that still leaves the parser with a good claim to be basically top-down, is very common. Arithmetic expressions are essential for a computer language. But they are also among the many things top-down parsing cannot handle, even with ordinary lookahead. Even so, most computer languages these days are parsed top-down -- by recursive descent. These recursive descent parsers deal with expressions by temporarily handing control over to an bottom-up operator precedence parser. Neither of these parsers is extremely smart about the hand-over and hand-back -- it is up to the programmer to make sure the two play together nicely. But used with caution, this approach works.

Top-down parsing and language-oriented programming

But what about taking top-down methods into the future of language-oriented programming, extensible languages, and grammars which write grammars? Here we are forced to confront the reality -- that the effectiveness of top-down parsing comes entirely from the foreign elements that are added to it. Starting from a basis of top-down parsing is literally starting with nothing. As I have shown in more detail elsewhere, top-down techniques simply do not have enough horsepower to deal with grammar-driven programming.

Perl 6 grammars are top-down -- PEG with lots of extensions. These extensions include backtracking, backtracking control, a new style of tie-breaking and lots of opportunity for the programmer to intervene and customize everything. But behind it all is a top-down parse engine.

One aspect of Perl 6 grammars might be seen as breaking out of the top-down trap. That trick of switching over to a bottom-up operator precedence parser for expressions, which I mentioned above, is built into Perl 6 and semi-automated. (I say semi-automated because making sure the two parsers "play nice" with each other is not automated -- that's still up to the programmer.)

As far as I know, this semi-automation of expression handling is new with Perl 6 grammars, and it may prove handy for duplicating what is done in recursive descent parsers. But it adds no new technique to those already in use. And features like

  • mulitple types of expression, which can be told apart based on their context,
  • n-ary expressions for arbitrary n, and
  • the autogeneration of multiple rules, each allowing a different precedence scheme, for expressions of arbitrary arity and associativity,

all of which are available and in current use in Marpa, are impossible for the technology behind Perl 6 grammars.

I am a fan of the Perl 6 effort. Obviously, I have doubts about one specific set of hopes for Perl 6 grammars. But these hopes have not been central to the Perl 6 effort, and I will be an eager student of the Perl 6 team's work over the coming months.

Comments

To learn more about Marpa, there's the official web site maintained by Ron Savage. I also have a Marpa web site. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Dave's Free Press: Journal: I Love Github

Dave's Free Press: Journal: Palm Treo call db module

Dave's Free Press: Journal: Graphing tool

Dave's Free Press: Journal: Travelling in time: the CP2000AN

Dave's Free Press: Journal: XML::Tiny released

Dave's Free Press: Journal: YAPC::Europe 2007 report: day 1

Ocean of Awareness: Parsing: an expanded timeline

The fourth century BCE: In India, Pannini creates a sophisticated description of the Sanskrit language, exact and complete, and including pronunciation. Sanskrit could be recreated using nothing but Pannini's grammar. Pannini's grammar is probably the first formal system of any kind, predating Euclid. Even today, nothing like it exists for any other natural language of comparable size or corpus. Pannini is the object of serious study today. But in the 1940's and 1950's Pannini is almost unknown in the West. His work has no direct effect on the other events in this timeline.

1943: Emil Post defines and studies a formal rewriting system using productions. With this, the process of reinventing Pannini in the West begins.

1948: Claude Shannon publishes the foundation paper of information theory. Andrey Markov's finite state processes are used heavily.

1952: Grace Hopper writes a linker-loader and describes it as a "compiler". She seems to be the first person to use this term for a computer program. Hopper uses the term "compiler" in its original sense: "something or someone that brings other things together".

1954: At IBM, a team under John Backus begins working on the language which will be called FORTRAN. The term "compiler" is still being used in Hopper's looser sense, instead of its modern one. In particular, there is no implication that the output of a "compiler" is ready for execution by a computer. The output of one 1954 "compiler", for example, produces relative addresses, which need to be translated by hand before a machine can execute them.

1955: Noam Chomsky is awarded a Ph.D. in linguistics and accepts a teaching post at MIT. MIT does not have a linguistics department and Chomsky, in his linguistics course, is free to teach his own approach, highly original and very mathematical.

1956: Chomsky publishes the paper which is usually considered the foundation of Western formal language theory. The paper advocates a natural language approach that involves

  • a bottom layer, using Markov's finite state processes;
  • a middle, syntactic layer, using context-free grammars and context-sensitive grammars; and
  • a top layer, which involves mappings or "transformations" of the output of the syntactic layer.

These layers resemble, and will inspire, the lexical, syntactic and AST transformation phases of modern parsers. For finite state processes, Chomsky acknowledges Markov. The other layers seem to be Chomsky's own formulations -- Chomsky does not cite Post's work.

1957: Steven Kleene discovers regular expressions, a very handy notation for Markov's processes. Regular expressions turn out to describe exactly the mathematical objects being studied as finite state automata, as well as some of the objects being studied as neural nets.

1957: Noam Chomsky publishes Syntactic Structures, one of the most influential books of all time. The orthodoxy in 1957 is structural linguistics which argues, with Sherlock Holmes, that "it is a capital mistake to theorize in advance of the facts". Structuralists start with the utterances in a language, and build upward.

But Chomsky claims that without a theory there are no facts: there is only noise. The Chomskyan approach is to start with a grammar, and use the corpus of the language to check its accuracy. Chomsky's approach will soon come to dominate linguistics.

1957: Backus's team makes the first FORTRAN compiler available to IBM customers. FORTRAN is the first high-level language that will find widespread implementation. As of this writing, it is the oldest language that survives in practical use. FORTRAN is a line-by-line language and its parsing is primitive.

1958: John McCarthy's LISP appears. LISP goes beyond the line-by-line syntax -- it is recursively structured. But the LISP interpreter does not find the recursive structure: the programmer must explicitly indicate the structure herself, using parentheses.

1959: Backus invents a new notation to describe the IAL language (aka ALGOL). Backus's notation is influenced by his study of Post -- he seems not to have read Chomsky until later.

1960: Peter Naur improves the Backus notation and uses it to describe ALGOL 60. The improved notation will become known as Backus-Naur Form (BNF).

1960: The ALGOL 60 report specifies, for the first time, a block structured language. ALGOL 60 is recursively structured but the structure is implicit -- newlines are not semantically significant, and parentheses indicate syntax only in a few specific cases. The ALGOL compiler will have to find the structure. It is a case of 1960's optimism at its best. As the ALGOL committee is well aware, a parsing algorithm capable of handling ALGOL 60 does not yet exist. But the risk they are taking will soon pay off.

1960: A.E. Gleenie publishes his description of a compiler-compiler. Glennie's "universal compiler" is more of a methodology than an implementation -- the compilers must be written by hand. Glennie credits both Chomsky and Backus, and observes that the two notations are "related". He also mentions Post's productions. Glennie may have been the first to use BNF as a description of a procedure instead of as the description of a Chomsky grammar. Glennie points out that the distinction is "important".

Chomskyan BNF and procedural BNF: BNF, when used as a Chomsky grammar, describes a set of strings, and does not describe how to parse strings according to the grammar. BNF notation, if used to describe a procedure, is a set of instructions, to be tried in some order, and used to process a string. Procedural BNF describes a procedure first, and a language only indirectly.

Both procedural and Chomskyan BNF describe languages, but usually not the same language. That is,

  • Suppose D is some BNF description.
  • Let P(D) be D interpreted as a procedure,
  • Let L(P(D)) be the language which the procedure P(D) parses.
  • Let G(D) be D interpreted as a Chomsky grammar.
  • Let L(G(D)) be the language which the grammar G(D) describes.
  • Then, usually, L(P(D)) != L(G(D)).

The pre-Chomskyan approach, using procedural BNF, is far more natural to someone trained as a computer programmer. The parsing problem appears to the programmer in the form of strings to be parsed, exactly the starting point of procedural BNF and pre-Chomsky parsing.

Even when the Chomskyan approach is pointed out, it does not at first seem very attractive. With the pre-Chomskyan approach, the examples of the language more or less naturally lead to a parser. In the Chomskyan approach the programmer has to search for an algorithm to parse strings according to his grammar -- and the search for good algorithms to parse Chomskyan grammars has proved surprisingly long and difficult. Handling semantics is more natural with a Chomksyan approach. But, using captures, semantics can be added to a pre-Chomskyan parser and, with practice, this seems natural enough.

Despite the naturalness of the pre-Chomskyan approach to parsing, we will find that the first fully-described automated parsers are Chomskyan. This is a testimony to Chomsky's influence at the time. We will also see that Chomskyan parsers have been dominant ever since.

1961: In January, Ned Irons publishes a paper describing his ALGOL 60 parser. It is the first paper to fully describe any parser. The Irons algorithm is Chomskyan and top-down with a "left corner" element. The Irons algorithm is general, meaning that it can parse anything written in BNF. It is syntax-driven (aka declarative), meaning that the parser is actually created from the BNF -- the parser does not need to be hand-written.

1961: Peter Lucas publishes the first description of a purely top-down parser. This can be considered to be recursive descent, though in Lucas's paper the algorithm has a syntax-driven implementation, useable only for a restricted class of grammars. Today we think of recursive descent as a methodology for writing parsers by hand. Hand-coded approaches became more popular in the 1960's due to three factors:

  • Memory and CPU were both extremely limited. Hand-coding paid off, even when the gains were small.
  • Non-hand coded top-down parsing, of the kind Lucas's syntax-driven approach allowed, is a very weak parsing technique. It was (and still is) often necessary to go beyond its limits.
  • Top-down parsing is intuitive -- it essentially means calling subroutines. It therefore requires little or no knowledge of parsing theory. This makes it a good fit for hand-coding.

1963: L. Schmidt, Howard Metcalf, and Val Schorre present papers on syntax-directed compilers at a Denver conference.

1964: Schorre publishes a paper on the Meta II "compiler writing language", summarizing the papers of the 1963 conference. Schorre cites both Backus and Chomsky as sources for Meta II's notation. Schorre notes that his parser is "entirely different" from that of Irons 1961 -- in fact it is pre-Chomskyan. Meta II is a template, rather than something that readers can use, but in principle it can be turned into a fully automated compiler-compiler.

1965: Don Knuth invents LR parsing. The LR algorithm is deterministic, Chomskyan and bottom-up, but it is not thought to be practical. Knuth is primarily interested in the mathematics.

1968: Jay Earley invents the algorithm named after him. Like the Irons algorithm, Earley's algorithm is Chomskyan, syntax-driven and fully general. Unlike the Irons algorithm, it does not backtrack. Earley's algorithm is both top-down and bottom-up at once -- it uses dynamic programming and keeps track of the parse in tables. Earley's approach makes a lot of sense and looks very promising indeed, but there are three serious issues:

  • First, there is a bug in the handling of zero-length rules.
  • Second, it is quadratic for right recursions.
  • Third, the bookkeeping required to set up the tables is, by the standards of 1968 hardware, daunting.

1969: Frank DeRemer describes a new variant of Knuth's LR parsing. DeRemer's LALR algorithm requires only a stack and a state table of quite manageable size. LALR looks practical.

1969: Ken Thompson writes the "ed" editor as one of the first components of UNIX. At this point, regular expressions are an esoteric mathematical formalism. Through the "ed" editor and its descendants, regular expressions will become an everyday part of the working programmer's toolkit.

Recognizers: In comparing algorithms, it can be important to keep in mind whether they are recognizers or parsers. A recognizer is a program which takes a string and produces a "yes" or "no" according to whether a string is in part of a language. Regular expressions are typically used as recognizers. A parser is a program which takes a string and produces a tree reflecting its structure according to a grammar. The algorithm for a compiler clearly must be a parser, not a recognizer. Recognizers can be, to some extent, used as parsers by introducing captures.

1972: Alfred Aho and Jeffrey Ullman publish a two volume textbook summarizing the theory of parsing. This book is still important. It is also distressingly up-to-date -- progress in parsing theory slowed dramatically after 1972. Aho and Ullman describe a straightforward fix to the zero-length rule bug in Earley's original algorithm. Unfortunately, this fix involves adding even more bookkeeping to Earley's.

1972: Under the names TDPL and GTDPL, Aho and Ullman investigate the non-Chomksyan parsers in the Schorre lineage. They note that "it can be quite difficult to determine what language is defined by a TDPL parser". That is, GTDPL parsers do whatever they do, and that whatever is something the programmer in general will not be able to describe. The best a programmer can usually do is to create a test suite and fiddle with the GTDPL description until it passes. Correctness cannot be established in any stronger sense. GTDPL is an extreme form of the old joke that "the code is the documentation" -- with GTDPL nothing documents the language of the parser, not even the code.

GTDPL's obscurity buys nothing in the way of additional parsing power. Like all non-Chomskyan parsers, GTDPL is basically a extremely powerful recognizer. Pressed into service as a parser, it is comparatively weak. As a parser, GTDPL is essentially equivalent to Lucas's 1961 syntax-driven algorithm, which was in turn a restricted form of recursive descent.

At or around this time, rumor has it that the main line of development for GTDPL parsers is classified secret by the US government. GTDPL parsers have the property that even small changes in GTDPL parsers can be very labor-intensive. For some government contractors, GTDPL parsing provides steady work for years to come. Public interest in GTDPL fades.

1975: Bell Labs converts its C compiler from hand-written recursive descent to DeRemer's LALR algorithm.

1977: The first "Dragon book" comes out. This soon-to-be classic textbook is nicknamed after the drawing on the front cover, in which a knight takes on a dragon. Emblazoned on the knight's lance are the letters "LALR". From here on out, to speak lightly of LALR will be to besmirch the escutcheon of parsing theory.

1979: Bell Laboratories releases Version 7 UNIX. V7 includes what is, by far, the most comprehensive, useable and easily available compiler writing toolkit yet developed.

1979: Part of the V7 toolkit is Yet Another Compiler Compiler (YACC). YACC is LALR-powered. Despite its name, YACC is the first compiler-compiler in the modern sense. For some useful languages, the process of going from Chomskyan specification to executable is fully automated. Most practical languages, including the C language and YACC's own input language, still require manual hackery. Nonetheless, after two decades of research, it seems that the parsing problem is solved.

1987: Larry Wall introduces Perl 1. Perl embraces complexity like no previous language. Larry uses YACC and LALR very aggressively -- to my knowledge more aggressively than anyone before or since.

1991: Joop Leo discovers a way of speeding up right recursions in Earley's algorithm. Leo's algorithm is linear for just about every unambiguous grammar of practical interest, and many ambiguous ones as well. In 1991 hardware is six orders of magnitude faster than 1968 hardware, so that the issue of bookkeeping overhead had receded in importance. This is a major discovery. When it comes to speed, the game has changed in favor of the Earley algorithm.

But Earley parsing is almost forgotten. Twenty years will pass before anyone writes a practical implementation of Leo's algorithm.

1990's: Earley's is forgotten. So everyone in LALR-land is content, right? Wrong. Far from it, in fact. Users of LALR are making unpleasant discoveries. While LALR automatically generates their parsers, debugging them is so hard they could just as easily write the parser by hand. Once debugged, their LALR parsers are fast for correct inputs. But almost all they tell the users about incorrect inputs is that they are incorrect. In Larry's words, LALR is "fast but stupid".

2000: Larry Wall decides on a radical reimplementation of Perl -- Perl 6. Larry does not even consider using LALR again.

2002: John Aycock and R. Nigel Horspool publish their attempt at a fast, practical Earley's parser. Missing from it is Joop Leo's improvement -- they seem not to be aware of it. Their own speedup is limited in what it achieves and the complications it introduces can be counter-productive at evaluation time. But buried in their paper is a solution to the zero-length rule bug. And this time the solution requires no additional bookkeeping.

2004: Bryan Ford publishes his paper on PEG. Implementers by now are avoiding YACC, and it seems as if there might soon be no syntax-driven algorithms in practical use. Ford fills this gap by repackaging the nearly-forgotten GTDPL. Ford adds packratting, so that PEG is always linear, and provides PEG with an attractive new syntax. But nothing has been done to change the problematic behaviors of GTDPL.

2006: GNU announces that the GCC compiler's parser has been rewritten. For three decades, the industry's flagship C compilers have used LALR as their parser -- proof of the claim that LALR and serious parsing are equivalent. Now, GNU replaces LALR with the technology that it replaced a quarter century earlier: recursive descent.

Today: After five decades of parsing theory, the state of the art seems to be back where it started. We can imagine someone taking Ned Iron's original 1961 algorithm from the first paper ever published describing a parser, and republishing it today. True, he would have to translate its code from the mix of assembler and ALGOL into something more fashionable, say Haskell. But with that change, it might look like a breath of fresh air.

Marpa: an afterword

The recollections of my teachers cover most of this timeline. My own begin around 1970. Very early on, as a graduate student, I became unhappy with the way the field was developing. Earley's algorithm looked interesting, and it was something I returned to on and off.

The original vision of the 1960's was a parser that was

  • efficient,
  • practical,
  • general, and
  • syntax-driven.

By 2010 this vision seemed to have gone the same way as many other 1960's dreams. The rhetoric stayed upbeat, but parsing practice had become a series of increasingly desperate compromises.

But, while nobody was looking for them, the solutions to the problems encountered in the 1960's had appeared in the literature. Aycock and Horspool had solved the zero-length rule bug. Joop Leo had found the speedup for right recursion. And the issue of bookkeeping overhead had pretty much evaporated on its own. Machine operations are now a billion times faster than in 1968, and are probably no longer relevant in any case -- cache misses are now the bottleneck.

The programmers of the 1960's would have been prepared to trust a fully declarative Chomskyan parser. With the experience with LALR in their collective consciousness, modern programmers might be more guarded. As Lincoln said, "Once a cat's been burned, he won't even sit on a cold stove." But I found it straightforward to rearrange the Earley parse engine to allow efficient event-driven handovers between procedural and syntax-driven logic. And Earley tables provide the procedural logic with full knowledge of the state of the parse so far, so that Earley's algorithm is a better platform for hand-written procedural logic than recursive descent.

References, comments, etc.

My implementation of Earley's algorithm is called Marpa. For more about Marpa, there is the semi-official web site, maintained by Ron Savage. The official, but more limited, Marpa website is my personal one. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Dave's Free Press: Journal: Thanks, Yahoo!

Dave's Free Press: Journal: YAPC::Europe 2007 report: day 2

Ocean of Awareness: What are the reasonable computer languages?

"You see things; and you say 'Why?' But I dream things that never were; and I say 'Why not?'" -- George Bernard Shaw

In the 1960's and 1970's computer languages were evolving rapidly. It was not clear which way they were headed. Would most programming be done with general-purpose languages? Or would programmers create a language for every task domain? Or even for every project? And, if lots of languages were going to be created, what kinds of languages would be needed?

It was in that context that Čulik and Cohen, in a 1973 paper, outlined what they thought programmers would want and should have. In keeping with the spirit of the time, it was quite a lot:

  • Programmers would want to extend their grammars with new syntax, including new kinds of expressions.
  • Programmers would also want to use tools that automatically generated new syntax.
  • Programmers would not want to, and especially in the case of auto-generated syntax would usually not be able to, massage the syntax into very restricted forms. Instead, programmers would create grammars and languages which required unlimited lookahead to disambiguate, and they would require parsers which could handle these grammars.
  • Finally, programmers would need to be able to rely on all of this parsing being done in linear time.

Today, we think we know that Čulik and Cohen's vision was naive, because we think we know that parsing technology cannot support it. We think we know that parsing is much harder than they thought.

The eyeball grammars

As a thought problem, consider the "eyeball" class of grammars. The "eyeball" class of grammars contains all the grammars that a human can parse at a glance. If a grammar is in the eyeball class, but a computer cannot parse it, it presents an interesting choice. Either,

  • your computer is not using the strongest practical algorithm; or
  • your mind is using some power which cannot be reduced to a machine computation.

There are some people out there (I am one of them) who don't believe that everything the mind can do reduces to a machine computation. But even those people will tend to go for the choice in this case: There must be some practical computer parsing algorithm which can do at least as well at parsing as a human can do by "eyeball". In other words, the class of "reasonable grammars" should contain the eyeball class.

Čulik and Cohen's candidate for the class of "reasonable grammars" were the grammars that a deterministic parse engine could parse if it had a lookahead that was infinite, but restricted to distinguishing between regular expressions. They called these the LR-regular, or LRR, grammars. And the LRR grammars do in fact seem to be a good first approximation to the eyeball class. They do not allow lookahead that contains things that you have to count, like palindromes. And, while I'd be hard put to eyeball every possible string for every possible regular expression, intuitively the concept of scanning for a regular expression does seem close to capturing the idea of glancing through a text looking for a telltale pattern.

So what happened?

Alas, the algorithm in the Čulik and Cohen paper turned out to be impractical. But in 1991, Joop Leo discovered a way to adopt Earley's algorithm to parse the LRR grammars in linear time, without doing the lookahead. And Leo's algorithm does have a practical implementation: Marpa.

References, comments, etc.

To learn more about Marpa, there's the official web site maintained by Ron Savage. I also have a Marpa web site. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Dave's Free Press: Journal: YAPC::Europe 2007 travel plans

Dave's Free Press: Journal: Wikipedia handheld proxy

Dave's Free Press: Journal: Bryar security hole

Dave's Free Press: Journal: POD includes

Dave's Free Press: Journal: cgit syntax highlighting

Dave's Free Press: Journal: CPAN Testers' CPAN author FAQ

Ocean of Awareness: What parser do birds use?

"Here we provide, to our knowledge, the first unambiguous experimental evidence for compositional syntax in a non-human vocal system." -- "Experimental evidence for compositional syntax in bird calls", Toshitaka N. Suzuki, David Wheatcroft & Michael Griesser Nature Communications 7, Article number: 10986

In this post I look at a subset of the language of the Japanese great tit, also known as Parus major. The above cited article presents evidence that bird brains can parse this language. What about standard modern computer parsing methods? Here is the subset -- probably a tiny one -- of the language actually used by Parus major.

      S ::= ABC
      S ::= D
      S ::= ABC D
      S ::= D ABC
    

Classifying the Parus major grammar

Grammophone is a very handy new tool for classifying grammars. Its own parser is somewhat limited, so that it requires a period to mark the end of a rule. The above grammar is in Marpa's SLIF format, which is smart enough to use the "::=" operator to spot the beginning and end of rules, just as the human eye does. Here's the same grammar converted into a form acceptable to Grammophone:

      S -> ABC .
      S -> D .
      S -> ABC D .
      S -> D ABC .
    

Grammophone tells us that the Parus major grammar is not LL(1), but that it is LALR(1).

What does this mean?

LL(1) is the class of grammar parseable by top-down methods: it's the best class for characterizing most parsers in current use, including recursive descent, PEG, and Perl 6 grammars. All of these parsers fall short of dealing with the Parus major language.

LALR(1) is probably most well-known from its implementations in bison and yacc. While able to handle this subset of Parus's language, LALR(1) has its own, very strict, limits. Whether LALR(1) could handle the full complexity of Parus language is a serious question. But it's a question that in practice would probably not arise. LALR(1) has horrible error handling properties.

When the input is correct and within its limits, an LALR-driven parser is fast and works well. But if the input is not perfectly correct, LALR parsers produce no useful analysis of what went wrong. If Parus hears "d abc d", a parser like Marpa, on the other hand, can produce something like this:

# * String before error: abc d\s
# * The error was at line 1, column 7, and at character 0x0064 'd', ...
# * here: d
    

Parus uses its language in predatory contexts, and one can assume that a Parus with a preference for parsers whose error handling is on an LALR(1) level will not be keeping its alleles in the gene pool for very long.

References, comments, etc.

Those readers content with sub-Parus parsing methods may stop reading here. Those with greater parsing ambitions, however, may wish to learn more about Marpa. A Marpa test script for parsing the Parus subset is in a Github gist. Marpa has a semi-official web site, maintained by Ron Savage. The official, but more limited, Marpa website is my personal one. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Dave's Free Press: Journal: Thankyou, Anonymous Benefactor!

Dave's Free Press: Journal: Number::Phone release

Dave's Free Press: Journal: Ill

Dave's Free Press: Journal: CPANdeps upgrade

Ocean of Awareness: Introduction to Marpa Book in progress

What follows is a summary of the features of the Marpa algorithm, followed by a discussion of potential applications. It refers to itself as a "monograph", because it is a draft of part of the introduction to a technical monograph on the Marpa algorithm. I hope the entire monograph will appear in a few weeks.

The Marpa project

The Marpa project was intended to create a practical and highly available tool to generate and use general context-free parsers. Tools of this kind had long existed for LALR and regular expressions. But, despite an encouraging academic literature, no such tool had existed for context-free parsing. The first stable version of Marpa was uploaded to a public archive on Solstice Day 2011. This monograph describes the algorithm used in the most recent version of Marpa, Marpa::R2. It is a simplification of the algorithm presented in my earlier paper.

A proven algorithm

While the presentation in this monograph is theoretical, the approach is practical. The Marpa::R2 implementation has been widely available for some time, and has seen considerable use, including in production environments. Many of the ideas in the parsing literature satisfy theoretical criteria, but in practice turn out to face significant obstacles. An algorithm may be as fast as reported, but may turn out not to allow adequate error reporting. Or a modification may speed up the recognizer, but require additional processing at evaluation time, leaving no advantage to compensate for the additional complexity.

In this monograph, I describe the Marpa algorithm as it was implemented for Marpa::R2. In many cases, I believe there are better approaches than those I have described. But I treat these techniques, however solid their theory, as conjectures. Whenever I mention a technique that was not actually implemented in Marpa::R2, I will always explicitly state that that technique is not in Marpa as implemented.

Features

General context-free parsing

As implemented, Marpa parses all "proper" context-free grammars. The proper context-free grammars are those which are free of cycles, unproductive symbols, and inaccessible symbols. Worst case time bounds are never worse than those of Earley's algorithm, and therefore never worse than O(n**3).

Linear time for practical grammars

Currently, the grammars suitable for practical use are thought to be a subset of the deterministic context-free grammars. Using a technique discovered by Joop Leo, Marpa parses all of these in linear time. Leo's modification of Earley's algorithm is O(n) for LR-regular grammars. Leo's modification also parses many ambiguous grammars in linear time.

Left-eidetic

The original Earley algorithm kept full information about the parse --- including partial and fully recognized rule instances --- in its tables. At every parse location, before any symbols are scanned, Marpa's parse engine makes available its information about the state of the parse so far. This information is in useful form, and can be accessed efficiently.

Recoverable from read errors

When Marpa reads a token which it cannot accept, the error is fully recoverable. An application can try to read another token. The application can do this repeatedly as long as none of the tokens are accepted. Once the application provides a token that is accepted by the parser, parsing will continue as if the unsuccessful read attempts had never been made.

Ambiguous tokens

Marpa allows ambiguous tokens. These are often useful in natural language processing where, for example, the same word might be a verb or a noun. Use of ambiguous tokens can be combined with recovery from rejected tokens so that, for example, an application could react to the rejection of a token by reading two others.

Using the features

Error reporting

An obvious application of left-eideticism is error reporting. Marpa's abilities in this respect are ground-breaking. For example, users typically regard an ambiguity as an error in the grammar. Marpa, as currently implemented, can detect an ambiguity and report specifically where it occurred and what the alternatives were.

Event driven parsing

As implemented, Marpa::R2 allows the user to define "events". Events can be defined that trigger when a specified rule is complete, when a specified rule is predicted, when a specified symbol is nulled, when a user-specified lexeme has been scanned, or when a user-specified lexeme is about to be scanned. A mid-rule event can be defined by adding a nulling symbol at the desired point in the rule, and defining an event which triggers when the symbol is nulled.

Ruby slippers parsing

Left-eideticism, efficient error recovery, and the event mechanism can be combined to allow the application to change the input in response to feedback from the parser. In traditional parser practice, error detection is an act of desperation. In contrast, Marpa's error detection is so painless that it can be used as the foundation of new parsing techniques.

For example, if a token is rejected, the lexer is free to create a new token in the light of the parser's expectations. This approach can be seen as making the parser's "wishes" come true, and I have called it "Ruby Slippers Parsing".

One use of the Ruby Slippers technique is to parse with a clean but oversimplified grammar, programming the lexical analyzer to make up for the grammar's short-comings on the fly. As part of Marpa::R2, the author has implemented an HTML parser, based on a grammar that assumes that all start and end tags are present. Such an HTML grammar is too simple even to describe perfectly standard-conformant HTML, but the lexical analyzer is programmed to supply start and end tags as requested by the parser. The result is a simple and cleanly designed parser that parses very liberal HTML and accepts all input files, in the worst case treating them as highly defective HTML.

Ambiguity as a language design technique

In current practice, ambiguity is avoided in language design. This is very different from the practice in the languages humans choose when communicating with each other. Human languages exploit ambiguity in order to design highly flexible, powerfully expressive languages. For example, the language of this monograph, English, is notoriously ambiguous.

Ambiguity of course can present a problem. A sentence in an ambiguous language may have undesired meanings. But note that this is not a reason to ban potential ambiguity --- it is only a problem with actual ambiguity.

Syntax errors, for example, are undesired, but nobody tries to design languages to make syntax errors impossible. A language in which every input was well-formed and meaningful would be cumbersome and even dangerous: all typos in such a language would be meaningful, and parser would never warn the user about errors, because there would be no such thing.

With Marpa, ambiguity can be dealt with in the same way that syntax errors are dealt with in current practice. The language can be designed to be ambiguous, but any actual ambiguity can be detected and reported at parse time. This exploits Marpa's ability to report exactly where and what the ambiguity is. Marpa::R2's own parser description language, the SLIF, uses ambiguity in this way.

Auto-generated languages

In 1973, Čulik and Cohen pointed out that the ability to efficiently parse LR-regular languages opens the way to auto-generated languages. In particular, Čulik and Cohen note that a parser which can parse any LR-regular language will be able to parse a language generated using syntax macros.

Second order languages

In the literature, the term "second order language" is usually used to describe languages with features which are useful for second-order programming. True second-order languages --- languages which are auto-generated from other languages --- have not been seen as practical, since there was no guarantee that the auto-generated language could be efficiently parsed.

With Marpa, this barrier is raised. As an example, Marpa::R2's own parser description language, the SLIF, allows "precedenced rules". Precedenced rules are specified in an extended BNF. The BNF extensions allow precedence and associativity to be specified for each RHS.

Marpa::R2's precedenced rules are implemented as a true second order language. The SLIF representation of the precedenced rule is parsed to create a BNF grammar which is equivalent, and which has the desired precedence. Essentially, the SLIF does a standard textbook transformation. The transformation starts with a set of rules, each of which has a precedence and an associativity specified. The result of the transformation is a set of rules in pure BNF. The SLIF's advantage is that it is powered by Marpa, and therefore the SLIF can be certain that the grammar that it auto-generates will parse in linear time.

Notationally, Marpa's precedenced rules are an improvement over similar features in LALR-based parser generators like yacc or bison. In the SLIF, there are two important differences. First, in the SLIF's precedenced rules, precedence is generalized, so that it does not depend on the operators: there is no need to identify operators, much less class them as binary, unary, etc. This more powerful and flexible precedence notation allows the definition of multiple ternary operators, and multiple operators with arity above three.

Second, and more important, a SLIF user is guaranteed to get exactly the language that the precedenced rule specifies. The user of the yacc equivalent must hope their syntax falls within the limits of LALR.

References, comments, etc.

Marpa has a semi-official web site, maintained by Ron Savage. The official, but more limited, Marpa website is my personal one. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Dave's Free Press: Journal: YAPC::Europe 2006 report: day 3

Header image by Tambako the Jaguar. Some rights reserved.