Marpa::R2::NAIF::Progress(3) Progress reports for the NAIF

About this document

This document describes Marpa's progress reports for its named argument interface (NAIF). If you are a beginner, or are not sure which interface you are interested in, or do not know what the NAIF interfaces is, you probably are looking for the document on progress reports for the SLIF interface.

Progress reports allow an application to know exactly where it is in the parse at any point. For parse locations of the user's choosing, progress reports list all the rules in play, and indicate the location at which the rule started, and how far into the rule parsing has progressed.

For those new to parsing, this ``situational awareness'' might seem to be a feature that they can expect from any servicable parser. In fact, situation awareness in most parsers is extremely limited, and it tends to be more limited in those parsers considered production quality. Marpa is not just unusual in the amount of feedback it offers the application, it breaks new ground.

Progress reports are extremely useful in debugging grammars. Because debugging creates a clear ``narrative'', the detailed example in this document is a debugging situation. Readers specifically interested in debugging a grammar should read the document on tracing problems before reading this document.

Introduction to Earley parsing

To read the "show_progress" output, it is important to have a basic idea of what Earley items are, and of what the information in them means. Everything that the user needs to know is explained in this section.

Dotted rules

The idea behind Earley's algorithm is that you can parse by building a table of rules and where you are in those rules. ``Where'' means two things: location in the rule relative to the rule's symbols, and location relative to the parse's input stream.

Let's look at an example of a rule in a context-free grammar. Here's the rule for assignment from the Perl distribution's "perly.y"

"    termbinop -> term ASSIGNOP term"

"ASSIGNOP" is "perly.y"'s internal name for the assignment operator. In plain Perl terms, this is the ""="" character.

In parsing this rule, we can be at any of four possible locations. One location is at the beginning, before all of the symbols. The other three locations are immediately after each of the rule's three symbols.

Within a rule, position relative to the symbols of the rule is traditionally indicated with a dot. In fact, the symbol-relative rule position is very often called the dot location. Taken as a pair, a rule and a dot location are called a dotted rule.

Here's our rule with a dot location indicated:

"    termbinop -> X term ASSIGNOP term"

The dot location in this dotted rule is at the beginning. A dot location at the beginning of a dotted rule means that we have not recognized any symbols in the rule yet. All we are doing is predicting that the rule will occur. A dotted rule with the dot before all of its symbols is called a prediction or a predicted rule.

Here's another dotted rule:

"    termbinop -> term X ASSIGNOP term"

In this dotted rule, we are saying we have seen a "term", but have not yet recognized an "ASSIGNOP".

There's another special kind of dotted rule, a completion. A completion (also called a completed rule) is a dotted rule with the dot after all of the symbols. Here is the completion for the rule that we have been using as an example:

"    termbinop -> term ASSIGNOP term X"

A completion indicates that a rule has been fully recognized.

Earley items

The dotted rules contain all but one piece of the information that Earley's algorithm needs to track. The missing piece is the second of the two ``wheres'': where in the input stream. To associate input stream location and dotted rules, Earley's algorithm uses what are now called Earley items.

A convenient way to think of an Earley item is as a triple, or 3-tuple, consisting of dotted rule, origin and current location. The origin is the location in the input stream where the dotted rule starts. The current location (also called the dot location) is the location in the input stream which corresponds to the dot position.

Two noteworthy consequences follow from the way in which origin and current location are defined. First, if a dotted rule is a prediction, then origin and current location will always be the same. Second, the input stream location where a rule ends is not tracked unless the dotted rule is a completion. In other cases, an Earley item does not tell us if a rule will ever be completed, much less at which location.

The example

For this example of debugging, I've taken a very common example of a grammar and deliberately introduced a problem. (All the code and the full trace outputs for this example are in the Appendix.) I've commented out the correct start rule:

        ## { lhs => 'Expression', rhs => [qw/Term/] },

and replaced it with another start rule, one which will cause problems:

        { lhs => 'Expression', rhs => [qw/Factor/] },

In what follows, we'll pretend we don't already know where the problem is, and use the Marpa diagnostics and tracing facilities to ``discover'' it.

First warning

Right off the bat, we get two warning messages:

    Inaccessible symbol: Add
    Inaccessible symbol: Term

If we were alert, these would be enough to tell us there is a serious problem. ``Inaccessible'' symbols are symbols which cannot be reached from the start symbol. This means that the grammar will never produce them, and that parses will never find them in the input.

Since "Add" and "Term" are both important symbols in our application, that should tell us our grammar has a serious problem. In fact, these warning messages would often be enough to point us to the error. But, in order to look at more of Marpa's tracing facilities, let's pretend we have not had our morning coffee, and that we miss the significance of these warning messages.

Output from trace_terminals()

Before looking at Marpa's progress reports, it is usually best to orient yourself by looking at the output from "trace_terminals". Typically, you will be interested in the last tokens to be accepted, and perhaps also in the tokens the recognizer was looking for when it didn't find what it wanted. Sometimes that information alone is enough to make it clear where the problem is.

The full "trace_terminals" output for this example is in the Appendix. We see that the recognizer seems to accept ""42*1"`` but it fails when it confronts the plus sign (''"+""). The last two lines are:

    Accepted "Number" at 2-3
    Expecting "Multiply" at 3

A note in passing: Marpa shows the location of the tokens it accepts as a range of locations. For "Number", the range is ""2-3"", indicating that the token starts at location 2 and ends at location 3. In its default input model, all tokens have length 1, so this is clearly information overkill. But Marpa allows other input models, and in those models the information about start and end location of the token is important.

Returning to the problem at hand: We notice that at location 3 we are expecting a "Multiply" operator, but not an "Add" operator. That should strike us as strange, and send us back to the grammar. But for the sake of our example we will assume that we are slow on the uptake today, and that this does not clue us in. We move on.

Output from show_progress()

Marpa's most powerful tool for debugging grammars is its progress report, which shows the Earley items being worked on. In the Appendix, progress reports for the entire parse are shown. Our example in this document is a very small one, so that producing progress reports for the entire parse is a reasonable thing to do in this case. If a parse is at all large, you will usually need to be selective.

The progress report that is usually of most interest is the one for the Earley set that you were working on when the error occurred. This is called the current location. In our example the current location is location 3. By default, "show_progress" prints out only the progress reports for the current location.

Here are the progress reports for the current location, location 3, from our example.

      F0 @0-3 Expression -> Factor .
      F2 @2-3 Factor -> Number .
      R4:1 x2 @0,2-3 Factor -> Factor . Multiply Factor
      F4 @0-3 Factor -> Factor Multiply Factor .

Progress report lines

The last field of each Progress Report line shows, in fully expanded form, the dotted rule we were working on. Since that is the most important information, it may be tempting to skip the rest of this section, and move directly forward with the debugging.

In fact, you might want to do exactly that --- skip to the beginning of the next section. What follows talks about the details of the format of the first few fields in each progress report line. These first few fields, while helpful, are also usually one or more of obvious in their meaning, not relevant to our example, and repetitive of information which can be deduced from other fields.

      F0 @0-3 Expression -> Factor .

Prefixed to the dotted rule are two fields: ""F0 @0-3"``. The ''"F0"" says that this is a completed or final rule, and that it is rule number 0. The rule number is used in other tracing and debugging output, when displaying the whole rule would take too much space. In what follows we won't need the rule number.

The ""@0-3"`` describes the location of the dotted rule in the parse. In its simplest form, the location field is two location numbers, separated by a hyphen. The first location number is the origin, the place where Marpa first started recognizing the rule. The last location number is the dot location, the location location of the dot in a dotted rule. ''"@0-3"" say that this rule began at location 0, and that the dot is at location 3.

The current location is location 3, and this is no coincidence. Whenever we are displaying the progress report for an location, all the progress report lines will have their dot location at that location.

As an aside, notice that the left hand side symbol is "Expression". That is the start symbol. The presence of a completed start rule in our progress report indicates that if our input ended at location 3, it would be a valid sentence in the language of our grammar.

Let's look at another progress report line:

      R4:1 x2 @0,2-3 Factor -> Factor . Multiply Factor

Here the ""R4:1"`` indicates that this is rule number 4 (the ''"R"" stands for rule number) and that its dot position is after the first symbol on the right hand side. Symbol positions are numbered using the ordinal of the symbol just before the position. Symbols are numbered starting with 1, and symbol position 1 is the position immediately after symbol 1.

The next field (""x2"``) is new. It is a count. A progress report can contain multiple instances of the same dotted rule, and when there is more than one, a count field is included in the progress report line. Here the ''"x2"" indicates that there are two instances of "Factor -> Factor . Multiply Factor" at this location.

Multiple instances of a dotted rule will differ in their origin, and where they do, this is shown in the location field of the progress report line. Here the location field is ""@0,2-3"", which indicates that one instance of this dotted rule has its origin at location 0, and the other has its origin at location 2. All instances reported on a single progress report line will always have the same dot location, and in this case it is location 3.

Predicted rules also appear in progress reports:

    P2 @2-2 Factor -> . Number

Here the ""P"`` in the summary field means ''predicted". As with much of the information in the summary field, this only repeats what is obvious from the full expansion of the dotted rule later in the line. But final (or completed) and predicted rules can be important and the initial "F" and "P" make these lines easy to spot.

Notice that in the predicted rule, the origin is the same as the dot location. This will always be the case with predicted rules.

For any given location, no predicted rule has more than one instance. For other dotted rules, there may be many instances of the dotted rule at a single location. In grammars with right recursion, the number of instances is limited only by the length of the recursion. The length of a recursion is limited primarily by the available memory.

When there are many instances of a dotted rule at a single location, it is inconvenient to show all the origins in a comma-separated list. In that case the origins in the location field are shown as a range, with the earliest separated from the most recent by a ""..."". The example in this document contains no lines with a large number of instances, but here is an example from another grammar. This is the progress report line for the completed rule in a right recursion of length 20.

    F1 x20 @0...19-20 Top_sequence -> Top Top_sequence .

OK! Now to find the bug

Here again are progress reports at the location where things went wrong:

      F0 @0-3 Expression -> Factor .
      F2 @2-3 Factor -> Number .
      R4:1 x2 @0,2-3 Factor -> Factor . Multiply Factor
      F4 @0-3 Factor -> Factor Multiply Factor .

We see that we have completed rules for "Expression", and "Factor", as expected. We also see two Earley items that show that we are in the process of building another "Factor", and that it is expecting a "Multiply" symbol. This is not the rule we want, but it explains why the "trace_terminals" output showed that the recognizer was expecting a "Multiply" symbol.

What we want to know is, why is the recognizer not expecting an "Add" symbol? Looking back at the grammar, we see that only one rule uses the "Add" symbol: the rule ""Term -> Term Add Term"". The next step is to look at the Earley items for this rule. But there is a problem. We don't find any.

Next, we ask ourselves, what is the earliest place the ""Term -> Term Add Term"`` rule should be appearing? The answer is that there should be a prediction of ''"Term -> Term Add Term"" at location 0. So we look at the predictions at location 0.

      P0 @0-0 Expression -> . Factor
      P2 @0-0 Factor -> . Number
      P4 @0-0 Factor -> . Factor Multiply Factor

No ""Term -> Term Add Term"`` rule. We are never even predicting a ''"Term -> Term Add Term"" rule. We look back at the grammar, and start from the beginning.

    { lhs     => 'Expression', rhs => [qw/Factor/] },
    { lhs => 'Term',       rhs => [qw/Factor/] },
    { lhs => 'Factor',     rhs => [qw/Number/] },
    {   lhs    => 'Term',
        rhs    => [qw/Term Add Term/],
        action => 'do_add'
    },
    {   lhs    => 'Factor',
        rhs    => [qw/Factor Multiply Factor/],
        action => 'do_multiply'
    },

The start symbol is "Expression" and we do see a rule with "Expression" on the left hand side. "Expression" in turn produces a "Factor" symbol, and there are two rules with "Factor" on the left hand side.

But none of these rules ever produce a "Term". In fact, however far we follow the productions, no rule ever produces a "Term". At this point we see the problem: If we start at the start symbol, and follow the rules of our grammar, we will never get to a "Term" symbol. Which is exactly what that first warning message was saying.

Now that we know what is wrong, we can reread our grammar, and see that our "Expression -> Factor" rule is wrong. It should be "Expression -> Term". Change that and the problem is fixed.

Empty rules

When a symbol is nulled in your parse, "show_progress" does not show its expansion into rules. This reduces clutter, and seems to be the intuitive way to treat nulled rules.

Access to the raw progress report information

This section deals with the "progress()" recognizer method, which allows access to the raw progress report information. This method is not needed for typical debugging and tracing situations. It is intended for applications which want to leverage Marpa's ``situational awareness'' in innovative ways.

progress()

    my $report0 = $recce->progress(0);

    my $latest_report = $recce->progress();

Given the parse location (Earley set ID) as its argument, the "progress()" recognizer method returns a reference to an array of ``report items''. The parse location may be negative. An argument of -X will be interpreted as location N+X+1, where N is the latest Earley set. This means that an argument of -1 indicates the latest Earley set, an argument of -2 indicates the Earley set just before the latest one, etc.

Each report item is a triple: an array of three elements. The three elements are, in order, rule ID, dot position, and origin. The data returned by the two displays above, as well as the data for the other parse locations in our example, are shown below.

The rule ID is the same number that Marpa uses to identify rules in tracing and debugging output. Given a rule ID, an application can find its LHS and RHS symbols using the grammar's "rule()" method.

Dot position is -1 for completions, and 0 for predictions. Where the report item is not for a completion or a prediction, dot position is N, where N is the number of RHS symbols successfully recognized at the parse location of the progress report.

Origin is parse location (Earley set ID) at which the rule application reported by the report item began. For a prediction, origin will always be the same as the location of the parse report.

Progress reports and efficiency

When progress reports are used for production parsing, instead of just for debugging and tracing, efficiency considerations become significant. We will start with the good news. "progress()" is implemented in C, so that the application can usually expect calls to "progress()" to be extremely fast.

Now, as to the potential bad news: Applications planning frequent use of calls to "progress()" need to be aware that, where there is a right recursion at a parse location, "progress()" needs to follow the entire chain of recursions to create the progress report. That this is happening will always be evident from the parse report itself, which will contain one report item for every completion in the right-recursive chain of completions. If an application tries to produce progress reports for a large number of parse locations, and these parse locations have long right recursive chains, it can prove computationally expensive.

The efficiency consideration just mentioned for following right recursions is not an issue for left recursions. Left recursions only produce at most two report items per parse location and are extremely fast to process. It is also not an issue for Marpa's sequence rules, because sequence rules are implemented internally as left recursions.

Appendix: full code and output for the example

Below are the code, the trace outputs and the progress report for the example used in this document.

Code

    my $grammar = Marpa::R2::Grammar->new(
        {   start          => 'Expression',
            actions        => 'My_Actions',
            default_action => 'first_arg',
            rules          => [
                ## This is a deliberate error in the grammar
                ## The next line should be:
                ## { lhs => 'Expression', rhs => [qw/Term/] },
                ## I have changed the Term to 'Factor' which
                ## will cause problems.
                { lhs => 'Expression', rhs => [qw/Factor/] },
                { lhs => 'Term',       rhs => [qw/Factor/] },
                { lhs => 'Factor',     rhs => [qw/Number/] },
                {   lhs    => 'Term',
                    rhs    => [qw/Term Add Term/],
                    action => 'do_add'
                },
                {   lhs    => 'Factor',
                    rhs    => [qw/Factor Multiply Factor/],
                    action => 'do_multiply'
                },
            ],
        }
    );

    $grammar->precompute();
    my @tokens = (
        [ 'Number',   42 ],
        [ 'Multiply', q{*} ],
        [ 'Number',   1 ],
        [ 'Add',      q{+} ],
        [ 'Number',   7 ],
    );
    sub My_Actions::do_add {
        my ( undef, $t1, undef, $t2 ) = @_;
        return $t1 + $t2;
    }
    sub My_Actions::do_multiply {
        my ( undef, $t1, undef, $t2 ) = @_;
        return $t1 * $t2;
    }
    sub My_Actions::first_arg { shift; return shift; }
    my $recce = Marpa::R2::Recognizer->new(
        { grammar => $grammar, trace_terminals => 2 } );
    my $token_ix = 0;
    TOKEN: for my $token_and_value (@tokens) {
        last TOKEN if not defined $recce->read( @{$token_and_value} );
    }
    $progress_report = $recce->show_progress( 0, -1 );

Trace output

    Inaccessible symbol: Add
    Inaccessible symbol: Term
    Setting trace_terminals option
    Expecting "Number" at earleme 0
    Accepted "Number" at 0-1
    Expecting "Multiply" at 1
    Accepted "Multiply" at 1-2
    Expecting "Number" at 2
    Accepted "Number" at 2-3
    Expecting "Multiply" at 3
    Rejected "Add" at 3-4

Note the use of the term ``earleme''. If you are using the default input model, you can assume that earleme means ``location'': the earleme and the location will always be exactly the same. Advanced users, using alternative input models, may set it up so that earleme and location are two different things, and in that case the distinction will matter.

show_progress() output

    P0 @0-0 Expression -> . Factor
    P2 @0-0 Factor -> . Number
    P4 @0-0 Factor -> . Factor Multiply Factor
    F0 @0-1 Expression -> Factor .
    F2 @0-1 Factor -> Number .
    R4:1 @0-1 Factor -> Factor . Multiply Factor
    P2 @2-2 Factor -> . Number
    P4 @2-2 Factor -> . Factor Multiply Factor
    R4:2 @0-2 Factor -> Factor Multiply . Factor
    F0 @0-3 Expression -> Factor .
    F2 @2-3 Factor -> Number .
    R4:1 x2 @0,2-3 Factor -> Factor . Multiply Factor
    F4 @0-3 Factor -> Factor Multiply Factor .

progress() outputs

These section contains the output of the "progress()" method --- the progress reports in their ``raw'' format. The output is shown in Data::Dumper format, with "Data::Dumper::Indent" set to 0 and "Data::Dumper::Terse" set to 1.

The "Data::Dumper" output from "progress()" at location 0:

    [[0,0,0],[2,0,0],[4,0,0]]

The "Data::Dumper" output from "progress()" at location 1:

    [[0,-1,0],[2,-1,0],[4,1,0]]

The "Data::Dumper" output from "progress()" at location 2:

    [[2,0,2],[4,0,2],[4,2,0]]

The default "progress()" output is for the latest Earley set, which is location 3 in our example. Here is the "progress()" output for location 3.

    [[0,-1,0],[2,-1,2],[4,-1,0],[4,1,0],[4,1,2]]

Copyright and License

  Copyright 2014 Jeffrey Kegler
  This file is part of Marpa::R2.  Marpa::R2 is free software: you can
  redistribute it and/or modify it under the terms of the GNU Lesser
  General Public License as published by the Free Software Foundation,
  either version 3 of the License, or (at your option) any later version.
  Marpa::R2 is distributed in the hope that it will be useful,
  but WITHOUT ANY WARRANTY; without even the implied warranty of
  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
  Lesser General Public License for more details.
  You should have received a copy of the GNU Lesser
  General Public License along with Marpa::R2.  If not, see
  http://www.gnu.org/licenses/.