grammar debugging

CTTL on
SourceForge

Download
Latest

Documentation
Index

Library
News

CVS
Repository

introduction

CTTL implementation takes advantage of assert() macro from <cassert> header, providing adequate amount of assertions that help catching most obvious bugs. To disable assertions from being compiled, define NDEBUG macro before including any of the library headers:

  #define NDEBUG
  #include "cttl/cttl.h"

Assertions fall into the category of run time diagnostics. In addition to assertions, the program could print a fair amount of run time debugging information, commonly known as trace. Traces provide significant assistance in understanding code and alleviating hard issues. Result of the trace (usually text messages sent to the standard output) gives a "big picture" with enough details to begin working on problems, find ways to optimize code, and simply verify that the code works as expected.

Tracing helps to debug grammars without conventional debuggers. CTTL traces are built into library implementation classes; therefore, a program can turn on traces simply by defining one of the trace-level macros. CTTL trace messages are sent to the standard output. Traces follow the trail of grammar evaluations with support for different levels of intensity. The messages are globally serialized for all threads. Current implementation of the tracing facility has no special support for multi-threaded environments. Consequently, at present time, all tracing activities should be limited to grammar debugging on a single thread.

Trace generation naturally slows down the speed of execution, but that kind of influence could be reduced if smaller inputs were submitted to the parser. Different issue arises when program creates large amounts of trace messages. For example, programming language parser could easily hammer down hundreds of megabytes of traces. When standard output is redirected, an operating system does a good job writing the information into the text file, usually much faster than printing same output to the console screen. In this way, creating a trace file is not a big challenge. On the other hand, reading and searching through the data could become a burden. Our recommendation is to split the information into roughly equivalent, but better manageable chunks of text. CTTL samples library includes the file splitter utility, named partition_file.cpp, which divides large text files into fractions of 100MB-sized text. This approach is often particularly productive, because pertinent information usually appears close to the end of the trace output.

trace level macros

Programmer may choose an appropriate level of tracing by defining one of the following macros:

Table D.1 - Trace level macros

Trace macro	description
CTTL_TRACE_TRIVIAL	Turns on messages generated by `CTTL_TRACE_MESSAGE()` macro.
CTTL_TRACE_RULES	Turns on rule()-level tracing, which includes traces produced by `CTTL_RULE()`, `CTTL_MEMBER_RULE()`, `CTTL_STATIC_RULE()`, as well as `CTTL_TRACE_MESSAGE()`.
CTTL_TRACE_EVERYTHING	Turns on grammar expression tracing, in addition to the rule()-level traces. This tracing mode generates most detailed level of run time trace information, which includes rules, and both terminal, and non-terminal symbol data.
CTTL_TRACE_SILENT	Brute force silencer of the standard output, effective in case of a need to silence an enormous flood of trace messages. For example, macro `CTTL_TRACE_SILENT( true )` turns all standard output to `std::cout` off and is equivalent to std::cout.flush(); std::cout.setstate( std::ios_base::badbit ); // silence // ... Macro `CTTL_TRACE_SILENT( false )` turns standard output back on and is equivalent to std::cout.clear(); // output back to vocal // ... `CTTL_TRACE_SILENT( true )` creates a temporary object on the stack, therefore, the silencer is effective only in the scope of the function. The destructor of the temporary object turns the output back on automatically: void f() { CTTL_TRACE_SILENT( true ); // silence inside f() // ... } void g() { f(); // output is back to vocal as g() continues // ... } The opposite is also true - `CTTL_TRACE_SILENT( false )` also creates a temporary object on the stack to enable output in the current scope of the function. Object destructor automatically turns the output off: void f() { CTTL_TRACE_SILENT( false ); // output becomes vocal inside f() // ... } void g() { CTTL_TRACE_SILENT( true ); // silence inside g() f(); // silence inside g() continues... }

grammar trace macros

In addition to built-in traces, rule()-level symbolic information can be added to the trace output by a set of trace macros:

Table D.2 - Grammar trace macros

CTTL_TRACE_MESSAGE( text_message ) where `text_message` is a null-terminated string.	Sends user `text_message` to the standard output.
CTTL_RULE( r ) where r is non-static member function representing a grammar rule.	`CTTL_RULE` macro falls into the category of member function adaptors. The argument is the address of a non-static member function that represents a grammar rule. The `CTTL_RULE` macro is similar to `rule(*this,&C::r)` adaptor, but requires only address of the class member function. When this macro is used, the member function context is assumed, so that the `this` pointer is a valid expression. Depending on the level of tracing, `CTTL_RULE` produces a trace that shows events of getting in and out of the encapsulated rule r. See `arithmetics_traced.cpp` sample program for an example of use.
CTTL_MEMBER_RULE( object, &r ) where r is non-static member function representing a grammar rule, and `object` is a reference to an instance of C++ class that defines r.	`CTTL_MEMBER_RULE` is similar to `CTTL_RULE`, but requires both object reference and class member function, similar to `rule(object,&C::r)` member function adaptor.
CTTL_STATIC_RULE( r ) where r is global or static member function representing a rule.	The macro can be used in place of `rule(r)` function adaptor.

example

A modified version of the arithmetic expression parser, named arithmetics_traced.cpp, demonstrates CTTL tracing facility. The traced version of the program uses debug macros to invoke grammar rules.

To begin, we compile the program without tracing levels defined. If input expression is "2*-3", the program generates the following output:

Input:
    2*-3

Output:
        1        ;
        2        PUSH    2
        3        ;
        4        PUSH    3
        5        ;
        6        POP     R1
        7        NEG     R1             ; R1 = -(3)
        8        PUSH    R1
        9        ;
        10       POP     R1
        11       POP     R2
        12       MUL     R2, R1         ; R2 = R2 *-3
        13       PUSH    R2

If incorrect arithmetic expression is entered, "2-*3", the output becomes

Input:
    2-*3

Output:
        1        ;
        2        PUSH    2
*** syntax error ***
2-*3
  ^-- at position 2

*** error: parser terminated: ***
2-*3
 ^-- at position 1

To deal with the "unexpected" problem, we define CTTL_TRACE_RULES macro, which turns on the rule-level traces. Information is captured at run time by CTTL_RULE() and CTTL_MEMBER_RULE() macros:

Input:
    2-*3

Output:
--------------------@2-*3-198->parser::rule_expr@0-4
--------------------@2-*3--85->parser::rule_term@0-4
--------------------@2-*3---60->parser::rule_factor@0-4
-----------------------@2----36->parser::numeric_literal@0-1
        1        ;
        2        PUSH    2
-----------------------@2++++36<-parser::numeric_literal@0-0    ''
---------------------@-*3+++60<-parser::rule_factor@0-1 '2'
---------------------@-*3++85<-parser::rule_term@0-1    '2'
----------------------@*3--95->parser::rule_term@2-4
----------------------@*3---60->parser::rule_factor@2-4
----------------------@*3----50->parser::parse_error@2-4
*** syntax error ***
2-*3
  ^-- at position 2
~~~~~~~~~~~~~~~~~~~~~~@*3~~~~50<-parser::parse_error@2
~~~~~~~~~~~~~~~~~~~~~~@*3~~~60<-parser::rule_factor@2
~~~~~~~~~~~~~~~~~~~~~~@*3~~95<-parser::rule_term@2
---------------------@-*3+198<-parser::rule_expr@0-1    '2'

*** error: parser terminated: ***
2-*3
 ^-- at position 1

Traces generated by CTTL_RULE(), CTTL_MEMBER_RULE(), and CTTL_STATIC_RULE() have the following format:

--------------------@2-*3-198->parser::rule_expr@0-4
                    ^^  ^ ^  ^ ^               ^^^ ^
                    ||  | |  | |               ||| |
                    ||  | |  | |               ||`-`--Offset range corresponds
                    ||  | |  | |               ||     to the upper and lower bounds
                    ||  | |  | |               ||     of the universe.
                    ||  | |  | |               ||
                    ||  | |  | |               |`---At sign '@' separates rule name
                    ||  | |  | |               |    from the universe offset range.
                    ||  | |  | |               |
                    ||  | |  | `---------------`---Name of the rule captured by
                    ||  | |  |                     CTTL_RULE(), CTTL_MEMBER_RULE(),
                    ||  | |  |                     and CTTL_STATIC_RULE() macros.
                    ||  | |  |
                    ||  | |  `----Arrow indicates that the rule is being invoked.
                    ||  | |
                    ||  | `------The line number in the C++ source file where the rule
                    ||  |        is being invoked.
                    ||  |
                    |`--`-------Fragment of the user input text to be parsed.
                    |
                    `----------At sign '@' in front of the input text fragment.

When evaluation of the rule completes, the trace shows

---------------------@-*3+198<-parser::rule_expr@0-1    '2'
                     ^   ^   ^                          ^
                     |   |   |                          |
                     |   |   |                          '--Single quoted literal
                     |   |   |                             corresponds to the
                     |   |   |                             matched fragment.
                     |   |   |
                     |   |   `------Arrow indicates return from the rule
                     |   |          back to line 198.
                     |   |
                     |   `------The plus sign indicates successful result of evaluation.
                     |
                     `------At sign '@' in front of the input fragment corresponds
                            to the upper boundary of the universe after evaluation.

CTTL_TRACE_EVERYTHING macro turns on traces of expression evaluation inside individual grammar rules, for example,

Input:
    2-*3

Output:
--------------------@2-*3-198->parser::rule_expr@0-4
--------------------@2-*3? {;       0-4 
--------------------@2-*3---85->parser::rule_term@0-4
--------------------@2-*3?   {;     0-4 
--------------------@2-*3-----60->parser::rule_factor@0-4
--------------------@2-*3?     {|       0-4 
--------------------@2-*3?      {|      0-4 
--------------------@2-*3?       {|     0-4 
--------------------@2-*3?        {|        0-4 
--------------------@2-*3?         {&       0-4 
-----------------------2@|          $       1-4 
-----------------------@2-----------36->parser::numeric_literal@0-1
    1    ;
    2    PUSH    2
-----------------------@2+++++++++++36<-parser::numeric_literal@0-0 ''
                                   }
                                  }
                                 }
                                }
                               }
---------------------@-*3+++++60<-parser::rule_factor@0-1   '2'
---------------------@-*3?    {*        1-4 
---------------------@-*3?     {|       1-4 
---------------------@-*3?      {&      1-4 
---------------------@-*3?       {;     1-4 
~~~~~~~~~~~~~~~~~~~~~~~2@~        *     1-4 FAIL char
                                 }
~~~~~~~~~~~~~~~~~~~~~~~2@~       &      1-4 FAIL 
                                }
---------------------@-*3?      {&      1-4 
---------------------@-*3?       {;     1-4 
~~~~~~~~~~~~~~~~~~~~~~~2@~        /     1-4 FAIL char
                                 }
~~~~~~~~~~~~~~~~~~~~~~~2@~       &      1-4 FAIL 
                                }
~~~~~~~~~~~~~~~~~~~~~~~2@~      |       1-4 FAIL 
                               }
                              }
                             }
---------------------@-*3+++85<-parser::rule_term@0-1   '2'
---------------------@-*3?  {*      1-4 
---------------------@-*3?   {|     1-4 
---------------------@-*3?    {&        1-4 
---------------------@-*3?     {;       1-4 
~~~~~~~~~~~~~~~~~~~~~~~2@~      +       1-4 FAIL char
                               }
~~~~~~~~~~~~~~~~~~~~~~~2@~     &        1-4 FAIL 
                              }
---------------------@-*3?    {&        1-4 
---------------------@-*3?     {;       1-4 
----------------------2-@|      -       2-4 char
----------------------@*3-------95->parser::rule_term@2-4
----------------------@*3?       {;     2-4 
----------------------@*3---------60->parser::rule_factor@2-4
----------------------@*3?         {|       2-4 
----------------------@*3?          {|      2-4 
----------------------@*3?           {|     2-4 
----------------------@*3?            {|        2-4 
----------------------@*3?             {&       2-4 
~~~~~~~~~~~~~~~~~~~~~~2-@~              $       2-4 FAIL 
~~~~~~~~~~~~~~~~~~~~~~2-@~              &       2-4 FAIL 
                                       }
----------------------@*3?             {;       2-4 
----------------------@*3?              {;      2-4 
~~~~~~~~~~~~~~~~~~~~~~2-@~               (      2-4 FAIL char
                                        }
                                       }
~~~~~~~~~~~~~~~~~~~~~~2-@~             |        2-4 FAIL 
                                      }
----------------------@*3?            {;        2-4 
~~~~~~~~~~~~~~~~~~~~~~2-@~             -        2-4 FAIL char
                                      }
~~~~~~~~~~~~~~~~~~~~~~2-@~            |     2-4 FAIL 
                                     }
----------------------@*3?           {;     2-4 
~~~~~~~~~~~~~~~~~~~~~~2-@~            +     2-4 FAIL char
                                     }
~~~~~~~~~~~~~~~~~~~~~~2-@~           |      2-4 FAIL 
                                    }
----------------------@*3-----------50->parser::parse_error@2-4
*** syntax error ***
2-*3
  ^-- at position 2
~~~~~~~~~~~~~~~~~~~~~~@*3~~~~~~~~~~~50<-parser::parse_error@2
~~~~~~~~~~~~~~~~~~~~~~2-@~          |       2-4 FAIL 
                                   }
~~~~~~~~~~~~~~~~~~~~~~@*3~~~~~~~~~60<-parser::rule_factor@2
                                 }
~~~~~~~~~~~~~~~~~~~~~~@*3~~~~~~~95<-parser::rule_term@2
                               }
~~~~~~~~~~~~~~~~~~~~~~~2@~     &        1-4 FAIL 
                              }
~~~~~~~~~~~~~~~~~~~~~~~2@~    |     1-4 FAIL 
                             }
                            }
                           }
---------------------@-*3+198<-parser::rule_expr@0-1    '2'

*** error: parser terminated: ***
2-*3
 ^-- at position 1

Individual trace lines contain trace symbols corresponding to terminal and non-terminal units of grammar. For example, the following lines show trace of evaluation of two characters. Character 'a' matched at offset 1, but evaluation of 'b' has failed:

-----------------------a@| a            1-15    char
~~~~~~~~~~~~~~~~~~~~~~~a@~ b            1-15    FAIL char

Traces of non-terminal symbols appear enclosed by pair of braces. For example, when expression

	symbol('a') + symbol('b')

is evaluated against input text "abc", the trace output may look like

---------------------@abc?{;            0-3
-----------------------a@| a            1-3    char
----------------------ab@| b            2-3    char
                          }

To distinguish between sequences and Kleene plus operators, sequence operator, '+', is replaced by a semicolon, ';'. The question mark in front of the sequence, '?', indicates that result of the evaluation is yet to be known. Pipe characters, '|', separate fragments of input text from the symbol evaluation data.

The lines formed by tilde characters, '~', indicate failed symbols. In the following example, grammar expression

	symbol('a') ^ symbol('c')

fails against input text "abc":

---------------------@abc?{^            0-3
-----------------------a@| a            1-3    char
~~~~~~~~~~~~~~~~~~~~~~~a@~ c            1-3    FAIL char
~~~~~~~~~~~~~~~~~~~~~~~~@~ ^            0-3    FAIL
                          }

Lexeme symbol('a') successfully matches character 'a', but symbol('c') fails to match 'b'. The next line of trace shows that concatenation operator '^', connecting these two symbols, also fails.

trace symbols

The following table specifies various CTTL trace symbols:
Table D.3 - CTTL trace symbols

Trace symbol description

{e
Represents edge expression adaptor construct.

eN
where N is non-negative integer corresponding to the node identity that represents upper boundary of the edge. Displays result of the edge expression adaptor evaluation.

{n
Represents node expression adaptor construct.

nN
where N is a non-negative integer corresponding to the node identity. Displays result of the node expression adaptor evaluation.

i
Kleene list iteration messages, one of the following:
"kleene_list: iteration made no progress: bailing out" "kleene_list: epsilon match: bailing out" "kleene_list: user-specified match limit: bailing out"

{!
Unary search operator.

{!!
Identifies repeatable search operator.

{*
Identifies Kleene star operator.

{+
Identifies Kleene plus operator.

{E
Uppercase 'E' represents entity expression adaptor, written as entity(R), which enforces non-empty match of the underlying expression R

{;
Semicolon indicates relaxed sequence operator, R1 + R2. The semicolon chosen to distinguish sequence from Kleene plus operators.

{|
Pipe symbols represent set union and POSIX set union operators.

{&
Specifies set intersection operator.

{"
Identifies various quote symbols.

L
Identifies diagnostic messages generated by lexeme terminal symbols, one of the following:
"empty universe" "lexeme match out-of-universe bounds" "lexeme match inside void space" "lexeme find: invalid universe" "lexeme find: out-of-universe bounds" "lexeme find: lower bound inside void region" "lexeme find: upper bound inside void region"

0
Digit '0' represents symbol(false) and begin(false) lexemes.

1
Digit '1' represents symbol(true) and begin(true) lexemes.

S
Uppercase 'S' represents string set lexeme.

<
Indicates upper boundary of the character entity, specified by begin(char), begin(is...), and begin(text) lexemes.

>
Indicates lower boundary of the character entity, specified by end(char), end(is...), and end(text) lexemes.

F
Indicates first character of the character entity, specified by first(is...) and first(text) lexemes.

$
Indicates character entity corresponding to entity(is...) and entity(text) lexemes.

T
Identifies symbol(text) lexeme, specifying user-defined character entity.

A
begin() lexeme, corresponding to the beginning of the user input.

Z
Identifies end() lexeme, which corresponds to the lower boundary of the universe.

R
Represents grammar rule invoked by one of the function adaptors. Note that traces of grammar rule calls can be expanded by CTTL_RULE(), CTTL_MEMBER_RULE(), and CTTL_STATIC_RULE() macros.

{u
Trace produced by unary operator of lambda expression.

{b
Trace produced by binary operator of lambda expression.

{d
Trace produced by lambda compound set of deposit instructions.

Trace symbol	description
{e	Represents edge expression adaptor construct.
eN	where `N` is non-negative integer corresponding to the node identity that represents upper boundary of the edge. Displays result of the edge expression adaptor evaluation.
{n	Represents node expression adaptor construct.
nN	where `N` is a non-negative integer corresponding to the node identity. Displays result of the node expression adaptor evaluation.
i	Kleene list iteration messages, one of the following: "kleene_list: iteration made no progress: bailing out" "kleene_list: epsilon match: bailing out" "kleene_list: user-specified match limit: bailing out"
{!	Unary search operator.
{!!	Identifies repeatable search operator.
{*	Identifies Kleene star operator.
{+	Identifies Kleene plus operator.
{E	Uppercase `'E'` represents entity expression adaptor, written as `entity(R)`, which enforces non-empty match of the underlying expression R
{;	Semicolon indicates relaxed sequence operator, `R1 + R2`. The semicolon chosen to distinguish sequence from Kleene plus operators.
{\|	Pipe symbols represent set union and POSIX set union operators.
{&	Specifies set intersection operator.
{"	Identifies various quote symbols.
L	Identifies diagnostic messages generated by lexeme terminal symbols, one of the following: "empty universe" "lexeme match out-of-universe bounds" "lexeme match inside void space" "lexeme find: invalid universe" "lexeme find: out-of-universe bounds" "lexeme find: lower bound inside void region" "lexeme find: upper bound inside void region"
0	Digit `'0'` represents `symbol(false)` and `begin(false)` lexemes.
1	Digit `'1'` represents `symbol(true)` and `begin(true)` lexemes.
S	Uppercase `'S'` represents string set lexeme.
<	Indicates upper boundary of the character entity, specified by `begin(char)`, `begin(is...)`, and `begin(text)` lexemes.
>	Indicates lower boundary of the character entity, specified by `end(char)`, `end(is...)`, and `end(text)` lexemes.
F	Indicates first character of the character entity, specified by `first(is...)` and `first(text)` lexemes.
$	Indicates character entity corresponding to `entity(is...)` and `entity(text)` lexemes.
T	Identifies `symbol(text)` lexeme, specifying user-defined character entity.
A	`begin()` lexeme, corresponding to the beginning of the user input.
Z	Identifies `end()` lexeme, which corresponds to the lower boundary of the universe.
R	Represents grammar rule invoked by one of the function adaptors. Note that traces of grammar rule calls can be expanded by `CTTL_RULE()`, `CTTL_MEMBER_RULE()`, and `CTTL_STATIC_RULE()` macros.
{u	Trace produced by unary operator of lambda expression.
{b	Trace produced by binary operator of lambda expression.
{d	Trace produced by lambda compound set of deposit instructions.

After returning back to the arithmetic expression parser example, and realizing that the problem was data related, we can change the data string back to the proper input format, "2*-3", and rerun the program. To reduce the volume of output, we can also scale the trace back to the CTTL_TRACE_RULES level (compare to full version):

Input:
    2*-3

Output:
--------------------@2*-3-198->parser::rule_expr@0-4
--------------------@2*-3--85->parser::rule_term@0-4
--------------------@2*-3---60->parser::rule_factor@0-4
-----------------------@2----36->parser::numeric_literal@0-1
        1        ;
        2        PUSH    2
-----------------------@2++++36<-parser::numeric_literal@0-0    ''
---------------------@*-3+++60<-parser::rule_factor@0-1 '2'
----------------------@-3---64->parser::rule_factor@2-4
-----------------------@3----43->parser::rule_factor@3-4
-----------------------@3-----36->parser::numeric_literal@3-4
        3        ;
        4        PUSH    3
-----------------------@3+++++36<-parser::numeric_literal@3-3   ''
------------------------@++++43<-parser::rule_factor@3-4        '3'
-----------------------@3----43->parser::unary_minus@3-4
        5        ;
        6        POP     R1
        7        NEG     R1             ; R1 = -(3)
        8        PUSH    R1
-----------------------@3++++43<-parser::unary_minus@3-3        ''
------------------------@+++64<-parser::rule_factor@2-4 '-3'
---------------------@*-3---66->parser::multiply@1-4
        9        ;
        10       POP     R1
        11       POP     R2
        12       MUL     R2, R1         ; R2 = R2 *-3
        13       PUSH    R2
---------------------@*-3+++66<-parser::multiply@1-1    ''
------------------------@++85<-parser::rule_term@0-4    '2*-3'
------------------------@+198<-parser::rule_expr@0-4    '2*-3'

advice

To catch faulty inputs as early as possible, a good parser ought to employ assertive program design, based on highly restrictive set of grammars. In general, the initial grammar should unambiguously and accurately describe the input language. Even so, often the grammar has to change before it can used inside a program.

Because EBNF grammars are language-neutral, sets of production rules are often developed independently from the actual parser implementation. For heavy-duty applications, variances in grammar descriptions of the same input language may introduce a degree of ambiguity or inaccuracy. Many formal EBNF definitions are written for human audiences, creating possibility for inconsistency, redundancy, and sometimes hard-to-detect typographical errors. When original productions are re-written by automated tools, certain rules could become misinterpreted and turn into less affirmative constructs, if compared to the initial grammar requirements.

After initial inspection, elimination of left factoring and left recursion (also referred to as grammar transformations) are common steps in accommodation of new grammars. Soon after, a review of miscellaneous kleene quantifiers should take place. For example, kleene-based production rule R may succeed on empty string, but the parent rules that invoke R may not be willing to accept the empty match as a satisfactory outcome. Therefore, rules invoking R could use more restrictive expression, such as entity(R) to protect themselves against empty matches.

While traces may help to explore dynamics of grammar evaluation, understanding of a large trace output can be intimidating experience, particularly if the trace has considerable characteristics of the length and depth.

Debugging with traces requires disciplined approach of narrowing down the search to places with highest degree of relevance to the potential bug. Statements (sentences) of input failing to pass grammar evaluation yield clues about the problem. A handful of earlier, properly evaluated statements, should also be identified and analyzed. Each iterative step may require restarting the program with considerably smaller amount of input, sufficient to reproduce the bug. Eventually, the search area will become specific to a particular grammar rule, lexeme, or operator, directly associated with the problem.

In many cases it may be helpful to locate traces where lexer has made best progress in successful evaluation of the input text. When working with absolute offsets, consider using offset2line.cpp utility, or a similar tool, that can translate absolute offset to the actual line number in the input file.

Don't be discouraged by large amounts of trace information, routinely generated by parser programs. Use tools to split the output files into smaller, manageable pieces. Most of the time the problem will occur at the end of the execution; therefore, the tail of the trace will be a logical place to begin looking for clues.

Permission to copy, use, modify, sell and distribute this document is granted provided this copyright notice appears in all copies. This document is provided "as is" without express or implied warranty, and with no claim as to its suitability for any purpose.

Generated on Thu Nov 2 17:44:54 2006 for Common Text Transformation Library by

1.3.9.1