grammar evaluation

CTTL on
SourceForge

Download
Latest

Documentation
Index

Library
News

CVS
Repository

alphabets and symbols

Mathematical definition of an alphabet associates symbols with individual characters. Abstract alphabet, on the other hand, may contain symbols that are either individual characters, or some larger character entities, such as substrings. For example, keywords of a programming language could belong to an abstract alphabet over which the language is defined.

CTTL definition of a symbol is comparable with abstract symbol of some abstract alphabet. From the parser point of view, the symbols are smallest syntactic units of the language, otherwise known as grammar tree terminals.

Related pages: CTTL | Grammar evaluation | Grammar reference

grammar expressions

A set of grammar production rules defines a language. Each rule contains series of terminal and non-terminal symbols, recognized as grammar expression.

overview

Grammar expression R describes patterns of symbols in the input language. Within CTTL framework, EBNF expressions are both human- and C++ compiler-readable. To specify individual symbols, the library provides variety of lexeme functions, for example,

    symbol( 'Z' )     // matches a single character, 'Z'
    +symbol( 'Z' )    // matches one or more characters 'Z'
    symbol( "abc" )   // matches exact text "abc"
    entity( ispunct ) // matches a sequence of punctuation characters

Overloaded C++ operators defined within the library, allow programmers to construct C++ expressions representing syntactic groupings of symbols, known as non-terminals. For example, binary operator '^' specifies strict concatenation of two symbols:

	// matches notation for hexadecimal literals, such as 0xFF:
	symbol( "0x" ) ^ entity( isxdigit )

Parts of grammar expression are compiled into parts of the lexical analyzer, also named lexer or scanner. The lexer recognizes elements of the input sequence, and translates them into a stream of tokens. Examples of tokens are identifiers, operators, and literals of a programming language.

Parser program can implement a loop to iteratively invoke particular grammar rule, and then process result of lexical analysis upon return from each call. This is a typical approach for parsing sentences of a regular language (known as a type 3 language.) Writing such program with CTTL is simple: there is no need to prototype custom Finite State Machine (FSM) for the language scanner; proper FSM is compiled directly from components of a given CTTL expression.

Unfortunately, an infix notation-based arithmetic expression such as "a+b" cannot be described by a regular grammar. It requires a context-free grammar to identify its rules, usually defined by a set of BNF or EBNF productions. If CTTL program needs to parse a context-free language, it invokes a function designated to be the starting rule of the language. Furthermore, the parser employs a set of semantic actions to collect and process data.

An alternative to semantic actions would be an automatic abstract syntax tree (AST) generation, but such solution is less flexible when compared to semantic action-based approach to processing parsed tokens. If your application requires habitual creation of a tree, you may try to use spirit or hapy parser generator frameworks.

Automatic tree generation is absent from the text transformation library, because it is difficult to anticipate the kind of tree a user may want; tree container is not part of the STL. However, an STL-like C++ tree class is available for download and can be populated by CTTL semantic action functions. Tree container could be emulated with STL map that is using a vector key. For instance, an "integer tree of strings", typedefed as std::map< std::vector< int >, std::string > could allow imitation of a homogeneous tree with random access to its nodes.

To help programmers with syntax trees, CTTL library includes two utility classes, cttl::inode_writer and cttl::inode_reader, which facilitate creation and navigation of integer syntax trees.

CTTL library sample, xml_parser.cpp demonstrates generation of abstract syntax tree to store xml elements, populated by semantic actions of the sample parser class. Most grammars, including XML, require heterogeneous tree nodes to accommodate evaluation results. Such requirement separates ASTs from STL containers, which, by design, store homogeneous objects. Certain applications, for example, XPath search engine, lay down further requirements for the tree functionality. An XPath query is likely to foresee random access to the nodes of the tree. Therefore, an auxiliary index structure may be needed to provide mapping between values of xml nodes (elements, attributes, attribute values, and so on) and physical locations of the nodes in the underlying tree container. These and other considerations make it difficult to design a high-level tree container. Usually, application requirements dictate characterestics of trees qualified for a heavy-duty implementation.

The lexer invokes semantic actions based on events triggered by lexical analysis. Such events may signal successful completion of a particular production rule, or an error, usually caused by lexer's transition to an undefined state.

Depending on structure of grammar expression, C++ compiler generates either Nondeterministic Finite Automation (NFA), or Deterministic Finite Automation (DFA) implementation. (Every NFA is also a DFA with states corresponding to collections of NFA states.) Thus, generated CTTL solution may be a DFA, NFA, or a combination of both. Multiple NFAs may be connected into a bigger NFA.

Because low-level details of lexical analysis are known before the code is compiled, classes and functions of the library encapsulate pre-fabricated parts of NFA and DFA algorithms. Compiler translates grammar expressions into finite state automata interconnected by function calls to the production rule functions. Compile-time solution relies on C++ operator overloading and template meta-programming technique.

The following grammar expression invokes production rule named digits, defined in rules.cpp sample:

    rule( digits )
    +
    ( '.' + rule( digits ) ) * 1

EBNF grammar production rule is written as non-terminal symbol specifying the name of the rule on the left side, and a grammar expression containing terminal and non-terminal symbols on the right side. Every non-terminal in its own turn can include a pattern of terminal and non-terminal symbols. While individual CTTL grammar expressions represent non-terminal symbols, productions are defined as C++ functions, encapsulating those grammar expressions. This way, a set of grammar rules invokes other rules defined as C++ functions. Ultimately, the parser becomes a collection of recursive procedures.

CTTL parser program follows top-down parsing technique. Lexical analyzer reads user input from left to right. If a non-terminal symbol is present in the grammar expression, it is immediately evaluated. If lexer must choose between two non-terminals, the leftmost is always evaluated first. This process easily becomes recursive, since any non-terminal may refer to other non-terminals, including itself.

Related pages: CTTL | Grammar evaluation | Grammar reference

left factoring

Depending on types of operators chosen to construct grammar expressions, grammar rules may be predictive, or backtracking. Sometimes branches of a grammar expression begin the same way:

    ( symbol( "if" )
      +
      '('
      +
      rule( boolean_expression )
      +
      ')'
      +
      rule( statement )
      +
      "else"
      +
      rule( statement )
    )
    |
    ( symbol( "if" )
      +
      '('
      +
      rule( boolean_expression )
      +
      ')'
      +
      rule( statement )
    )

The expression is acceptable, but it is inefficient, since it causes the lexer to backtrack and re-evaluate the same fragment more than once. Such lack of determinism can be eliminated by splitting the original rule into two, reorganizing the common part into an separate grammar rule:

    rule( if_head ) + rule( if_tail )

where if_head is defined as

    symbol( "if" )
    +
    '('
    +
    rule( boolean_expression )
    +
    ')'
    +
    rule( statement )

and if_tail is defined as

    (
      "else"
      +
      rule( statement )
    )
    |
    begin( true )

To make parser predictive, we factored out common sub-expression on the left side of the original production rule by creating a new rule, named if_head. This kind of transformation towards deterministic parser is called left factoring.

Left factoring generally improves parser performance. Left factoring transformation compacts BNF grammar containing disjunction of two or more sub-expressions connected by set union '|' operator. For example, BNF rule

    S --> P Q | P R

can be transformed into

    S --> P ( Q | R )

where P, Q, R, and S are grammar productions. Additionally, commonalities may exist deeper in separate productions:

    S --> A | B
    A --> P Q
    B --> P R

which can be optimized into more efficient set of productions

    S --> P ( A | B )
    A --> Q
    B --> R

Related pages: CTTL | Grammar evaluation | Grammar reference

left recursion

Left recursion comes into existence when BNF rule invokes itself as first non-terminal on its left hand side. It becomes a pitfall for top-down parsers like CTTL, and should be avoided.

                                                      Grammar A

    <expr> --> <term> | <expr> ( '+' | '-' ) <term>

This BNF suggests that fragment matching <expr> always starts with a fragment that matches <term>. Furthermore, the matching fragment can be indefinitely extended by adding plus or minus sign, followed by a <term>. For example, if symbols x, y, and z qualify as terms, then input strings

    x
    x+y
    x-y+z
    x+x+y-z
    z-z-z+z-z

will satisfy the rule <expr>. If written "as is" in C++, the grammar A leads to infinite recursion:

    size_t expr( UniverseT& universe_ )
    {
        return (

            rule( term )
            |
            (
                rule( expr )  // endless recursion!
                +
                ( symbol( '+' ) | '-' )
                +
                rule( term ) 
            )

        ).match( universe_ );
     }

Every repetitive fragment of input is covered by BNF sub-expression "( '+' | '-' ) <term>", which can be pushed out into an auxiliary BNF rule named <expr_suffix>:

                                                      Grammar B

    <expr_suffix> --> ( '+' | '-' ) <term> <expr_suffix> | <epsilon>
           <expr> --> <term> <expr_suffix>

This grammar transformation removes left recursion from the original grammar. Grammar B has become right recursion. Kleene star operator helps to simplify things a step further by replacing recursion with plain vanilla iteration:

                                                      Grammar B'
    <expr_suffix> --> ( ( '+' | '-' ) <term> )*
           <expr> --> <term> <expr_suffix>

                                                      Grammar C

           <expr> --> <term> ( ( '+' | '-' ) <term> )*

The C++ version of grammar C can be written as

    size_t expr( UniverseT& universe_ )
    {
        return (

            rule( term )
            +
            *(
                ( symbol( '+' ) | '-' )
                +
                rule( term ) 
            )

        ).match( universe_ );
    }

Steps to remove left recursion and left factoring are known as grammar transformations. Formal algorithms exist to remove left recursion. In addition to left recursion, its descent, named indirect, or cyclic left recursion, may also be found in BNF grammars. This type of recursion happens when production P invokes another production, R, through a chain of productions, and R refers back to the original production P in a left-recursive way:

    P --> Q rest-of-P
    Q --> R rest-of-Q
    R --> P rest-of-R

CTTL library does not support BNF left recursion, and requires its removal or replacement with alternative grammar constructs discussed in this section. Grammar transformations may bring a risk of losing semantics of the initial, unmodified grammar. For example, all modifications should eliminate risk of accidental changes to operator associativity or precedence specified by the original grammar.

Related pages: CTTL | Grammar evaluation | Grammar reference

parseable universe

Substring classes

    cttl::const_edge    // constant substring access
    cttl::edge          // mutable substring access

specify universe of discourse U, which can be parsed by the grammar rule R. The universe is accepted by the grammar evaluation method M as input parameter, which is passed to the method as a mutable reference. Evaluation method M may change the universe boundaries as follows:

While grammar expression is evaluated, the universe U gets consumed. Evaluation starts at the beginning of the universe, that is, at the upper boundary position specified by the node U.first. With every match found by the grammar expression R, physical offset of the node U.first is advanced forward to point to the lower bound (the right hand side ending position) of the matched symbols. As a result, the node U.first continuously points to the unparsed symbols of the input. The absolute offset P returned from the evaluation method M corresponds to the upper boundary of the matched fragment, pointing to the symbols that were already parsed.

For example, assume that letters x, y, and z represent unparsed symbols of the language, and the input contains a sequence of symbols "xxxyyyyyyyzzz". If universe U points to the substring "yyyyyyy" of the input, then the state of the system before grammar evaluation looks like

   xxxyyyyyyyzzz
      ^      ^
      |      |
   first    second
      |      |
      '---U--'

If grammar rule R finds a match for substring "ABC" at the beginning of the universe, it advances the U.first node to the end of the matched symbols. Accordingly to the terminology of grammar evaluation, fragment "ABC" was consumed by evaluation method M:

   xxxABCyyyyzzz
      ^  ^   ^
      |  |   |
      |first second
      |  |   |
      P  '-U-'

The universe consumption stops when evaluation of rule R is complete. If evaluation is a success, the caller of M receives back absolute offset P pointing to the upper boundary of the matched fragment. In addition to this, the caller receives consumed universe, which has the node U.first pointing to the lower boundary of the matched symbols. If evaluation method M fails, it returns the value of string::npos, where string is the type of the user input string. In case of failure, the method restores (backtracks) the boundaries of the universe to their original positions.

If the length of the edge U becomes zero or negative, the universe is considered to be fully exhausted:

   xxxABCDEFGzzz
      ^      ^
      |      |
      |     first
      |     second
      |      |
      P      U

Internally, implementation of the CTTL lexer engine depends on the type of the universe. Two types of the universe, cttl::const_edge for constant substring access, and cttl::edge, for mutable substring access are available.

Related pages: CTTL | Grammar evaluation | Grammar reference

For more information about constant and mutable universes, see

edge classes overview, and
mutable universe.

grammar evaluation methods

CTTL grammar expression R is resolved by the compiler as an instance of C++ template class, which exposes public member functions named grammar evaluation methods. The evaluation methods are the entry points into CTTL lexical analysis implementation, compiled for the rule R. There are four grammar evaluation methods:

Table E.1 - Grammar evaluation methods

method name	description
R`::match()`	Matches symbols at the beginning of the universe.
R.`::find()`	Searches for a matching fragment from the beginning of the universe.
R.`::bang_find()`	Executes repeatable search for a matching fragment from the beginning of the universe.
R.`::runtime_match()`	Does run time flag lookup and delegates work to `match()`, `find()`, or `bang_find()`.

Each method takes mutable reference to the substring representing the universe, and returns an absolute offset corresponding to the upper boundary of the matched symbols, for example:

	template< typename UniverseT >
	size_t find( UniverseT& U )
	{
		// ...
	}

In case of success, the evaluation method updates the universe by forwarding position of the node U.first to the lower boundary (the right-hand-side ending position) of the matched fragment of input. If evaluation fails, the evaluation method returns the value of string::npos, where string is the string type being used by the program, for example, std::string. The failure case has no universe-related side effects: the universe boundaries remain unchanged.

method `match()`

Method match() attempts to match grammar expression immediately at the position specified by the upper boundary of the universe, U.first. This method is also known as match-evaluation.

method `find()`

Method find() initiates the search for the leftmost terminal symbol L of the grammar expression R, starting at the beginning of the universe, U.first. If symbol L is found, the rest of the expression R is matched against the input by using match-evaluation.

method `bang_find()`

Very much like find-evaluation, the evaluation method bang_find() initiates repeatable search for the leftmost terminal symbol L of the grammar expression R as follows: if L is found, the rest of the components of rule R are matched against the input by the match-evaluation. If the latter evaluation fails, the bang_find() resumes search for symbol L indefinitely in anticipation of either a successful find, or until the full exhaustion of the universe. For example,

// sample code: yellow_taxi.cpp

#include <iostream>
#include "cttl/cttl.h"

using namespace cttl;

int main()
{
    input<> inp( "yellow pages were found in yellow taxi" );
    const_edge< policy_space<> > U( new_edge( inp ) );
    size_t P = ( symbol( "yellow" ) + "taxi" ).bang_find( U );
    assert( P != std::string::npos );   // assert bang_find() succeeded
    return 0;
}

If the rule R contains only one terminal symbol L in its pattern, evaluation methods find() and bang_find() behave in exactly the same way. For instance, the outcomes of both find() and bang_find() evaluations are always the same regardless of the user input:

size_t f( cttl::const_edge<>& universe_ )
{
	return symbol( "ABC" ).find( universe_ );
	// result is exactly the same as
	// return symbol( "ABC" ).bang_find( universe_ );
}

On the other hand, the result of more complex grammar expressions depends on the grammar evaluation method: find() is likely to fail sooner then bang_find(). Indeed, if either symbol("ABC"), or entity(isdigit) sub-expression fails, the bang_find() evaluation repeats the search for the leftmost terminal symbol("ABC"):

size_t f( cttl::const_edge<>& universe_ )
{
	return ( symbol( "ABC" ) + entity( isdigit ) ).bang_find( universe_ );
}

leftmost terminal symbol

Once production replacements have been applied to EBNF grammar, all non-terminals are replaced by a sequence of terminal symbols matching the input sentence. In case of successful evaluation, each production rule will contain at least one terminal symbol. The lexeme function describing first symbol on the left side of the terminal sequence is qualified as leftmost terminal symbol.

search transitivity

Regardless of the grammar structure and complexity, only leftmost terminal symbol can be the target of the search method. This property of CTTL grammar expression is called search transitivity. Search transitivity is guaranteed by two common design characteristics of find() and bang_find() method implementations. First, because all overloaded binary operators have left-to-right associativity, they stay search-transient by delegating incoming search requests to their left-hand-side expressions. And second, the remaining non-terminal constructs, such as unary operators, adaptors, and quotes, maintain their search transitivity by delegating the search requests to the underlying sub-expressions. As soon as find() or bang_find() request reaches the first terminal symbol (which is, by definition, the leftmost symbol in the expression), the search request becomes consumed, that is, executed. All remaining terminal symbols in the expression are evaluated by the match() method.

The following table shows examples of search transitivity. If e is an expression adaptor, R, R1, and R2 are valid grammar expressions, then the rules on the left side of the table can be rewritten in a comparable form on the right side:

Table E.2 - Search transitivity examples

search expression	equivalent to
!e( R )	e( !R )
!( R1 + R2 )	( !R1 ) + R2
!+R	( !R ) + *R

search parsers

Search methods find() and bang_find() allow to write parsers that initiate search for fragments of text corresponding to patterns of symbols identified by a set of grammar production rules. Search parsers usually implement only small fraction of the input grammar, sufficient to unambiguously describe the required search pattern. A typical search parser is a search and replace program.

Another library sample, cpp_comment_strip.cpp, also demonstrates a search parser. The program loads a C/C++ or java source file, searches for fragments of text corresponding to program comments, and prints the stripped version of source file on the screen.

method `runtime_match()`

Evaluation method runtime_match() should be used inside grammar production functions representing subrules. Subrules could be invoked by the rule() adaptor with unary search operators, "!" or "!!". The runtime_match() of the subrule examines the flag set by the rule() adaptor, and, if either "!" or "!!" operators were applied in front of the rule(), the corresponding find() or bang_find() evaluation methods are invoked. If neither operator was used, the work is delegated to the match() evaluation method. For example,

// sample code: runtime_match.cpp

#include <iostream>
#define CTTL_TRACE_EVERYTHING   //define to turn tracing on
#include "cttl/cttl.h"

using namespace cttl;

struct parser {
    size_t subrule( const_edge< policy_space<> >& edge_ )
    {
        return ( symbol( "yellow" ) + "taxi" ).runtime_match( edge_ );
    }

    size_t start( const_edge< policy_space<> >& edge_ )
    {
        return ( !!rule( *this, &parser::subrule ) ).match( edge_ );
    }
};

int main()
{
    input<> inp( "yellow pages were found in yellow taxi" );
    const_edge< policy_space<> > U( new_edge( inp ) );
    parser parser_taxi;
    size_t P = parser_taxi.start( U );
    assert( P != std::string::npos );   // assert bang_find() succeeded
    return 0;
}

Related pages: CTTL | Grammar evaluation | Grammar reference

evaluation result

Result returned by the evaluation method M of the grammar production rule R is a non-negative absolute offset P, corresponding to the upper bound (beginning) of the leftmost terminal symbol L of the successfully matched pattern of symbols. For example,

// sample code: Z.cpp

#include "cttl/cttl.h"

using namespace cttl;

int main()
{
    input<> inp( "Z" );
    const_edge<> U( new_edge( inp ) );
    // evaluation method is "match"
    // grammar expression is "symbol( 'Z' )"
    size_t P = symbol( 'Z' ).match( U );
    assert( P == 0 );
    return 0;
}

If evaluation method fails, it returns the value of string::npos, where string is the type of the string being used by the program, for example, std::string::npos. The npos result is a mechanism to report undefined DFA state to the CTTL lexer implementation.

When lexer receives an undefined state from production rule function, semantic action, or an evaluation method of sub-expression, it backtracks to the last known successful position, and continues with an alternative path through the grammar, if available. The set of grammar alternatives is formulated as a disjunction of evaluation paths linked together with set union operators.

Semantic actions capable of handling errors, often choose to register the error condition and then recover, instead of unconditional transition to an undefined state. Recovery helps parser to discover more errors in a single pass through the input. The job of recovery is to reach a synchronizing boundary of the current grammar construct - a phrase, an expression, or a statement. Once recovered, the lexer continues with the main line of evaluation. Implementation of the error recovery can be tricky, as it may involve private evaluation of auxiliary grammar constructs.

Set intersection expression, RL & RR, treats the result of right hand side operand RR as boolean. Prior to RR evaluation, the lexer saves on the stack the result of left operand, RL, as well as the position of the universe upper boundary. If RR succeeds, the universe backtracks to the saved position, and result of RL becomes the result of the set intersection. This behavior renders RR unsuitable for error recovery, since its successful result is thrown away. If RR operand is a semantic action, it can only process data, or fail unconditionally by returning npos.

On the other hand, constructs such as sequence RL + RR, and concatenation RL ^ RR, allow production rules and semantic actions to return input positions of their choice, constrained only by the validity rules of the edge class that represents the universe of discourse in current context. The RR parts of the sequence and concatenation are proper placeholders for full-featured CTTL error handlers.

Related pages: CTTL | Grammar evaluation | Grammar reference

production rule functions

By nature, EBNF grammar rules can invoke other rules recursively. CTTL supports EBNF recursion by means of C++ functions.

User can define a C++ function r( ) that encapsulates grammar expression R corresponding to an individual EBNF grammar rule. A prototype of such function may look like

	size_t r( edge<>& U )
	{
		return R.M( U );
	}

where R is grammar expression, M is the method, and U is the universe of discourse. Therefore, distinctive grammar production rules can be organized into a set of C++ functions invoking each other recursively. For example, a parser for fractional numbers might be organized as a set of global functions, named start(), digits(), and fract(). The rules become connected by the rule() function adaptors:

// sample code: rules.cpp
// demonstrates EBNF grammar rules

#include "cttl/cttl.h"

using namespace cttl;

// EBNF grammar for fractional number can be written as: 
//
//  <start>  := '-'* <fract>
//  <fract>  := <digits> ('.' <digits>)*
//  <digits> := ( '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' )+
//
// With cttl, these rules can be written as C++ functions:
//

size_t digits( const_edge<>& universe_ )
{
    return entity( isdigit ).match( universe_ );
}

size_t fract( const_edge<>& universe_ )
{
    return (
        rule( digits )
        +
        ( '.' + rule( digits ) ) * 1

    ).match( universe_ );
}

size_t start( const_edge<>& universe_ )
{
    return ( *symbol( '-' ) + rule( fract ) ).match( universe_ );
}

int main()
{
    input<> inp( "123.45" );
    const_edge<> U( new_edge( inp ) );
    size_t P = rule( start ).match( U );
    assert( P == 0 );
    assert( U.length() == 0 );
    return 0;
}

To call C++ functions from grammar expressions, the library provides function adaptors similar to STL adaptors std::mem_fun1_t. The family of adaptor functions named cttl::rule() allows invoking user-defined global, static, and class member functions with the following signatures:

// Examples of function signatures that can be used
// as wrapper functions for CTTL grammar rules:

size_t gfc( const_edge<>& );
size_t gfm( edge<>& );

class X {
    static size_t sfc( const_edge<>& );
    static size_t sfm( edge<>& );
    size_t mfc( const_edge<>& );
    size_t mfcc( const_edge<>& ) const;
    size_t mfm( edge<>& );
    size_t mfmc( edge<>& ) const;
};

where

	gfc  is a global function accepting constant universe;
	gfm  is a global function accepting mutable universe;
	sfc  is a static member function of class X accepting constant universe;
	sfm  is a static member function of class X accepting mutable universe;
	mfc  is a member function of class X accepting constant universe;
	mfcc is a const member function of class X accepting constant universe;
	mfm  is a member function of class X accepting mutable universe;
	mfmc is a const member function of class X accepting mutable universe;

In addition to this, a family of function objects such as

	struct RC  { size_t operator() ( const_edge<>& ); };
	struct RCC { size_t operator() ( const_edge<>& ) const; };
	struct RE  { size_t operator() ( edge<>& ); };
	struct REC { size_t operator() ( edge<>& ) const; };

could be used to encapsulate CTTL grammar expressions. For example,

// sample code: functor.cpp
// demonstrates function object rule_char used inside grammar rule.

#include "cttl/cttl.h"

using namespace cttl;

class rule_char : public std::unary_function< const_edge<>, size_t >{
    char m_char;

public:
    rule_char( char char_ )
        :   m_char( char_ )
    {}

    size_t operator() ( const_edge<>& edge_ ) const
    {
        return symbol( m_char ).match( edge_ );
    }
};


int main()
{
    input<> inp( "ZZZA" );
    const_edge<> U( new_edge( inp ) );
    char L = 'Z';
    rule_char R( L );           // construct grammar rule for a single character
    // the following statement describes grammar production rule R+ 'A'
    // and evaluates this grammar against specified universe by
    // invoking "match()" evaluation method:
    size_t P = ( +rule( R ) + 'A' ).match( U );
    assert( P == 0 );           // assert match found at the beginning of U
    assert( U.length() == 0 );  // make sure universe is fully consumed
    return 0;
}

Related pages: CTTL | Grammar evaluation | Grammar reference

semantic actions

During grammar expression evaluation, the tokens are either instantly discarded, or given out to the parser component for validation, storage, and other data-related tasks. Consequently, the parser receives and consumes the data generated by the lexer. Parser function that stores the data, or takes immediate processing step during lexical analysis, is commonly known as a semantic action. The set of all semantic actions represents the parser component of the program. If a semantic action detects an error, it can signal error condition back to the lexer. Based on the structure of the grammar rule, the lexer may choose to proceed with alternative grammar evaluation path, or discontinue further input processing and return the error condition to the caller.

In CTTL, semantic actions are similar to production rule functions: they have the same signature as production rule functions, and the same evaluation result. In addition to semantic actions, the latest CTTL release added support for lambda expressions, higher-order functions (functional composition), and closures (delayed function calls).

To make clean distinction between grammar rules and semantic actions, lexer and parser may be implemented in two separate classes, for example:

// sample code: word_count.cpp
// demonstrates stateful parser and lexer implementation.

//#define NDEBUG    // must appear before assert.h is included to stop assertions from being compiled 
//#define CTTL_TRACE_EVERYTHING

#include <iostream>
#include "cttl/cttl.h"

using namespace cttl;

template< typename UniverseT >
struct base_parser {
    // parser defines two kinds of universes:
    typedef UniverseT universe_T;
    typedef typename UniverseT::strict_edge_T strict_universe_T;

    // semantic actions:
    size_t count_words( strict_universe_T& ) const
    {
        return 0;
    }

    size_t replace_words( strict_universe_T& ) const
    {
        return 0;
    }
};

template< typename UniverseT >
struct parser : public base_parser< UniverseT > {

    // parser defines two kinds of universes:
    typedef UniverseT universe_T;
    typedef typename UniverseT::strict_edge_T strict_universe_T;

    int count;

    parser( int count_ )
        :
        count( count_ )
    {
    }

    // semantic actions:
    size_t count_words( strict_universe_T& universe_ )
    {
        ++count;
        return universe_.second.offset();
    }

    /*
    size_t replace_words( strict_universe_T& universe_ )
    {
        universe_ = "<WORD/>";
        return universe_.second.offset();
    }
    */
};

template< typename ParserT >
struct lexer : public ParserT {

    // lexer defines two kinds of universes:
    typedef typename ParserT::universe_T universe_T;
    typedef typename universe_T::strict_edge_T strict_universe_T;

    // lexer static data:
    std::set< std::string > keywords;

    lexer( int count_ = 0 )
        :   ParserT( count_ )
    {
        // populate list of keywords:
        keywords.insert( "abc" );
        keywords.insert( "xyz" );
    }

    // grammar rule definitions
    size_t start( universe_T& universe_ )
    {
        return (
            // at least one word should be present
            +(
                // invoke grammar rule
                rule( *this, &lexer< ParserT >::word )
                &
                // invoke semantic action
                rule( *this, &ParserT::count_words )
                &
                // invoke another semantic action
                rule( *this, &ParserT::replace_words )
            )

        ).match( universe_ )
        ;
    }

    size_t word( universe_T& universe_ ) const
    {
        // a word can anything made of alphabetic
        // characters, but not a keyword:
        return (

            isalpha
            -
            begin( keywords )

        ).match( universe_ );
    }
};


int main(int argc, char* argv[])
{
    if ( argc == 1 ) {
        std::cout
            << "Usage: on command the line, enter some words to count, for example,"
            << std::endl
            << '\t'
            << argv[ 0 ]
            << " abc def ghi"
            << std::endl
            ;

        return 1;
    }

    // construct input string from the command line arguments:
    input<> inp( &argv[ 1 ], ' ' );

    // construct universe to be parsed:
    typedef const_edge< policy_space<> > universe_T;

    universe_T universe( new_edge( inp ) );

    // construct the parser:
    lexer< parser< universe_T > > word_parser;

    // count words:
    if ( word_parser.start( universe ) != std::string::npos ) {
        std::cout << "Word count: " << word_parser.count << std::endl;

    } else {
        std::cout << "*** parser failed ***" << std::endl;
        return 1;
    }

    std::cout << "Input: " << inp.text() << std::endl;
    return 0;
}

Related pages: CTTL | Grammar evaluation | Grammar reference

strict universe: strict_edge_T

In the word_count.cpp sample above, the template parameter UniverseT specifies the type of parseable universe provided by the user. Sample lexer and parser classes regulate their use of the UniverseT type as follows:

template< typename UniverseT >
struct parser : public base_parser< UniverseT > {

    // parser defines two kinds of universes:
    typedef UniverseT universe_T;
    typedef typename UniverseT::strict_edge_T strict_universe_T;
    //...
};

template< typename ParserT >
struct lexer : public ParserT {

    // lexer defines two kinds of universes:
    typedef typename ParserT::universe_T universe_T;
    typedef typename universe_T::strict_edge_T strict_universe_T;
    //...
};

Semantic action functions parser< UniverseT >::count_words(), base_parser< UniverseT >::count_words(), and base_parser< UniverseT >::replace_words() have different signatures compared to the member functions implementing grammar rules. The difference is that semantic actions expect a reference to strict_universe_T object as their input parameter:

    size_t sa( strict_universe_T& universe_ )
    {
        //...
    }

Here, sa is the name of the semantic action. The strict_universe_T is a type name of modified universe:

    typedef typename UniverseT::strict_edge_T strict_universe_T;

Variations of function signatures in different parts of the program are imposed by the overloaded binary operators concatenation '^', set intersection '&', and set complement '-'. These operators remove user-provided white space policy from the type of the universe and replace it with its strictest form, strict_policy_T, which by default is a typedef of policy_default.

The usage of strictness in this context refers to the white space sensitivity: stricter policy pays less attention to the white space, whereas relaxed policy is motivated to process more. For the purpose of switching between the user-defined and stricter whitespace mode, CTTL edge template classes expose typedefs named strict_policy_T and strict_edge_T. Internally, CTTL grammar evaluation methods can redefine their white space policy as follows:

    template< typename UniverseT >
    size_t match( UniverseT& universe_ )
    {
        typename UniverseT::strict_edge_T strict_universe( edge_ );
        // use strict_universe instead of universe_
        //...
    }

The policy replacement takes place before the universe is presented to right-hand-side expression of the binary operator. For example:

    rule( *this, &lexer< ParserT >::word )
    &
    // original policy is replaced with policy_default
    rule( *this, &ParserT::count_words )

Therefore, the right-hand-side expression, as well as the semantic action ParserT::count_words() itself should be able to parse universe of the modified type, UniverseT::strict_edge_T:

    size_t count_words( strict_universe_T& ) const
    {
        ++count;
        return universe_.second.offset();
    }

Related pages: CTTL | Grammar evaluation | Grammar reference

white space policy

Languages contain "strings" which represent various "symbols", such as words, punctuation characters, numbers, and so on. Symbols are typically separated by either white space, or predefined combinations of other symbols, not considered being part of the input language. For example, compilers are known to ignore programming language comments. The comments are commonly intended for human audience, or for tools that might use comment areas as metadata storage. Interpreted programming language MUMPS allowed programmer to store data inside comment lines of the program. By convention, MUMPS precompiler would not strip such comments away from the generated code.

Given all this variety, the lexer engine could benefit from a facility that specifies "white space" category of symbols at compile time, and then stays sensitive to that category at run time. Text transformation library defines parseable universe specialized by the compile-time policy class, yielding a white-space-aware universe.

CTTL grammar expressions determine C++ template classes, which become actual implementation components of the lexer engine, once the program is compiled. Implementation classes have entry points named evaluation methods, defined as member template functions accepting parameter that represents mutable reference to the parseable universe. The universe, described by an instance of either const_edge, or edge template classes, exposes its policy object to the lexer implementation. The lexer delegates the white space management functionality to the interface of the policy class.

CTTL edge is a composition of two nodes, representing edge boundaries within user input. First template parameter of the edge, named white space policy, determines space sensitivity of the edge. Categorized by the policy, specific symbols (or no symbols at all) are distinguished as white space.

White space policy classes derive from cttl::policy_default class. The policy class must define member function named match(), which skips over white space symbols by advancing forward the upper boundary of the universe. For example:

// sample code: follow_whitespace.cpp
// demonstrates custom white space policy
// defined as dash characters, '-'.
#include <iostream>
#include "cttl/cttl.h"

using namespace cttl;

struct policy_dash : public policy_default
{
    template< typename UniverseT >
    static size_t match( UniverseT& universe_ )
    {
        typedef typename UniverseT::char_T char_T;
        size_t non_greedy_offset = universe_.first.offset();
        ( true ^ *symbol( char_T( '-' ) ) ).match( universe_ );
        // policy_dash is non-greedy:
        return non_greedy_offset;
    }
};  // struct policy_dash

int main()
{
    input<> inp( "-a-b-c--d-e-f---" );
    const_edge< policy_dash > universe( new_edge( inp ) );
    const_edge<> data( new_edge( inp ) );
    size_t result = ( data( +entity( isalpha ) ) ).match( universe );
    assert( result != std::string::npos );
    std::cout << "data: /" << data.text() << '/' << std::endl;
    return 0;
}

This program prints the following output:

data: /a-b-c--d-e-f/

During grammar evaluation, the lexer implementation invokes match() function of the white space policy class, giving it a chance to modify upper boundary of the universe accordingly to the definition of white space.

The match() function must always succeed and supposed to return a valid offset within specified universe of discourse. If match() function returns position before the white space, its behavior is described as non-greedy white space evaluation. On the other hand, if the function returns position immediately after the white space, the behavior is greedy. Non-greedy evaluations consume white space symbols, but offer capability to backtrack to the original position if necessary. The greedy version, on the other hand, consumes white space and suggests newly modified offset for future backtracking attempts.

Sample policy_dash class saves original position of the upper boundary of the universe in local variable named non_greedy_offset, and returns that position back to the lexer. Therefore, a non-greedy implementation creates potential for multiple white space evaluations in case of repetitive backtracking.

A greedy version of policy_dash class could be written as

struct policy_dash : public policy_default
{
	template< typename UniverseT >
	static size_t match( UniverseT& universe_ )
	{
		typedef typename UniverseT::char_T char_T;
		( true ^ *symbol( char_T( '-' ) ) ).match( universe_ );
		// greedy policy simply returns modified offset:
		return universe_.first.offset();
	}
};	// struct policy_dash

Most policy objects do not require their own state. For that reason, they can implement static version of match(). At the same time, if implementation requires actual instance of policy object, the policy can be provided as the second argument of the edge constructor. For example, policy_dash from the previous example could be explicitly instantiated and passed to the constructor of the universe object:

int main()
{
	input<> inp( "-a-b-c--d-e-f---" );
	policy_dash space_policy;	// instantiate policy
	const_edge< policy_dash > universe( new_edge( inp ), space_policy );
	//...
}

To avoid infinite recursion, it is essential for the policy implementation to evaluate its white space expression in strict mode, that is, against the universe that uses policy_default class. That is why sample policy is using true^... construct to describe the white space:

	( true ^ *symbol( char_T( '-' ) ) ).match( universe_ );

This is equivalent to

struct policy_dash : public policy_default
{
	template< typename UniverseT >
	static size_t match( UniverseT& universe_ )
	{
		typedef typename UniverseT::char_T char_T;
		size_t non_greedy_offset = universe_.first.offset();
		( *symbol( char_T( '-' ) ) ).match( const_edge<>( universe_ ) );
		// policy_dash is non-greedy:
		return non_greedy_offset;
	}
};	// struct policy_dash

The policy template should derive from cttl::policy_default class, which is also used as default edge template parameter. Default policy provides no recognition of the white space.

If compiled against mutable universe, CTTL policy expression may modify the universe, if necessary. For example, white space rule may physically remove white space from the input text.

The policy template can provide custom version of the strict_policy_T:

   typedef user-defined-strict-policy strict_policy_T;

Please note that C++ type designated as strict policy type must be default-constructible. Strict policy is always instantiated as a static member of the universe using the default constructor. Since strict policy member is static, it should have stateless implementation in order to obey the rules of the thread safety. For more information, see strict universe overview.

Related pages: CTTL | Grammar evaluation | Grammar reference

pre-defined policies

cttl::policy_default - empty policy that does nothing
cttl::policy_space< flag_follow_space > - skips typical white space characters ht, lf, vt, ff, cr, and space
cttl::policy_space< flag_cpp_comments > - typical C++ or Java white space, includes C and C++ style comments
cttl::policy_space< flag_follow_region > - user-defined regions of text become white space

For most common cases of space, the library provides a number of pre-defined white space policy classes that can be used with const_edge and edge classes. Class template policy_space is specialized for various types of white space by combinations of compile-time flags. Feel free to examine the source code of the policy classes when implementing custom white space policies.

Table E.3 - Pre-defined white space policies

policy class	description
`policy_default`	Default policy supports strict evaluation of the grammar. The implementation is empty, therefore, it has no white space recognition.
`policy_space<> policy_space< flag_follow_space > policy_space< flag_greedy\|flag_follow_space > policy_space< flag_greedy >`	Instructs CTTL lexer to bypass conventional white space characters, `'\t', '\v', '\f',` and `' '`, formally known as ht, lf, vt, ff, cr, and space. The `flag_greedy` specifies type of evaluation that reduces amount of backtracking.
`policy_space< flag_follow_region > policy_space< flag_greedy\|flag_follow_region >`	Policy flag `flag_follow_region` instructs CTTL lexer to bypass pre-defined regions of input. The region policy is typically used in two-pass algorithms. Search and repeatable search methods of grammar evaluation could use a small subset of grammar to describe particular patterns of search subjects. For example, search-and-replace algorithm identifies replacement targets described by small grammar patterns. If input text contains some programming language, the algorithm may require an additional constraint, that is, to avoid changes inside string and character literals. One possible solution for this task is to implement a two-pass algorithm as follows: detected by the first pass, boundaries of the string literals are added to the set of client regions. The second pass replaces target fragments, but skips input regions stored earlier. Sample program `client_regions.cpp` demonstrates region-specific policy functionality. The `flag_greedy` specifies type of evaluation that reduces amount of backtracking.
`policy_space< flag_follow_space\|flag_follow_region > policy_space< flag_greedy\| flag_follow_space\|flag_follow_region >`	Implements functionality of conventional white space combined with client-defined regions of input.
`policy_space< flag_cpp_comments > policy_space< flag_greedy\|flag_cpp_comments >`	Combines functionality of conventional white space with C and C++ -style comments. The `flag_greedy` specifies type of evaluation that reduces amount of backtracking.

Related pages: CTTL | Grammar evaluation | Grammar reference

stream parsers

Multi-threaded applications, such as communication protocol servers, may impose significant constraints on the amount of memory available to the individual worker threads. Theoretically, a thread should be able to process the data by using input buffer of only one byte in size. The parser must not fail due to an incomplete input; instead, it should ask for more data to assemble a meaningful message (although there may be other reasons for parser to fail, such as timeout limit, message size limit, and so on.)

A stream parser is a program that receives data from a stream, successfully interprets, and accumulates incomplete messages. Stream parsers solve problems known as incremental parsing or incremental lexical analysis.

CTTL stream parser implements incremental parser. The space policy class, capable of modifying content of the current universe, is a key component of the stream parser implementation. Since match() function of the policy class is invoked before mainstream grammar evaluation, the policy gets a chance to inspect user input, verify its completeness, add more data if the universe is empty, and finally, discard already processed data to free up the memory.

For an example of CTTL stream parser, see library sample xml_parser.cpp.

Related pages: CTTL | Grammar evaluation | Grammar reference

Permission to copy, use, modify, sell and distribute this document is granted provided this copyright notice appears in all copies. This document is provided "as is" without express or implied warranty, and with no claim as to its suitability for any purpose.

Generated on Thu Nov 2 17:44:54 2006 for Common Text Transformation Library by

1.3.9.1

overview

left factoring

left recursion

method match()

method find()

method bang_find()

leftmost terminal symbol

search transitivity

search parsers

method runtime_match()

method `match()`

method `find()`

method `bang_find()`

method `runtime_match()`