Parser design

Common Text Transformation Library http://cttl.sourceforge.net/

Parser design

edge_functors.cpp

To parse text data, CTTL proposes grammar expression-based approach. Grammar expressions are C++ expressions, constructed from a number of pre-built lexeme operands, interconnected via a set of overloaded operators.
Lexemes represent terminal symbol definitions of the input language. Logical operators allow to combine smaller grammar expressions into larger ones.
Ideally, each grammar expression represents one complete grammar production rule of the input language. Such expression is frequently hosted inside a specially designated and properly named C++ function, conventionally recognized by the rest of the program as the grammar production function, or simply the grammar rule.
As C++ functions, CTTL grammar rules may invoke each other recursively. Consequently, a C++ application integrates a set of EBNF grammar expressions in natively compiled binary format. The implementation, formally dealing with user input examination, is fast, thread safe, and entirely compile-time.
Collection of grammar productions defines a lexer component of the program. C++ functions taking immediate actions while grammar evaluation is in progress (for example, storing and distributing parsed tokens, signalling errors, requesting additional input, etc.) are known as semantic actions. A collection of semantic actions provides the basis for a parser component^(*) of the program. Although grammars and corresponding sets of semantic actions are closely associated, parser and lexer implementations should be kept as distinct as possible. Potential reuse by future applications with minimal code changes is a good reason for a clear division between grammar definitions and a particular set of semantic actions.
- ____________________________
  - ^(*) Note that the proposed lexer and parser terminology is purely suggestive. It was chosen to clarify possible structural decisions related to the design and organization of a CTTL-based parser program. Whenever possible, CTTL documentation follows lexer/parser vocabulary to emphasize the distinction between grammar definitions and semantic actions.
Common tasks and responsibilities of CTTL grammar expressions are:
1. To parse the user input by matching (or searching for) tokens (terminal symbols) in the input language.
2. To parse input patterns formulated by non-terminal grammar constructs.
3. To invoke other grammar rules, defined^(**) by the lexer component.
4. To invoke semantic actions, defined by the parser component.
5. To invoke in-line semantic actions, defined by lambda expressions.
- ____________________________
  - ^(**) Here, lexer and parser do not necessarily represent particular software classes; rather, they stand for collective design terms. Both grammar rules and semantic actions can be implemented as combinations of member functions of multiple C++ classes and/or global functions.
The design where a lexer class inherits from parser, is quite common:
For example, edge_functors.cpp program uses a parser that keeps track of the parsed substrings. The program includes components, discussed in subsequent parts of the documentation:
- The lexer struct includes one production rule function named start();
- The parser struct accommodates semantic action named accumulate_words();
- The main() function uses parser to match input tokens, and then replaces them with the "<WORD/>" tag.

Keywords: edge_functors.cpp, std::vector, edge, const_edge, edge_functors.h, policy_space, isalpha, accumulate_words, string_array2string, std::for_each, edge_replace, string_array2string


// sample code: edge_functors.cpp
// demonstrates cttl::edge_replace class.

//#define NDEBUG    // must appear before assert.h is included to stop assertions from being compiled 
//#define CTTL_TRACE_EVERYTHING

#include <iostream>
#include "cttl/cttl.h"
#include "cttl/edge_functors.h"

using namespace cttl;

struct parser {
    std::vector< edge<> > word_edges;

    // semantic actions:
    size_t accumulate_words( const_edge<>& edge_ )
    {
        // remember each word that was found:
        word_edges.push_back( edge_ );
        return edge_.second.offset();
    }
};

template< typename ParserT >
struct lexer : public ParserT {

    size_t start( const_edge< policy_space<> >& substr_ )
    {
        return (
                +( isalpha & rule( *this, &ParserT::accumulate_words ) )
        ).match( substr_ )
        ;
    }
};


int main(int argc, char* argv[])
{
    if ( argc == 1 ) {
        std::cout
            << "Usage: on command the line, enter some words to process, for example,"
            << std::endl
            << '\t'
            << argv[ 0 ]
            << " abc def ghi"
            << std::endl
            ;

        return 1;
    }

    // construct input string from the command line arguments:
    std::string inp;
    string_array2string( inp, &argv[ 1 ], ' ' );

    // construct substring to be parsed:
    const_edge< policy_space<> > substring( inp );

    // construct the parser:
    lexer< parser > word_parser;

    // count words:
    if ( word_parser.start( substring ) != std::string::npos ) {
        std::cout << "Input: " << inp << std::endl;

        // Replace words with word marker:
        std::for_each(
            word_parser.word_edges.begin(),
            word_parser.word_edges.end(),
            edge_replace<>( "<WORD/>" )
            );

        std::cout << "Output: " << inp << std::endl;

    } else {
        std::cout << "*** parser failed ***" << std::endl;
        return 1;
    }

    return 0;
}

Permission to copy and distribute this document is granted provided this copyright notice appears in all copies. This document is provided "as is" without express or implied warranty, and with no claim as to its suitability for any purpose.

<<< Stream parsers

Table Of Contents

Production rule functions >>>