Stream parsers

<<< Void regions

Table Of Contents

Parser design >>>

Common Text Transformation Library http://cttl.sourceforge.net/

Stream parsers

Stream parser grammar
policy_stream.cpp
Predictive grammar in stream parsers
Space policy and stream data extraction
Case study: XML stream parser

A stream parser receives data from a sequential data stream. The program
1. Detects incomplete messages;
2. Obtains and stores data for incomplete messages;
3. Successfully interprets accumulated messages.
Stream parsers solve problems known as incremental parsing, or incremental lexical analysis.
Multi-threaded applications, such as communication protocol servers, may impose significant constraints on the amount of memory, available to the individual worker threads. Theoretically, a thread should be able to process the data by using input buffer of only one byte in size. Besides justified reasons for program to fail, such as timeout limit, message size limit, and so on, the parser must not give up processing due to an incomplete input. Instead, the scanner should ask for more data from the stream and eventually be able to assemble a meaningful message.
In addition to pattern matching, the lexer becomes responsible for extracting known-size tokens from the data stream.
CTTL includes a number of key components that support incremental parsing:
- User-defined space policy classes, capable of appending content to the input string;
- Positive unlimited lookahead based on grammar expression adaptors;
- Positive lookahead assertion (lookup) by begin() lexemes;
- Negative lookahead assertion by operator-;
- Direct, length-based validations of character-counted entities.

Stream parser grammar

To assemble a series of meaningful tokens, the lexer needs to be contextually predictive. In a practical way, the scanner has to read a number of characters, reset input position, and then perform the usual lexical analysis. The "reset" part of this technique may require lookahead assertion, character counting, and input position backtracking.
Since data comes from the stream, the scanner may append text at end of input string at any time during grammar evaluation. Equally, the parser should also be able to safely discard consumed parts of input to free up the memory.
Prior to lexical analysis, the following types of tokens may need to be pre-assembled by the scanner:
- symbols of fixed length, e.g. "a three-digit-long symbol";
- symbols of a length within the specified range, e.g. "hex number four-to-eight digits long";
- symbols of a particular character class (digits, alphanumeric, etc.);
- combinations of the above (non-terminals.)
The sample program policy_stream.cpp demonstrates implementation of a stream lexer that gets exactly one character at a time from the standard input device. Such extremity puts interaction between parser's software components through a rigorous test. (The amount of characters, entered from the stream in a single read attempt remains scalable and can be reconfigured in accordance with the application memory consumption requirements.)

Keywords: policy_stream.cpp, policy_strict_stream, policy_default, strict_policy_T, policy_relaxed_stream, std::cin, cin.get(), flag_follow_space, ungreedy policy, isalpha, isdigit, isxdigit, CTTL_STATIC_RULE


// sample code: policy_stream.cpp

//#define CTTL_TRACE_EVERYTHING //define to turn tracing on

#include <iostream>
#include "cttl/cttl.h"

using namespace cttl;

struct policy_strict_stream : public policy_default
{
    typedef policy_strict_stream strict_policy_T;

    template< typename SubstrT >
    static size_t match( SubstrT& substr_ )
    {
        // Non-greedy by design - does nothing but appends new data:
        if ( std::cin.good() ) {
            if ( !substr_.length() ) {
            substr_.parent().append( 1, std::cin.get() );
            substr_.second.go_eof();
            }
        } else {
            // No more data from the stream is available:
            return std::string::npos;
        }
        return substr_.first.offset();
    }
}; // struct policy_relaxed_stream


struct policy_relaxed_stream
    :
    public policy_strict_stream,
    public policy_space< flag_follow_space >
{
    typedef policy_strict_stream strict_policy_T;

    template< typename SubstrT >
    static size_t match( SubstrT& substr_ )
    {
        // non-greedy implementation:
        size_t ungreedy_offset = substr_.first.offset();
        policy_space< flag_follow_space >::match( substr_ );
        while ( !substr_.length() ) {
            if ( policy_strict_stream::match( substr_ ) == std::string::npos ) {
                // No more data from the stream is available:
                return std::string::npos;
            }
            policy_space< flag_follow_space >::match( substr_ );
        }
        return ungreedy_offset;
    }
}; // struct policy_relaxed_stream


template< typename SubstrT >
struct lexer {

    static size_t alpha( SubstrT& substr_ )
    {
        return
            (
                +first( isalpha )             // Get as many alpha chars as you can,
                &                             // and, in strict mode,
                ( entity( isalpha ) + end() ) // verify that no white space is present.

            ).match( substr_ );
    }

    static size_t digit3( SubstrT& substr_ )
    {
        static const std::pair< int, int > TOKEN_LENGTH( 3, std::string::npos );
        return
            (
                // Matches exactly 3 digits:
                ( symbol() + TOKEN_LENGTH )   // Get any three characters
                &                             // and, in strict mode,
                ( entity( isdigit ) + end() ) // verify that they are 3 digits.

            ).match( substr_ );
    }

    static size_t hex48( SubstrT& substr_ )
    {
        static const std::pair< int, int > TOKEN_LENGTH( 4, 8 );
        return
            (
                // Matches 4 to 8 hex digits:
                ( first( isxdigit ) + TOKEN_LENGTH )
                &                              // and, in strict mode,
                ( entity( isxdigit ) + end() ) // verify that they are hex digits.

            ).match( substr_ );
    }

    static size_t start( SubstrT& substr_ )
    {
        return
            (
                CTTL_STATIC_RULE( &lexer< SubstrT >::digit3 )
                +
                CTTL_STATIC_RULE( &lexer< SubstrT >::alpha )
                +
                CTTL_STATIC_RULE( &lexer< SubstrT >::hex48 )

            ).match( substr_ );
    }

};// lexer


int main()
{
    std::string inp;
    typedef const_edge< policy_relaxed_stream > substr_T;
    substr_T substring( inp );
    const_edge<> fragment = substring;
    std::cout
        << "Enter 3-digit number, 1+ alpha chars, 4-8 hex digits, and semicolon:"
        << std::endl
        << "123 xyz 123abc ;"
        << std::endl
        << "\t\t(Any mismatch causes program to exit.)"
        << std::endl
        ;
    while(
        (
            fragment(
                CTTL_STATIC_RULE( &lexer< substr_T >::start )
            )
            +
            symbol( ';' )
        )
        .match( substring )
        !=
        std::string::npos
        )
    {
        std::cout << "Good match: /" << fragment << "/ -- try again:" << std::endl;
    }

    if ( substring.length() ) {
        std::cout << "Mismatch: " << substring << " -- exiting." << std::endl;
    }

    return 0;
}

The grammar rules describe a set of exemplary input tokens as follows:


    lexer< SubstrT >::digit3 // matches 3-digit-long decimal number
    lexer< SubstrT >::alpha  // matches one or more alphabetical characters
    lexer< SubstrT >::hex48  // mathes hex number 4 to 8 digits long.
    lexer< SubstrT >::start  // matches sequence of the above

Since start rule permits white space between entered tokens, the resulting fragment is best described as a non-terminal, consisting of three terminal symbols. Other rules match terminal symbols only, therefore, no embedded space between characters is allowed.
The required terminal lengths are dictated by the Kleene plus quantifiers. The types of entered characters are reinforced by operator&, which defines a set intersection with the corresponding entity(class). for example,
```
        +first( isalpha )             // Get as many alpha chars as you can,
        &                             // and, in strict mode,
        ( entity( isalpha ) + end() ) // verify that no white space is present.
```
The end() lexeme guarantees that there are no other categories of characters rather than the one specified, such as isalpha here.

Predictive grammar in stream parsers

The validation of character types in the above grammar expressions is based on backtracking technique. The backtracking is brought by the set intersection operator&.

It is possible to eliminate backtracking using alternative grammar constructs. For example, the lexer's alpha() rule could be rewritten as


    static size_t alpha( SubstrT& substr_ )
    {
        return
            (
                // Causes lexer to invoke space policy match():
                *end( isspace ) // skip optional white space
                +
                (
                    begin( true )      // Always succeeds; ignores space policy.
                    ^                  // In strict mode,
                    +entity( isalpha ) // get alpha characters.
                )

            ).match( substr_ );
    }

Instead of backtracking, the rule now became a predictive grammar. To preserve terminal symbol format (i.e. not permitting white space between individual characters), the grammar switches to the strict evaluation mode before scanning the input. (Stream policy implementation is discussed below.)
The switch to strict grammar mode is made by epsilon symbol begin(true) and concatenation operator^ in front of the actual character sequence scan by +entity(isalpha).
By definition, the epsilon symbol has no knowledge of the space policy: the lexer never invokes policy's match() function for begin(true) lexeme. This creates a situation in which the characters are ready to be entered from the stream, but an arbitrary amount of the white space could still legally precede the alphabetical characters. To prevent +entity(isalpha) from failing, white space has to be skipped. To foresee an optional space, the grammar conducts an explicit check for its presence by adding the *end(isspace) clause.
Proper usage of lexeme-specific space sensitivity rules is important for a stream parser, because its space policy classes are responsible for the data supply, in addition to the standard white space processing.

Space policy and stream data extraction

The match() member function of the space policy class is invoked prior to the mainstream grammar evaluation of every terminal symbol. The policy gets a chance to
- inspect user input;
- verify its completeness;
- add more data if the input is empty;
- discard already processed data to free up the memory.
CTTL space policy class, capable of appending content to the input string, becomes an important component of the stream parser implementation.
A lexer of a stream parser needs two kinds of policy classes:
- one for the strict mode to processing of raw characters
- another for automatic bypass of the white space.
In addition to this, both policies are responsible for the data supply from the stream and properly handle important input events, such as errors, conditions, etc.
The policy_stream.cpp sample defines two policy classes^(*)
- struct policy_strict_stream, and
- struct policy_relaxed_stream.
- ____________________
  - ^(*) Despite the fact that both types are declared as structs, naming them "policy classes" is done for consistency with the rest of the documetation.
Both definitions declare policy_strict_stream to be the strict-mode policy by adding
```
    typedef policy_strict_stream strict_policy_T;
```
The policy_strict_stream::match() implementation is responsible for reading data from std::cin and handling cin's state. The policy is designed in accordance with strict policy requirements. The function does not attempt to skip any white space, simply returning raw text back to the lexer. The policy_strict_stream policy is
- default constructible, and
- stateless,
which allows the sample program to declare it a strict-mode space policy type in every case.
The second policy class, policy_relaxed_stream, derives from the policy_strict_stream, and reimplements its policy_relaxed_stream::match() member to automatically skip white space delimiters between the tokens.

The policy_relaxed_stream::match() is an example of a policy mixer. Instead of rewriting low-level logic, the policy mixes in two existing match() implementations together:


struct policy_relaxed_stream
    :
    public policy_strict_stream,
    public policy_space< flag_follow_space >
{
    typedef policy_strict_stream strict_policy_T;

    template< typename SubstrT >
    static size_t match( SubstrT& substr_ )
    {
        // ungreedy implementation:
        size_t ungreedy_offset = substr_.first.offset();
        policy_space< flag_follow_space >::match( substr_ );
        while ( !substr_.length() ) {
            if ( policy_strict_stream::match( substr_ ) == std::string::npos ) {
                // No more data from the stream is available:
                return std::string::npos;
            }
            policy_space< flag_follow_space >::match( substr_ );
        }
        return ungreedy_offset;
    }
}; // struct policy_relaxed_stream

Note that multiple inheritance form is somewhat artificial in this example. Since input state is held entirely by std::cin, the policy_relaxed_stream::match() is declared static. As long as both of the base classes are in scope, a simplified version would also compile just fine:


struct policy_relaxed_stream
    :
    //public policy_strict_stream,
    //public policy_space< flag_follow_space >
    public policy_default

{
    //...

}; // struct policy_relaxed_stream

Mutliple inheritance can be a convinient design choice for mixing in multiple stateful policy classes together.

Case study: XML stream parser

The library sample XML parser, located in example/lambda/xml subdirectory, demonstrates various aspects of incremental parsing in CTTL.
To cope with the stream input-related tasks, the program defines two custom space policy classes:
- policy_strict_stream
- policy_relaxed_stream
The policy_strict_stream class is responsible for opening the file, reading the data, and closing the file when the end of the file is reached. At run time, an instance of the policy maintains the state of input and discards data which has already been processed by the parser. The policy's policy_strict_stream::match() does not skip any white space, presenting raw data to the lexer.
The second policy class, policy_relaxed_stream, derives from the policy_strict_stream, and reimplements its policy_relaxed_stream::match() member function in order to automatically skip the white space delimiters from the input.
The xml_lexer::xml_name() grammar rule parses the names of xml elements and attributes. It is using the policy_strict_stream policy class. The grammar employs explicit check for the white space and does the xml name scan. The xml name may contain alphanumeric characters, underscores, and colons.
The XML parser sample creates and populates an abstract syntax tree (AST), built on top of CTTL integer tree facility. The tree annotation is made using CTTL lambda library, which extends the functionality of the grammar expression by in-line semantic actions.

Permission to copy and distribute this document is granted provided this copyright notice appears in all copies. This document is provided "as is" without express or implied warranty, and with no claim as to its suitability for any purpose.

<<< Void regions

Table Of Contents

Parser design >>>