CTTL XML stream parser sample: CTTL XML stream parser sample

Multi-threaded applications, such as communication protocol servers, may impose significant constraints on the amount of memory available to the individual worker threads. Theoretically, a thread should be able to process the data by using input buffer of only one byte in size. The parser must not fail due to an incomplete input; instead, it should ask for more data to assemble a meaningful message (although there may be other reasons for parser to fail, such as timeout limit, message size limit, and so on.)

A stream parser is a program that receives data from a stream, and can successfully interpret and accumulate incomplete messages. CTTL space policy class, capable of modifying content of the current universe, becomes an important component of the stream parser implementation. Since match() function of the policy class is invoked before mainstream grammar evaluation, the policy gets a chance to inspect user input, verify its completeness, add more data if the universe is empty, and finally, discard already processed data to free up the memory.

Library sample xml_parser.cpp demonstrates CTTL stream parser. The program takes two arguments: xml file name to parse, and optional switch -o, which sends results to the standard output. To cope with the stream input, the parser defines two custom policy classes: policy_strict_stream and policy_relaxed_stream. Policy classes are responsible for opening the file, reading the data, and closing the file when the end of file is reached. At run time, an instance of the policy class maintains state of the input file specified by the user.

The "strict" part in the name of the class suggests that its match() implementation does not skip any white space characters. This makes policy_strict_stream suitable for lexer<>::xml_name() grammar rule. The function makes explicit check for the white space, and then scans individual characters of the xml name, which may contain alphanumeric characters, underscores, and a colon.

Policy policy_relaxed_stream derives from the policy_strict_stream class and adds automatic white space checks to its implementation of the match(). This policy is employed by the grammar rule functions of the lexer class.

Grammar rules of xml lexer have fairly straight forward implementation. Rules referring to multi-character terminal symbols check for availability of required amount of the input characters. The checks are performed by positive lookahead assertions

Program starts with an empty input. Before grammar lexemes can see the data, custom policy object loads first chunk of data from the input file. The size of input is arbitrary; it can be anything from 1 byte to megabytes in size. Current implementation uses max_buffer_size template parameter to determine amount of input characters to add. Setting max_buffer_size equal to 1 byte is a good durability test for the lexer.

Both lexer and policy objects share edge object named consumed_data. The upper boundary of the edge never changes its position and always points to the beginning of the user input. The lower boundary is repositioned by the xml_node() grammar rule. Whenever it is time to load more data from the stream, the content of consumed_data substring is removed. Other implementations may retain the entire input in memory. Removal of consumed_data can be disabled by commenting out DISPOSE_CONSUMED_DATA macro definition, which controls parts of the code to free up memory allocated by the parser input string:

Sample parser implementation stores parsed xml data in several data structures of the parser object. Abstract syntax tree, xml_parser::xml_tree, is populated by semantic actions of the parser class. The tree references individual elements of xml data. (Alternatively, the entire xml document could be retained in memory, allowing the syntax tree to point back to the substrings of xml elements in the original document.)

// xml_parser.cpp
// utility to parse XML stream from a large file

//#define NDEBUG    // define before assert.h to stop assertions from being compiled 
//#define CTTL_TRACE_EVERYTHING //define to turn tracing on
//#define CTTL_TRACE_RULES  //define to turn light tracing on
//#define CTTL_LAMBDA_SINGULAR   // turns off multidimensional depositories
#define DISPOSE_CONSUMED_DATA   // when defined, free up memory as we read data from the stream

#include <iostream>
#include <stack>
#include "utils/win32_specific.h"
#include "cttl/cttl.h"
#include "utils/fileio.h"
#include "utils/timeutils.h"
#include "lambda/lambda.h"
#include "utils/offset_stack_guard.h"
#include "utils/inode.h"
#include <algorithm>

#include "example/lambda/xml/xml_tree_struct.h"
#include "example/lambda/xml/xml_tree_show.h"
#include "example/lambda/xml/xml_stream_policy.h"

using namespace cttl;

namespace {
    // common xml alphabet
    const std::string LITERAL_XML_NAME          = CTTL_QWERTY_123_ ":-";
    const std::string LITERAL_XML_COMMENT_OPEN  = "<!--";
    const std::string LITERAL_XML_COMMENT_CLOSE = "-->";
    const std::string LITERAL_XML_PI_OPEN       = "<?";
    const std::string LITERAL_XML_PI_CLOSE      = "?>";
    const std::string LITERAL_XML_CLOSE_ELEM    = "</";
    const std::string LITERAL_XML_ELEM_CLOSED   = "/>";
}

#include "example/lambda/xml/xml_parser.h"
#include "example/lambda/xml/xml_lexer.h"

int main(int argc, char* argv[])
{
    if ( argc < 2 ) {
        std::cout
            << "Usage: "
            << std::endl
            << '\t'
            << argv[ 0 ]
            << " path/file.xml [-o]"
            << std::endl
            << "If -o switch is specified, results will be sent to the standard output."
            << std::endl
            ;
        return 1;

    } else if ( ( argc > 2 ) && strcmp( argv[ 2 ], "-o" ) ) {
        std::cout
            << "ERROR: unknown switch "
            << argv[ 2 ]
            << std::endl
            ;
        return 1;
    }


    input<> inp;
    edge<> consumed_data( new_edge( inp ) ); 

    policy_relaxed_stream file_stream( consumed_data );
    if ( !file_stream.init( argv[ 1 ] ) ) {
        std::cout
            << "ERROR: failed to open input file "
            << std::endl
            << '\t'
            << argv[ 1 ]
            << std::endl
            ;
        return 1;
    }

    typedef edge< policy_relaxed_stream > universe_T;
    universe_T universe( new_edge( inp ), file_stream );
    lexer< parser< universe_T > > xml_parser( consumed_data );

    std::cout
        << time2string( current_time() )
        << " parsing "
        << argv[ 1 ]
        << std::endl
        ;

    if ( xml_parser.xml_grammar( universe ) == universe_T::string_T::npos ) {
        std::cout
            << time2string( current_time() )
            << " parser failed"
            << std::endl
            ;

        return 1;
    }

    std::cout
        << time2string( current_time() )
        << " done"
        << std::endl
        ;

    if ( ( argc > 2 ) && !strcmp( argv[ 2 ], "-o" ) ) {
        inode_reader<> root( xml_parser.vect_xml_tree, 1 );
        std::for_each(
            root,
            root.end(),
            xml_tree_show(
                1,
                xml_parser.vect_xml_names,
                xml_parser.vect_xml_text
                )
            );
    }
    return 0;
}

Permission to copy, use, modify, sell and distribute this document is granted provided this copyright notice appears in all copies. This document is provided "as is" without express or implied warranty, and with no claim as to its suitability for any purpose.

CTTL XML stream parser sample

Version 2.08