Main Page | Namespace List | Class Hierarchy | Alphabetical List | Class List | Directories | File List | Namespace Members | Class Members | File Members | Related Pages

syntax trees


SourceForge.net Logo     CTTL on    
    SourceForge    
    Download    
    Latest    
    Documentation    
    Index    
    Library    
    News    
    CVS    
    Repository    
   Other    
   Links    

Most grammars require generation of abstract syntax trees (ASTs) with heterogeneous tree nodes that accommodate and store parsed data at runtime. Such requirement separates AST tree from existing STL containers, which, by design, store only homogeneous objects.

Automatic tree generation is absent from the text transformation library. However, trees can be successfully created and populated by lambda expressions during CTTL grammar parsing. An advantage of this approach is that it gives programmer a higher level of control, along with an ability to tailor desired tree structure to the specific needs of a particular application. This section of documentation covers aspects of dealing with challenges, associated with the generation of syntax trees.

A relatively complex tree structure can be modeled on top of all-familiar STL sequence container. For example, STL vector of integers could be divided into distinct logical nodes. In this arrangement, a subsequence of the vector elements of finite length describes a tree node . The length of a node becomes specific to the node type. Node types are bound to the pieces of data coming from the individual grammar productions of the input language. For instance, a typical tree structure may include a root node, a terminal node, and a variety of non-terminal node types.

Vector elements that belong to a particular logical node could be referred to as data fields. For simplicity, first data field of each node will store the node type. Distance from the beginning of the STL container to the first physical element of any given node provides the node with a unique identifier. Thus, utilizing unique id numbers, parent nodes can point to the child nodes (and vice versa). The relation of the nodes to each other can be obtained by looking at the node ids in the designated data fields. Multiple child nodes may also be arranged into the lists of siblings. The simplicity of id-based preparation is based on the assumption that once created, the node cannot be removed. This assumption restricts parser dynamics. That is, no tree nodes should be created until a corresponding grammar unit is entirely parsed and understood by the program. Otherwise the program could quickly begin to waste memory, occupied by the nodes that might became no longer needed, in case if grammar evaluation path takes turn to a different production rule. This is a reasonable constraint, since parser may store intermediate results in a temporary location, such as a stack, and move parsed data to the permanent data fields in the tree as soon as the end of the grammar production is reached. In a typical context-free grammar, parsing of parent rule begins first, followed by the child rules, than followed by the ending of the parent. Therefore, child nodes must be created before the parent, with child ids stored someplace on the stack. When the parent is ready, it links itself with children, simultaneously clearing the temporary storage on the stack.

Consider an interpreter of already parsed arithmetic expressions with floating point numbers. Grammar results are stored inside an integer tree, interpreter::m_tree.(see interpreter.h listing below) Constructor of the interpreter object receives a table of values, represented by interpreter::m_vector_doubles. Public function interpreter::non_terminal( ) is an entry point for the expression calculation. There are four types of non-terminal nodes that could be present in the tree, including addition, subtraction, multiplication, division, and negation, one for each common operator of the arithmetic expression. In combination with those, the terminal nodes store expression operands. Since floating point numbers cannot be accumulated by a vector of integers, the terminal node relies on indirect storage of terminal values. Each terminal node has a data field that contains position of the actual value inside a separate data table, interpreter::m_vector_doubles. The data table translates data to double precision value when the computation takes place.

arithmetic interpreter example

// Common Text Transformation Library
// Copyright (C) 1997-2006 by Igor Kholodov. 
//
// This library is free software; you can redistribute it and/or
// modify it under the terms of the GNU Lesser General Public
// License as published by the Free Software Foundation; either
// version 2.1 of the License, or (at your option) any later version.
//
// This library is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
// Lesser General Public License for more details.
//
// You should have received a copy of the GNU Lesser General Public
// License along with this library; if not, write to the
// Free Software Foundation, Inc.,
// 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301  USA
//
// mailto:cttl@users.sourceforge.net
// http://sourceforge.net/projects/cttl/

// interpreter.h
struct interpreter
{
    std::vector< int > const& m_tree;
    std::vector< double > const& m_vector_doubles;

    interpreter( std::vector< int > const& tree_, std::vector< double > const& vector_doubles_ )
    :
    m_tree( tree_ ),
    m_vector_doubles( vector_doubles_ )
    {
    }

    void operator=( interpreter const& ) const
    {
    }

    double non_terminal( size_t id )
    {
        // input: id of the node
        // field            content
        //m_tree[ id + 0 ]  opcode
        //m_tree[ id + 1 ]  left operand id
        //m_tree[ id + 2 ]  right operand id (optional)

        assert( id < m_tree.size() );

        switch ( m_tree[ id ] ) {
        case op_add:
            return non_terminal( m_tree[ id + left_node ] ) + non_terminal( m_tree[ id + right_node ] );

        case op_subtract:
            return non_terminal( m_tree[ id + left_node ] ) - non_terminal( m_tree[ id + right_node ] );

        case op_multiply:
            return non_terminal( m_tree[ id + left_node ] ) * non_terminal( m_tree[ id + right_node ] );

        case op_divide:
            return non_terminal( m_tree[ id + left_node ] ) / non_terminal( m_tree[ id + right_node ] );

        case op_negate:
            return -non_terminal( m_tree[ id + factor_node ] );

        case op_terminal:
            return m_vector_doubles[ m_tree[ id + double_value ] ];

        default:
            std::cout
                << "Error: unknown node type @" << id
                << ", opcode " << m_tree[ id ]
                << std::endl
                ;
            return 0;
        };
    }

};  // struct interpreter

Arithmetic expression is parsed and populated by the calc_lexer<UniverseT> object (see lexer.h source file below). All temporary data is stored in calc_lexer::m_stack_nodes stack of integers. Each grammar production manufactures an appropriate tree node upon completion of grammar evaluation. Physical writing of data is done by std::back_insert_iterator adaptor. For example, the following fragment of code inserts a new terminal node into the tree:

        // lambda expression to create terminal node:
        *m_tree_inserter++ = const_scalar( op_terminal ),  // code of operation
        *m_tree_inserter++ = ++( scalar( 0 )^vect_doubles^atof^edge_value ),
        m_stack_nodes = alias::size( scalar( &vect_parse_tree ) ),
        *m_stack_nodes -= terminal_size

Note that ++( scalar( 0 )^vect_doubles^atof^edge_value ) is a higher order function, which starts with text argument specified by edge_value. Library function atof() converts text to double value, and stores it in vect_doubles. Finally, position of the value in data table is returned and stored in the terminal node:

        *m_tree_inserter++ = ++( scalar( 0 )^vect_doubles^atof^edge_value )

Since the parent of the new terminal node is not known at the time when lambda expression is executed, the expression simply calculates and pushes identifier of the newly created node into the m_stack_nodes stack:

        m_stack_nodes = alias::size( scalar( &vect_parse_tree ) ),
        *m_stack_nodes -= terminal_size

When a non-terminal parent node is created, it gets the ids of the corresponding child nodes from the stack. For example, unary minus grammar production does the following:

        // lambda expression that creates non-terminal node:
        *m_tree_inserter++ = const_scalar( op_negate ),  // code of operation
        *m_tree_inserter++ = *m_stack_nodes--, // child node already on the stack
        m_stack_nodes = alias::size( scalar( &vect_parse_tree ) ),
        *m_stack_nodes -= factor_node_size

The source file for the lexer implements similar lambda expression for every type of the syntax tree nodes for arithmetic expressions:

// Common Text Transformation Library
// Copyright (C) 1997-2006 by Igor Kholodov. 
//
// This library is free software; you can redistribute it and/or
// modify it under the terms of the GNU Lesser General Public
// License as published by the Free Software Foundation; either
// version 2.1 of the License, or (at your option) any later version.
//
// This library is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
// Lesser General Public License for more details.
//
// You should have received a copy of the GNU Lesser General Public
// License along with this library; if not, write to the
// Free Software Foundation, Inc.,
// 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301  USA
//
// mailto:cttl@users.sourceforge.net
// http://sourceforge.net/projects/cttl/

// calc_lexer.h
#ifndef _CALC_LEXER_H_INCLUDED_
#define _CALC_LEXER_H_INCLUDED_

namespace {
    // non-terminal node constants
    static const int op_code = 0;
    static const int left_node = 1;
    static const int right_node = 2;
    static const int non_terminal_size = 3;

    // terminal node constants
    static const int double_value = 1;
    static const int terminal_size = 2;

    // factor non-terminal node constants
    static const int factor_node = 1;
    static const int factor_node_size = 2;

    // arithmetic interpreter constants
    static const int op_add = '+';
    static const int op_subtract = '-';
    static const int op_multiply = '*';
    static const int op_divide = '/';
    static const int op_negate = 'N';
    static const int op_terminal = 'T';
} // namespace


template< typename UniverseT >
struct calc_lexer
{
    typedef typename UniverseT::string_T string_T;
    typedef typename UniverseT::policy_T policy_T;
    typedef calc_lexer< UniverseT > lexer_T;

    std::vector< int >& vect_parse_tree;
    const_edge<>& edge_value;
    std::vector< double >& vect_doubles;

    lambda< std::back_insert_iterator< std::vector< int > > >::scalar m_tree_inserter;
    lambda<>::stack m_stack_nodes;
    lambda<>::scalar m_temp;

public:
    calc_lexer(
        std::vector< int >& vect_parse_tree_,
        const_edge<>& edge_value_,
        std::vector< double >& vect_doubles_
        )
        :
        vect_parse_tree( vect_parse_tree_ ),
        edge_value( edge_value_ ),
        vect_doubles( vect_doubles_ ),
        m_tree_inserter( scalar( std::back_insert_iterator< std::vector< int > >( vect_parse_tree ) ) )
    {
        assert( vect_parse_tree_.size() );
    }

    void operator=( calc_lexer< UniverseT > const& ) const
    {
    }

    bool grammar( UniverseT& universe_ )
    {
        size_t result =
            (
                CTTL_RULE( lexer_T::lex_expr )
                +
                // make sure entire input was parsed:
                end()
            ).match( universe_ );

        if ( result == UniverseT::string_T::npos )
            // parser failed
            return false;

        assert( m_stack_nodes.size() == 1 );
        vect_parse_tree[ 0 ] = m_stack_nodes.top();
        return true;

    } // rule

// arithmetic calculator grammar
     size_t lex_expr( UniverseT& universe_ )
     {
         return
         (
            CTTL_RULE( lexer_T::lex_term )
            +
            *(
                (
                    symbol( '+' )
                    +
                    CTTL_RULE( lexer_T::lex_term )
                    +
                    *(
                        *m_tree_inserter++ = const_scalar( op_add ),  // code of operation
                        m_temp = *m_stack_nodes--,             // the right side on top of the stack
                        *m_tree_inserter++ = *m_stack_nodes--, // store left operand
                        *m_tree_inserter++ = m_temp,           // store right operand
                        m_stack_nodes = alias::size( scalar( &vect_parse_tree ) ),
                        *m_stack_nodes -= non_terminal_size
                    )
                )
                |
                (
                    symbol( '-' )
                    +
                    CTTL_RULE( lexer_T::lex_term )
                    +
                    *(
                        *m_tree_inserter++ = const_scalar( op_subtract ),  // code of operation
                        m_temp = *m_stack_nodes--,             // the right side on top of the stack
                        *m_tree_inserter++ = *m_stack_nodes--, // store left operand
                        *m_tree_inserter++ = m_temp,           // store right operand
                        m_stack_nodes = alias::size( scalar( &vect_parse_tree ) ),
                        *m_stack_nodes -= non_terminal_size
                    )
                )
            )
        ).match( universe_ );

     } // rule


     size_t lex_term( UniverseT& universe_ )
     {
         return
         (
           CTTL_RULE( lexer_T::lex_factor )
           +
           *(
                (
                    symbol( '*' )
                    +
                    CTTL_RULE( lexer_T::lex_factor )
                    +
                    *(
                        *m_tree_inserter++ = const_scalar( op_multiply ),  // code of operation
                        m_temp = *m_stack_nodes--,             // the right side on top of the stack
                        *m_tree_inserter++ = *m_stack_nodes--, // store left operand
                        *m_tree_inserter++ = m_temp,           // store right operand
                        m_stack_nodes = alias::size( scalar( &vect_parse_tree ) ),
                        *m_stack_nodes -= non_terminal_size
                    )
                )
                |
                (
                    symbol( '/' )
                    +
                    CTTL_RULE( lexer_T::lex_factor )
                    +
                    *(
                        *m_tree_inserter++ = const_scalar( op_divide ),  // code of operation
                        m_temp = *m_stack_nodes--,             // the right side on top of the stack
                        *m_tree_inserter++ = *m_stack_nodes--, // store left operand
                        *m_tree_inserter++ = m_temp,           // store right operand
                        m_stack_nodes = alias::size( scalar( &vect_parse_tree ) ),
                        *m_stack_nodes -= non_terminal_size
                    )
                )
           )
         ).match( universe_ );

     } // rule


     size_t lex_factor( UniverseT& universe_ )
     {
         return
         (
           (
                edge_value(
                    entity( isdigit ) + ( '.' + entity( isdigit ) ) * 1
                )
                +
                *(
                    *m_tree_inserter++ = const_scalar( op_terminal ),  // code of operation
                    *m_tree_inserter++ = ++( scalar( 0 )^vect_doubles^atof^edge_value ),
                    m_stack_nodes = alias::size( scalar( &vect_parse_tree ) ),
                    *m_stack_nodes -= terminal_size
                )
           )
           |
           (
             symbol( '(' )
             +
             CTTL_RULE( lexer_T::lex_expr )
             +
             symbol( ')' )
           )
           |
           (
             symbol( '-' )
             +
             CTTL_RULE( lexer_T::lex_factor )
             +
             *(
                *m_tree_inserter++ = const_scalar( op_negate ),  // code of operation
                *m_tree_inserter++ = *m_stack_nodes--,
                m_stack_nodes = alias::size( scalar( &vect_parse_tree ) ),
                *m_stack_nodes -= factor_node_size
             )
           )
           |
           (
             symbol( '+' )
             +
             CTTL_RULE( lexer_T::lex_factor )
           )
         ).match( universe_ );

     } // rule

};  // struct calc_lexer

#endif //_CALC_LEXER_H_H_INCLUDED_

Finally, the driver of the program instantiates the input specified by the user on the command line. The driver also instantiates vector of integers to hold the syntax tree, vector of doubles for the table of values, and passes these containers to the lexer object constructor.

After invoking the lexer entry point, lexer.grammar( universe ), the program passes the populated syntax tree to the interpreter object, and displays the result of calculation on the screen:

// arithmetic_interpreter.cpp

//#define NDEBUG    // define before assert.h to stop assertions from being compiled 
//#define CTTL_TRACE_EVERYTHING //define to turn tracing on
//#define CTTL_TRACE_RULES  //define to turn light tracing on

#include <iterator> // needed for ostream_iterator
#include "utils/win32_specific.h"
#include "cttl/cttl.h"
#include "lambda/lambda.h"

using namespace cttl;

#include "example/lambda/arithmetic_interpreter/lexer.h"
#include "example/lambda/arithmetic_interpreter/interpreter.h"

int main(int argc, char* argv[])
{
    assert( argc > 1 );
    input<> inp( &argv[ 1 ] );
    const_edge<> value( new_edge( inp ) );

    std::vector< int > tree( 1 );   // reserve one element for the header node
    std::vector< double > vector_doubles;

    // construct universe to be parsed:
    typedef const_edge< policy_space<> > universe_T;

    universe_T universe( new_edge( inp ) );

    calc_lexer< universe_T > lexer(
        tree,
        value,
        vector_doubles
        );

    if ( lexer.grammar( universe ) ) {

        interpreter inter( tree, vector_doubles );

        std::cout
            << std::endl
            << "result: "
            << inter.non_terminal( tree[ 0 ] )
            << std::endl
            ;

    } else {
        std::cout << "\t\t *** PARSER FAILED *** " << std::endl;
        return 1;
    }

    return 0;

} // main

integer tree

In the above example of the arithmetic expression parser, the handling of physical data storage is done by writing via std::back_insert_iterator iterator adaptor into STL container. It is possible because requirements for the tree structure were quite simple - the intermediate results of the arithmetic expression can be organized into a binary tree. On the other hand, a syntax tree for accommodation of XML data is slightly more complicated. For example, an XML node might have many children. Depending on the purpose of the application, there could be varying requirements for internal structure of the XML tree. An XPath query may wish to to jump from a child node directly to its parent. On the other hand, a system to process a purchase order may require only simple traversal of the XML nodes.

CTTL library provides two utility classes, namely cttl::inode_writer and cttl::inode_reader, which facilitate creation and navigation of integer syntax trees (modeled on top of STL sequence containers of integer data). cttl::inode_writer and cttl::inode_reader classes encapsulate the concept of integer tree iterator.

Despite the limitation imposed by the requirement that once created, integer tree cannot be modified, this type of tree is sufficient for many applications, since there is no limitation on the complexity of the structure and relationships between individual tree nodes.

Utility classes for the tree node manipulation are declared in "utils/inode.h" header file. The following example demonstrates basic principles of construction and iteration through integer tree:

syntax tree example

//#define CTTL_TRACE_DEPOSITS // turn on lambda expression tracing

#include "cttl/cttl.h"
#include "lambda/lambda.h"
#include "utils/inode.h"

using namespace cttl;

enum {
    FIRST_,
    LAST_,
    DATA_,
    SIZE_
};

int main(int argc, char* argv[])
{
    std::vector< int > vec( 1, 0 ); // one elem with zero value

    // construct integer tree
    lambda< inode_writer<> >::scalar iprev = inode_writer<>( vec );
    lambda< inode_writer<> >::scalar inext = inode_writer<>( vec );
    lambda< inode_writer<> >::scalar iparent = inode_writer<>( vec );
    (
        iprev << ( scalar(0)^0^1 ), // (c1)
        inext << ( scalar(0)^0^2 ), // (c2)
        iprev += inext,             // (c1) -> (c2)

        iprev = inext,
        inext << ( scalar(0)^0^3 ), // (c3)
        iprev += inext,             // (c2) -> (c3)

        iprev = inext,
        inext << ( scalar(0)^0^4 ), // (c4)
        iprev += inext,             // (c3) -> (c4)

        iparent << 2,               // (p)
        iprev = 1,
        iparent *= iprev,           // (p) <- *(c)

        iparent += iprev,           // (p) -> (c1)
        inext -= iparent            // (c4) <- (p)

    ).evaluate();

	std::copy(
		vec.begin(),
		vec.end(),
		std::ostream_iterator< int >( std::cout, " " )
		);

    // display tree data

    // construct root node at known position
    inode_reader<> root = inode_reader<>( vec, iparent.top().offset() );

    // iterate integer tree
    for(
        inode_reader<> child = root( FIRST_ );
        child != root.end();
        ++child
        )
    {
        std::cout << std::endl << child[ DATA_ ];
    }

    return 0;
}
/*

Program output:

0 4 13 1 7 13 2 10 13 3 0 13 4 1 10
1
2
3
4

*/

This example begins with enumeration of constants describing terminal data node in the tree. The last constant, SIZE_, specifies the size of the node.

A vector named vec holds physical storage for integer tree. The first element of the vector is left reserved. Structure of the tree does not expect to find a valid node at zero offset, because zero offset is used as a stop node for links between the nodes. To avoid zero offset, the vector is constructed with single element, which is left unused by the program.

Three inode_writer objects are also constructed. Two of them, iprev and inext, represent terminal nodes. Total four nodes are created and marked as (c1), (c2), (c3), (c4) in program comments.

Overloaded operators << physically position node writer at the end of the vector and insert data from the lambda composite structure:

        iprev << ( scalar(0)^0^1 ), // (c1)

Overloaded operator += links two nodes, creating a structure known as single-linked list:

        iprev += inext, // (c1) -> (c2)

Before another node is created, the iprev object is repositioned to point to the last node in the list:

        iprev = inext,
        inext << ( scalar(0)^0^3 ), // (c3)

Once all four nodes are inserted and linked, their parent node (p) is created. Another overloaded operator << inserts two zero elements at the end of the vector, forming a node with two data fields:

        iparent << 2, // (p)

The iprev is repositioned again to point to the very first child. Overloaded operator *= updates all child nodes in the list, in such way that their second data field points to the parent:

        iprev = 1,
        iparent *= iprev,           // (p) <- *(c)

Finally, the parent node also is linked in such way that it remembers the first and the last child node in its two fields:

        iparent += iprev,           // (p) -> (c1)
        inext -= iparent            // (c4) <- (p)

The (p) node is the only non-terminal node in the tree. The second part of the program shows how inode_reader objects iterates through the level of child nodes. For every terminal node, the value of data field, child[ DATA_ ] is displayed on the screen.

For further information about cttl::inode_writer and cttl::inode_reader classes, see XML parser library sample, which demonstrates how these objects can be used by CTTL grammar to instantiate, populate, link, and navigate a tree of XML nodes.



Copyright © 1997-2006 Igor Kholodov mailto:cttl@users.sourceforge.net.

Permission to copy, use, modify, sell and distribute this document is granted provided this copyright notice appears in all copies. This document is provided "as is" without express or implied warranty, and with no claim as to its suitability for any purpose.


Generated on Thu Nov 2 17:48:53 2006 for CTTL Lambda Expression by  doxygen 1.3.9.1