substrings

CTTL on
SourceForge

Download
Latest

Documentation
Index

Library
News

CVS
Repository

introduction

Concept of a substring plays major role in design of the text transformation library. CTTL substring is an object that interacts with fragments of text encapsulated by STL std::basic_string template class. Template classes cttl::const_edge and cttl::edge, designed for constant and mutable data access, respectively, represent CTTL substrings. Substrings may be compared, inserted, deleted, or replaced across multiple text inputs. If content of text mutates, the substrings adjust their positions accordingly to the change. CTTL guarantees that substrings remain stable with respect to a potentially mutable text.

Rudimentary text transformations (i.e. mutating operations such as deletions, insertions, and replacements of fragments) are defined on substrings of text. Therefore, reviewing properties of targets of these operations may help to establish a set of requirements for a C++ type that encapsulates a substring.

An insertion requires knowledge of the insertion position. Replacements and deletions require knowledge of starting and ending positions of the target substring. Such positions could be specified by a pair of absolute offsets counted from the beginning of the input character sequence.

Consider example of the user input text, "one two three". The input contains substrings of alphabetical characters (words) separated by spaces. A parser program might find three pairs of absolute offsets, [0,3), [4,7), and [8,13), corresponding to the beginning and ending positions of every word:

    "one two three"
     01234567890123
               1111
     |  ||  ||    |
     |  ||  ||    |
     |  ||  |`----`----- [8,13)
     |  ||  |
     |  |`--`----------- [4,7)
     |  |
     `--`--------------- [0,3)

After the first word, "one", is deleted,

    " two three"
     01234567890123
               1111
     |  ||  ||    |
     |  ||  ||    |
     |  ||  |`----`----- [8,13)
     |  ||  |
     |  |`--`----------- [4,7)
     |  |
     `--`--------------- [0,3)

offset pairs [4,7) and [8,13) no longer represent original substrings. Worse yet, absolute offset 13 is no longer a valid offset within our input. Therefore, once found, a substring must rely on indirect, or logical locations, which implicitly point to the actual physical positions of the substring.

Related pages: CTTL | Substrings

identity vector

CTTL offset management mechanism allows indirect access to the collection of absolute offsets pointing to different parts of input. References to the actual offsets require provision of the offset identity, represented by a non-negative integer number. The range of identities begins at zero and stays within the size of the identity vector, which is an instance of std::vector<size_t> object. Each element of the vector stores the actual absolute offset, while the index of the element establishes its identity:

    "one two three"
     01234567890123
               1111
     |  ||  ||    |
     |  ||  ||    `--- identity=5, offset=13
     |  ||  |`-------- identity=4, offset=8
     |  ||  |
     |  ||  `--------- identity=3, offset=7
     |  |`------------ identity=2, offset=4
     |  |
     |  `------------- identity=1, offset=3
     `---------------- identity=0, offset=0

If user deletes a fragment of text, the library implementation updates offset values in the identity vector, so that the logical positions of the text fragments remain pointing to the same underlying pieces of text.

For instance, in case of deletion of the word "one" at the beginning of the input, the length of the deleted substring is subtracted from every absolute offset on the right side of the deletion point:

    " two three"
     01234567890123
     ||  ||    |
     ||  ||    `------ identity=5, offset=10
     ||  |`----------- identity=4, offset=5
     ||  |
     ||  `------------ identity=3, offset=4
     |`--------------- identity=2, offset=1
     |
     `---------------- identity=1, offset=0
     `---------------- identity=0, offset=0

The library relies on the concept of the offset identity vector for its substring implementation. The library maintains state of every input object instantiated by the user. Along with encapsulated string, identity vector becomes a centralized repository of the current input state. As a rule of thumb, legitimate values of offsets P stored in the identity vector should stay within bounds

	0 <= P <= inp.length()
	- or -
	P == string::npos

where "inp.length()" is the length of the input, and "string" is the type of the string being used. If P equals to the string::npos value, the offset is considered to be deliberately invalid, and will be ignored by the offset adjustment algorithms.

CTTL provides boundary-unchecked access to the underlying text. Therefore, if a program is using invalid logical positions to access the data, the situation may yield unpredictable, and possibly catastrophic results. Comparison, reading, and writing through the interface of the CTTL substring require that the offset pairs representing valid substrings should be ordered. For example, if U is a substring, then it should abide by the rule

	U.first.offset() <= U.second.offset()

where expressions U.first.offset() and U.second.offset() return offsets of the lower and upper boundary of the substring, respectively.

Although it is possible to have identity vector values outside of the legal ranges, any data access requiring proper offset calculation (such as text comparison), must rely on offsets within bounds of the actual STL string contained by the cttl::input object.

Related pages: CTTL | Substrings

input class

cttl::input

Template class cttl::input stores user input text, along with the offset identity vector. It implements offset adjustment algorithms required by the identity vector maintenance tasks. The input template class gets the string type from its template parameter, and instantiates its member string object to store the input text. The default string type is std::string:

	void f()
	{
		using namespace cttl;
		input<> inp;
	}

Since instance of the input class takes care of the run time offset state, the input object becomes a parent of all associated objects:

	void f()
	{
		using namespace cttl;
		input<> inp;
		node<> nodeA( new_node( inp ) );
		node<> nodeB( new_node( inp ) );
		const_edge<> edgeC( new_edge( inp ) );
	}

The input class hides details of the low-level offset manipulation. When program invokes mutating functions exposed through the node or edge interfaces, the input parent routinely makes background offset adjustments to preserve correct logical positions that belong to the identity vector. However, the input class also exposes its own mutable reference to the underlying STL string (see input::text()). If program makes direct changes to the content of input, the program grows to be responsible for repositioning objects whose logical locations may be no longer valid.

Because of the input class configuration cost and complexity, its copy constructor and assignment operator are disabled. Instead, interaction between input objects is handled explicitly:

	void f()
	{
		using namespace cttl;
		input<> inp;
		// use inp...
		input<> output( inp.text() );
		// use output...
		inp.text( output.text() );
	}

Related pages: CTTL | Substrings

node class

cttl::node

In this section:

node creation

node identity

copying nodes

node assignment

node swap

offset-like behavior of the node

node navigation

node data access

adaptor behavior of the node

Template class cttl::node is a building block for writing programs that keep track of logical positions of input symbols. The node class abstracts away low-level absolute offset calculations associated with the text transformations and parsing.

The node stores reference to an instance of the cttl::input object, known as node's parent. The node delegates most of its tasks to the offset management mechanism of the parent input class. Another member variable, the node identity, stores a non-negative integer subscript corresponding to the element of the parent's identity vector. Parent/identity pair points to an absolute offset held by the identity vector element. Both parent and identity can be provided when the node is constructed:

	void f()
	{
		using namespace cttl;
		// create input with the identity vector of 2 elements in size:
		input<> inp( 2 );
		// create two nodes with explicitly specified identities:
		node<> nodeA( inp, 0 );
		node<> nodeB( inp, 1 );
		assert( nodeA.identity() == 0 );
		assert( nodeB.identity() == 1 );
	}

An alternative way to create a node is to make copy of the existing node. In this case, the parent/identity pair is specified indirectly by another node. The library function new_node() adds new element to the input identity vector and constructs a new node. The function takes reference to the input object as the first argument, and an optional absolute offset value as the second argument. The offset is assigned to the new node. By default, the node is positioned at the end of the input. The function returns instance of the node, constructed with newly created identity:

	void f()
	{
		using namespace cttl;
		// create input with empty identity vector:
		input<> inp;
		// construct node after adding new
		// element to the identity vector:
		node<> nodeA( new_node( inp ) );
		assert( nodeA.offset() == inp.length() );
	}

Identity vector elements are not automatically deleted when node objects go out of scope. The program should be aware of this fact, and treat the node identity as a reusable resource. In the following example, the calls to new_node() function are made in a loop. Unless this it's a planned intention, the code is poorly structured, because identity vector of the input object inp grows to an unnecessary 10000 elements in size:

	void f()
	{
		input<> inp;
		for ( int idx = 0; idx < 10000; ++idx ) {
			node<> nodeA( new_node( inp ) );  // *** poor choice!
			// use nodeA...
		}
	}

Better:

	void f()
	{
		input<> inp;
		int id = new_node( inp ).identity();
		for ( int idx = 0; idx < 10000; ++idx ) {
			node<> nodeA( inp, id ); // reuse the identity
			// use nodeA...
		}
	}

Related pages: CTTL | Substrings | Node

Next sections describe behavior and properties of the node class.

node creation

Because input object is required to create a node, nodes don't have default constructor. The input parent is specified directly as constructor's argument, or indirectly by another node. Node's parent cannot be changed. It is an error if a program allows input objects to go out of scope while outstanding nodes and edges continue to be in service (comparable to an attempt to use STL iterator after its container had been destroyed). This situation is probably rare, evidently, indicating an incorrect design. The program can obtain reference to the parent object by calling node::parent() function.

By default, a new node is positioned at the end of the user input text; otherwise, the exact location may be specified:

	void f()
	{
		using namespace cttl;
		input<> inp( "Hello, world!" );
		node<> nodeA( new_node( inp ) );    // same as new_node( inp, inp.length() )
		node<> nodeB( new_node( inp, 0 ) ); // position node at zero offset
		assert( nodeA.offset() == inp.length() );
		assert( nodeB.offset() == 0 );
	}

The type of string can be provided when the node is constructed by the template parameter:

	void wf()
	{
		using namespace cttl;
		input< std::wstring > inp( L"Hello, world!" );
		node< std::wstring > nodeA( new_node( inp ) );
		node< std::wstring > nodeA( new_node( inp ), 0 );
		assert( nodeA.offset() == inp.length() );
		assert( nodeB.offset() == 0 );
	}

Related pages: CTTL | Substrings | Node

node identity

Each node keeps a non-negative integer identity, which uniquely specifies the absolute offset of the identity vector maintained by the node's parent input object. The identity can be either explicitly provided when the node is constructed, or it can be taken over from another node, if the node is created by the copy constructor. Multiple nodes may share the same identity, just like multiple memory pointers can point to the same address.

Unlike parent reference, the node identity can change at run time, as it happens, for instance, when one node is assigned to another.

Related pages: CTTL | Substrings | Node

copying nodes

Node copy constructor manufactures an exact copy of the existing node:

	void f( node<> const& node_ )
	{
		node<> nodeA( node_ );
		assert( nodeA.identity() == node_.identity() );
	}

Related pages: CTTL | Substrings | Node

node assignment

Node assignment matches behavior of the copy constructor. The assignment operator assigns new identity from another node, while the offsets of either node remain unchanged:

	void f( node<>& nodeA_, node<>& nodeB_ )
	{
		nodeA_ = nodeB_;
		assert( nodeA_.identity() == nodeB_.identity() );
	}

If a non-negative integer type, convertible to size_t, is assigned to the node, it is stored as a new node offset:

	void f( size_t offset_, node<>& node_ )
	{
		node_ = offset_;
		assert( offset_ == node_.offset() );
	}

Related pages: CTTL | Substrings | Node

node swap

When nodes are swapped by the std::swap() algorithm, the identities of the nodes are exchanged. (For details, see node_edge_swap.cpp sample code.)

Related pages: CTTL | Substrings | Node

offset-like behavior of the node

The library provides full range of node comparison operators, such as ==, !=, <, >, <=, and >=. The comparison operators compare absolute offsets of two nodes. The two nodes are equal if their offsets are the same:

	void f( node<>& nodeA_, node<>& nodeB_ )
	{
		nodeA_ = nodeB_.offset();
		assert( nodeA_ == nodeB_ );
	}

Related pages: CTTL | Substrings | Node

node navigation

The node can be manually navigated to various locations inside user input. (Alternatively, node positions can be controlled by grammar expressions which employ node adaptor functions). As discussed earlier, the initial absolute offset of the node can be specified by the second argument of the cttl::new_node() function:

	void f( input<>& inp_, size_t offset_ )
	{
		node<> nodeA( new_node( inp_, offset_ ) );
		assert( nodeA.offset() == offset_ );
	}

Later, another offset can be either directly assigned to the node, or it can be modified by the node::offset() member function:

	void f( node<>& node_, size_t offset_ )
	{
		node_ = offset_;
		assert( node_.offset() == offset_ );
		//...
		node_.offset( offset_ );
		//...
	}

In addition to direct offset navigation, the node class provides a family of navigation member functions to dispatch nodes to locations commonly known as beginning-of-file, ending-of-file, line home and line end of the current, or an arbitrary line. Line-based traversals and other navigation member functions begin with prefix "go_", for example, node::go_line_home().

The following diagram shows agreement between line numbers and physical offsets as it is understood by the node navigation functions. Sample data contains two words, "one" and "two", separated by '\r' and '\n' characters:

	one\r\ntwo    <-- user input
	012 3 45678   <-- absolute offsets
	H   E  H  E   <-- line boundaries: H=line home, E=line end
	111 X X222X   <-- line numbers. Offsets marked by letter 'X'
	                  do not belong to any line.

For more information about line navigation, refer to the sample program line_navigate.cpp.

Manual node navigation is usually a secondary way of managing node positions. Typically, grammar expression invokes node adaptor function to resolve exact node position corresponding to the upper boundary of the leftmost input symbol.

Related pages: CTTL | Substrings | Node

node data access

The node class provides read and write access to the underlying character sequence. For example,

	void f()
	{
		using namespace cttl;
		input<> inp( "one two three" );
		node<> nodeA( new_node( inp ) );
		nodeA[ -1 ] = 'E';
		assert( inp.text() == "one two threE" );
	}

Here, the nodeA gets positioned at the end of the user input, which is not a valid read/write character position. However, subscript access to characters is relative to the current position of the node (as illustrated by the "nodeA[-1]" expression).

Another typical use case is to insert text at the position specified by the node:

	void f()
	{
		using namespace cttl;
		input<> inp;
		node<> nodeA( new_node( inp ) );

		// Morse code for letter 'A' gets inserted into the input:
		nodeA.insert_go( "dit " );
		nodeA.insert_go( "dah " );
		assert( inp.text() == "dit dah " );

		// Morse code for letter 'N' gets inserted:
		nodeA.insert_stay( "dit " );
		nodeA.insert_stay( "dah " );
		assert( inp.text() == "dit dah dah dit " );
	}

The node::insert_go() function inserts text at the current position of the node and moves current node offset to the end of the inserted fragment. This is similar to the behavior of text cursor of a text editor program. The side effect of the node::insert_go() function is that offsets of all nodes currently positioned at the insertion point are moved forward to the end of the inserted fragment of text.

The node::insert_stay() function inserts text at the current position of the node. Positions of nodes at the insertion point remain unchanged.

Related pages: CTTL | Substrings | Node

adaptor behavior of the node

The node class overloads function call operator, node::operator(), which takes a single argument, CTTL grammar expression. If grammar evaluation succeeds, the node position is set equal to the left boundary of the text fragment that matched the leftmost terminal symbol of the specified expression. For example,

void f() {
	input<> inp( "123" );
	const_edge<> universe( new_edge( inp ) );
	node<> left( new_node( inp ) );
	size_t result = left( entity( isdigit ) ).match( universe );
	assert( left.offset() == 0 );
}

The node adaptor provides access to the left boundary of the text corresponding to parsed symbols of the input language.

An important role of the node adaptor is that it allows to construct grammar rules that rely on the positive unlimited lookahead of the input symbols. The parseable universe U makes available two nodes, U.first and U.second, representing the lower and upper boundaries of the universe, respectively. If left node, U.first, is applied as an expression adaptor, it backtracks the left boundary of the universe to the position matching leftmost terminal symbol of the expression:

void f() {
	input<> inp( "123" );
	const_edge<> universe( new_edge( inp ) );
	size_t result = universe.first( entity( isdigit ) ).match( universe );
	assert( universe.first.offset() == 0 );
}

After successful evaluation of the expression entity( isdigit ), upper boundary of the universe backtracks to the beginning of the matched symbol, "123".

In summary, node adaptors can be used for locating data and to carry out the task of unlimited positive lookahead inside grammar expressions.

Related pages: CTTL | Substrings

edge classes

In this section:

edge creation

copying edges

edge assignment

edge swap

string-like behavior of the edge

adaptor behavior of the edge

To implement concept of a substring, the library defines two edge template classes for read-only and read/write data access:

    cttl::const_edge    // constant substring access
    cttl::edge          // mutable substring access

Separation of the data access levels yields better design and performance. Class cttl::edge derives from const_edge and adds a few member functions with mutable access to the underlying substring. Edge classes contain two node member objects, named first and second. The nodes represent lower and upper boundaries of the substring, respectively. Node positions adhere to the following rules:

0 <= U.first.offset() <= U.second.offset() <= inp.length()

where U is an edge, and inp.length() is the length of the user input. If both nodes point to the same location, the substring is empty. The substring "ABC" which belongs to the input "xxxABCyyy" illustrates the boundary rules. The substring has physical boundaries that can be denoted by valid range [3,6):

	xxxABCyyy    <-- user input
	0123456789   <-- absolute offsets
	   B  E      <-- substring boundaries, B=begin, E=end

CTTL edge represents a valid range of the character sequence (for valid range definition, see "Definitions" section of the STL Sequence.) The length of substring U can be calculated as

	U.second.offset() - U.first.offset()

Member function const_edge::length() computes and returns the length of the substring. Because either node of the edge could be navigated independently, the length() function may sometimes return a negative value. In such case it is illegal to call other edge member functions that require access to the underlying text data (for instance, functions or operators that compare, extract, delete, or replace substrings.)

The following sections explain basic behavior and properties of CTTL substrings:

Related pages: CTTL | Substrings | Edge

edge creation

Because input object is required to create an edge, edges don't have default constructor. The input can be specified directly by the argument of the edge constructor, or indirectly, by another edge or a pair of node objects. It is an error if a program allows input objects to go out of scope while associated edge objects are still in use. The program can obtain reference to the input (often referred to as parent object by calling edge::parent() member function.)

By default, edge constructor creates a substring that points to the entire input text by positioning nodes edge::first and edge::second at the beginning and at the end of the user input, respectively. Otherwise, exact node locations may be provided:

	void f()
	{
		using namespace cttl;
		input<> inp;
		edge<> edgeA( new_edge( inp ) );    // same as new_edge( inp, 0, inp.length() )
		edge<> edgeB( new_edge( inp, 0, 0 ) ); // position both nodes at zero offset
		assert( edgeA.first.offset() == 0 );
		assert( edgeA.second.offset() == inp.length() );
		assert( edgeB.first.offset() == 0 );
		assert( edgeB.second.offset() == 0 );
	}

To customize CTTL substring class for use as a parseable universe, the user can provide two template parameters when the edge is constructed. First template parameter determines white space sensitivity of the universe. The second parameter identifies the type of encapsulated string. For example, to parse wide character strings with the option of automatic white space support, we might choose the following templates:

	void f()
	{
		using namespace cttl;
		input< std::wstring > inp;
		// construct parseable substring which is
		// conscious of conventional white space:
		const_edge< policy_space<>, std::wstring > universe( new_edge( inp ) );
		//...
	}

For further information, please refer to the API documentation of CTTL edge classes.

Related pages: CTTL | Substrings | Edge

copying edges

Edge copy constructor manufactures an exact copy of the existing edge:

	void f( edge<> const& edge_ )
	{
		edge<> edgeA( edge_ );
		assert( edgeA.first.identity() == edge_.first.identity() );
		assert( edgeA.second.identity() == edge_.second.identity() );
	}

Related pages: CTTL | Substrings | Edge

edge assignment

Edge assignment matches behavior of the copy constructor. The assignment operator assigns new node identities from another edge, while positions of both edges remain unchanged:

	void f( edge<>& edgeA_, edge<>& edgeB_ )
	{
		edgeA_ = edgeB_;
		assert( edgeA_.first.identity() == edgeB_.first.identity() );
		assert( edgeA_.second.identity() == edgeB_.second.identity() );
	}

Related pages: CTTL | Substrings | Edge

edge swap

When edges are swapped by the std::swap() algorithm, the identities of boundary nodes are exchanged. (For details, see node_edge_swap.cpp sample code.)

Related pages: CTTL | Substrings | Edge

string-like behavior of the edge

Unlike copy constructor and assignment, edge comparison operators, ==, !=, <, >, <=, and >=, compare actual underlying substrings of the two edges. Two edges are equal if the underlying substrings are the same:

	void f( edge<> const& edgeA_, edge<> const& edgeB_ )
	{
		if ( edgeA_ == edgeB_ ) {
			assert( edgeA_.text() == edgeB_.text() );
			//...
		}
	}

An STL substring can be directly assigned to an edge object:

	void f( edge<>& edge_, const std::string& str_ )
	{
		edge_ = str_;
		assert( edge_.text() == str_ );
	}

The combination of copy, assignment, swap, and string-like less-than comparison behavior of cttl::edge class yield a mutable value type satisfying strict weak ordering requirement, as described in STL LessThan Comparable concept. Therefore, various non-mutating, mutating, and sorting algorithms can be successfully applied to a sequence of edge objects. For a complete example, see edge_sort.cpp sample program.

Related pages: CTTL | Substrings | Edge

adaptor behavior of the edge

The edge class overloads function call operator, edge::operator(), which takes a single argument, CTTL grammar expression. If grammar evaluation succeeds, the edge boundaries are set equal to the upper and lower boundaries of the matched text fragment. For example,

void f() {
	input<> inp( "123" );
	const_edge<> universe( new_edge( inp ) );
	edge<> data( new_edge( inp ) );
	size_t result = data( entity( isdigit ) ).match( universe );
	assert( data.text() == "123" );
}

The edge grammar expression adaptor provides access to the boundaries of the matched fragment of text corresponding to parsed symbols of the input language.

Related pages: CTTL | Substrings

substring example

The following program demonstrates deletion of the fragment "one" from the input "one two three" discussed earlier on this page. The parent input object manages logical positions of nodes and edges:

// sample program: input.cpp
// demonstrates concept of offset identities

//#define NDEBUG    // define to stop assertions from being compiled 
#include <cassert>
#include <iostream>
#include "cttl/cttl.h"

int main()
{
    using namespace cttl;

    // create input object with identity vector size == 6
    input<> inp( "one two three", 6 );

    // create nodes
    node<> n0( inp, 0 );
    node<> n1( inp, 1 );
    node<> n2( inp, 2 );
    node<> n3( inp, 3 );
    node<> n4( inp, 4 );
    node<> n5( inp, 5 );

    // create edges
    edge<> e01( n0, n1 );
    const_edge<> e23( n2, n3 );
    const_edge<> e45( n4, n5 );

    // pretend that we parsed the input:
    n0.offset( 0 );
    n1.offset( 3 );
    n2.offset( 4 );
    n3.offset( 7 );
    n4.offset( 8 );
    n5.offset( 13 );

    // verify that edges correctly point to the substrings:
    assert( e01.text() == "one" );
    assert( e23.text() == "two" );
    assert( e45.text() == "three" );

    // delete first substring:
    e01.text( "" );

    // verify new substrings:
    assert( e01.text() == "" );
    assert( e23.text() == "two" );
    assert( e45.text() == "three" );

    // verify new absolute offsets:
    assert( n0.offset() == 0 );
    assert( n1.offset() == 0 );
    assert( n2.offset() == 1 );
    assert( n3.offset() == 4 );
    assert( n4.offset() == 5 );
    assert( n5.offset() == 10 );

    // verify modified input text:
    assert( inp.text() == " two three" );
    return 0;
}

Related pages: CTTL | Substrings

See also:

Permission to copy, use, modify, sell and distribute this document is granted provided this copyright notice appears in all copies. This document is provided "as is" without express or implied warranty, and with no claim as to its suitability for any purpose.

Generated on Thu Nov 2 17:44:53 2006 for Common Text Transformation Library by

1.3.9.1