Main Page | Namespace List | Class Hierarchy | Alphabetical List | Class List | File List | Namespace Members | Class Members | File Members | Related Pages

Common Text Transformation Library

Version 2.08


SourceForge.net Logo     CTTL on    
    SourceForge    
    Download    
    Latest    
    Documentation    
    Index    
    Library    
    News    
    CVS    
    Repository    
   Other    
   Links    

introduction

Common Text Transformation Library, CTTL for short, is a set of C++ classes and functions to understand and modify text data. The library implementation is based on STL classes and algorithms.

Concept of a substring plays major role in design of the text transformation library. CTTL substring is an object that interacts with fragments of text encapsulated by STL std::basic_string template class. Template classes cttl::const_edge and cttl::edge, designed for constant and mutable data access, respectively, represent CTTL substrings. Substrings may be compared, inserted, deleted, or replaced across multiple text inputs. If content of text mutates, the substrings adjust their positions accordingly to the change. CTTL guarantees that substrings remain stable with respect to a potentially mutable text.

Within CTTL framework, a substring may be parsed with EBNF-like grammar. CTTL lexical analysis engine generates a stream of substrings corresponding to the parsed symbols. BNF and EBNF grammars can be written directly in C++. Template meta-programming and operator overloading offer features to write C++ expressions that describe grammar rules. No additional steps of parsing, compiling, or generating source code are required. Compiled CTTL program implements LL(INF)-parser, the recursive-descent parser with infinite lookahead.

latest release

Current CTTL release was made possible by support from our main sponsor, C-jump Factory. The company offers c-jump computer programming learning games for children and adults with no prior knowledge of computer programming. Besides that, web site makes available animated C++ Standard Library tutorial, which can be downloaded free of charge.

The latest version of CTTL adds support for lambda expressions, higher-order functions, and closures. Alltogether, this functionality allows to write powerful text transformations directly inside grammar rules. Lambda expressions bring familiar look and feel of a C++ expression to CTTL grammar, providing direct communication between parser, STL containers, iterators, and other components of the program. While core CTTL parsing facilities remain independent from the implementation of the new lambda engine, lambda expressions help to reduce the crowd of little functions with small amount of code in parser program.

license

Copyright © 1997-2006 Igor Kholodov mailto:cttl@users.sourceforge.net.

Permission to use, copy, modify, and distribute this software and its documentation under the terms of the GNU Lesser General Public License is hereby granted. No representations are made about the suitability of this software for any purpose. It is provided "as is" without express or implied warranty. See GNU Lesser General Public License for more details.

portability and debugging

CTTL architecture was designed with portability in mind. An effort was made to keep the library as small as possible. Essential tracing facility is provided for debugging grammar rules at run time. CTTL parser programs are LL(k) parsers, known to be easy to use and debug.

CTTL has been fully tested with the following compilers:

    GNU GCC        Versions 3.2.x, 3.3.x, 3.4.x
    MSVC 7.1       Version 13.10.3077 (2003)
    MSVC 8.0       Version 14.00.50727.42 (2005)
    Comeau C/C++   Version 4.3.3

thread safety

The library is STL-based and closely follows guidelines for STL thread safety. CTTL forbids any non-constant static data to be part of the library implementation. The only exception to this rule is CTTL tracing facility, which has no multithreading capability and is provided exclusively for programmer's workbench grammar debugging.

Owing to some known design problems related to the reference counted implementations of STL strings, straightforward read-only access to characters through character references and iterators obtained by operator[] calls may yield incorrect results. Please refer to the documentation of your version of STL before deciding when it is safe to access strings for simultaneous reading by multiple threads. CTTL implementation avoids expressions like s[1]==s[2], disallowed by the committee draft standard. The library never caches string iterators, making it easy to comply with STL string iterator invalidation rules.

If multiple threads attempt to use CTTL classes and functions for writing, the user is responsible for ensuring mutual exclusion of data access between multiple threads. The same restriction applies when more than one concurrent thread of execution attempts to parse text stored by the same instance of cttl::input object, because CTTL lexer implementation has to exchange data and stay in sync with the input state during grammar evaluation.

basic terminology

In CTTL documentation, the terms "text" or "input" represent instances of text data, potentially containing "strings of symbols" found in abstract alphabet of some formal language. The language can be characterized by its grammar, as a set of rewrite rules for strings that belong to the language.

For clarity, the term "string" will always refer exclusively to STL std::basic_string template class, or its instance. The term "offset" indicates an absolute offset from the beginning of the input character sequence. Absolute offsets specify physical locations of characters inside user input. A logical offset refers to either lower, or upper boundary of a particular input symbol. Logical offsets unambiguously describe positions associated with parsed symbols of the input language. To support logical offsets, the library provides identity vector facility, bound to the implementation of the user input object.

The library adopts graph terminology to work with texts, substrings, and offsets. From the graph theory, a graph is an object consisting of nodes and edges between pairs of nodes. The library grows on the idea that the text data is best described as a graph. The graph contains a set of vertices (nodes). The nodes indicate logical boundaries of the input symbols. An edge, by definition, is an object connecting two vertices; therefore, it fits description of a substring (terms edge and substring will be often used interchangeably).

grammar parsing

A program often expects input that contains sentences of some formal language. A set of grammar rules for the language in question can be specified. By definition, EBNF grammar rules may reference other rules recursively. One rule is usually designated to be the start rule, serving as an entry point for the lexical analysis of the input. Technically, any CTTL grammar rule can be used as the start rule.

C++ operator overloading and template meta-programming allow EBNF grammar rules to be written as C++ expressions. Individual grammar expressions define patterns of tokens of the input language. CTTL lexer engine is implemented as a set of functions holding expressions of the input grammar. The input is presented to the lexer, and while lexer consumes the input, it generates a stream of tokens. Token is a string of characters in the input language that is treated as a single unit during syntax and semantics analysis. For example, if input language is a programming language, string literal such as "hello, world!" is an example of a token. To describe combinations of tokens inside grammar expressions, the library provides a family of lexeme functions and overloaded operators.

During grammar expression evaluation, the tokens are either instantly discarded, or given out to the parser component for validation, storage, and other data-related tasks. Consequently, the parser receives and consumes the data generated by the lexer. Parser function that stores the data, or takes immediate processing step during lexical analysis is commonly known as a semantic action. The set of all semantic actions represents the parser component of the program. If a semantic action detects an error, it can signal error condition back to the lexer. Based on the structure of the grammar rule, the lexer may choose to proceed with alternative grammar evaluation path, or discontinue further input processing and return the error condition to the caller.

CTTL grammar expressions are compiled into instances of C++ template classes. Compiled program yields a an implementation of the recursive-descent LL(INF) parser.

For applications with low memory consumption requirements, the library supports creation of stream parsers based on incremental parsing algorithms.

mutable universe

Parseable universe is a substring of text that is presented to the lexer for semantic and lexical analysis. A typical read-only parser knows how to parse the user input, but does not anticipate any accidental changes in the input during grammar evaluation. In this situation, the input universe is expected to be constant, or immutable. The concept of immutable input helps to compile simpler, faster implementation, with minimal backtracking and good performance.

At the same time, an implementer of a conventional search-and-replace program could benefit from one-pass algorithm that modifies fragments of text instantly, while the grammar expressions are being evaluated. Besides straightforwardness of the one-pass algorithm, the parser would also be simpler, because it could avoid using potentially complicated data structure that stores temporary results. (Note that complexity of the data structure to accommodate parsed data will be proportional to the complexity of the grammar). Evidently, this kind of program might take advantage of less restrictive, mutable universe.

The library defines two C++ classes that can specify parseable universe:

    cttl::const_edge    // constant substring access
    cttl::edge          // mutable substring access

The library offers declarative, or compile-time, support for both constant and mutable universes via distinction between these two classes. C++ compiler generates an implementation that is dependent on the type of the universe. If text transformations must occur during grammar evaluation, the implementation of the parser should be compiled against mutable universe. At run time, such implementation keeps track of, and adjusts the lexer state, as the input changes take place. Consequently, the parser program compiled against mutable universe is free to modify the universe (i.e. replace, delete, or insert substrings) without restrictions at any time.

See also:

download

Common Text Transformation Library can be found on SourceForge.net. Library files can be accessed directly in CVS repository of the project, although files in the repository sometimes may be unstable.

CTTL distribution includes only C++ headers and sample source files. Latest release of the library can be downloaded as ZIP file archive. When distribution archive is unzipped with "recursive" option turned on, the relative directory structure will be as follows:

Table M-1 - CTTL distribution directory structure
Directory description
cttl20
 
Top directory containing readme file, license info, and sample main.cpp program.
 
cttl20/cttl
 
Directory containing all library headers.
 
cttl20/example
 
Sample programs.
 
cttl20/lambda
 
CTTL lambda library.
 
cttl20/test
 
Regression test programs.
 
cttl20/utils
 
Auxiliary headers used by offset2line.cpp and partition_file.cpp sample utilities.
For more information, see CTTL Utility Classes and Functions documentation.
 

This directory structure is not a requirement; the sample programs expect library headers to be available in relative subdirectory named "cttl", for example,

	#include "cttl/cttl.h"

If you plan to use different directory structure, please modify the include paths of the sample programs that you wish to compile.

The library does not require any specific compiler switches or settings; with GNU compiler, your should be able to compile and link your CTTL programs by commands like

> g++ -g -Wall -c -o main.o main.cpp
> g++ -o sample.exe main.o    

sample programs


Table M-2 - CTTL sample programs
Source file description
 xml_parser.cpp NEW!
 
Example of CTTL incremental parser. Program implements XML stream parser, which reads data from the file stream. Demonstrates stateful policy classes that load data dynamically.
 
 gumus.cpp NEW!
 
Gumus utility is a processor of a small scripting language to transform prefabricated units of free text into pieces of C++ code of stream output that is capable of reproducing the original text. This utility is useful when there is a need to generate code instead of writing it by hand.
 
 csv2xml.cpp NEW!
 
Utility to convert input file in comma separated value (CSV) format into its XML equivalent, assuming the first line contains the column headers.
 
 text2array.cpp NEW!
 
Utility to convert input text file to an array of string literals.
 
 arithmetic_interpreter.cpp NEW!
 
Arithmetic expression parser and interpreter. Demonstrates grammar results being stored inside a syntax tree.
 
 arithmetics.cpp 
 
Arithmetic expression parser. Converts arithmetic expression into sequence of pseudo-assembly instructions. For more information, see grammar tutorial.
 
 arithmetics_traced.cpp 
 
Modified version of the arithmetic expression parser. Introduces grammar rule tracing facilities of the library. For more information, see grammar debugging and tracing.
 
 action_handler.cpp 
 
Demonstrates semantic action implemented as member function template.
 
 client_regions.cpp 
 
Demonstrates policy_space< flag_follow_region >, a white space policy that relies on pre-defined regions of the input text.
 
 cpp_comment_strip.cpp 
 
Demonstrates white space policy based on pre-defined regions of input. The program takes C/C++ or java program(s) as an input and strips out comments.
 
 edge_functors.cpp 
 
Demonstrates cttl::edge_replace function object class. The program replaces words (substrings) inside input text.
 
 edge_sort.cpp 
 
Demonstrates string-like behavior of CTTL edge objects. The program applies STL sort algorithm to a vector of edges.
 
 follow_whitespace.cpp 
 
Implements user-defined white space policy.
 
 functor.cpp 
 
Program defining a function object that implements grammar production rule.
 
 grammar.cpp 
 
Demonstrates simple stateless lexer and parser. The program finds and displays types of character entities matching user input text.
 
 hello_world.cpp 
 
Demonstrates deletion of CTTL substring from the user input object. Logical positions of the input nodes and edges are preserved.
 
 input.cpp 
 
Rudimentary program that demonstrates concept of offset identity, the CTTL equivalent of a logical offset.
 
 java_lexer.cpp 
 
Working version of a bare lexer for the Java programming language. Feel free to modify for your own Java-related needs. This is a work in progress. Java grammar was taken "as is", directly from Sun web site, http://java.sun.com/docs/books/jls/second_edition/html/syntax.doc.html. Grammar optimizations soon to come!
 
 line_navigate.cpp 
 
Simple program that uses CTTL node objects to navigate the lines inside user input text.
 
 line_parser.cpp 
 
Demonstrates text line parser which uses asymmetric quote, quote(true,RM,RR), to parse individual lines inside user input.
 
 mutable_universe.cpp 
 
An example of CTTL mutable universe. "Search and replace" changes are made while input text is being parsed.
 
 node_edge_swap.cpp 
 
Demonstrates exchange of the node identities when two nodes are swapped by the std::swap() algorithm. In a similar way, identities of edge boundary nodes are exchanged when two edge objects are swapped.
 
 node_find_class.cpp 
 
Shows results of manual node navigation via cttl::node::find_class() and cttl::node::rfind_class() calls. Position of the node is moved around boundaries of the nearby character class of the user input.
 
 node_functors.cpp 
 
Demonstrates application of cttl::node_insert_go() and cttl::node_insert_stay() functors.
 
 number2words.cpp 
 
Another example that demonstrates mutability of the parseable universe. Program replaces numbers with their fancy names, such as "quadrillion".
 
 offset2line.cpp 
 
A utility that translates absolute offsets to the actual line numbers in the input file. The program may be helpful to understand CTTL trace file output (example), created during grammar debugging.
 
 partition_file.cpp 
 
A utility to divide a large text file, such as trace output, into fractions of 100MB-chunks of text.
 
 rules.cpp 
 
Sample program implementing grammar expressions to parse fractional numbers.
 
 runtime_match.cpp 
 
Program demonstrating application of runtime_match() grammar evaluation method, employed to support subrule searches.
 
 search_replace.cpp 
 inline_replace.cpp NEW!
 
Utility removes html tags from the input file specified on the command line, and prints modified text on the screen. Program demonstrates search-and-replace parser in action. The parser compiles against mutable universe, and makes changes to the underlying text while parsing the input.
 
 string_set_lexeme.cpp 
 
Demonstrates begin(std::set<std::string>&), the string set lexeme.
 
 word_count.cpp 
 
This sample demonstrates semantic action functions organized within a base parser class.
 
 yellow_taxi.cpp 
 
Demonstrates repeatable search grammar evaluation method.
 
 Z.cpp 
 
A tiny program that matches a single character inside user input. Demonstrates essentials of CTTL grammar evaluation statements.
 



Copyright © 1997-2006 Igor Kholodov mailto:cttl@users.sourceforge.net.

Permission to copy, use, modify, sell and distribute this document is granted provided this copyright notice appears in all copies. This document is provided "as is" without express or implied warranty, and with no claim as to its suitability for any purpose.


Generated on Thu Nov 2 17:44:05 2006 for Common Text Transformation Library by  doxygen 1.3.9.1