|
|
CTTL on
SourceForge |
Download
Latest |
Documentation
Index |
Library
News |
CVS
Repository |
Other
Links |
Common Text Transformation Library, CTTL for short, is a set of C++ classes and functions to understand and modify text data. The library implementation is based on STL classes and algorithms.
Concept of a substring plays major role in design of the text transformation library. CTTL substring is an object that interacts with fragments of text encapsulated by STL std::basic_string template class. Template classes cttl::const_edge and cttl::edge, designed for constant and mutable data access, respectively, represent CTTL substrings. Substrings may be compared, inserted, deleted, or replaced across multiple text inputs. If content of text mutates, the substrings adjust their positions accordingly to the change. CTTL guarantees that substrings remain stable with respect to a potentially mutable text.
Within CTTL framework, a substring may be parsed with EBNF-like grammar. CTTL lexical analysis engine generates a stream of substrings corresponding to the parsed symbols. BNF and EBNF grammars can be written directly in C++. Template meta-programming and operator overloading offer features to write C++ expressions that describe grammar rules. No additional steps of parsing, compiling, or generating source code are required. Compiled CTTL program implements LL(INF)-parser, the recursive-descent parser with infinite lookahead.
Current CTTL release was made possible by support from our main sponsor, C-jump Factory. The company offers c-jump computer programming learning games for children and adults with no prior knowledge of computer programming. Besides that, web site makes available animated C++ Standard Library tutorial, which can be downloaded free of charge.
The latest version of CTTL adds support for lambda expressions, higher-order functions, and closures. Alltogether, this functionality allows to write powerful text transformations directly inside grammar rules. Lambda expressions bring familiar look and feel of a C++ expression to CTTL grammar, providing direct communication between parser, STL containers, iterators, and other components of the program. While core CTTL parsing facilities remain independent from the implementation of the new lambda engine, lambda expressions help to reduce the crowd of little functions with small amount of code in parser program.
Copyright © 1997-2006 Igor Kholodov mailto:cttl@users.sourceforge.net.
Permission to use, copy, modify, and distribute this software and its documentation under the terms of the GNU Lesser General Public License is hereby granted. No representations are made about the suitability of this software for any purpose. It is provided "as is" without express or implied warranty. See GNU Lesser General Public License for more details.
CTTL architecture was designed with portability in mind.
An effort was made to keep the library as small as possible. Essential
tracing facility
is provided for debugging grammar rules at run time.
CTTL parser programs are LL(k) parsers, known
to be easy to use and debug.
CTTL has been fully tested with the following compilers:
GNU GCC Versions 3.2.x, 3.3.x, 3.4.x
MSVC 7.1 Version 13.10.3077 (2003)
MSVC 8.0 Version 14.00.50727.42 (2005)
Comeau C/C++ Version 4.3.3
The library is STL-based and closely follows guidelines for STL thread safety. CTTL forbids any non-constant static data to be part of the library implementation. The only exception to this rule is CTTL tracing facility, which has no multithreading capability and is provided exclusively for programmer's workbench grammar debugging.
Owing to some known design problems related to the reference counted implementations of STL strings, straightforward read-only access to characters through character references and iterators obtained by operator[] calls may yield incorrect results. Please refer to the documentation of your version of STL before deciding when it is safe to access strings for simultaneous reading by multiple threads. CTTL implementation avoids expressions like s[1]==s[2], disallowed by the committee draft standard. The library never caches string iterators, making it easy to comply with STL string iterator invalidation rules.
If multiple threads attempt to use CTTL classes and functions for writing, the user is responsible for ensuring mutual exclusion of data access between multiple threads. The same restriction applies when more than one concurrent thread of execution attempts to parse text stored by the same instance of cttl::input object, because CTTL lexer implementation has to exchange data and stay in sync with the input state during grammar evaluation.
In CTTL documentation, the terms "text" or "input" represent instances of text data, potentially containing "strings of symbols" found in abstract alphabet of some formal language. The language can be characterized by its grammar, as a set of rewrite rules for strings that belong to the language.
For clarity, the term "string" will always refer exclusively to STL std::basic_string template class, or its instance. The term "offset" indicates an absolute offset from the beginning of the input character sequence. Absolute offsets specify physical locations of characters inside user input. A logical offset refers to either lower, or upper boundary of a particular input symbol. Logical offsets unambiguously describe positions associated with parsed symbols of the input language. To support logical offsets, the library provides identity vector facility, bound to the implementation of the user input object.
The library adopts graph terminology to work with texts, substrings, and offsets. From the graph theory, a graph is an object consisting of nodes and edges between pairs of nodes. The library grows on the idea that the text data is best described as a graph. The graph contains a set of vertices (nodes). The nodes indicate logical boundaries of the input symbols. An edge, by definition, is an object connecting two vertices; therefore, it fits description of a substring (terms edge and substring will be often used interchangeably).
A program often expects input that contains sentences of some formal language. A set of grammar rules for the language in question can be specified. By definition, EBNF grammar rules may reference other rules recursively. One rule is usually designated to be the start rule, serving as an entry point for the lexical analysis of the input. Technically, any CTTL grammar rule can be used as the start rule.
C++ operator overloading and template meta-programming allow EBNF grammar rules to be written as C++ expressions. Individual grammar expressions define patterns of tokens of the input language. CTTL lexer engine is implemented as a set of functions holding expressions of the input grammar. The input is presented to the lexer, and while lexer consumes the input, it generates a stream of tokens. Token is a string of characters in the input language that is treated as a single unit during syntax and semantics analysis. For example, if input language is a programming language, string literal such as "hello, world!" is an example of a token. To describe combinations of tokens inside grammar expressions, the library provides a family of lexeme functions and overloaded operators.
During grammar expression evaluation, the tokens are either instantly discarded, or given out to the parser component for validation, storage, and other data-related tasks. Consequently, the parser receives and consumes the data generated by the lexer. Parser function that stores the data, or takes immediate processing step during lexical analysis is commonly known as a semantic action. The set of all semantic actions represents the parser component of the program. If a semantic action detects an error, it can signal error condition back to the lexer. Based on the structure of the grammar rule, the lexer may choose to proceed with alternative grammar evaluation path, or discontinue further input processing and return the error condition to the caller.
CTTL grammar expressions are compiled into instances of C++ template classes. Compiled program yields a an implementation of the recursive-descent LL(INF) parser.
For applications with low memory consumption requirements, the library supports creation of stream parsers based on incremental parsing algorithms.
Parseable universe is a substring of text that is presented to the lexer for semantic and lexical analysis. A typical read-only parser knows how to parse the user input, but does not anticipate any accidental changes in the input during grammar evaluation. In this situation, the input universe is expected to be constant, or immutable. The concept of immutable input helps to compile simpler, faster implementation, with minimal backtracking and good performance.
At the same time, an implementer of a conventional search-and-replace program could benefit from one-pass algorithm that modifies fragments of text instantly, while the grammar expressions are being evaluated. Besides straightforwardness of the one-pass algorithm, the parser would also be simpler, because it could avoid using potentially complicated data structure that stores temporary results. (Note that complexity of the data structure to accommodate parsed data will be proportional to the complexity of the grammar). Evidently, this kind of program might take advantage of less restrictive, mutable universe.
The library defines two C++ classes that can specify parseable universe:
cttl::const_edge // constant substring access cttl::edge // mutable substring access
The library offers declarative, or compile-time, support for both constant and mutable universes via distinction between these two classes. C++ compiler generates an implementation that is dependent on the type of the universe. If text transformations must occur during grammar evaluation, the implementation of the parser should be compiled against mutable universe. At run time, such implementation keeps track of, and adjusts the lexer state, as the input changes take place. Consequently, the parser program compiled against mutable universe is free to modify the universe (i.e. replace, delete, or insert substrings) without restrictions at any time.
See also:Common Text Transformation Library can be found on SourceForge.net. Library files can be accessed directly in CVS repository of the project, although files in the repository sometimes may be unstable.
CTTL distribution includes only C++ headers and sample source files. Latest release of the library can be downloaded as ZIP file archive. When distribution archive is unzipped with "recursive" option turned on, the relative directory structure will be as follows:
Table M-1 - CTTL distribution directory structure| Directory | description |
|---|---|
|
cttl20
|
Top directory containing readme file, license info,
and sample main.cpp program.
|
|
cttl20/cttl
|
Directory containing all library headers.
|
|
cttl20/example
|
Sample programs.
|
|
cttl20/lambda
|
CTTL lambda library.
|
|
cttl20/test
|
Regression test programs.
|
|
cttl20/utils
|
Auxiliary headers used by offset2line.cpp
and partition_file.cpp sample utilities.
For more information, see CTTL Utility Classes and Functions documentation. |
This directory structure is not a requirement; the sample programs expect library headers to be available in relative subdirectory named "cttl", for example,
#include "cttl/cttl.h"
If you plan to use different directory structure, please modify the include paths of the sample programs that you wish to compile.
The library does not require any specific compiler switches or settings; with GNU compiler, your should be able to compile and link your CTTL programs by commands like
> g++ -g -Wall -c -o main.o main.cpp > g++ -o sample.exe main.o
| Source file | description |
|---|---|
|
xml_parser.cpp NEW!
|
Example of CTTL incremental parser. Program implements
XML stream parser,
which reads data from the file stream.
Demonstrates stateful policy
classes that load data dynamically.
|
|
gumus.cpp NEW!
|
Gumus utility is a processor of a small scripting language to transform
prefabricated units of free text into pieces of C++ code
of stream output that is capable of reproducing the
original text. This utility is useful when there is a need to generate code
instead of writing it by hand.
|
|
csv2xml.cpp NEW!
|
Utility to convert input file in comma separated value (CSV) format into its
XML equivalent, assuming the first line contains the column headers.
|
|
text2array.cpp NEW!
|
Utility to convert input text file to an array of string literals.
|
|
arithmetic_interpreter.cpp NEW!
|
Arithmetic expression parser and interpreter.
Demonstrates grammar results being stored inside a syntax tree.
|
|
arithmetics.cpp
|
Arithmetic expression parser. Converts arithmetic expression
into sequence of pseudo-assembly instructions.
For more information, see grammar tutorial.
|
|
arithmetics_traced.cpp
|
Modified version of the arithmetic expression parser.
Introduces grammar rule tracing facilities of the library.
For more information, see
grammar debugging and tracing.
|
|
action_handler.cpp
|
Demonstrates semantic action implemented as member function template.
|
|
client_regions.cpp
|
Demonstrates
policy_space< flag_follow_region >,
a white space policy
that relies on pre-defined regions of the input text.
|
|
cpp_comment_strip.cpp
|
Demonstrates
white space policy
based on pre-defined regions of input. The program takes C/C++
or java program(s) as an input and strips out comments.
|
|
edge_functors.cpp
|
Demonstrates cttl::edge_replace function object class.
The program
replaces words (substrings)
inside input text.
|
|
edge_sort.cpp
|
Demonstrates
string-like behavior
of CTTL
edge objects.
The program applies STL
sort
algorithm to a vector of edges.
|
|
follow_whitespace.cpp
|
Implements user-defined
white space policy.
|
|
functor.cpp
|
Program defining a function object that implements
grammar production rule.
|
|
grammar.cpp
|
Demonstrates simple stateless lexer and parser.
The program finds and displays types of
character entities
matching user input text.
|
|
hello_world.cpp
|
Demonstrates deletion of
CTTL substring
from the
user input object.
Logical positions of the input
nodes
and
edges
are preserved.
|
|
input.cpp
|
Rudimentary program that demonstrates concept of
offset identity,
the CTTL equivalent of a logical offset.
|
|
java_lexer.cpp
|
Working version of a bare lexer for the Java programming language.
Feel free to modify for your own Java-related needs.
This is a work in progress. Java grammar was taken "as is",
directly from Sun web site,
http://java.sun.com/docs/books/jls/second_edition/html/syntax.doc.html.
Grammar optimizations soon to come!
|
|
line_navigate.cpp
|
Simple program that uses CTTL
node objects
to navigate the lines inside user input text.
|
|
line_parser.cpp
|
Demonstrates text line parser which uses
asymmetric quote,
quote(true,RM,RR), to parse individual lines
inside user input.
|
|
mutable_universe.cpp
|
An example of CTTL
mutable universe.
"Search and replace" changes are made while input text is being parsed.
|
|
node_edge_swap.cpp
|
Demonstrates exchange of the
node
identities when two nodes are swapped
by the std::swap() algorithm.
In a similar way, identities of
edge
boundary nodes are exchanged when two edge objects are
swapped.
|
|
node_find_class.cpp
|
Shows results of manual
node navigation
via cttl::node::find_class() and cttl::node::rfind_class()
calls. Position of the
node
is moved around boundaries of the nearby character class
of the user input.
|
|
node_functors.cpp
|
Demonstrates application of cttl::node_insert_go()
and cttl::node_insert_stay() functors.
|
|
number2words.cpp
|
Another example that demonstrates
mutability
of the
parseable universe.
Program replaces numbers with their fancy names, such as
"quadrillion".
|
|
offset2line.cpp
|
A utility that translates absolute offsets
to the actual line numbers in the input file.
The program may be helpful to understand CTTL trace file output
(example),
created during grammar
debugging.
|
|
partition_file.cpp
|
A utility to divide a large text file,
such as trace output,
into fractions of 100MB-chunks of text.
|
|
rules.cpp
|
Sample program implementing
grammar expressions
to parse fractional numbers.
|
|
runtime_match.cpp
|
Program demonstrating application of
runtime_match()
grammar evaluation method,
employed to support subrule
searches.
|
|
search_replace.cpp
inline_replace.cpp NEW! |
Utility removes html tags from the input file specified
on the command line, and prints
modified text on the screen. Program demonstrates
search-and-replace parser
in action. The parser compiles against
mutable universe,
and makes changes to the underlying text
while
parsing
the
input.
|
|
string_set_lexeme.cpp
|
Demonstrates
begin(std::set<std::string>&),
the string set
lexeme.
|
|
word_count.cpp
|
This sample demonstrates
semantic action
functions organized within a base parser class.
|
|
yellow_taxi.cpp
|
Demonstrates
repeatable search
grammar evaluation method.
|
|
Z.cpp
|
A tiny program that matches a single character inside user input.
Demonstrates essentials of CTTL
grammar evaluation statements.
|
Permission to copy, use, modify, sell and distribute this document is granted provided this copyright notice appears in all copies. This document is provided "as is" without express or implied warranty, and with no claim as to its suitability for any purpose.
1.3.9.1