grammar reference

CTTL on
SourceForge

Download
Latest

Documentation
Index

Library
News

CVS
Repository

lexemes

Lexeme is a single terminal unit in the syntax of the input language. When lexeme is recognized as part of the grammar, corresponding fragment of the input text is returned from the lexical analysis engine back to the parser program. Such fragment is commonly referred to as token. Thus, token is an actual representation of the lexeme in the input language grammar.

Text transformation library provides a family of functions to delineate lexemes. Lexeme functions may have one or no arguments. C++ types of the lexeme arguments are listed in Table R.2.

There are five basic types of lexeme functions:

Table R.1 - Lexeme types

Lexeme function	Description
`cttl::symbol()`	Usually refers to a specific terminal symbol in the abstract alphabet of the input language.
`cttl::entity()`	Describes a particular class of symbols (i.e. permutation of symbols) in the input language.
`cttl::begin()`	Describes upper (left-hand-side) boundary of a character entity.
`cttl::end()`	Describes lower (right-hand-side) boundary of a character entity.
`cttl::first()`	Refers to the first input character that belongs to a character entity.

Lexemes symbol(), entity(), and first() specify non-empty entities of characters. Lexemes begin() and end() describe boundaries of particular entities, therefore, they return tokens of zero length.

Table R.2 - Lexeme functions

C++ argument type	symbol	entity	begin	end	first
`None`	`symbol()` matches a single character.	`entity()` matches the universe.	`begin()` matches beginning of the user input.	`end()` matches ending of the universe.
`bool`	`symbol(true)` always succeeds unless at the end of the universe. `symbol(false)` always fails.		`begin(true)` always succeeds. `begin(false)` always fails.
std::set<std::string>& -or- std::set<std::wstring>&			`begin(&strset)` matches parseable universe against set of strings.
`int` `char` `wchar_t`	`symbol(0x20)` `symbol('\x20')` `symbol(L'\x20')` matches single character.	`entity(0x20)` `entity('\x20')` `entity(L'\x20')` matches single character.	`begin(65)` `begin('A')` `begin(L'A')` matches upper boundary of a single character.	`end(9)` `end('\t')` `end(L'\t')` matches lower boundary of a single character.
int(pf)(int) -or- int(pf)(wint_t)		`entity(isalpha)` `entity(iswalpha)` matches character entity.	`begin(isalpha)` `begin(iswalpha)` matches upper boundary of the character entity.	`end(isalpha)` `end(iswalpha)` matches lower boundary of the character entity.	`first(isalpha)` `first(iswalpha)` matches first character of the character entity.
int(pf)(int) -or- int(pf)(wint_t)	The following ANSI C character classification routines can be used with lexeme functions `entity(), begin(), end()`, and `first()`: `isalnum, iswalnum......`alphanumeric characters `isalpha, iswalpha......`alphabetic characters `iscntrl, iswcntrl......`control characters `isdigit, iswdigit......`decimal digits `isgraph, iswgraph......`printable other than space `islower, iswlower......`lowercase characters `isprint, iswprint......`printable characters `ispunct, iswpunct......`punctuation characters `isspace, iswspace......`white-space characters `isupper, iswupper......`uppercase characters `isxdigit, iswxdigit....`hexadecimal digits
`char` `wchar_t` `std::string&` `std::wstring&` `std::string` `std::wstring`	`symbol("ABC")` `symbol(L"ABC")` `symbol(&str)` matches exact text.	`entity("ABC")` `entity(L"ABC")` `entity(&str)` matches user-defined character entity.	`begin("ABC")` `begin(L"ABC")` `begin(&str)` matches upper boundary of the user-defined character entity.	`end("ABC")` `end(L"ABC")` `end(&str)` matches lower boundary of the user-defined character entity.	`first("ABC")` `first(L"ABC")` `first(&str)` matches first character of the user-defined character entity.

lexeme `symbol()`

Lexeme cttl::symbol() with no arguments matches a single character in the user input.

Implementation of the symbol() lexeme conforms to the white space sensitivity rules of the parseable universe. The implementation skips white space symbols defined by the space policy before symbol() is evaluated.

Search operator has no effect on the symbol() lexeme, that is, expressions

	symbol()
	!symbol()

are equivalent.

Related pages: CTTL | Grammar evaluation | Grammar reference | Lexeme functions

lexeme `entity()`

Lexeme cttl::entity() matches fragment of text defined by the current boundaries of the universe. In other words, the entity() lexeme matches the entire universe.

Implementation of the entity() lexeme ignores white space policy of the universe. Regardless of the policy, the entity() lexeme is evaluated immediately at the current position of the universe upper boundary.

The entity() evaluation succeeds against empty universe; that is, if length of the universe is zero, the function returns successful result (similar behavior exhibited by begin(true) lexeme.)

Internally, the entity() is defined as a metaphor: it is equivalent to the expression

	begin( true ) + !end()

Search operator has no effect on the entity() lexeme, that is, expressions

	entity()
	!entity()

are equivalent.

Related pages: CTTL | Grammar evaluation | Grammar reference | Lexeme functions

lexeme `begin()`

The begin() lexeme matches beginning of the user input. If evaluation of begin() succeeds, it always returns zero offset.

Implementation of the begin() lexeme ignores white space policy of the universe. Regardless of the policy, the begin() lexeme is evaluated immediately at the current position of the universe upper boundary.

Search operator forces upper boundary of the universe to move to the beginning-of-file location (the position with absolute offset zero.)

Related pages: CTTL | Grammar evaluation | Grammar reference | Lexeme functions

lexeme `end()`

Matches lower boundary of the universe. The lexeme assumes that the end of the universe is always a valid position. If end() lexeme succeeds, the universe is empty, and the result of evaluation is always the offset of the universe lower boundary.

When applied to the end() lexeme, search operators, as well as search evaluation methods, always succeed. When used with end(), these evaluations force the upper boundary of the universe to move to the position pointed by the lower boundary, resulting in empty universe.

Implementation of the end() lexeme conforms to the white space sensitivity rules of the parseable universe. The implementation skips white space symbols defined by the space policy before end() is evaluated.

Related pages: CTTL | Grammar evaluation | Grammar reference | Lexeme functions

lexeme `symbol(true)`

Lexeme cttl::symbol(true) matches an empty symbol in the input language (sometimes called epsilon symbol, or empty string.) The match evaluation of symbol(true) lexeme always succeeds, unless upper boundary of the universe points to the lower boundary (indicating that the universe is empty.) Therefore, symbol(true) imposes the rule that at least one character should be available from the input pointed by the universe. The size of the token returned by successful symbol(true) evaluation is always zero.

The symbol(true) lexeme makes it possible to enter strict evaluation modes by using construct

	symbol( true ) ^ R

where R is a valid CTTL grammar expression. Overloaded binary concatenation operators allow short hand notation when used with symbol(true) lexeme, for example,

	true ^ entity( isdigit )

Implementation of the symbol(true) lexeme ignores white space policy of the universe. Regardless of the policy, the symbol(true) lexeme is evaluated immediately at the current position of the universe upper boundary.

Search operator makes symbol(true) behave like begin(true) lexeme. Therefore, expressions

	!symbol( true )
	begin( true )

are equivalent: they always succeed.

Related pages: CTTL | Grammar evaluation | Grammar reference | Lexeme functions

lexeme `symbol(false)`

This lexeme always fails, this is a stop symbol.

Related pages: CTTL | Grammar evaluation | Grammar reference | Lexeme functions

lexeme `begin(true)`

Evaluation of cttl::begin(true) lexeme always succeeds. Lexeme cttl::symbol(true) matches an empty symbol in the input language (sometimes called epsilon symbol.) The size of the token returned by the successful begin(true) evaluation is always zero.

Implementation of the begin(true) lexeme ignores white space policy of the universe. Regardless of the policy, the begin(true) lexeme is evaluated immediately at the current position of the universe upper boundary.

Search operator has no effect on the begin(true) lexeme, that is, expressions

	begin(true)
	!begin(true)

are equivalent.

Related pages: CTTL | Grammar evaluation | Grammar reference | Lexeme functions

lexeme `begin(false)`

This lexeme always fails, this is a stop symbol (same behavior is exhibited by symbol(false) lexeme.)

Related pages: CTTL | Grammar evaluation | Grammar reference | Lexeme functions

lexemes `begin(std::set<std::string>&)` and `begin(std::set<std::wstring>&)`

String set lexemes support only match() evaluation for user-defined sets of strings (see std::set). The substring pointed by the current universe is compared against the set of strings. Based on positive lookahead algorithm, the string set lexeme does not consume any characters from the input universe.

Evaluation succeeds if there is an exact match. For a complete example of string set lexeme, see string_set_lexeme.cpp sample program.

The implementation of the string set lexeme is reference-based, it stores only reference to the std::set object provided by the user. Therefore, the set object must stay in scope while string set lexeme remains in use.

String sets ignore white space policy of the universe. Regardless of the policy, the string sets are evaluated immediately at the current position of the universe upper boundary.

Since string set lexeme supports only match() evaluation, expressions with search operator in front of the string set will not compile. For example, the following expression is illegal:

	!begin( str_set ) // will not compile

where str_set is an object of type std::set<std::string> or std::set<std::wstring>.

Related pages: CTTL | Grammar evaluation | Grammar reference | Lexeme functions

lexemes `symbol(int)`, `symbol(char)`, and `symbol(wchar_t)`

These lexemes match a particular character from the input character sequence.

Implementation of the symbol(char) lexeme conforms to the white space sensitivity rules of the parseable universe. The implementation skips white space symbols defined by the space policy before symbol(char) is evaluated.

Search operator initiates search for the specified character.

Related pages: CTTL | Grammar evaluation | Grammar reference | Lexeme functions

lexemes `entity(int)`, `entity(char)`, and `entity(wchar_t)`

These lexemes behave exactly as symbol(int), symbol(char), and symbol(wchar_t).

Related pages: CTTL | Grammar evaluation | Grammar reference | Lexeme functions

lexemes `begin(int)`, `begin(char)`, and `begin(wchar_t)`

These lexemes match upper boundary of the single character from the input character sequence.

Implementation of the begin(char) lexeme conforms to the white space sensitivity rules of the parseable universe. The implementation skips white space symbols defined by the space policy before begin(char) is evaluated.

Search operator initiates search for the specified character.

Related pages: CTTL | Grammar evaluation | Grammar reference | Lexeme functions

lexemes `end(int)`, `end(char)`, and `end(wchar_t)`

These lexemes match lower boundary of the single character from the input character sequence.

Implementation of the end(char) lexeme conforms to the white space sensitivity rules of the parseable universe. The implementation skips white space symbols defined by the space policy before end(char) is evaluated.

Search operator initiates search for the specified character.

Related pages: CTTL | Grammar evaluation | Grammar reference | Lexeme functions

lexemes `entity(is...)`

An entity lexeme matches the ANSI C multi-character entity. The argument specifies ANSI C character classification routine, such as

   isalnum, iswalnum......alphanumeric characters
   isalpha, iswalpha......alphabetic characters
   iscntrl, iswcntrl......control characters
   isdigit, iswdigit......decimal digits
   isgraph, iswgraph......printable other than space
   islower, iswlower......lowercase characters
   isprint, iswprint......printable characters
   ispunct, iswpunct......punctuation characters
   isspace, iswspace......white-space characters
   isupper, iswupper......uppercase characters
   isxdigit, iswxdigit....hexadecimal digits

For example, expression entity(isaplha) matches alphabetic substrings like "ABC", "qwerty", etc. In addition to ANSI C routines, the user may provide their own functions matching the prototypes

	int f( int );		// char classification
	int wf( wint_t );	// wchar_t classification

Implementation of entity lexemes conforms to the white space sensitivity rules of the parseable universe. The implementation skips white space symbols defined by the space policy before entity lexemes are evaluated.

Search operator initiates search for the specified character entity.

Related pages: CTTL | Grammar evaluation | Grammar reference | Lexeme functions

lexeme `begin(is...)`

Lexeme begin(is...) matches the upper boundary of the ANSI C multi-character entity. The rest of the behavior is the same as entity(is...).

Related pages: CTTL | Grammar evaluation | Grammar reference | Lexeme functions

lexeme `end(is...)`

Lexeme end(is...) matches the lower boundary of the ANSI C multi-character entity. The rest of the behavior is the same as entity(is...).

Related pages: CTTL | Grammar evaluation | Grammar reference | Lexeme functions

lexeme `first(is...)`

Lexeme first(is...) matches first character of the ANSI C multi-character entity. The rest of the behavior is the same as entity(is...).

Related pages: CTTL | Grammar evaluation | Grammar reference | Lexeme functions

lexemes `symbol(char)`, `symbol(wchar_t )`,
`symbol(std::string&)`, `symbol(std:::wstring&)`,
`symbol(std::string)`, and `symbol(std:::wstring)`

Family of terminal symbol lexemes, symbol(char*), etc. matches the exact text provided by the user.

Implementation of symbol lexemes conforms to the white space sensitivity rules of the parseable universe. The implementation skips white space symbols defined by the space policy before symbol lexemes are evaluated.

Search operator initiates search for the specified symbol.

Lexeme functions

	symbol(char*)
	symbol(wchar_t* )
	symbol(std::string&)
	symbol(std:::wstring&)

instantiate private std::string (or std::wstring) member object to store the copy of the text fragment to look for. For example, temporary object std::string("ABC") is constructed each time thread of execution runs through expression "symbol("ABC")":

	static size_t r( const_edge<>& edge_ )
	{
		return symbol( "ABC" ).match( edge_ );
	}

For many applications, instance implementation can be inefficient. Alternatively, string pointer can specify searchable text:

	static size_t r( const_edge<>& edge_ )
	{
		static const std::string ABC( "ABC" );
		return symbol( &ABC ).match( edge_ );
	}

In this case, reference implementation of symbol(std::string*) lexeme will be used when the code is compiled. The reference implementation stores only reference to the specified string, which is expected to be in scope while lexeme symbol(std::string*) is invoked.

Using token string pointers offers advantage of reusing string objects constructed once the program is loaded into memory. This yields faster implementation, plus the ability to centralize the definitions of all symbol constants (i.e. the common alphabet) that is shared by the grammar productions.

Related pages: CTTL | Grammar evaluation | Grammar reference | Lexeme functions

lexemes `entity(char)`, `entity(wchar_t )`,
`entity(std::string&)`, `entity(std:::wstring&)`,
`entity(std::string)`, and `entity(std:::wstring)`,

The family of entity lexemes matches user-defined multi-character entities. For example, grammar expression entity("abc") matches permutations of strings "a", "b", "c", "aa", "ab", "ac", "ad", and so on.

Search operator initiates search for the specified character entity.

Similar to symbol(std::string*) lexeme, implementations of entity(std::string*) and entity(std:::wstring*) lexemes hold only a reference to the specified character entity string. Other lexemes in this family create their own private copy of the entity string.

Related pages: CTTL | Grammar evaluation | Grammar reference | Lexeme functions

lexemes `begin(char)`, `begin(wchar_t )`,
`begin(std::string&)`, `begin(std:::wstring&)`,
`begin(std::string)`, and `begin(std:::wstring)`

Lexeme family begin(char*), etc. matches the upper boundary of the user-defined multi-character entity.

Implementation of boundary lexemes conforms to the white space sensitivity rules of the parseable universe. The implementation skips white space symbols defined by the space policy before boundary lexemes are evaluated.

Search operator initiates search for the specified character entity.

Similar to symbol(std::string*) lexeme, implementations of begin(std::string*) and begin(std:::wstring*) lexemes hold only a reference to the specified character entity string. Other lexemes in this family create their own private copy of the entity string.

Note that expresison

	begin( "+=" )

is not a lookahead lookup for text "+=", but rather a multi-character entity lookup. Expression "begin( "+=" )" matches permutations of characters '+' and '=', therefore, it will match the beginning of any string that begins with '+' and '='. If grammar expression needs a lookahead lookup for a particular text, it should use expression

	universe_.first( symbol( "+=" ) )

instead.

Related pages: CTTL | Grammar evaluation | Grammar reference | Lexeme functions

lexemes `end(char)`, `end(wchar_t )`,
`end(std::string&)`, `end(std:::wstring&)`,
`end(std::string)`, and `end(std:::wstring)`

Lexemes end(char*), etc. match the lower boundary of the user-defined multi-character entity.

Search operator initiates search for the specified character entity.

Similar to symbol(std::string*) lexeme, implementations of end(std::string*) and end(std:::wstring*) lexemes hold only a reference to the specified character entity string. Other lexemes in this family create their own private copy of the entity string.

Related pages: CTTL | Grammar evaluation | Grammar reference | Lexeme functions

lexemes `first(char)`, `first(wchar_t )`,
`first(std::string)`, `first(std:::wstring)`,
`first(std::string)`, and `first(std:::wstring)`

Lexeme family first(char*), etc. matches first character of the user-defined multi-character entity.

Implementation of single character lexemes conforms to the white space sensitivity rules of the parseable universe. The implementation skips white space symbols defined by the space policy before single character lexemes are evaluated.

Search operator initiates search for the specified character entity.

Similar to symbol(std::string*) lexeme, implementations of first(std::string*) and first(std:::wstring*) lexemes hold only a reference to the specified character entity string. Other lexemes in this family create their own private copy of the entity string.

Related pages: CTTL | Grammar evaluation | Grammar reference | Lexeme functions

function adaptors

The library provides function adaptors that allow grammar rules to make function calls during grammar evaluation.

At compile time, a set of template functions named cttl::rule() captures addresses of C++ functions implementing grammar production rules and semantic actions. While lexical analysis engine evaluates grammar rules, it may invoke other rules and semantic actions defined by the parser.

There are function adaptors and debugging macros acting as function adaptors:

Table R.4 - Function adaptors

function	arguments	semantics
rule(&r) function adaptor.	address of global function, -or- static member function, -or- function object representing production rule or semantic action r.	struct parser { static size_t subrule( const_edge<>& universe_ ) { std::cout << universe_.text(); return universe_.second.offset(); } static size_t start( const_edge<>& universe_ ) { return rule( &subrule ).match( universe_ ); } };
rule(O,&C::r) member function adaptor.	Object reference O pointing to an instance of class `C` and the address of a non-static, constant or mutable, member function r.	struct parser { size_t subrule( const_edge<>& universe_ ) const { std::cout << universe_.text(); return universe_.second.offset(); } size_t start( const_edge<>& universe_ ) const { return rule( *this, &parser::subrule ).match( universe_ ); } };
CTTL_RULE(&C::r) member function adaptor.	Address of a non-static member function r.	This adaptor is defined as a macro. It is similar to `rule(this,&C::r)`, but requires only address of the class member function. Member function context is assumed, in which the `this` pointer is a valid expression. Depending on the level of tracing, it produces a trace of getting in and out of the encapsulated rule function. See `arithmetics_traced.cpp` sample program for an example of use. struct parser { size_t subrule( const_edge<>& universe_ ) { std::cout << universe_.text(); return universe_.second.offset(); } size_t start( const_edge<>& universe_ ) { return CTTL_RULE*( parser::subrule ).match( universe_ ); } }; See also: grammar debugging with `CTTL_RULE()`.
CTTL_MEMBER_RULE(O,&r) member function adaptor.	Object reference O and address of a non-static member function r.	`CTTL_MEMBER_RULE` is similar to `CTTL_RULE`, but requires both object reference and class member function, similar to `rule(object,&C::r)` member function adaptor. See also: grammar debugging with `CTTL_MEMBER_RULE()`. Please note that since release 206 the `address of, '&',` operator is required in front of the member function name. This allows users to explicitly provide an instance of pointer to member function. For an example of semantic action implemented as member function template, see `action_handler.cpp` sample program. Another sample, `arithmetics_traced.cpp`, demonstrates example of `CTTL_MEMBER_RULE()` macro.
CTTL_STATIC_RULE(r) function adaptor.	address of global function, -or- static member function, -or- function object representing production rule or semantic action r.	The macro can be used in place of `rule(r)` function adaptor. See also: grammar debugging with `CTTL_STATIC_RULE()`.

Related pages: CTTL | Grammar evaluation | Grammar reference

expression adaptors

The second type of CTTL adaptor is grammar expression wrapper.

Node and edge expression adaptors locate and provide access to the boundaries of the input text fragments during grammar parsing. They capture locations of symbols parsed by the encapsulated expression and have forms

	n(R)
	e(R)

where n is an instance of the cttl::node, e is an instance of either cttl::const_edge or cttl::edge, and R is a valid grammar expression. Node and edge adaptors are instances of the template classes returned by cttl::node::operator() and cttl::const_edge::operator() function call operators. At compile time, the adaptor captures and stores a copy of node or edge object, and an instance of the grammar expression R. At run time, if evaluation of R succeeds, the adaptor positions specified node or edge on the boundaries of the matched symbols.

If R is a valid grammar expression, the entity adaptor entity(R) guarantees that the underlying expression R matches a non-empty string. Expression adaptor entity(R) helps to determine whether expression R is nullable when implementing predictive parser.

The semantics of CTTL expression adaptors are as follows:

Table R.3 - Expression adaptors

function	semantics
n(R) Node expression adaptor; if expression R succeeds, the node n is positioned on the upper boundary of the matched fragment. Node expression adaptor can provide positive unlimited lookahead facility to the grammar expression. For more information, see adaptor behavior of the node documentation.	void f() { input<> inp( "123" ); const_edge<> universe( new_edge( inp ) ); node<> left( new_node( inp ) ); size_t result = left( entity( isdigit ) ).match( universe ); assert( left.offset() == 0 ); }
e(R) where e is an instance of either `cttl::const_edge` or `cttl::edge` object. Edge expression adaptor; if expression R succeeds, the edge e is positioned on the boundaries of the matched fragment.	void f() { input<> inp( "123" ); const_edge<> universe( new_edge( inp ) ); const_edge<> number( new_edge( inp ) ); size_t result = number( entity( isdigit ) ).match( universe ); assert( number.text() == "123" ); }
`entity(R)` Entity expression adaptor; if expression R succeeds, the expression adaptor `entity(R)` succeeds only if R matched a non-empty fragment of text. Entity expression adaptor validates that the encapsulated expression yields a non-empty match.	void f() { input<> inp( "abc" ); const_edge<> universe( new_edge( inp ) ); size_t result = entity( *entity( isdigit ) ).match(universe); assert( result == std::string::npos ); }

Related pages: CTTL | Grammar evaluation | Grammar reference

quotes

In notations of logical and mathematical expressions, in programming languages, as well as in natural languages, there are numerous examples of quotations, insertions, subexpressions, grouped phrases, citations, and passages, all of which require certain level of distinction from the rest of the sentence. Such elements of the input language often utilize balanced pair grammar constructs, for example:

Examples of quotes

`"Hello, World!"`	Text is enclosed inside balanced quotation marks.
`a/(b+c)`	Arithmetic sub-expression `b+c` is put inside balanced parenthesis.
`<HTML><BODY>Hello, World!</BODY></HTML>`	Html elements, such as `HTML` and `BODY`, are formed by pairs of html tags. The pairs of angled brackets form html tags themselves.

The constructs in the examples above are made clearly distinct from the rest of the phrase, expression, or entire document. The library offers special recognition to the objects set off by balanced punctuation, or by more complex patterns. The library defines a category of quote functions, designated for parsing various styles of quotes, all of which have general form:

	quote( RL, RM, RR )

where RL, RM, and RR are valid grammar expressions, named left, middle, and right expression, respectively. RL expression describes syntactical unit that opens the quote. RM expression describes grammar for input symbols located at the beginning of the interior of the quote. RM expression describes syntactical unit that closes the quote.

The left and right units of the quote are balanced. For example, if user input is specified as

	<HTML><BODY>Hello, World!</BODY></HTML>

then the middle expression, entity(), of the following quote,

	quote(
		'<' + entity( isalpha ) + '>',
		entity(),
		symbol( '<' ) + '/' + entity( isalpha ) + '>'
	)

matches the fragment "<BODY>Hello, World!</BODY>".

Related pages: CTTL | Grammar evaluation | Grammar reference

quote evaluation rules

The following rules apply during grammar evaluation of quotes:

Left and right expressions are balanced

The quote is balanced in the sense that left and right expressions are evaluated exactly the same number of times. Evaluation of the quote stops when the balance between the two sides is reached.

If input contains nested pairs of fragments matching expression units RL and RR, both expressions are evaluated multiple times. The middle expression, RM, is invoked only once. If no balance between RL and RR could be found, the evaluation of the quote fails, and the expression RM is never evaluated.

Expressions RL and RR are evaluated by `bang_find()` evaluation method

While searching for a balanced match, implementation of the quote alternates the bang_find() calls on expressions RL and RR. It is important to remember to use runtime_match() when RL and RR grammar units are implemented as distinct grammar rule functions. For example,

struct parser {

	static size_t start( const_edge<>& universe_ )
	{
		return quote(
			rule( parser::left ),
			entity(),
			rule( parser::right )
		).match( universe_ );
	}

	static size_t left( const_edge<>& universe_ )
	{
		return symbol( "BEGIN" ).runtime_match( universe_ );
	}

	static size_t right( const_edge<>& universe_ )
	{
		return symbol( "END" ).runtime_match( universe_ );
	}

}; //struct parser

The universe presented to the middle expression corresponds to the interior of the quote.

Interior universe, which is visible to the middle expression RM, is limited to the inner boundaries of the fragments matching RL and RR. For the middle expression, it is sufficient to match only part of the interior. The middle expression is not required to consume the entire universe.

Related pages: CTTL | Grammar evaluation | Grammar reference

quote formats

The following quote formats are available:

Table R.5 - CTTL quotes

semantics	description
quote(RL,RM,RR) Here and below the abbreviations RL, RM, and RR stand for valid grammar expressions, describing opening, interior, and closing clauses of the quote.	This is least specialized form of the quote. It can be used to locate balanced pairs of two arbitrary grammatical clauses, RL and RR, such as matching XML tags in a document.
quote(true,RM,RR)	Asymmetric quote. Expression `!!`RR is evaluated before RM. The universe U visible to the middle expression RM is restricted by the boundaries corresponding to the position of U`.first` before evaluation, and the upper boundary of the fragment matched by `!!`RR. See `line_parser.cpp` for an example of this quote.
quote(char,RM,char)	Quote specialized for left and right expressions containing only a single character. For example, void f() { input<> inp( "((a+b)(c-d))" ); const_edge<> univ( new_edge( inp ) ); const_edge<> intr( new_edge( inp ) ); const_edge<> extr( new_edge( inp ) ); if ( extr( quote( '(', intr( entity() ), ')' ) ).match( univ ) != std::string::npos ) { std::cout << "Interior: " << intr.text() << std::endl; std::cout << "Exterior: " << extr.text() << std::endl; } } This example prints text Interior: (a+b)(c-d) Exterior: ((a+b)*(c-d))
wchar_quote(wchar_t,RM,wchar_t)	Version for wide characters: #include <iostream> #include "cttl/cttl.h" using namespace cttl; int main() { input< std::wstring > inp( L"((a+b)(c-d))" ); const_edge< policy_default, std::wstring > univ( new_edge( inp ) ); const_edge< policy_default, std::wstring > data( new_edge( inp ) ); if ( ( data( wchar_quote( L'(', entity(), L')' ) ) ).match( univ ) != std::wstring::npos ) std::wcout << L"Data: " << data.text() << std::endl; return 0; } This example prints text Data: ((a+b)(c-d))
ansi_single_quote(RM) ansi_single_quote()	Some languages (for example, various SQL dialects) allow two single quotes in a row to represent a single quote character inside single-quoted string literal. For example, literal `'abc''def'` delineates a sequence of 7 characters, namely, `'a'`, `'b'`, `'c'`, `'\''`, `'d'`, `'e'`, and `'f'`. CTTL provides `ansi_single_quote()` function to parse such literals. For correct matches, `ansi_single_quote()` should be combined with kleene plus operator, `'+'` (see code example below for details). The RM expression provides access to the interior of the quoted literal. Function `ansi_single_quote()` is a shorthand notation for `quote('\'',RM,'\'')`, but its implementation is optimized for literals formed by single quotation marks. Function `ansi_single_quote()` is an abbreviated version of `ansi_single_quote(RM)`. It is a shorthand notation for `quote('\'',true,'\'')`. void f() { input<> inp( "\'abc\'\'def\'" ); const_edge<> universe( new_edge( inp ) ); const_edge<> data( new_edge( inp ) ); if ( ( data( +ansi_single_quote() ) ).match( universe ) != std::string::npos ) std::cout << "Data: " << data.text() << std::endl; } This example prints text Data: 'abc''def'
ansi_double_quote(RM) ansi_double_quote()	Same as `ansi_single_quote(RM)`, but for double-quoted literals, for example, `"abc""def"`. This is a shorthand notation for `quote('\"',RM,'\"')`, but optimized for literals formed by double quotation marks. `ansi_double_quote()` is an abbreviated version of `ansi_double_quote(expr)`. It is a shorthand notation for `quote('\"',true,'\"')`, but is also more optimized. void f() { input<> inp( "\"abc\"\"def\"" ); const_edge<> universe( new_edge( inp ) ); const_edge<> data( new_edge( inp ) ); if ( ( data( +ansi_double_quote() ) ).match( universe ) != std::string::npos ) std::cout << "Data: " << data.text() << std::endl; } This example prints text Data: "abc""def"
c_single_quote(RM) c_single_quote()	An ANSI-C string literal that recognizes C-style escape characters inside string literals delimited by single quotation marks.
c_double_quote(RM) c_double_quote()	An ANSI-C string literal that recognizes C-style escape characters inside string literals delimited by double quotation marks.
wchar_ansi_single_quote(RM) wchar_ansi_single_quote()	Specializations for wide characters.
wchar_ansi_double_quote(RM) wchar_ansi_double_quote()	Specializations for wide characters.
wchar_c_single_quote(RM) wchar_c_single_quote()	Specializations for wide characters.
wchar_c_double_quote(RM) wchar_c_double_quote()	Specializations for wide characters.

Related pages: CTTL | Grammar evaluation | Grammar reference

metaphors

The library introduces an idea of lexemic metaphor functions, encapsulating frequently used distinctive lexeme constructs, which describe certain classes of terminal symbols of the input grammar. For instance, names of identifiers, such as variable, function, and object names in many programming languages, could be described by the following construct:

	// Identifiers in C, C++, Java:
	begin( "QWERTYUIOPASDFGHJKLZXCVBNM_qwertyuiopasdfghjklzxcvbnm" )
	^
	end( "QWERTYUIOPASDFGHJKLZXCVBNM_qwertyuiopasdfghjklzxcvbnm1234567890" )

Yet in other programming languages, the identifier rules could diverge into something like

	// Identifiers in language XYZ:
	begin( "%QWERTYUIOPASDFGHJKLZXCVBNMqwertyuiopasdfghjklzxcvbnm" )
	^
	end( "%QWERTYUIOPASDFGHJKLZXCVBNMqwertyuiopasdfghjklzxcvbnm1234567890" )

Programming language identifiers fall into the category of terminal symbols that can be described by a metaphor function named literal(). For example,

	literal(
		"%QWERTYUIOPASDFGHJKLZXCVBNMqwertyuiopasdfghjklzxcvbnm",
		"%QWERTYUIOPASDFGHJKLZXCVBNMqwertyuiopasdfghjklzxcvbnm1234567890"
	)

is a synonym for the identifier of XYZ language shown above. The metaphor literal() with no arguments is an overloaded function, which is short hand notation for C/C++ identifier. Table R.6 lists CTTL metaphors:

Table R.6 - CTTL metaphors

metaphor name description

metaphor name	description
`literal()` `literal(B,E)` `wchar_literal()` `wchar_literal(B,E)`	Literal metaphor is a shorthand notation for grammar expression begin( B ) ^ end( E ) where `B` and `E` are user-defined character entities. Note that `B` is a subset of `E`. The default value for `B` is a set of alphabetic characters and the underscore; the default value for `E` is alphabetic character set, underscore, and ten digits. Wide character versions of `literal` metaphor are also provided. Metaphor `cttl::literal()` is equivalent to the expression begin( "QWERTYUIOPASDFGHJKLZXCVBNM_qwertyuiopasdfghjklzxcvbnm" ) ^ end( "QWERTYUIOPASDFGHJKLZXCVBNM_qwertyuiopasdfghjklzxcvbnm1234567890" ) Metaphor `cttl::wchar_literal()` is equivalent to begin( L"QWERTYUIOPASDFGHJKLZXCVBNM_qwertyuiopasdfghjklzxcvbnm" ) ^ end( L"QWERTYUIOPASDFGHJKLZXCVBNM_qwertyuiopasdfghjklzxcvbnm1234567890" ) For most efficient implementation, character entities `B` and `E` should be passed by address, for example: static size_t r( const_edge<>& edge_ ) { static const std::string ABC( "ABC" ); static const std::string ABCabc( "ABCabc" ); return literal( &ABC, &ABCabc ).match( edge_ ); }
`entity()`	Although `cttl::entity()` falls into the category of lexeme functions, internally it is implemented as a metaphor: begin( true ) + !end()

literal()
literal(B,E)

wchar_literal()
wchar_literal(B,E)

Literal metaphor is a shorthand notation for grammar expression

	begin( B ) ^ end( E )

where B and E are user-defined character entities.

Note that B is a subset of E. The default value for B is a set of alphabetic characters and the underscore; the default value for E is alphabetic character set, underscore, and ten digits. Wide character versions of literal metaphor are also provided.

Metaphor cttl::literal() is equivalent to the expression

	begin( "QWERTYUIOPASDFGHJKLZXCVBNM_qwertyuiopasdfghjklzxcvbnm" )
	^
	end( "QWERTYUIOPASDFGHJKLZXCVBNM_qwertyuiopasdfghjklzxcvbnm1234567890" )

Metaphor cttl::wchar_literal() is equivalent to

	begin( L"QWERTYUIOPASDFGHJKLZXCVBNM_qwertyuiopasdfghjklzxcvbnm" )
	^
	end( L"QWERTYUIOPASDFGHJKLZXCVBNM_qwertyuiopasdfghjklzxcvbnm1234567890" )

For most efficient implementation, character entities B and E should be passed by address, for example:

	static size_t r( const_edge<>& edge_ )
	{
		static const std::string ABC( "ABC" );
		static const std::string ABCabc( "ABCabc" );
		return literal( &ABC, &ABCabc ).match( edge_ );
	}

entity()

Although cttl::entity() falls into the category of lexeme functions, internally it is implemented as a metaphor:

	begin( true ) + !end()

Related pages: CTTL | Grammar evaluation | Grammar reference

backtracking vs. predictive parsing

Sometimes a lookahead is necessary to decide in advance how to parse part of the input. For instance, parser may try one way of matching, and if it fails, parser may try the alternative way. For this kind of processing, the lexer has to backtrack to the original input position. Backtracking technique (resetting input position) can often simplify the structure of grammar rules. However, repetitive backtracking and parsing the same fragment multiple times may cause negative performance impact. This deficiency sometimes yields poor performance, rendering an impractical parser.

An alternative is to stay away from the backtracking technique by specifying a more restrictive set of grammars. Restrictive parser predicts what production rule to use and removes requirements for excessive backtracking.

The library provides building blocks for compiling both backtracking and restrictive parsers. Backtracking CTTL parser supports unlimited positive and negative lookahead types.

Certain grammar constructs always require backtracking. For example, set union operator,

	symbol( 'A' ) | symbol( 'a' )

requires the starting position to be restored before right-hand-side expression, symbol( 'a' ), can be evaluated.

For more information about backtracking, see CTTL operators.

Related pages: CTTL | Grammar evaluation | Grammar reference

operators

A valid production rule expression, for example,

	entity( isalpha ) + entity( isdigit )

describes syntactic grouping of symbols, known as a non-terminal symbol. Non-terminals represent patterns of symbols of the input language. They are organized as idioms containing individual terminals, as well as calls to the production rules encapsulating other non-terminal symbols of the grammar.

Grammar expression can be evaluated by one of the evaluation methods: match(), find(), bang_find(), or runtime_match(). For example,

	( entity( isalpha ) + entity( isdigit ) ).match( universe )

Unary and binary operators can be applied to form production rule expressions from separate units like terminals, non-terminals, and sub-expressions enclosed in parenthesis:

	// matches notation for hexadecimal numbers like "0x0d", "0XFFFF", etc.
	symbol( '0' ) ^ ( symbol( 'x' ) | 'X' ) ^ entity( isxdigit )

Binary operators chain two expressions together and specify relationship between them; unary operators modify behavior of the underlying expression. For example:

	// matches one or more zeroes or ones, like "0", "1", "01", etc.
	+( symbol( '0' ) | symbol( '1' ) )

Search operator illustrates modification of the expression behavior. It forces lexer engine to escalate evaluation from match() to find(), or from find() to bang_find() evaluation:

	// search for lowercase characters:
	!entity( islower )

In case of a search, first terminal symbol on the left side of the grammar expression is formally given a special recognition. Such symbol is distinguished as leftmost terminal symbol, which consumes the search request initiated by find() or bang_find() evaluation. (By saying "consumes" we imply that the lexeme specifying leftmost terminal symbol does not delegate, but actually carries out the requested task.) For example,

	// search for negative number
	!!( symbol( '-' ) + entity( isdigit ) )

Here, repeatable search !! invokes bang_find() evaluation method of the subexpression enclosed in parenthesis:

	( symbol( '-' ) + entity( isdigit ) ).bang_find( universe )

The two lexemes are connected by binary plus operator forming a sequence. The sequence is evaluated as follows: binary plus delegates bang_find() call to the left-hand-side expression:

	symbol( '-' ).bang_find( universe )

If this call succeeds, the right-hand-side expression, entity( isdigit ), is evaluated by the match() method:

	entity( isalpha ).match( universe )

The sequence operator succeeds if both of these calls succeed. Regardless of the grammar expression structure and complexity, the leftmost terminal symbol always receives the search request, because all C++ binary operators recognized within CTTL framework have left-to-right associativity.

The library supplies C++ operators overloaded for template classes implementing CTTL lexemes, quotes, and adaptors. In addition, library defines overloaded operators for string and character literals, std::string and std::wstring template classes, and other C++ types that can be used in place of lexeme arguments (see table R.2 for details about lexeme argument types). This arrangement allows to write abbreviated grammar expressions. For example,

	// matches substring "<document>"
	symbol( '<' ) + symbol( "document" ) + symbol( '>' )

	// abbreviated form:
	symbol( '<' ) + "document" + '>'

Table R.7 describes CTTL operators in order of C++ operator precedence. Items towards the bottom of the list have lower precedence; therefore, parenthesis should be used to group sub-expressions requiring alternative priority of evaluation. In our description, R, R1, and R2 will represent valid grammar expressions. An event of successful application of any evaluation method against a grammar expression is commonly known as a match. On the other hand, the term search usually indicates an attempt to execute a continuous search for a pattern of symbols via find() or bang_find() evaluation.

If grammar evaluation succeeds, the fragment of the input text corresponding to the required pattern is called matched fragment. Therefore, every instance of successful match or search will have a corresponding matched fragment. Fragments can be characterized by their boundaries and length. If the fragment has zero length, the match is considered to be an empty match. Empty matches are not uncommon; they occur when grammars contain explicit mentioning of epsilon symbols, or sub-expressions qualified with kleene star operators. For example,

	// Rule BracketsOpt has potential for empty match:
	size_t BracketsOpt( const_edge< policy_space<> >& universe_ )
	{
		return ( *( symbol( '[' ) + ']' ) ).match( universe_ );
	}

Related pages: CTTL | Grammar evaluation | Grammar reference

Table R.7 - CTTL operators

operator name	description
`!R` search	Unary search operator, '`!`', promotes current evaluation method to a step-higher search. The match is promoted to find, find is promoted to bang_find. For example, grammar rule rm size_t rm( const_edge<>& universe_ ) { return ( !R ).match( universe_ ); } could be rewritten as an equivalent rule, rf: size_t rf( const_edge<>& universe_ ) { return ( R ).find( universe_ ); } where R is a valid grammar expression.
`!!R` repeatable search	Two search operators in a row, '`!!`', applied to the grammar expression R, guarantee promotion of the lexical analysis of R to the level of repeatable search, carried out by the `bang_find()` evaluation method. Therefore, grammar rule rm size_t rm( const_edge<>& universe_ ) { return ( !!R ).match( universe_ ); } could be equally rewritten as a different rule, *rbf: size_t rbf*( const_edge<>& universe_ ) { return ( R ).bang_find( universe_ ); } where R is a valid grammar expression.
`+R` kleene plus	Unary plus operator, '`+`', specifies one or more matches of grammar expression R, evaluated as one syntactical clause.
`-R` logical not	Unary minus operator, '`-`', specifies negative lookahead assertion of grammar expression R.
`*R` kleene star	Unary star operator, '`*`', specifies zero or more matches of grammar expression `R`.
`R * N` kleene star	Binary star operator, '`*`', specifies zero or up to N matches of grammar expression R, where N is a positive integer number.
`R1 - R2` set complement (difference)	Binary minus operator, '`-`', specifies matches of grammar expression R1 that are not matches of expression R2.
`R1 + R2` sequence	Binary plus operator, '`+`', specifies that grammar expressions R1 and R2 are evaluated in sequence, from left to right.
`R + N` kleene plus	Binary plus operator, '`+`', specifies one or up to N matches of grammar expression R, where N is a positive integer number.
`R + pair( N, M )` kleene plus	Binary plus operator, '`+`', specifies at least N, or up to M matches of grammar expression R, where pair( N, M ) is an instance of `std::pair` object, N and M are positive integer numbers, such that N <= M.
`R1 & R2` set intersection	Binary and oprator, '`&`', specifies matches of grammar expression R1 that are also matches of another expression, R2.
`R1 ^ R2` concatenation, (strict sequence)	Binary exclusive or operator, '`^`', specifies that two succeeding evaluations of grammar expressions R1 and R2 form a single leximatic unit, consisting of two strictly following one another fragments of input.
`R1 \| R2` set union	Binary or operator, '`\|`', specifies matches of either expression R1 or R2.
`R1 \|\| R2` POSIX union operator	Binary logical or operator, '`\|\|`', specifies matches of either expression R1, or R2 accordingly to the POSIX standard: it guarantees the longest leftmost match.

logical not

Unary minus operator, or logical not, makes available expression format

-R

where R is a C++ expression describing grammar production rule.

Logical not specifies negative lookahead assertion of expression R. Successful lookahead assertion -R does not consume any characters of the parseable universe. The lexer engine saves universe position at the beginning of the evaluation, and restores it back after the evaluation is complete.

Result of -R is best described by the definition of absolute complement from the set theory. Assume that a universal set of strings U is defined as a set of all strings immediately before evaluation of expression -R. Also, if set A of matching strings is defined for rule R, then A is a subset of U. Then, expression -R matches a new set of strings which is an absolute complement (or simply complement) of the subset A of U, defined as a set of all strings of U which are not elements of A.

See also: Wikipedia - Absolute complement

Related pages: CTTL | Grammar evaluation | Grammar reference | Overloaded C++ operators

kleene star

Unary Kleene star operator makes available expression format

*R

where R is a C++ expression describing grammar production rule.

Kleene star specifies zero or more matches of R. This Kleene form attempts as many matches as possible, otherwise known as greedy evaluation.

From the set theory, Kleene star operator can be defined as follows: if set A of matching strings is defined for rule R, then expression *R defines a new set of strings *A, which is a set of all strings that can be made by concatenating zero or more strings from A. Therefore, empty string is also part of set *A.

Expression *R always succeeds.

Empty match is counted as a good match. To prevent the lexer engine from entering infinite loop on empty matches, empty match never repeats. To exclude empty matches from the results of evaluation, use *entity( R ) expression.

kleene plus

Unary Kleene plus operator makes available expression format

+R

where R is a C++ expression describing grammar production rule.

Kleene plus specifies one or more matches of R. This Kleene form attempts as many matches as possible, otherwise known as greedy evaluation.

From the set theory, Kleene plus operator can be defined as follows: if set A of matching strings is defined for rule R, expression +R defines a new set of strings +A, which is a set of all strings that can be made by concatenating strings from A.

Related pages: CTTL | Grammar evaluation | Grammar reference | Overloaded C++ operators

binary kleene star

Binary Kleene star operator makes available expression format

	R * N

where R is a C++ expression describing grammar production rule, and N is a positive integer number.

This version of Kleene star specifies zero or up to N matches of R. This form attempts as few repeats as possible, otherwise known as non-greedy, or lazy evaluation.

Expression R * N always succeeds.

Expression R * N never returns more than N matches; if Nth occurrence of the match fragment is found, the evaluation stops.

Empty match is counted as a good match. To exclude empty matches from the results of evaluation, use entity( R ) * N expression instead.

Many regular expression engines recognize "R?" syntax, which specifies "zero or one matches". In CTTL, request for zero or one matches is written as R * 1.

Related pages: CTTL | Grammar evaluation | Grammar reference | Overloaded C++ operators

binary kleene plus

Binary Kleene plus operator makes available expression format

	R + N

where R is a C++ expression describing grammar production rule, and N is a positive integer number.

This version of Kleene plus specifies one or up to N matches of R. This Kleene form attempts as few repeats as possible, otherwise known as non-greedy, or lazy evaluation.

Expression R + N never returns more than N matches; if Nth occurrence of the match fragment is found, the evaluation stops.

Empty match is counted as a good match. To exclude empty matches from the results of evaluation, use entity( R ) + N expression instead.

Related pages: CTTL | Grammar evaluation | Grammar reference | Overloaded C++ operators

greedy kleene plus

Binary Kleene plus operator makes available expression form

	R + pair( N, M )

where R is a C++ expression describing grammar production rule; pair( N, M ) is an instance of std::pair N and M are positive integer numbers, such that N <= M.

This version of Kleene plus requires at least N matches, or up to M matches of expression R. Anywhere between N and M, this form of Kleene plus attempts as many matches as possible, otherwise known as greedy evaluation.

If less than N or more than M matches exist, expression R + pair( N, M ) fails.

Empty match is counted as a good match, but never repeats. If empty match is found, it stops the evaluation. To exclude empty matches from the results of evaluation, use entity( R ) + pair( N, M ) expression instead.

To restrict Kleene operator to exactly N matches, use

	R + std::make_pair( N, N )

There is no explicit CTTL syntax that allows to specify an infinity for the upper limit of the Kleene matches. To work around, a "big enough" number should be specified for the upper bound parameter:

	R + std::make_pair( N, 999 )

Related pages: CTTL | Grammar evaluation | Grammar reference | Overloaded C++ operators

set complement

Binary minus operator, or set absolute complement, makes available expression format

	R1 - R2

where R1 and R2 are C++ expressions describing grammar production rules.

Binary minus operator specifies matches of R1 that are not matches of R2. Result of R1 - R2 is best described by the definition of absolute complement from the set theory: If a universal set U of matching strings is defined for rule R1, and a subset A of U of matching strings is defined for rule R2, then expression R1 - R2 matches a new set, which is an absolute complement (or simply complement) of a subset A of U, defined as a set of all strings of U which are not elements of A.

Successful evaluation of R2 does not consume any characters of the parseable universe. The lexer engine saves universe position at the beginning of the R2 evaluation, and backtracks the position after evaluation of R2 is complete.

The universe presented to the expression R2 is restricted to the boundaries of the fragment of text that matches expression R1. To enforce this rule, complement operator removes custom white space policy from the universe and replaces it with policy_default before universe is presented to the expression R2 for evaluation.

Complement operator R1 - R2 has left-to-right associativity: expression R1 is evaluated first, and then R2 is evaluated. If evaluation of R1 fails, expression R2 is never evaluated. This feature is also known as short-circuiting.

Note that expressions

	entity( isalpha ) - symbol( '(' )

and

	entity( isalpha ) + -symbol( '(' )

are very different. First expression matches an alphabetical multi-character entity and then tries to compare its first character with '(', which will always fail. The second expression attempts negative lookahead for character '(' after alphabetical entity. The rationale for set complement is to validate the fragment matched by the left hand side against expression provided on the right hand side of the operator. The negative lookahead, on the other hand, can be done by the negative lookahead assertion operator, '-'.

See also: Wikipedia - Absolute complement.

Related pages: CTTL | Grammar evaluation | Grammar reference | Overloaded C++ operators

sequence

Binary sequence operator makes available expression format

	R1 + R2

where R1 and R2 are C++ expressions describing grammar production rules.

Binary sequence operator specifies that expressions R1 and R2 are evaluated in sequence, from left to right. Resulting match contains fragment match of R1 followed by the fragment match of R2. Both expressions must succeed in order for the entire sequence to succeed.

Sequence operator R1 + R2 has left-to-right associativity: expression R1 is evaluated first, and then R2 is evaluated. If evaluation of R1 fails, expression R2 is never evaluated. This feature is also known as short-circuiting.

Related pages: CTTL | Grammar evaluation | Grammar reference | Overloaded C++ operators

set intersection

Binary and operator, or set intersection, makes available expression format

	R1 & R2

where R1 and R2 are C++ expressions describing grammar production rules.

Intersection operator specifies matches of R1 that are also matches of R2. Result of R1 & R2 is best described by the definition of set intersection from the set theory: If a universal set U of matching strings is defined for rule R1, and a subset A of U of matching strings is defined for the rule R2, then expression R1 & R2 matches set of strings U & A that contains all elements of U that also belong to A.

Successful evaluation of R2 does not have any effect on the parseable universe. The lexer engine saves universe position at the beginning of the R2 evaluation, and backtracks the position after evaluation of R2 is complete.

The universe presented to the expression R2 is restricted to the boundaries of the fragment of text that matches expression R1. To enforce this rule, intersection removes custom white space policy from the universe and replaces it with policy_default before universe is presented to the expression R2 for evaluation.

Intersection operator R1 & R2 has left-to-right associativity: expression R1 is evaluated first, and then R2 is evaluated. If evaluation of R1 fails, expression R2 is never evaluated. This feature is also known as short-circuiting.

concatenation (strict sequence)

Binary concatenation operator makes available expression format

	R1 ^ R2

where R1 and R2 are C++ expressions describing grammar production rules.

Concatenation operator specifies that expressions R1 and R2 are evaluated in sequence, from left to right. Resulting matched fragment of text contains fragment matching R1 immediately followed by the fragment matching R2. Both expressions must succeed in order for the concatenation to succeed.

The difference between sequence and concatenation is that concatenation removes custom white space policy from the universe and replaces it with policy_default before universe is presented to the expression R2 for the evaluation. Therefore, concatenation can be formally qualified as a "strict sequence".

Concatenation operator R1 ^ R2 has left-to-right associativity: expression R1 is evaluated first, and then R2 is evaluated. If evaluation of R1 fails, expression R2 is never evaluated. This feature is also known as short-circuiting.

Related pages: CTTL | Grammar evaluation | Grammar reference | Overloaded C++ operators

set union

Binary or operator, or set union, makes available expression format

	R1 | R2

where R1 and R2 are C++ expressions describing grammar production rules.

Set union operator specifies matches of either R1, or R2. It is also known as the disjunction operator. Result of R1 | R2 is best described by the definition of set union from the set theory: If a set A of matching strings is defined for rule R1, and a set B of matching strings is defined for the rule R2, then the expression R1 | R2 matches set of strings A | B which is a union of two sets A and B, defined as the set that contains all elements of A and all elements of B, but no other elements.

Union operator R1 | R2 has left-to-right associativity: expression R1 is evaluated first. If evaluation of R1 succeeds, result of the union is the result of expression R1, so expression R2 is never evaluated. This feature is also known as short-circuiting. If evaluation of R1 fails, result of the union is the result of expression R2.

CTTL implementation of the union operator is based on the traditional Non-deterministic Finite Automaton (NFA) behavior. It may leave longer matches undiscovered, because evaluation stops as soon as the first match is found. (Therefore, traditional NFA is non-greedy.) Traditional NFA backtracks in the event of a non-match, and so does the CTTL lexer engine. If R1 fails, the beginning of the parseable universe is restored to give R2 a chance.

POSIX union

Binary POSIX union operator makes available expression format

	R1 || R2

where R1 and R2 are C++ expressions describing grammar production rules.

POSIX union specifies matches of either R1 or R2 accordingly to the POSIX standard: it guarantees the longest leftmost match.

POSIX NFA is slower than the traditional NFA-based set union operator, especially if there are many possible matches. Traditional NFA may find a match early, and then stop evaluation. In contrast, POSIX NFA continues to backtrack and goes on with matching. With POSIX NFA guarantees that both sides of the union expression are evaluated; therefore, it will not favor a shorter match over a longer one by changing the order of the expressions, e g. R2 || R1.

Related pages: CTTL | Grammar evaluation | Grammar reference | Overloaded C++ operators

Permission to copy, use, modify, sell and distribute this document is granted provided this copyright notice appears in all copies. This document is provided "as is" without express or implied warranty, and with no claim as to its suitability for any purpose.

Generated on Thu Nov 2 17:44:54 2006 for Common Text Transformation Library by

1.3.9.1

lexeme symbol()

lexeme entity()

lexeme begin()

lexeme end()

lexeme symbol(true)

lexeme symbol(false)

lexeme begin(true)

lexeme begin(false)

lexemes begin(std::set<std::string>&) and begin(std::set<std::wstring>&)

lexemes symbol(int), symbol(char), and symbol(wchar_t)

lexemes entity(int), entity(char), and entity(wchar_t)

lexemes begin(int), begin(char), and begin(wchar_t)

lexemes end(int), end(char), and end(wchar_t)

lexemes entity(is...)

lexeme begin(is...)

lexeme end(is...)

lexeme first(is...)

lexemes symbol(char*), symbol(wchar_t* ), symbol(std::string&), symbol(std:::wstring&), symbol(std::string*), and symbol(std:::wstring*)

lexemes entity(char*), entity(wchar_t* ), entity(std::string&), entity(std:::wstring&), entity(std::string*), and entity(std:::wstring*),

lexemes begin(char*), begin(wchar_t* ), begin(std::string&), begin(std:::wstring&), begin(std::string*), and begin(std:::wstring*)

lexemes end(char*), end(wchar_t* ), end(std::string&), end(std:::wstring&), end(std::string*), and end(std:::wstring*)

lexemes first(char*), first(wchar_t* ), first(std::string), first(std:::wstring), first(std::string), and first(std:::wstring)

Left and right expressions are balanced

Expressions RL and RR are evaluated by bang_find() evaluation method

The universe presented to the middle expression corresponds to the interior of the quote.

logical not

kleene star

kleene plus

binary kleene star

binary kleene plus

greedy kleene plus

set complement

sequence

set intersection

concatenation (strict sequence)

set union

POSIX union

lexeme `symbol()`

lexeme `entity()`

lexeme `begin()`

lexeme `end()`

lexeme `symbol(true)`

lexeme `symbol(false)`

lexeme `begin(true)`

lexeme `begin(false)`

lexemes `begin(std::set<std::string>&)` and `begin(std::set<std::wstring>&)`

lexemes `symbol(int)`, `symbol(char)`, and `symbol(wchar_t)`

lexemes `entity(int)`, `entity(char)`, and `entity(wchar_t)`

lexemes `begin(int)`, `begin(char)`, and `begin(wchar_t)`

lexemes `end(int)`, `end(char)`, and `end(wchar_t)`

lexemes `entity(is...)`

lexeme `begin(is...)`

lexeme `end(is...)`

lexeme `first(is...)`

lexemes `symbol(char)`, `symbol(wchar_t )`,
`symbol(std::string&)`, `symbol(std:::wstring&)`,
`symbol(std::string)`, and `symbol(std:::wstring)`

lexemes `entity(char)`, `entity(wchar_t )`,
`entity(std::string&)`, `entity(std:::wstring&)`,
`entity(std::string)`, and `entity(std:::wstring)`,

lexemes `begin(char)`, `begin(wchar_t )`,
`begin(std::string&)`, `begin(std:::wstring&)`,
`begin(std::string)`, and `begin(std:::wstring)`

lexemes `end(char)`, `end(wchar_t )`,
`end(std::string&)`, `end(std:::wstring&)`,
`end(std::string)`, and `end(std:::wstring)`

lexemes `first(char)`, `first(wchar_t )`,
`first(std::string)`, `first(std:::wstring)`,
`first(std::string)`, and `first(std:::wstring)`

Expressions RL and RR are evaluated by `bang_find()` evaluation method