CTTL on
SourceForge |
Download
Latest |
Documentation
Index |
Library
News |
CVS
Repository |
Other
Links |
CTTL defines variety of functions and overloaded operators for constructing programs that parse input with EBNF grammars. Typical CTTL parser combines definitions of grammar expressions and an assortment of functions named semantic actions to store and process results.
With CTTL, EBNF production rules can be written as C++ expressions. EBNF requires that production rules can be invoked recursively. CTTL supports recursion by means of function adaptors for C++ functions. Therefore, CTTL parser program is typically organized as a set of functions implementing EBNF rules and semantic actions.
CTTL library provides implementation for the lexical analysis engine. C++ compiler resolves individual grammar expressions into template classes acting as lexer implementation components. Each lexer class has entry points named grammar evaluation methods. Therefore, CTTL lexer implementation is determined at compile time; no virtual functions or other types of dynamic polymorphism are employed (however, the library does not restrict programmers from including these features into program design.)
This reference page offers detailed overview of the building blocks to describe grammar pimitives: lexeme functions, overloaded operators, quotes, metaphors, adaptors, and compile-time policies for handling white space. Information presented here is not a formal API document: refer to the cttl namespace articles for exact signatures of template functions and classes. For simplicity, we will reflect on synopsis, semantics of use, and examples of C++ code to make it helpful to understand various facilities of the library.
Lexeme is a single terminal unit in the syntax of the input language. When lexeme is recognized as part of the grammar, corresponding fragment of the input text is returned from the lexical analysis engine back to the parser program. Such fragment is commonly referred to as token. Thus, token is an actual representation of the lexeme in the input language grammar.
Text transformation library provides a family of functions to delineate lexemes. Lexeme functions may have one or no arguments. C++ types of the lexeme arguments are listed in Table R.2.
There are five basic types of lexeme functions:
Table R.1 - Lexeme typesLexeme function | Description |
---|---|
cttl::symbol() |
Usually refers to a specific terminal symbol in the
abstract alphabet of the input language.
|
cttl::entity() |
Describes a particular class of symbols (i.e. permutation of symbols) in the input language.
|
cttl::begin() |
Describes upper (left-hand-side) boundary of a character entity.
|
cttl::end() |
Describes lower (right-hand-side) boundary of a character entity.
|
cttl::first() |
Refers to the first input character that belongs to a character entity.
|
Lexemes symbol(), entity(), and first() specify non-empty entities of characters. Lexemes begin() and end() describe boundaries of particular entities, therefore, they return tokens of zero length.
C++ argument type | symbol | entity | begin | end | first |
---|---|---|---|---|---|
None |
symbol()
matches a single character. |
entity()
matches the universe. |
begin()
matches beginning of the user input. |
end()
matches ending of the universe. |
|
bool |
symbol(true)
always succeeds unless at the end of the universe. symbol(false) always fails. |
begin(true)
always succeeds. begin(false) always fails. |
|||
std::set<std::string>&-or- std::set<std::wstring>& |
begin(&strset)
matches parseable universe against set of strings. |
||||
int
char wchar_t |
symbol(0x20)
symbol('\x20') symbol(L'\x20') matches single character. |
entity(0x20)
entity('\x20') entity(L'\x20') matches single character. |
begin(65)
begin('A') begin(L'A') matches upper boundary of a single character. |
end(9)
end('\t') end(L'\t') matches lower boundary of a single character. |
|
int(*pf)(int)-or- int(*pf)(wint_t) |
entity(isalpha)
entity(iswalpha) matches character entity. |
begin(isalpha)
begin(iswalpha) matches upper boundary of the character entity. |
end(isalpha)
end(iswalpha) matches lower boundary of the character entity. |
first(isalpha)
first(iswalpha) matches first character of the character entity. |
|
The following ANSI C character classification routines can be used with lexeme functions entity(), begin(), end(), and first(): isalnum, iswalnum......alphanumeric characters isalpha, iswalpha......alphabetic characters iscntrl, iswcntrl......control characters isdigit, iswdigit......decimal digits isgraph, iswgraph......printable other than space islower, iswlower......lowercase characters isprint, iswprint......printable characters ispunct, iswpunct......punctuation characters isspace, iswspace......white-space characters isupper, iswupper......uppercase characters isxdigit, iswxdigit....hexadecimal digits |
|||||
char*
wchar_t* std::string& std::wstring& std::string* std::wstring* |
symbol("ABC")
symbol(L"ABC") symbol(&str) matches exact text. |
entity("ABC")
entity(L"ABC") entity(&str) matches user-defined character entity. |
begin("ABC")
begin(L"ABC") begin(&str) matches upper boundary of the user-defined character entity. |
end("ABC")
end(L"ABC") end(&str) matches lower boundary of the user-defined character entity. |
first("ABC")
first(L"ABC") first(&str) matches first character of the user-defined character entity. |
Lexeme cttl::symbol() with no arguments matches a single character in the user input.
Implementation of the symbol() lexeme conforms to the white space sensitivity rules of the parseable universe. The implementation skips white space symbols defined by the space policy before symbol() is evaluated.
Search operator has no effect on the symbol() lexeme, that is, expressions
symbol() !symbol()are equivalent.
Lexeme cttl::entity() matches fragment of text defined by the current boundaries of the universe. In other words, the entity() lexeme matches the entire universe.
Implementation of the entity() lexeme ignores white space policy of the universe. Regardless of the policy, the entity() lexeme is evaluated immediately at the current position of the universe upper boundary.
The entity() evaluation succeeds against empty universe; that is, if length of the universe is zero, the function returns successful result (similar behavior exhibited by begin(true) lexeme.)
Internally, the entity() is defined as a metaphor: it is equivalent to the expression
begin( true ) + !end()
Search operator has no effect on the entity() lexeme, that is, expressions
entity() !entity()are equivalent.
The begin() lexeme matches beginning of the user input. If evaluation of begin() succeeds, it always returns zero offset.
Implementation of the begin() lexeme ignores white space policy of the universe. Regardless of the policy, the begin() lexeme is evaluated immediately at the current position of the universe upper boundary.
Search operator forces upper boundary of the universe to move to the beginning-of-file location (the position with absolute offset zero.)
Matches lower boundary of the universe. The lexeme assumes that the end of the universe is always a valid position. If end() lexeme succeeds, the universe is empty, and the result of evaluation is always the offset of the universe lower boundary.
When applied to the end() lexeme, search operators, as well as search evaluation methods, always succeed. When used with end(), these evaluations force the upper boundary of the universe to move to the position pointed by the lower boundary, resulting in empty universe.
Implementation of the end() lexeme conforms to the white space sensitivity rules of the parseable universe. The implementation skips white space symbols defined by the space policy before end() is evaluated.
Lexeme cttl::symbol(true) matches an empty symbol in the input language (sometimes called epsilon symbol, or empty string.) The match evaluation of symbol(true) lexeme always succeeds, unless upper boundary of the universe points to the lower boundary (indicating that the universe is empty.) Therefore, symbol(true) imposes the rule that at least one character should be available from the input pointed by the universe. The size of the token returned by successful symbol(true) evaluation is always zero.
The symbol(true) lexeme makes it possible to enter strict evaluation modes by using construct
symbol( true ) ^ Rwhere R is a valid CTTL grammar expression. Overloaded binary concatenation operators allow short hand notation when used with symbol(true) lexeme, for example,
true ^ entity( isdigit )
Implementation of the symbol(true) lexeme ignores white space policy of the universe. Regardless of the policy, the symbol(true) lexeme is evaluated immediately at the current position of the universe upper boundary.
Search operator makes symbol(true) behave like begin(true) lexeme. Therefore, expressions
!symbol( true ) begin( true )are equivalent: they always succeed.
This lexeme always fails, this is a stop symbol.
Evaluation of cttl::begin(true) lexeme always succeeds. Lexeme cttl::symbol(true) matches an empty symbol in the input language (sometimes called epsilon symbol.) The size of the token returned by the successful begin(true) evaluation is always zero.
Implementation of the begin(true) lexeme ignores white space policy of the universe. Regardless of the policy, the begin(true) lexeme is evaluated immediately at the current position of the universe upper boundary.
Search operator has no effect on the begin(true) lexeme, that is, expressions
begin(true) !begin(true)are equivalent.
This lexeme always fails, this is a stop symbol (same behavior is exhibited by symbol(false) lexeme.)
String set lexemes support only match() evaluation for user-defined sets of strings (see std::set). The substring pointed by the current universe is compared against the set of strings. Based on positive lookahead algorithm, the string set lexeme does not consume any characters from the input universe.
Evaluation succeeds if there is an exact match. For a complete example of string set lexeme, see string_set_lexeme.cpp sample program.
The implementation of the string set lexeme is reference-based, it stores only reference to the std::set object provided by the user. Therefore, the set object must stay in scope while string set lexeme remains in use.
String sets ignore white space policy of the universe. Regardless of the policy, the string sets are evaluated immediately at the current position of the universe upper boundary.
Since string set lexeme supports only match() evaluation, expressions with search operator in front of the string set will not compile. For example, the following expression is illegal:
!begin( str_set ) // will not compilewhere str_set is an object of type std::set<std::string> or std::set<std::wstring>.
These lexemes match a particular character from the input character sequence.
Implementation of the symbol(char) lexeme conforms to the white space sensitivity rules of the parseable universe. The implementation skips white space symbols defined by the space policy before symbol(char) is evaluated.
Search operator initiates search for the specified character.
These lexemes behave exactly as symbol(int), symbol(char), and symbol(wchar_t).
These lexemes match upper boundary of the single character from the input character sequence.
Implementation of the begin(char) lexeme conforms to the white space sensitivity rules of the parseable universe. The implementation skips white space symbols defined by the space policy before begin(char) is evaluated.
Search operator initiates search for the specified character.
These lexemes match lower boundary of the single character from the input character sequence.
Implementation of the end(char) lexeme conforms to the white space sensitivity rules of the parseable universe. The implementation skips white space symbols defined by the space policy before end(char) is evaluated.
Search operator initiates search for the specified character.
An entity lexeme matches the ANSI C multi-character entity. The argument specifies ANSI C character classification routine, such as
For example, expression entity(isaplha) matches alphabetic substrings like "ABC", "qwerty", etc. In addition to ANSI C routines, the user may provide their own functions matching the prototypes
int f( int ); // char classification int wf( wint_t ); // wchar_t classification
Implementation of entity lexemes conforms to the white space sensitivity rules of the parseable universe. The implementation skips white space symbols defined by the space policy before entity lexemes are evaluated.
Search operator initiates search for the specified character entity.
Lexeme begin(is...) matches the upper boundary of the ANSI C multi-character entity. The rest of the behavior is the same as entity(is...).
Lexeme end(is...) matches the lower boundary of the ANSI C multi-character entity. The rest of the behavior is the same as entity(is...).
Lexeme first(is...) matches first character of the ANSI C multi-character entity. The rest of the behavior is the same as entity(is...).
Family of terminal symbol lexemes, symbol(char*), etc. matches the exact text provided by the user.
Implementation of symbol lexemes conforms to the white space sensitivity rules of the parseable universe. The implementation skips white space symbols defined by the space policy before symbol lexemes are evaluated.
Search operator initiates search for the specified symbol.
Lexeme functions
symbol(char*) symbol(wchar_t* ) symbol(std::string&) symbol(std:::wstring&)instantiate private std::string (or std::wstring) member object to store the copy of the text fragment to look for. For example, temporary object std::string("ABC") is constructed each time thread of execution runs through expression "symbol("ABC")":
static size_t r( const_edge<>& edge_ ) { return symbol( "ABC" ).match( edge_ ); }For many applications, instance implementation can be inefficient. Alternatively, string pointer can specify searchable text:
static size_t r( const_edge<>& edge_ ) { static const std::string ABC( "ABC" ); return symbol( &ABC ).match( edge_ ); }In this case, reference implementation of symbol(std::string*) lexeme will be used when the code is compiled. The reference implementation stores only reference to the specified string, which is expected to be in scope while lexeme symbol(std::string*) is invoked.
Using token string pointers offers advantage of reusing string objects constructed once the program is loaded into memory. This yields faster implementation, plus the ability to centralize the definitions of all symbol constants (i.e. the common alphabet) that is shared by the grammar productions.
The family of entity lexemes matches user-defined multi-character entities. For example, grammar expression entity("abc") matches permutations of strings "a", "b", "c", "aa", "ab", "ac", "ad", and so on.
Implementation of entity lexemes conforms to the white space sensitivity rules of the parseable universe. The implementation skips white space symbols defined by the space policy before entity lexemes are evaluated.
Search operator initiates search for the specified character entity.
Similar to symbol(std::string*) lexeme, implementations of entity(std::string*) and entity(std:::wstring*) lexemes hold only a reference to the specified character entity string. Other lexemes in this family create their own private copy of the entity string.
Lexeme family begin(char*), etc. matches the upper boundary of the user-defined multi-character entity.
Implementation of boundary lexemes conforms to the white space sensitivity rules of the parseable universe. The implementation skips white space symbols defined by the space policy before boundary lexemes are evaluated.
Search operator initiates search for the specified character entity.
Similar to symbol(std::string*) lexeme, implementations of begin(std::string*) and begin(std:::wstring*) lexemes hold only a reference to the specified character entity string. Other lexemes in this family create their own private copy of the entity string.
Note that expresison
begin( "+=" )is not a lookahead lookup for text "+=", but rather a multi-character entity lookup. Expression "begin( "+=" )" matches permutations of characters '+' and '=', therefore, it will match the beginning of any string that begins with '+' and '='. If grammar expression needs a lookahead lookup for a particular text, it should use expression
universe_.first( symbol( "+=" ) )instead.
Lexemes end(char*), etc. match the lower boundary of the user-defined multi-character entity.
Implementation of boundary lexemes conforms to the white space sensitivity rules of the parseable universe. The implementation skips white space symbols defined by the space policy before boundary lexemes are evaluated.
Search operator initiates search for the specified character entity.
Similar to symbol(std::string*) lexeme, implementations of end(std::string*) and end(std:::wstring*) lexemes hold only a reference to the specified character entity string. Other lexemes in this family create their own private copy of the entity string.
Lexeme family first(char*), etc. matches first character of the user-defined multi-character entity.
Implementation of single character lexemes conforms to the white space sensitivity rules of the parseable universe. The implementation skips white space symbols defined by the space policy before single character lexemes are evaluated.
Search operator initiates search for the specified character entity.
Similar to symbol(std::string*) lexeme, implementations of first(std::string*) and first(std:::wstring*) lexemes hold only a reference to the specified character entity string. Other lexemes in this family create their own private copy of the entity string.
The library provides function adaptors that allow grammar rules to make function calls during grammar evaluation.
At compile time, a set of template functions named cttl::rule() captures addresses of C++ functions implementing grammar production rules and semantic actions. While lexical analysis engine evaluates grammar rules, it may invoke other rules and semantic actions defined by the parser.
There are function adaptors and debugging macros acting as function adaptors:
function | arguments | semantics |
---|---|---|
rule(&r)function adaptor. |
address of global function,
-or- static member function, -or- function object representing production rule or semantic action r. |
struct parser { static size_t subrule( const_edge<>& universe_ ) { std::cout << universe_.text(); return universe_.second.offset(); } static size_t start( const_edge<>& universe_ ) { return rule( &subrule ).match( universe_ ); } }; |
rule(O,&C::r)member function adaptor. |
Object reference O pointing to an instance of class C and the address of a non-static, constant or mutable, member function r. |
struct parser { size_t subrule( const_edge<>& universe_ ) const { std::cout << universe_.text(); return universe_.second.offset(); } size_t start( const_edge<>& universe_ ) const { return rule( *this, &parser::subrule ).match( universe_ ); } }; |
CTTL_RULE(&C::r)member function adaptor. |
Address of a non-static member function r. |
This adaptor is defined as a macro.
It is similar to rule(*this,&C::r), but
requires only address of the class member function.
Member function context is assumed, in which the this
pointer is a valid expression. Depending on the level of tracing,
it produces a trace of getting in and out of the encapsulated
rule function.
See arithmetics_traced.cpp
sample program for an example of use.
struct parser { size_t subrule( const_edge<>& universe_ ) { std::cout << universe_.text(); return universe_.second.offset(); } size_t start( const_edge<>& universe_ ) { return CTTL_RULE( parser::subrule ).match( universe_ ); } };See also: grammar debugging with CTTL_RULE(). |
CTTL_MEMBER_RULE(O,&r)member function adaptor. |
Object reference O and address of a non-static member function r. |
CTTL_MEMBER_RULE is similar to CTTL_RULE,
but requires both object reference and class member function,
similar to
rule(object,&C::r) member function adaptor. See also: grammar debugging with CTTL_MEMBER_RULE(). Please note that since release 206 the address of, '&', operator is required in front of the member function name. This allows users to explicitly provide an instance of pointer to member function. For an example of semantic action implemented as member function template, see action_handler.cpp sample program. Another sample, arithmetics_traced.cpp, demonstrates example of CTTL_MEMBER_RULE() macro. |
CTTL_STATIC_RULE(r)function adaptor. |
address of global function,
-or- static member function, -or- function object representing production rule or semantic action r. |
The macro can be used in place of
rule(r) function adaptor. See also: grammar debugging with CTTL_STATIC_RULE(). |
The second type of CTTL adaptor is grammar expression wrapper.
Node and edge expression adaptors locate and provide access to the boundaries of the input text fragments during grammar parsing. They capture locations of symbols parsed by the encapsulated expression and have forms
n(R) e(R)where n is an instance of the cttl::node, e is an instance of either cttl::const_edge or cttl::edge, and R is a valid grammar expression. Node and edge adaptors are instances of the template classes returned by cttl::node::operator() and cttl::const_edge::operator() function call operators. At compile time, the adaptor captures and stores a copy of node or edge object, and an instance of the grammar expression R. At run time, if evaluation of R succeeds, the adaptor positions specified node or edge on the boundaries of the matched symbols.
If R is a valid grammar expression, the entity adaptor entity(R) guarantees that the underlying expression R matches a non-empty string. Expression adaptor entity(R) helps to determine whether expression R is nullable when implementing predictive parser.
The semantics of CTTL expression adaptors are as follows:
function | semantics |
---|---|
n(R)
Node expression adaptor; if expression R succeeds, the node n is positioned on the upper boundary of the matched fragment. Node expression adaptor can provide positive unlimited lookahead facility to the grammar expression. For more information, see adaptor behavior of the node documentation. |
void f() { input<> inp( "123" ); const_edge<> universe( new_edge( inp ) ); node<> left( new_node( inp ) ); size_t result = left( entity( isdigit ) ).match( universe ); assert( left.offset() == 0 ); } |
e(R)
where e is an instance of either cttl::const_edge or cttl::edge object. Edge expression adaptor; if expression R succeeds, the edge e is positioned on the boundaries of the matched fragment. |
void f() { input<> inp( "123" ); const_edge<> universe( new_edge( inp ) ); const_edge<> number( new_edge( inp ) ); size_t result = number( entity( isdigit ) ).match( universe ); assert( number.text() == "123" ); } |
entity(R)
Entity expression adaptor; if expression R succeeds, the expression adaptor entity(R) succeeds only if R matched a non-empty fragment of text. Entity expression adaptor validates that the encapsulated expression yields a non-empty match. |
void f() { input<> inp( "abc" ); const_edge<> universe( new_edge( inp ) ); size_t result = entity( *entity( isdigit ) ).match(universe); assert( result == std::string::npos ); } |
In notations of logical and mathematical expressions, in programming languages, as well as in natural languages, there are numerous examples of quotations, insertions, subexpressions, grouped phrases, citations, and passages, all of which require certain level of distinction from the rest of the sentence. Such elements of the input language often utilize balanced pair grammar constructs, for example:
"Hello, World!" |
Text is enclosed inside balanced quotation marks.
|
a/(b+c) |
Arithmetic sub-expression b+c is put inside balanced parenthesis.
|
<HTML><BODY>Hello, World!</BODY></HTML> |
Html elements, such as HTML and BODY, are
formed by pairs of html tags.
The pairs of angled brackets form html tags themselves. |
The constructs in the examples above are made clearly distinct from the rest of the phrase, expression, or entire document. The library offers special recognition to the objects set off by balanced punctuation, or by more complex patterns. The library defines a category of quote functions, designated for parsing various styles of quotes, all of which have general form:
quote( RL, RM, RR )where RL, RM, and RR are valid grammar expressions, named left, middle, and right expression, respectively. RL expression describes syntactical unit that opens the quote. RM expression describes grammar for input symbols located at the beginning of the interior of the quote. RM expression describes syntactical unit that closes the quote.
The left and right units of the quote are balanced. For example, if user input is specified as
<HTML><BODY>Hello, World!</BODY></HTML>then the middle expression, entity(), of the following quote,
quote( '<' + entity( isalpha ) + '>', entity(), symbol( '<' ) + '/' + entity( isalpha ) + '>' )matches the fragment "<BODY>Hello, World!</BODY>".
The following rules apply during grammar evaluation of quotes:
The quote is balanced in the sense that left and right expressions are evaluated exactly the same number of times. Evaluation of the quote stops when the balance between the two sides is reached.
If input contains nested pairs of fragments matching expression units RL and RR, both expressions are evaluated multiple times. The middle expression, RM, is invoked only once. If no balance between RL and RR could be found, the evaluation of the quote fails, and the expression RM is never evaluated.
While searching for a balanced match, implementation of the quote alternates the bang_find() calls on expressions RL and RR. It is important to remember to use runtime_match() when RL and RR grammar units are implemented as distinct grammar rule functions. For example,
struct parser { static size_t start( const_edge<>& universe_ ) { return quote( rule( parser::left ), entity(), rule( parser::right ) ).match( universe_ ); } static size_t left( const_edge<>& universe_ ) { return symbol( "BEGIN" ).runtime_match( universe_ ); } static size_t right( const_edge<>& universe_ ) { return symbol( "END" ).runtime_match( universe_ ); } }; //struct parser
Interior universe, which is visible to the middle expression RM, is limited to the inner boundaries of the fragments matching RL and RR. For the middle expression, it is sufficient to match only part of the interior. The middle expression is not required to consume the entire universe.
The following quote formats are available:
Table R.5 - CTTL quotessemantics | description |
---|---|
quote(RL,RM,RR)Here and below the abbreviations RL, RM, and RR stand for valid grammar expressions, describing opening, interior, and closing clauses of the quote. |
This is least specialized form of the quote. It can be used to
locate balanced pairs of two arbitrary grammatical clauses,
RL and RR,
such as matching XML tags in a document.
|
quote(true,RM,RR) |
Asymmetric quote. Expression !!RR is evaluated before
RM. The universe U visible to the
middle expression RM is restricted by the boundaries
corresponding to the position of U.first
before evaluation, and the upper boundary of the fragment matched by
!!RR.
See line_parser.cpp for an example of this quote. |
quote(char,RM,char) |
Quote specialized for left and right expressions containing
only a single character. For example,
void f() { input<> inp( "((a+b)*(c-d))" ); const_edge<> univ( new_edge( inp ) ); const_edge<> intr( new_edge( inp ) ); const_edge<> extr( new_edge( inp ) ); if ( extr( quote( '(', intr( entity() ), ')' ) ).match( univ ) != std::string::npos ) { std::cout << "Interior: " << intr.text() << std::endl; std::cout << "Exterior: " << extr.text() << std::endl; } } This example prints text Interior: (a+b)*(c-d) Exterior: ((a+b)*(c-d)) |
wchar_quote(wchar_t,RM,wchar_t) |
Version for wide characters:
#include <iostream> #include "cttl/cttl.h" using namespace cttl; int main() { input< std::wstring > inp( L"((a+b)*(c-d))" ); const_edge< policy_default, std::wstring > univ( new_edge( inp ) ); const_edge< policy_default, std::wstring > data( new_edge( inp ) ); if ( ( data( wchar_quote( L'(', entity(), L')' ) ) ).match( univ ) != std::wstring::npos ) std::wcout << L"Data: " << data.text() << std::endl; return 0; } This example prints text Data: ((a+b)*(c-d)) |
ansi_single_quote(RM) ansi_single_quote() |
Some languages (for example, various SQL dialects) allow two single quotes
in a row to represent a single quote character
inside single-quoted string literal. For example, literal
'abc''def' delineates a sequence of 7 characters,
namely, 'a', 'b', 'c', '\'', 'd', 'e', and 'f'. CTTL provides ansi_single_quote() function to parse such literals. For correct matches, ansi_single_quote() should be combined with kleene plus operator, '+' (see code example below for details). The RM expression provides access to the interior of the quoted literal. Function ansi_single_quote() is a shorthand notation for quote('\'',RM,'\''), but its implementation is optimized for literals formed by single quotation marks. Function ansi_single_quote() is an abbreviated version of ansi_single_quote(RM). It is a shorthand notation for quote('\'',true,'\''). void f() { input<> inp( "\'abc\'\'def\'" ); const_edge<> universe( new_edge( inp ) ); const_edge<> data( new_edge( inp ) ); if ( ( data( +ansi_single_quote() ) ).match( universe ) != std::string::npos ) std::cout << "Data: " << data.text() << std::endl; } This example prints text Data: 'abc''def' |
ansi_double_quote(RM) ansi_double_quote() |
Same as ansi_single_quote(RM), but for double-quoted literals,
for example, "abc""def".
This is a shorthand notation for quote('\"',RM,'\"'), but optimized for literals formed by double quotation marks. ansi_double_quote() is an abbreviated version of ansi_double_quote(expr). It is a shorthand notation for quote('\"',true,'\"'), but is also more optimized. void f() { input<> inp( "\"abc\"\"def\"" ); const_edge<> universe( new_edge( inp ) ); const_edge<> data( new_edge( inp ) ); if ( ( data( +ansi_double_quote() ) ).match( universe ) != std::string::npos ) std::cout << "Data: " << data.text() << std::endl; } This example prints text Data: "abc""def" |
c_single_quote(RM) c_single_quote() |
An ANSI-C string literal that recognizes C-style escape characters inside
string literals delimited by single quotation marks.
|
c_double_quote(RM) c_double_quote() |
An ANSI-C string literal that recognizes C-style escape characters inside
string literals delimited by double quotation marks.
|
wchar_ansi_single_quote(RM) wchar_ansi_single_quote() |
Specializations for wide characters.
|
wchar_ansi_double_quote(RM) wchar_ansi_double_quote() |
Specializations for wide characters.
|
wchar_c_single_quote(RM) wchar_c_single_quote() |
Specializations for wide characters.
|
wchar_c_double_quote(RM) wchar_c_double_quote() |
Specializations for wide characters.
|
The library introduces an idea of lexemic metaphor functions, encapsulating frequently used distinctive lexeme constructs, which describe certain classes of terminal symbols of the input grammar. For instance, names of identifiers, such as variable, function, and object names in many programming languages, could be described by the following construct:
// Identifiers in C, C++, Java: begin( "QWERTYUIOPASDFGHJKLZXCVBNM_qwertyuiopasdfghjklzxcvbnm" ) ^ end( "QWERTYUIOPASDFGHJKLZXCVBNM_qwertyuiopasdfghjklzxcvbnm1234567890" )Yet in other programming languages, the identifier rules could diverge into something like
// Identifiers in language XYZ: begin( "%QWERTYUIOPASDFGHJKLZXCVBNMqwertyuiopasdfghjklzxcvbnm" ) ^ end( "%QWERTYUIOPASDFGHJKLZXCVBNMqwertyuiopasdfghjklzxcvbnm1234567890" )Programming language identifiers fall into the category of terminal symbols that can be described by a metaphor function named literal(). For example,
literal( "%QWERTYUIOPASDFGHJKLZXCVBNMqwertyuiopasdfghjklzxcvbnm", "%QWERTYUIOPASDFGHJKLZXCVBNMqwertyuiopasdfghjklzxcvbnm1234567890" )is a synonym for the identifier of XYZ language shown above. The metaphor literal() with no arguments is an overloaded function, which is short hand notation for C/C++ identifier. Table R.6 lists CTTL metaphors:
metaphor name | description |
---|---|
literal()
literal(B,E) wchar_literal() wchar_literal(B,E) |
Literal metaphor is a shorthand notation for grammar expression begin( B ) ^ end( E )where B and E are user-defined character entities. Note that B is a subset of E. The default value for B is a set of alphabetic characters and the underscore; the default value for E is alphabetic character set, underscore, and ten digits. Wide character versions of literal metaphor are also provided. Metaphor cttl::literal() is equivalent to the expression begin( "QWERTYUIOPASDFGHJKLZXCVBNM_qwertyuiopasdfghjklzxcvbnm" ) ^ end( "QWERTYUIOPASDFGHJKLZXCVBNM_qwertyuiopasdfghjklzxcvbnm1234567890" ) Metaphor cttl::wchar_literal() is equivalent to begin( L"QWERTYUIOPASDFGHJKLZXCVBNM_qwertyuiopasdfghjklzxcvbnm" ) ^ end( L"QWERTYUIOPASDFGHJKLZXCVBNM_qwertyuiopasdfghjklzxcvbnm1234567890" ) For most efficient implementation, character entities B and E should be passed by address, for example: static size_t r( const_edge<>& edge_ ) { static const std::string ABC( "ABC" ); static const std::string ABCabc( "ABCabc" ); return literal( &ABC, &ABCabc ).match( edge_ ); } |
entity() |
Although cttl::entity() falls into the category
of lexeme functions, internally it is implemented as a metaphor:
begin( true ) + !end() |
Sometimes a lookahead is necessary to decide in advance how to parse part of the input. For instance, parser may try one way of matching, and if it fails, parser may try the alternative way. For this kind of processing, the lexer has to backtrack to the original input position. Backtracking technique (resetting input position) can often simplify the structure of grammar rules. However, repetitive backtracking and parsing the same fragment multiple times may cause negative performance impact. This deficiency sometimes yields poor performance, rendering an impractical parser.
An alternative is to stay away from the backtracking technique by specifying a more restrictive set of grammars. Restrictive parser predicts what production rule to use and removes requirements for excessive backtracking.
The library provides building blocks for compiling both backtracking and restrictive parsers. Backtracking CTTL parser supports unlimited positive and negative lookahead types.
Certain grammar constructs always require backtracking. For example, set union operator,
symbol( 'A' ) | symbol( 'a' )requires the starting position to be restored before right-hand-side expression, symbol( 'a' ), can be evaluated.
For more information about backtracking, see CTTL operators.
A valid production rule expression, for example,
entity( isalpha ) + entity( isdigit )describes syntactic grouping of symbols, known as a non-terminal symbol. Non-terminals represent patterns of symbols of the input language. They are organized as idioms containing individual terminals, as well as calls to the production rules encapsulating other non-terminal symbols of the grammar.
Grammar expression can be evaluated by one of the evaluation methods: match(), find(), bang_find(), or runtime_match(). For example,
( entity( isalpha ) + entity( isdigit ) ).match( universe )
Unary and binary operators can be applied to form production rule expressions from separate units like terminals, non-terminals, and sub-expressions enclosed in parenthesis:
// matches notation for hexadecimal numbers like "0x0d", "0XFFFF", etc. symbol( '0' ) ^ ( symbol( 'x' ) | 'X' ) ^ entity( isxdigit )Binary operators chain two expressions together and specify relationship between them; unary operators modify behavior of the underlying expression. For example:
// matches one or more zeroes or ones, like "0", "1", "01", etc. +( symbol( '0' ) | symbol( '1' ) )Search operator illustrates modification of the expression behavior. It forces lexer engine to escalate evaluation from match() to find(), or from find() to bang_find() evaluation:
// search for lowercase characters: !entity( islower )In case of a search, first terminal symbol on the left side of the grammar expression is formally given a special recognition. Such symbol is distinguished as leftmost terminal symbol, which consumes the search request initiated by find() or bang_find() evaluation. (By saying "consumes" we imply that the lexeme specifying leftmost terminal symbol does not delegate, but actually carries out the requested task.) For example,
// search for negative number !!( symbol( '-' ) + entity( isdigit ) )Here, repeatable search !! invokes bang_find() evaluation method of the subexpression enclosed in parenthesis:
( symbol( '-' ) + entity( isdigit ) ).bang_find( universe )The two lexemes are connected by binary plus operator forming a sequence. The sequence is evaluated as follows: binary plus delegates bang_find() call to the left-hand-side expression:
symbol( '-' ).bang_find( universe )If this call succeeds, the right-hand-side expression, entity( isdigit ), is evaluated by the match() method:
entity( isalpha ).match( universe )The sequence operator succeeds if both of these calls succeed. Regardless of the grammar expression structure and complexity, the leftmost terminal symbol always receives the search request, because all C++ binary operators recognized within CTTL framework have left-to-right associativity.
The library supplies C++ operators overloaded for template classes implementing CTTL lexemes, quotes, and adaptors. In addition, library defines overloaded operators for string and character literals, std::string and std::wstring template classes, and other C++ types that can be used in place of lexeme arguments (see table R.2 for details about lexeme argument types). This arrangement allows to write abbreviated grammar expressions. For example,
// matches substring "<document>" symbol( '<' ) + symbol( "document" ) + symbol( '>' ) // abbreviated form: symbol( '<' ) + "document" + '>'Table R.7 describes CTTL operators in order of C++ operator precedence. Items towards the bottom of the list have lower precedence; therefore, parenthesis should be used to group sub-expressions requiring alternative priority of evaluation. In our description, R, R1, and R2 will represent valid grammar expressions. An event of successful application of any evaluation method against a grammar expression is commonly known as a match. On the other hand, the term search usually indicates an attempt to execute a continuous search for a pattern of symbols via find() or bang_find() evaluation.
If grammar evaluation succeeds, the fragment of the input text corresponding to the required pattern is called matched fragment. Therefore, every instance of successful match or search will have a corresponding matched fragment. Fragments can be characterized by their boundaries and length. If the fragment has zero length, the match is considered to be an empty match. Empty matches are not uncommon; they occur when grammars contain explicit mentioning of epsilon symbols, or sub-expressions qualified with kleene star operators. For example,
// Rule BracketsOpt has potential for empty match: size_t BracketsOpt( const_edge< policy_space<> >& universe_ ) { return ( *( symbol( '[' ) + ']' ) ).match( universe_ ); }
operator name | description |
---|---|
!R
search |
Unary search operator, '!', promotes current evaluation
method to a step-higher search. The match is promoted to find,
find is promoted to bang_find. For example, grammar rule
rm
size_t rm( const_edge<>& universe_ ) { return ( !R ).match( universe_ ); }could be rewritten as an equivalent rule, rf: size_t rf( const_edge<>& universe_ ) { return ( R ).find( universe_ ); }where R is a valid grammar expression. |
!!R
repeatable search |
Two search operators in a row, '!!',
applied to the grammar expression R, guarantee promotion
of the lexical analysis of R to the
level of repeatable search, carried out by the bang_find()
evaluation method. Therefore, grammar rule rm
size_t rm( const_edge<>& universe_ ) { return ( !!R ).match( universe_ ); }could be equally rewritten as a different rule, rbf: size_t rbf( const_edge<>& universe_ ) { return ( R ).bang_find( universe_ ); }where R is a valid grammar expression. |
+R
kleene plus |
Unary plus operator, '+',
specifies one or more matches of grammar expression R,
evaluated as one syntactical clause.
|
-R
logical not |
Unary minus operator, '-',
specifies negative lookahead assertion of grammar expression
R.
|
*R
kleene star |
Unary star operator, '*',
specifies zero or more matches of grammar expression
R.
|
R * N
kleene star |
Binary star operator, '*',
specifies zero or up to N matches of
grammar expression R,
where N is a positive integer number.
|
R1 - R2
set complement (difference) |
Binary minus operator, '-',
specifies matches of grammar expression R1 that
are not matches of expression R2.
|
R1 + R2
sequence |
Binary plus operator, '+',
specifies that grammar expressions R1 and
R2 are evaluated in sequence,
from left to right.
|
R + N
kleene plus |
Binary plus operator, '+',
specifies one or up to N matches of grammar
expression R,
where N is a positive integer number.
|
R + pair( N, M )
kleene plus |
Binary plus operator, '+',
specifies at least N, or
up to M matches of grammar expression R,
where pair( N, M ) is an instance of
std::pair |
R1 & R2
set intersection |
Binary and oprator, '&',
specifies matches of grammar expression R1 that
are also matches of another expression, R2.
|
R1 ^ R2
concatenation, (strict sequence) |
Binary exclusive or operator, '^',
specifies that two succeeding evaluations of grammar expressions
R1 and R2 form a single leximatic unit,
consisting of two strictly following one another fragments of input.
|
R1 | R2
set union |
Binary or operator, '|',
specifies matches of either expression R1 or
R2.
|
R1 || R2
POSIX union operator |
Binary logical or operator, '||',
specifies matches of either expression R1, or
R2 accordingly to the POSIX standard: it guarantees
the longest leftmost match.
|
-Rwhere R is a C++ expression describing grammar production rule.
Logical not specifies negative lookahead assertion of expression R. Successful lookahead assertion -R does not consume any characters of the parseable universe. The lexer engine saves universe position at the beginning of the evaluation, and restores it back after the evaluation is complete.
Result of -R is best described by the definition of absolute complement from the set theory. Assume that a universal set of strings U is defined as a set of all strings immediately before evaluation of expression -R. Also, if set A of matching strings is defined for rule R, then A is a subset of U. Then, expression -R matches a new set of strings which is an absolute complement (or simply complement) of the subset A of U, defined as a set of all strings of U which are not elements of A.
See also: Wikipedia - Absolute complement
Unary Kleene star operator makes available expression format
*Rwhere R is a C++ expression describing grammar production rule.
Kleene star specifies zero or more matches of R. This Kleene form attempts as many matches as possible, otherwise known as greedy evaluation.
From the set theory, Kleene star operator can be defined as follows: if set A of matching strings is defined for rule R, then expression *R defines a new set of strings *A, which is a set of all strings that can be made by concatenating zero or more strings from A. Therefore, empty string is also part of set *A.
Expression *R always succeeds.
Empty match is counted as a good match. To prevent the lexer engine from entering infinite loop on empty matches, empty match never repeats. To exclude empty matches from the results of evaluation, use *entity( R ) expression.
See also: Wikipedia - Kleene star
Unary Kleene plus operator makes available expression format
+Rwhere R is a C++ expression describing grammar production rule.
Kleene plus specifies one or more matches of R. This Kleene form attempts as many matches as possible, otherwise known as greedy evaluation.
From the set theory, Kleene plus operator can be defined as follows: if set A of matching strings is defined for rule R, expression +R defines a new set of strings +A, which is a set of all strings that can be made by concatenating strings from A.
Empty match is counted as a good match. To prevent the lexer engine from entering infinite loop on empty matches, empty match never repeats. To exclude empty matches from the results of evaluation, use +entity( R ) expression.
R * Nwhere R is a C++ expression describing grammar production rule, and N is a positive integer number.
This version of Kleene star specifies zero or up to N matches of R. This form attempts as few repeats as possible, otherwise known as non-greedy, or lazy evaluation.
Expression R * N always succeeds.
Expression R * N never returns more than N matches; if Nth occurrence of the match fragment is found, the evaluation stops.
Empty match is counted as a good match. To exclude empty matches from the results of evaluation, use entity( R ) * N expression instead.
Many regular expression engines recognize "R?" syntax, which specifies "zero or one matches". In CTTL, request for zero or one matches is written as R * 1.
R + Nwhere R is a C++ expression describing grammar production rule, and N is a positive integer number.
This version of Kleene plus specifies one or up to N matches of R. This Kleene form attempts as few repeats as possible, otherwise known as non-greedy, or lazy evaluation.
Expression R + N never returns more than N matches; if Nth occurrence of the match fragment is found, the evaluation stops.
Empty match is counted as a good match. To exclude empty matches from the results of evaluation, use entity( R ) + N expression instead.
R + pair( N, M )where R is a C++ expression describing grammar production rule; pair( N, M ) is an instance of std::pair
This version of Kleene plus requires at least N matches, or up to M matches of expression R. Anywhere between N and M, this form of Kleene plus attempts as many matches as possible, otherwise known as greedy evaluation.
If less than N or more than M matches exist, expression R + pair( N, M ) fails.
Empty match is counted as a good match, but never repeats. If empty match is found, it stops the evaluation. To exclude empty matches from the results of evaluation, use entity( R ) + pair( N, M ) expression instead.
To restrict Kleene operator to exactly N matches, use
R + std::make_pair( N, N )
There is no explicit CTTL syntax that allows to specify an infinity for the upper limit of the Kleene matches. To work around, a "big enough" number should be specified for the upper bound parameter:
R + std::make_pair( N, 999 )
Binary minus operator, or set absolute complement, makes available expression format
R1 - R2where R1 and R2 are C++ expressions describing grammar production rules.
Binary minus operator specifies matches of R1 that are not matches of R2. Result of R1 - R2 is best described by the definition of absolute complement from the set theory: If a universal set U of matching strings is defined for rule R1, and a subset A of U of matching strings is defined for rule R2, then expression R1 - R2 matches a new set, which is an absolute complement (or simply complement) of a subset A of U, defined as a set of all strings of U which are not elements of A.
Successful evaluation of R2 does not consume any characters of the parseable universe. The lexer engine saves universe position at the beginning of the R2 evaluation, and backtracks the position after evaluation of R2 is complete.
The universe presented to the expression R2 is restricted to the boundaries of the fragment of text that matches expression R1. To enforce this rule, complement operator removes custom white space policy from the universe and replaces it with policy_default before universe is presented to the expression R2 for evaluation.
Complement operator R1 - R2 has left-to-right associativity: expression R1 is evaluated first, and then R2 is evaluated. If evaluation of R1 fails, expression R2 is never evaluated. This feature is also known as short-circuiting.
Note that expressions
entity( isalpha ) - symbol( '(' )and
entity( isalpha ) + -symbol( '(' )are very different. First expression matches an alphabetical multi-character entity and then tries to compare its first character with '(', which will always fail. The second expression attempts negative lookahead for character '(' after alphabetical entity. The rationale for set complement is to validate the fragment matched by the left hand side against expression provided on the right hand side of the operator. The negative lookahead, on the other hand, can be done by the negative lookahead assertion operator, '-'.
See also: Wikipedia - Absolute complement.
Binary sequence operator makes available expression format
R1 + R2where R1 and R2 are C++ expressions describing grammar production rules.
Binary sequence operator specifies that expressions R1 and R2 are evaluated in sequence, from left to right. Resulting match contains fragment match of R1 followed by the fragment match of R2. Both expressions must succeed in order for the entire sequence to succeed.
Sequence operator R1 + R2 has left-to-right associativity: expression R1 is evaluated first, and then R2 is evaluated. If evaluation of R1 fails, expression R2 is never evaluated. This feature is also known as short-circuiting.
Binary and operator, or set intersection, makes available expression format
R1 & R2where R1 and R2 are C++ expressions describing grammar production rules.
Intersection operator specifies matches of R1 that are also matches of R2. Result of R1 & R2 is best described by the definition of set intersection from the set theory: If a universal set U of matching strings is defined for rule R1, and a subset A of U of matching strings is defined for the rule R2, then expression R1 & R2 matches set of strings U & A that contains all elements of U that also belong to A.
Successful evaluation of R2 does not have any effect on the parseable universe. The lexer engine saves universe position at the beginning of the R2 evaluation, and backtracks the position after evaluation of R2 is complete.
The universe presented to the expression R2 is restricted to the boundaries of the fragment of text that matches expression R1. To enforce this rule, intersection removes custom white space policy from the universe and replaces it with policy_default before universe is presented to the expression R2 for evaluation.
Intersection operator R1 & R2 has left-to-right associativity: expression R1 is evaluated first, and then R2 is evaluated. If evaluation of R1 fails, expression R2 is never evaluated. This feature is also known as short-circuiting.
See also: Wikipedia - Intersection.
Binary concatenation operator makes available expression format
R1 ^ R2where R1 and R2 are C++ expressions describing grammar production rules.
Concatenation operator specifies that expressions R1 and R2 are evaluated in sequence, from left to right. Resulting matched fragment of text contains fragment matching R1 immediately followed by the fragment matching R2. Both expressions must succeed in order for the concatenation to succeed.
The difference between sequence and concatenation is that concatenation removes custom white space policy from the universe and replaces it with policy_default before universe is presented to the expression R2 for the evaluation. Therefore, concatenation can be formally qualified as a "strict sequence".
Concatenation operator R1 ^ R2 has left-to-right associativity: expression R1 is evaluated first, and then R2 is evaluated. If evaluation of R1 fails, expression R2 is never evaluated. This feature is also known as short-circuiting.
Binary or operator, or set union, makes available expression format
R1 | R2where R1 and R2 are C++ expressions describing grammar production rules.
Set union operator specifies matches of either R1, or R2. It is also known as the disjunction operator. Result of R1 | R2 is best described by the definition of set union from the set theory: If a set A of matching strings is defined for rule R1, and a set B of matching strings is defined for the rule R2, then the expression R1 | R2 matches set of strings A | B which is a union of two sets A and B, defined as the set that contains all elements of A and all elements of B, but no other elements.
Union operator R1 | R2 has left-to-right associativity: expression R1 is evaluated first. If evaluation of R1 succeeds, result of the union is the result of expression R1, so expression R2 is never evaluated. This feature is also known as short-circuiting. If evaluation of R1 fails, result of the union is the result of expression R2.
CTTL implementation of the union operator is based on the traditional Non-deterministic Finite Automaton (NFA) behavior. It may leave longer matches undiscovered, because evaluation stops as soon as the first match is found. (Therefore, traditional NFA is non-greedy.) Traditional NFA backtracks in the event of a non-match, and so does the CTTL lexer engine. If R1 fails, the beginning of the parseable universe is restored to give R2 a chance.
Binary POSIX union operator makes available expression format
R1 || R2where R1 and R2 are C++ expressions describing grammar production rules.
POSIX union specifies matches of either R1 or R2 accordingly to the POSIX standard: it guarantees the longest leftmost match.
POSIX NFA is slower than the traditional NFA-based set union operator, especially if there are many possible matches. Traditional NFA may find a match early, and then stop evaluation. In contrast, POSIX NFA continues to backtrack and goes on with matching. With POSIX NFA guarantees that both sides of the union expression are evaluated; therefore, it will not favor a shorter match over a longer one by changing the order of the expressions, e g. R2 || R1.
Permission to copy, use, modify, sell and distribute this document is granted provided this copyright notice appears in all copies. This document is provided "as is" without express or implied warranty, and with no claim as to its suitability for any purpose.