<<< entity(R) grammar expression adaptor | Table Of Contents | Quote evaluation rules >>> |
Common Text Transformation Library http://cttl.sourceforge.net/
There are numerous examples of quotations, insertions, sub-expressions, grouped phrases, citations, and passages, all of which require certain level of distinction from the rest of the input language syntax. The inputs often involve grammar constructs that formulate balanced pairs of input symbols. CTTL collectively categorizes this type of syntax as quoted construct, or simply, a quote.
Numerous variations of quotes can be found in notation of logical and mathematical expressions, programming languages, and natural languages:
"Hello" // Text surrounded by quotation marks. ( x + 5 ) // Arithmetic expression enclosed in parenthesis.
Quotes can be nested inside one another, like HTML tags in a web page:
<HTML><BODY>Hello, World!</BODY></HTML>
Nesting is important characteristic of a generic quote. HTML elements, such as HTML and BODY, are formed by pairs of balanced html tags. Interestingly, the tags themselves can be described as quoted constructs, formed by pairs of angled brackets.
CTTL library defines a family of quote helper functions, designated to yield lexer components that parse various styles of quoted input symbols.
Generalized version of CTTL quote has format
quote( RL, RM, RR )
where RL, RM, and RR represent left, middle, and right grammar expressions. In particular,
The following sample matches pairs of parenthesis that balance each other out in an arithmetic expression:
#include "cttl/cttl.h" using namespace cttl; int main() { // Arithmetic expression using balanced parenthesis: std::string inp = "((a+b)*(c-d))"; const_edge<> substring( inp ); const_edge<> interior = substring; size_t result = quote( '(', interior( entity() ), ')' ).match( substring ); assert( result != std::string::npos ); assert( interior == "(a+b)*(c-d)" ); return 0; }
A more elaborate example, quotes_html.cpp, matches pairs of opening and closing tags in a simple HTML text. Note that this is not a complete HTML parser. The attributes of elements are ignored, and no white space recognition is attempted. Both limitations make current version of quotes_html.cpp a bit unrealistic. Nevetheless, the lexer finds pairs of balanced tags and prints the interior text of each element on the screen:
Keywords: quotes_html.cpp, CTTL_STATIC_RULE, HTML, literal, recursion
// quotes_html.cpp //#define CTTL_TRACE_RULES #include "cttl/cttl.h" using namespace cttl; /**This grammar rule matches HTML text.*/ size_t rule_text( const_edge<>& substr_ ) { if ( substr_.length() == 0 ) { return std::string::npos; } // Assume that the entire parseable substring // represents the text inside an HTML tag: std::cout << "TEXT: [" << substr_ << ']' << std::endl ; // Grab starting position of text: size_t result = substr_.first.offset(); // Update ending position of text: substr_.first = substr_.second; return result; } /**This is the starting grammar rule to match HTML tags.*/ size_t rule_element( const_edge<>& substr_ ) { const_edge<> interior = substr_; const_edge<> element_name = substr_; size_t result = quote( quote( '<', literal(), '>' ), interior( CTTL_STATIC_RULE( rule_element ) ), quote( symbol('<')+'/', element_name( literal() ), symbol( '>' ) ) ).match( substr_ ); if ( result != std::string::npos ) { // A tag was found: print some info: std::cout << "ELEMENT " << element_name << ": [" << interior << ']' << std::endl ; // Since more tags may exist, try to evaluate them too: ( *CTTL_STATIC_RULE( rule_element ) ).match( substr_ ); return result; } return CTTL_STATIC_RULE( rule_text ).match( substr_ ); } /**Very simple program to match some HTML tags.*/ int main() { // Construct simple HTML. // HTML elements are formed by pairs of tags: std::string inp = "<HTML>" "<BODY>" "<p id='greeting'>Hello, World!</p>" "<p>2nd paragraph</p>" "</BODY>" "</HTML>" ; const_edge<> substring( inp ); CTTL_STATIC_RULE( rule_element ).match( substring ); return 0; } /*Output TEXT: [Hello, World!] ELEMENT p: [Hello, World!] TEXT: [2nd paragraph] ELEMENT p: [2nd paragraph] ELEMENT BODY: [<p id='greeting'>Hello, World!</p><p>2nd paragraph</p>] ELEMENT HTML: [<BODY><p id='greeting'>Hello, World!</p><p>2nd paragraph</p></BODY>] */
The sample program intentionally makes heavy use of quotes, combining them with recursively invoked grammar rules.
The quotes_html.cpp demonstrates important structural aspect of a quote: by definition, left and right units are bound by balanced relation. The balance allows to have nested quotations. A simplified demo of a balanced relation can be illustrated by a smaller program that matches opening and closing tags of the outermost element in HTML input:
#include "cttl/cttl.h" using namespace cttl; int main() { std::string inp = "<HTML><BODY>Hello, World!</BODY></HTML>"; const_edge<> substring( inp ); const_edge<> interior = substring; quote( '<' + entity( isalpha ) + '>', interior( entity() ), symbol( '<' ) + '/' + entity( isalpha ) + '>' ).match( substring ); assert( interior == "<BODY>Hello, World!</BODY>" ); return 0; }
Generic formats of balanced quotes do not intend to solve every possible case, related to matching symmetrical patterns in the input language. Implementation of quotes is heavily based on the repeatable search algorithm. (This excludes the asymmetric and specialized quote formats listed below.)
Complexity of quoted search, combined with symmetry validation between left and right matches, requires additional backtracking effort from the underlying lexer component. In that capacity, the applicability of a generic quote is limited primarily to the construction of search parsers in the context of reduced grammar sets.
By design, generic quote implementation is a search parser that executes repeatable search for a pair of opening and closing patterns. A conventional, recursive-descent parser, performing sequential scanning and matching against a complete set of grammars yields better performance if not using quotes.
On the other hand, a number of secialized versions of CTTL quotes are optimized for mainstream grammar. As such, the following quote formats, as well as their wchar_t counterparts, can be used freely in all conventional parsers:
Asymmetric quote,
quote(true,RM,RR)
(deprecated, use
positive lookbehind assertion
algorithm instead);
ANSI single quote, ansi_single_quote(RM);
ANSI double quote, ansi_double_quote(RM);
C single quote, c_single_quote(RM);
C double quote, c_double_quote(RM).
Copyright © 1997-2009 Igor Kholodov mailto:cttl@users.sourceforge.net.
Permission to copy and distribute this document is granted provided this copyright notice appears in all copies. This document is provided "as is" without express or implied warranty, and with no claim as to its suitability for any purpose.
<<< entity(R) grammar expression adaptor | Table Of Contents | Quote evaluation rules >>> |