<<< Predefined space policy classes | Table Of Contents | Stream parsers >>> |
Common Text Transformation Library http://cttl.sourceforge.net/
A program could benefit from a concept of inoperable space of input by declaring it invisible to CTTL text operations. For example, a two-pass algorithm can detect a set of unwanted substrings in the first pass, and then exclude those fragments from the processing in the second pass. If second pass involves mutating operations on text, inoperable space is adjusted accordingly.
Consider a search-and-replace program that needs to separate string literals and programming language comments from the searchable areas of input. (A complete example, cpp_comment_strip.cpp, is given at the end of this section.) Locating literals and comments such as "Hello" and /*Goodbye*/ is a relatively uncomplicated task. For example, CTTL grammar to search for C/C++ comments may look like this:
#include <iostream> #include "cttl/cttl.h" using namespace cttl; int main() { std::string inp = "int main() {\n" "\t// I am a C++ comment\n" "\treturn 0;\n" "\t/" "*\n" "\tI am a C comment\n" "\t*" "/\n" "}\n" ; const_edge<> substring( inp ); const_edge<> comment = substring; std::cout << inp << std::endl; std::cout << "COMMENTS:" << std::endl; while ( comment( begin( '/' ) + ( ( symbol( '/' ) + symbol( '/' ) + !begin( symbol( '\n' ) | end() ) ) | ( symbol( '/' ) + symbol( '*' ) + !!( symbol( '*' ) + symbol( '/' ) ) ) ) ).bang_find( substring ) != std::string::npos ) { std::cout << "[" << comment << "]" << std::endl; } return 0; }
The program output is
int main() { // I am a C++ comment return 0; /* I am a C comment */ } COMMENTS: [// I am a C++ comment] [/* I am a C comment */]
At the same time, excluding a set of particular fragments from the searchable area isn't so easy, because it requires constant checks against the definition of a rejected space, potentially requiring elaborate calculations of physical offsets in the input string. Fortunately, CTTL offers direct, built-in support for the concept of inoperable space. CTTL lexer has a capability to skip pre-defined portions of text, found in a map of void regions of input. A stateful space policy is a logical place for storage and maintenance of such facility.
Implementation of void regions is built into a family of space policy classes
policy_space< flag_follow_region > policy_space< flag_greedy | flag_follow_region > policy_space< flag_follow_space | flag_follow_region > policy_space< flag_follow_either > // flag_follow_either = flag_follow_space | flag_follow_region policy_space< flag_greedy | flag_follow_space | flag_follow_region >
Region-aware space policy is a stateful object: it maintains a map of arbitrary input regions that describe unusable areas of input.
flag_follow_space allows to combine user-defined void regions with conventional white space characters:
cr = '\r' lf = '\n' ht = '\t' vt = '\v' ff = '\f' space = ' '
flag_greedy specifies that a greedy evaluation algorithm should be used by the policy. Unless flag_greedy is specified, the implementation is ungreedy.
flag_follow_either is a shorthand for both white space, and region flags, combined together:
flag_follow_either = flag_follow_space | flag_follow_region
Once defined, void regions become "invisible" to CTTL lexer, preserving ability to perform common substring operations without restrictions. The capacity of algorithms
Text transformations, applied to the substrings in the remaining operable space
remains the same, with or without the void regions.
The class policy_space<flag_follow_region> provides member functions to access and manipulate the map of void regions:
// Erase existing void regions from the map: void region_clear(); // Erase region(s) from the map: void region_erase(size_t first_offset_, size_t second_offset_); // Add new region to the map: void region_insert(size_t first_offset_, size_t second_offset_); // Extract the substring specified by two offsets, excluding the intersecting void regions: StringT region_difference(StringT const &str_, size_t first_offset_, size_t second_offset_);
The boundaries of a void region formulate a half open range, which includes all characters between positions first_offset_ and second_offset_, except the character pointed by the second_offset_:
[first_offset_, second_offset_)
Void regions are navigable. For example,
// Calculate new position, if the intersection with void region is found: size_t lower_bound( size_t offset_, size_t substr_length_ );
This member function of the policy_space class validates given offset_ against the map of existing void regions. It returns a modified position, guaranteed to be outside of any region. If an intersection with a region is found, the function returns the lower boundary position of that void region. Otherwise, the function returns unmodified value of offset_. The function returns substr_length_ when condition
( offset_ > str_length_ )
is true. The substr_length_ parameter usually specifies the length of the input string, which is guaranteed to be outside of any void region. For more information, refer to policy_space<flag_follow_region> documentation.
The sample program void_regions.cpp illustrates basic functionality of a region-based policy:
Keywords: void_regions.cpp, policy_space, flag_follow_region, region_insert, lower_bound, region_difference
// sample code: void_regions.cpp // Program demonstrates // const_edge< policy_space< flag_follow_region > > //#define CTTL_TRACE_EVERYTHING //#define CTTL_TRACE_RULES //define to turn light tracing on //#define CTTL_TRACE_TRIVIAL //define for trace messages only mode #include <iostream> #include "cttl/cttl.h" using namespace cttl; int main() { std::string inp = "0123456789"; std::cout << " the input is /" << inp << '/' << std::endl << "valid chars are /" ; policy_space< flag_follow_region > void_region; void_region.region_insert( 0, 1 ); void_region.region_insert( 9, 10 ); void_region.region_insert( 5, 8 ); size_t valid_offset = 0; while ( valid_offset < inp.length() ) { valid_offset = void_region.lower_bound( valid_offset, inp.length() ); if ( valid_offset >= inp.length() ) break; std::cout << inp[ valid_offset ]; ++valid_offset; } const_edge< policy_space< flag_follow_region > > substring( inp, void_region ); std::cout << '/' << std::endl << " substring.region_difference() is /" << substring.region_difference() << '/' << std::endl ; substring.first.go_bof(); substring.second.go_eof(); const_edge<> data = substring; std::cout << "the symbols are /" ; while( ( data( !first( isdigit ) ) ).match( substring ) != std::string::npos ) { std::cout << data.text(); } std::cout << '/' << std::endl ; return 0; }
The program prints output:
the input is /0123456789/ valid chars are /12348/ substring.region_difference() is /12348/ the symbols are /12348/
void cttl::const_edge< PolicyT, StringT >::region_erase(); void cttl::const_edge< PolicyT, StringT >::region_insert();
The content of text, enclosed within substring boundaries, but excluding the intersecting void regions, can be extracted:
StringT cttl::const_edge< PolicyT, StringT >::region_difference();
The library sample cpp_comment_strip.cpp, (which includes example/cpp_comment.h header) demonstrates a parser that is based on the void region-aware policy. The program
Keywords: cpp_comment_strip.cpp, fileio, pipe_input, policy_space, flag_follow_either, void regions, file2string, cpp_comment_lexer, cpp_comment_parser, region_difference, region_insert, c_single_quote, c_double_quote, rule
// cpp_comment_strip.cpp //#define NDEBUG // define before assert.h to stop assertions from being compiled //#define CTTL_TRACE_EVERYTHING //define to turn tracing on //#define CTTL_TRACE_RULES //define to turn light tracing on #include <iostream> #include "cttl/cttl.h" #include "utils/fileio.h" #include "utils/pipe_input.h" #include "example/cpp_comment.h" using namespace cttl; int main(int argc, char* argv[]) { std::vector< std::string > vect_input_files; if ( argc == 2 ) { // assume one file was specified as input argument vect_input_files.push_back( argv[1] ); } else { // assume list of java input files was specified as pipe input pipe_input_2_vector( vect_input_files ); } assert( vect_input_files.size() ); std::string inp; policy_space< flag_follow_either > void_region; typedef const_edge< policy_space< flag_follow_either > > substr_T; substr_T substring( inp, void_region ); for ( size_t file_cnt = 0; file_cnt < vect_input_files.size(); ++file_cnt ) { void_region.region_clear(); file2string( vect_input_files[ file_cnt ], inp ); assert( inp.length() ); cpp_comment_lexer< cpp_comment_parser< substr_T > > comment_parser; substring.first.go_bof(); substring.second.go_eof(); if ( comment_parser.rule_symbols( substring ) ) { substring.first.go_bof(); std::cout << substring.region_difference(); } } return 0; }
#ifndef _CTTL_CPP_COMMENT_H_INCLUDED_ #define _CTTL_CPP_COMMENT_H_INCLUDED_ // cpp_comment.h using namespace cttl; template< typename SubstrT > struct comments_base_parser { // parser defines two kinds of substrings: typedef SubstrT substr_T; typedef typename SubstrT::strict_edge_T strict_input_T; typedef typename SubstrT::policy_T policy_T; policy_T* m_policy_ptr; comments_base_parser() : m_policy_ptr( NULL ) { } void policy( policy_T* policy_ptr_ ) { m_policy_ptr = policy_ptr_; } policy_T* policy() { assert( m_policy_ptr ); return m_policy_ptr; } // default semantic actions: size_t cpp_comment( strict_input_T& ) { return 0; } size_t c_comment( strict_input_T& ) { return 0; } size_t single_quote_interior( strict_input_T& ) { return 0; } size_t double_quote_interior( strict_input_T& ) { return 0; } }; /**Adds comments to the map of void regions.*/ template< typename SubstrT > struct cpp_comment_parser : public comments_base_parser< SubstrT > { typedef SubstrT substr_T; typedef typename SubstrT::strict_edge_T strict_input_T; size_t cpp_comment( strict_input_T& edge_ ) { this->policy()->region_insert( edge_.first.offset(), edge_.second.offset() ); return 0; } size_t c_comment( strict_input_T& edge_ ) { this->policy()->region_insert( edge_.first.offset(), edge_.second.offset() ); return 0; } }; /**Adds comments, character, string literals to the map of void regions.*/ template< typename SubstrT > struct cpp_comment_and_literal_parser : public cpp_comment_parser< SubstrT > { typedef SubstrT substr_T; typedef typename SubstrT::strict_edge_T strict_input_T; size_t single_quote_interior( strict_input_T& edge_ ) { this->policy()->region_insert( edge_.first.offset(), edge_.second.offset() ); return 0; } size_t double_quote_interior( strict_input_T& edge_ ) { this->policy()->region_insert( edge_.first.offset(), edge_.second.offset() ); return 0; } }; /**Finds comments in cpp or java file.*/ template< typename ParserT > struct cpp_comment_lexer : public ParserT { // lexer defines two kinds of substrings: typedef typename ParserT::substr_T substr_T; typedef typename substr_T::strict_edge_T strict_input_T; typedef typename substr_T::string_T string_T; string_T wack_quote_quote; string_T wack_wack; string_T wack_asterisk; string_T asterisk_wack; cpp_comment_lexer() : wack_quote_quote( "/\"\'" ), wack_wack( "//" ), wack_asterisk( "/" "*" ), asterisk_wack( "*" "/" ) { } bool rule_symbols( substr_T& edge_ ) { policy( &edge_.space_policy() ); return ( +!!( // Find closest character that can begin a comment or a literal. // Once found, match the rest of the symbol at the current position begin( &wack_quote_quote ) ^ ( c_single_quote( entity() & rule( *this, &cpp_comment_lexer< ParserT >::single_quote_interior ) ) | c_double_quote( entity() & rule( *this, &cpp_comment_lexer< ParserT >::double_quote_interior ) ) | ( ( &wack_wack + ( !begin( '\n' ) | !end() ) ) & rule( *this, &cpp_comment_lexer< ParserT >::cpp_comment ) ) | ( ( &wack_asterisk + !symbol( &asterisk_wack ) ) & rule( *this, &cpp_comment_lexer< ParserT >::c_comment ) ) | // if none of the above matched, this can only be a division operator: '/' ) ) ).match( edge_ ) != std::string::npos; } }; // struct cpp_comment_lexer #endif // _CTTL_CPP_COMMENT_H_INCLUDED_
Copyright © 1997-2009 Igor Kholodov mailto:cttl@users.sourceforge.net.
Permission to copy and distribute this document is granted provided this copyright notice appears in all copies. This document is provided "as is" without express or implied warranty, and with no claim as to its suitability for any purpose.
<<< Predefined space policy classes | Table Of Contents | Stream parsers >>> |