1*16467b97STreehugger Robot#!/usr/bin/ruby 2*16467b97STreehugger Robot# encoding: utf-8 3*16467b97STreehugger Robot 4*16467b97STreehugger Robot=begin LICENSE 5*16467b97STreehugger Robot 6*16467b97STreehugger Robot[The "BSD licence"] 7*16467b97STreehugger RobotCopyright (c) 2009-2010 Kyle Yetter 8*16467b97STreehugger RobotAll rights reserved. 9*16467b97STreehugger Robot 10*16467b97STreehugger RobotRedistribution and use in source and binary forms, with or without 11*16467b97STreehugger Robotmodification, are permitted provided that the following conditions 12*16467b97STreehugger Robotare met: 13*16467b97STreehugger Robot 14*16467b97STreehugger Robot 1. Redistributions of source code must retain the above copyright 15*16467b97STreehugger Robot notice, this list of conditions and the following disclaimer. 16*16467b97STreehugger Robot 2. Redistributions in binary form must reproduce the above copyright 17*16467b97STreehugger Robot notice, this list of conditions and the following disclaimer in the 18*16467b97STreehugger Robot documentation and/or other materials provided with the distribution. 19*16467b97STreehugger Robot 3. The name of the author may not be used to endorse or promote products 20*16467b97STreehugger Robot derived from this software without specific prior written permission. 21*16467b97STreehugger Robot 22*16467b97STreehugger RobotTHIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR 23*16467b97STreehugger RobotIMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES 24*16467b97STreehugger RobotOF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. 25*16467b97STreehugger RobotIN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, 26*16467b97STreehugger RobotINCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT 27*16467b97STreehugger RobotNOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, 28*16467b97STreehugger RobotDATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY 29*16467b97STreehugger RobotTHEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 30*16467b97STreehugger Robot(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF 31*16467b97STreehugger RobotTHIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 32*16467b97STreehugger Robot 33*16467b97STreehugger Robot=end 34*16467b97STreehugger Robot 35*16467b97STreehugger Robotmodule ANTLR3 36*16467b97STreehugger Robot 37*16467b97STreehugger Robot=begin rdoc ANTLR3::Token 38*16467b97STreehugger Robot 39*16467b97STreehugger RobotAt a minimum, tokens are data structures that bind together a chunk of text and 40*16467b97STreehugger Robota corresponding type symbol, which categorizes/characterizes the content of the 41*16467b97STreehugger Robottext. Tokens also usually carry information about their location in the input, 42*16467b97STreehugger Robotsuch as absolute character index, line number, and position within the line (or 43*16467b97STreehugger Robotcolumn). 44*16467b97STreehugger Robot 45*16467b97STreehugger RobotFurthermore, ANTLR tokens are assigned a "channel" number, an extra degree of 46*16467b97STreehugger Robotcategorization that groups things on a larger scale. Parsers will usually ignore 47*16467b97STreehugger Robottokens that have channel value 99 (the HIDDEN_CHANNEL), so you can keep things 48*16467b97STreehugger Robotlike comment and white space huddled together with neighboring tokens, 49*16467b97STreehugger Roboteffectively ignoring them without discarding them. 50*16467b97STreehugger Robot 51*16467b97STreehugger RobotANTLR tokens also keep a reference to the source stream from which they 52*16467b97STreehugger Robotoriginated. Token streams will also provide an index value for the token, which 53*16467b97STreehugger Robotindicates the position of the token relative to other tokens in the stream, 54*16467b97STreehugger Robotstarting at zero. For example, the 22nd token pulled from a lexer by 55*16467b97STreehugger RobotCommonTokenStream will have index value 21. 56*16467b97STreehugger Robot 57*16467b97STreehugger Robot== Token as an Interface 58*16467b97STreehugger Robot 59*16467b97STreehugger RobotThis library provides a token implementation (see CommonToken). Additionally, 60*16467b97STreehugger Robotyou may write your own token class as long as you provide methods that give 61*16467b97STreehugger Robotaccess to the attributes expected by a token. Even though most of the ANTLR 62*16467b97STreehugger Robotlibrary tries to use duck-typing techniques instead of pure object-oriented type 63*16467b97STreehugger Robotchecking, it's a good idea to include this ANTLR3::Token into your customized 64*16467b97STreehugger Robottoken class. 65*16467b97STreehugger Robot 66*16467b97STreehugger Robot=end 67*16467b97STreehugger Robot 68*16467b97STreehugger Robotmodule Token 69*16467b97STreehugger Robot include ANTLR3::Constants 70*16467b97STreehugger Robot include Comparable 71*16467b97STreehugger Robot 72*16467b97STreehugger Robot # the token's associated chunk of text 73*16467b97STreehugger Robot attr_accessor :text 74*16467b97STreehugger Robot 75*16467b97STreehugger Robot # the integer value associated with the token's type 76*16467b97STreehugger Robot attr_accessor :type 77*16467b97STreehugger Robot 78*16467b97STreehugger Robot # the text's starting line number within the source (indexed starting at 1) 79*16467b97STreehugger Robot attr_accessor :line 80*16467b97STreehugger Robot 81*16467b97STreehugger Robot # the text's starting position in the line within the source (indexed starting at 0) 82*16467b97STreehugger Robot attr_accessor :column 83*16467b97STreehugger Robot 84*16467b97STreehugger Robot # the integer value of the channel to which the token is assigned 85*16467b97STreehugger Robot attr_accessor :channel 86*16467b97STreehugger Robot 87*16467b97STreehugger Robot # the index of the token with respect to other the other tokens produced during lexing 88*16467b97STreehugger Robot attr_accessor :index 89*16467b97STreehugger Robot 90*16467b97STreehugger Robot # a reference to the input stream from which the token was extracted 91*16467b97STreehugger Robot attr_accessor :input 92*16467b97STreehugger Robot 93*16467b97STreehugger Robot # the absolute character index in the input at which the text starts 94*16467b97STreehugger Robot attr_accessor :start 95*16467b97STreehugger Robot 96*16467b97STreehugger Robot # the absolute character index in the input at which the text ends 97*16467b97STreehugger Robot attr_accessor :stop 98*16467b97STreehugger Robot 99*16467b97STreehugger Robot alias :input_stream :input 100*16467b97STreehugger Robot alias :input_stream= :input= 101*16467b97STreehugger Robot alias :token_index :index 102*16467b97STreehugger Robot alias :token_index= :index= 103*16467b97STreehugger Robot 104*16467b97STreehugger Robot # 105*16467b97STreehugger Robot # The match operator has been implemented to match against several different 106*16467b97STreehugger Robot # attributes of a token for convenience in quick scripts 107*16467b97STreehugger Robot # 108*16467b97STreehugger Robot # @example Match against an integer token type constant 109*16467b97STreehugger Robot # token =~ VARIABLE_NAME => true/false 110*16467b97STreehugger Robot # @example Match against a token type name as a Symbol 111*16467b97STreehugger Robot # token =~ :FLOAT => true/false 112*16467b97STreehugger Robot # @example Match the token text against a Regular Expression 113*16467b97STreehugger Robot # token =~ /^@[a-z_]\w*$/i 114*16467b97STreehugger Robot # @example Compare the token's text to a string 115*16467b97STreehugger Robot # token =~ "class" 116*16467b97STreehugger Robot # 117*16467b97STreehugger Robot def =~ obj 118*16467b97STreehugger Robot case obj 119*16467b97STreehugger Robot when Integer then type == obj 120*16467b97STreehugger Robot when Symbol then name == obj.to_s 121*16467b97STreehugger Robot when Regexp then obj =~ text 122*16467b97STreehugger Robot when String then text == obj 123*16467b97STreehugger Robot else super 124*16467b97STreehugger Robot end 125*16467b97STreehugger Robot end 126*16467b97STreehugger Robot 127*16467b97STreehugger Robot # 128*16467b97STreehugger Robot # Tokens are comparable by their stream index values 129*16467b97STreehugger Robot # 130*16467b97STreehugger Robot def <=> tk2 131*16467b97STreehugger Robot index <=> tk2.index 132*16467b97STreehugger Robot end 133*16467b97STreehugger Robot 134*16467b97STreehugger Robot def initialize_copy( orig ) 135*16467b97STreehugger Robot self.index = -1 136*16467b97STreehugger Robot self.type = orig.type 137*16467b97STreehugger Robot self.channel = orig.channel 138*16467b97STreehugger Robot self.text = orig.text.clone if orig.text 139*16467b97STreehugger Robot self.start = orig.start 140*16467b97STreehugger Robot self.stop = orig.stop 141*16467b97STreehugger Robot self.line = orig.line 142*16467b97STreehugger Robot self.column = orig.column 143*16467b97STreehugger Robot self.input = orig.input 144*16467b97STreehugger Robot end 145*16467b97STreehugger Robot 146*16467b97STreehugger Robot def concrete? 147*16467b97STreehugger Robot input && start && stop ? true : false 148*16467b97STreehugger Robot end 149*16467b97STreehugger Robot 150*16467b97STreehugger Robot def imaginary? 151*16467b97STreehugger Robot input && start && stop ? false : true 152*16467b97STreehugger Robot end 153*16467b97STreehugger Robot 154*16467b97STreehugger Robot def name 155*16467b97STreehugger Robot token_name( type ) 156*16467b97STreehugger Robot end 157*16467b97STreehugger Robot 158*16467b97STreehugger Robot def source_name 159*16467b97STreehugger Robot i = input and i.source_name 160*16467b97STreehugger Robot end 161*16467b97STreehugger Robot 162*16467b97STreehugger Robot def hidden? 163*16467b97STreehugger Robot channel == HIDDEN_CHANNEL 164*16467b97STreehugger Robot end 165*16467b97STreehugger Robot 166*16467b97STreehugger Robot def source_text 167*16467b97STreehugger Robot concrete? ? input.substring( start, stop ) : text 168*16467b97STreehugger Robot end 169*16467b97STreehugger Robot 170*16467b97STreehugger Robot # 171*16467b97STreehugger Robot # Sets the token's channel value to HIDDEN_CHANNEL 172*16467b97STreehugger Robot # 173*16467b97STreehugger Robot def hide! 174*16467b97STreehugger Robot self.channel = HIDDEN_CHANNEL 175*16467b97STreehugger Robot end 176*16467b97STreehugger Robot 177*16467b97STreehugger Robot def inspect 178*16467b97STreehugger Robot text_inspect = text ? "[#{ text.inspect }] " : ' ' 179*16467b97STreehugger Robot text_position = line > 0 ? "@ line #{ line } col #{ column } " : '' 180*16467b97STreehugger Robot stream_position = start ? "(#{ range.inspect })" : '' 181*16467b97STreehugger Robot 182*16467b97STreehugger Robot front = index >= 0 ? "#{ index } " : '' 183*16467b97STreehugger Robot rep = front << name << text_inspect << 184*16467b97STreehugger Robot text_position << stream_position 185*16467b97STreehugger Robot rep.strip! 186*16467b97STreehugger Robot channel == DEFAULT_CHANNEL or rep << " (#{ channel.to_s })" 187*16467b97STreehugger Robot return( rep ) 188*16467b97STreehugger Robot end 189*16467b97STreehugger Robot 190*16467b97STreehugger Robot def pretty_print( printer ) 191*16467b97STreehugger Robot printer.text( inspect ) 192*16467b97STreehugger Robot end 193*16467b97STreehugger Robot 194*16467b97STreehugger Robot def range 195*16467b97STreehugger Robot start..stop rescue nil 196*16467b97STreehugger Robot end 197*16467b97STreehugger Robot 198*16467b97STreehugger Robot def to_i 199*16467b97STreehugger Robot index.to_i 200*16467b97STreehugger Robot end 201*16467b97STreehugger Robot 202*16467b97STreehugger Robot def to_s 203*16467b97STreehugger Robot text.to_s 204*16467b97STreehugger Robot end 205*16467b97STreehugger Robot 206*16467b97STreehugger Robotprivate 207*16467b97STreehugger Robot 208*16467b97STreehugger Robot def token_name( type ) 209*16467b97STreehugger Robot BUILT_IN_TOKEN_NAMES[ type ] 210*16467b97STreehugger Robot end 211*16467b97STreehugger Robotend 212*16467b97STreehugger Robot 213*16467b97STreehugger RobotCommonToken = Struct.new( :type, :channel, :text, :input, :start, 214*16467b97STreehugger Robot :stop, :index, :line, :column ) 215*16467b97STreehugger Robot 216*16467b97STreehugger Robot=begin rdoc ANTLR3::CommonToken 217*16467b97STreehugger Robot 218*16467b97STreehugger RobotThe base class for the standard implementation of Token. It is implemented as a 219*16467b97STreehugger Robotsimple Struct as tokens are basically simple data structures binding together a 220*16467b97STreehugger Robotbunch of different information and Structs are slightly faster than a standard 221*16467b97STreehugger RobotObject with accessor methods implementation. 222*16467b97STreehugger Robot 223*16467b97STreehugger RobotBy default, ANTLR generated ruby code will provide a customized subclass of 224*16467b97STreehugger RobotCommonToken to track token-type names efficiently for debugging, inspection, and 225*16467b97STreehugger Robotgeneral utility. Thus code generated for a standard combo lexer-parser grammar 226*16467b97STreehugger Robotnamed XYZ will have a base module named XYZ and a customized CommonToken 227*16467b97STreehugger Robotsubclass named XYZ::Token. 228*16467b97STreehugger Robot 229*16467b97STreehugger RobotHere is the token structure attribute list in order: 230*16467b97STreehugger Robot 231*16467b97STreehugger Robot* <tt>type</tt> 232*16467b97STreehugger Robot* <tt>channel</tt> 233*16467b97STreehugger Robot* <tt>text</tt> 234*16467b97STreehugger Robot* <tt>input</tt> 235*16467b97STreehugger Robot* <tt>start</tt> 236*16467b97STreehugger Robot* <tt>stop</tt> 237*16467b97STreehugger Robot* <tt>index</tt> 238*16467b97STreehugger Robot* <tt>line</tt> 239*16467b97STreehugger Robot* <tt>column</tt> 240*16467b97STreehugger Robot 241*16467b97STreehugger Robot=end 242*16467b97STreehugger Robot 243*16467b97STreehugger Robotclass CommonToken 244*16467b97STreehugger Robot include Token 245*16467b97STreehugger Robot DEFAULT_VALUES = { 246*16467b97STreehugger Robot :channel => DEFAULT_CHANNEL, 247*16467b97STreehugger Robot :index => -1, 248*16467b97STreehugger Robot :line => 0, 249*16467b97STreehugger Robot :column => -1 250*16467b97STreehugger Robot }.freeze 251*16467b97STreehugger Robot 252*16467b97STreehugger Robot def self.token_name( type ) 253*16467b97STreehugger Robot BUILT_IN_TOKEN_NAMES[ type ] 254*16467b97STreehugger Robot end 255*16467b97STreehugger Robot 256*16467b97STreehugger Robot def self.create( fields = {} ) 257*16467b97STreehugger Robot fields = DEFAULT_VALUES.merge( fields ) 258*16467b97STreehugger Robot args = members.map { |name| fields[ name.to_sym ] } 259*16467b97STreehugger Robot new( *args ) 260*16467b97STreehugger Robot end 261*16467b97STreehugger Robot 262*16467b97STreehugger Robot # allows you to make a copy of a token with a different class 263*16467b97STreehugger Robot def self.from_token( token ) 264*16467b97STreehugger Robot new( 265*16467b97STreehugger Robot token.type, token.channel, token.text ? token.text.clone : nil, 266*16467b97STreehugger Robot token.input, token.start, token.stop, -1, token.line, token.column 267*16467b97STreehugger Robot ) 268*16467b97STreehugger Robot end 269*16467b97STreehugger Robot 270*16467b97STreehugger Robot def initialize( type = nil, channel = DEFAULT_CHANNEL, text = nil, 271*16467b97STreehugger Robot input = nil, start = nil, stop = nil, index = -1, 272*16467b97STreehugger Robot line = 0, column = -1 ) 273*16467b97STreehugger Robot super 274*16467b97STreehugger Robot block_given? and yield( self ) 275*16467b97STreehugger Robot self.text.nil? && self.start && self.stop and 276*16467b97STreehugger Robot self.text = self.input.substring( self.start, self.stop ) 277*16467b97STreehugger Robot end 278*16467b97STreehugger Robot 279*16467b97STreehugger Robot alias :input_stream :input 280*16467b97STreehugger Robot alias :input_stream= :input= 281*16467b97STreehugger Robot alias :token_index :index 282*16467b97STreehugger Robot alias :token_index= :index= 283*16467b97STreehugger Robotend 284*16467b97STreehugger Robot 285*16467b97STreehugger Robotmodule Constants 286*16467b97STreehugger Robot 287*16467b97STreehugger Robot # End of File / End of Input character and token type 288*16467b97STreehugger Robot EOF_TOKEN = CommonToken.new( EOF ).freeze 289*16467b97STreehugger Robot INVALID_TOKEN = CommonToken.new( INVALID_TOKEN_TYPE ).freeze 290*16467b97STreehugger Robot SKIP_TOKEN = CommonToken.new( INVALID_TOKEN_TYPE ).freeze 291*16467b97STreehugger Robotend 292*16467b97STreehugger Robot 293*16467b97STreehugger Robot 294*16467b97STreehugger Robot 295*16467b97STreehugger Robot=begin rdoc ANTLR3::TokenSource 296*16467b97STreehugger Robot 297*16467b97STreehugger RobotTokenSource is a simple mixin module that demands an 298*16467b97STreehugger Robotimplementation of the method #next_token. In return, it 299*16467b97STreehugger Robotdefines methods #next and #each, which provide basic 300*16467b97STreehugger Robotiterator methods for token generators. Furthermore, it 301*16467b97STreehugger Robotincludes Enumerable to provide the standard Ruby iteration 302*16467b97STreehugger Robotmethods to token generators, like lexers. 303*16467b97STreehugger Robot 304*16467b97STreehugger Robot=end 305*16467b97STreehugger Robot 306*16467b97STreehugger Robotmodule TokenSource 307*16467b97STreehugger Robot include Constants 308*16467b97STreehugger Robot include Enumerable 309*16467b97STreehugger Robot extend ClassMacros 310*16467b97STreehugger Robot 311*16467b97STreehugger Robot abstract :next_token 312*16467b97STreehugger Robot 313*16467b97STreehugger Robot def next 314*16467b97STreehugger Robot token = next_token() 315*16467b97STreehugger Robot raise StopIteration if token.nil? || token.type == EOF 316*16467b97STreehugger Robot return token 317*16467b97STreehugger Robot end 318*16467b97STreehugger Robot 319*16467b97STreehugger Robot def each 320*16467b97STreehugger Robot block_given? or return enum_for( :each ) 321*16467b97STreehugger Robot while token = next_token and token.type != EOF 322*16467b97STreehugger Robot yield( token ) 323*16467b97STreehugger Robot end 324*16467b97STreehugger Robot return self 325*16467b97STreehugger Robot end 326*16467b97STreehugger Robot 327*16467b97STreehugger Robot def to_stream( options = {} ) 328*16467b97STreehugger Robot if block_given? 329*16467b97STreehugger Robot CommonTokenStream.new( self, options ) { | t, stream | yield( t, stream ) } 330*16467b97STreehugger Robot else 331*16467b97STreehugger Robot CommonTokenStream.new( self, options ) 332*16467b97STreehugger Robot end 333*16467b97STreehugger Robot end 334*16467b97STreehugger Robotend 335*16467b97STreehugger Robot 336*16467b97STreehugger Robot 337*16467b97STreehugger Robot=begin rdoc ANTLR3::TokenFactory 338*16467b97STreehugger Robot 339*16467b97STreehugger RobotThere are a variety of different entities throughout the ANTLR runtime library 340*16467b97STreehugger Robotthat need to create token objects This module serves as a mixin that provides 341*16467b97STreehugger Robotmethods for constructing tokens. 342*16467b97STreehugger Robot 343*16467b97STreehugger RobotIncluding this module provides a +token_class+ attribute. Instance of the 344*16467b97STreehugger Robotincluding class can create tokens using the token class (which defaults to 345*16467b97STreehugger RobotANTLR3::CommonToken). Token classes are presumed to have an #initialize method 346*16467b97STreehugger Robotthat can be called without any parameters and the token objects are expected to 347*16467b97STreehugger Robothave the standard token attributes (see ANTLR3::Token). 348*16467b97STreehugger Robot 349*16467b97STreehugger Robot=end 350*16467b97STreehugger Robot 351*16467b97STreehugger Robotmodule TokenFactory 352*16467b97STreehugger Robot attr_writer :token_class 353*16467b97STreehugger Robot def token_class 354*16467b97STreehugger Robot @token_class ||= begin 355*16467b97STreehugger Robot self.class.token_class rescue 356*16467b97STreehugger Robot self::Token rescue 357*16467b97STreehugger Robot ANTLR3::CommonToken 358*16467b97STreehugger Robot end 359*16467b97STreehugger Robot end 360*16467b97STreehugger Robot 361*16467b97STreehugger Robot def create_token( *args ) 362*16467b97STreehugger Robot if block_given? 363*16467b97STreehugger Robot token_class.new( *args ) do |*targs| 364*16467b97STreehugger Robot yield( *targs ) 365*16467b97STreehugger Robot end 366*16467b97STreehugger Robot else 367*16467b97STreehugger Robot token_class.new( *args ) 368*16467b97STreehugger Robot end 369*16467b97STreehugger Robot end 370*16467b97STreehugger Robotend 371*16467b97STreehugger Robot 372*16467b97STreehugger Robot 373*16467b97STreehugger Robot=begin rdoc ANTLR3::TokenScheme 374*16467b97STreehugger Robot 375*16467b97STreehugger RobotTokenSchemes exist to handle the problem of defining token types as integer 376*16467b97STreehugger Robotvalues while maintaining meaningful text names for the types. They are 377*16467b97STreehugger Robotdynamically defined modules that map integer values to constants with token-type 378*16467b97STreehugger Robotnames. 379*16467b97STreehugger Robot 380*16467b97STreehugger Robot--- 381*16467b97STreehugger Robot 382*16467b97STreehugger RobotFundamentally, tokens exist to take a chunk of text and identify it as belonging 383*16467b97STreehugger Robotto some category, like "VARIABLE" or "INTEGER". In code, the category is 384*16467b97STreehugger Robotrepresented by an integer -- some arbitrary value that ANTLR will decide to use 385*16467b97STreehugger Robotas it is creating the recognizer. The purpose of using an integer (instead of 386*16467b97STreehugger Robotsay, a ruby symbol) is that ANTLR's decision logic often needs to test whether a 387*16467b97STreehugger Robottoken's type falls within a range, which is not possible with symbols. 388*16467b97STreehugger Robot 389*16467b97STreehugger RobotThe downside of token types being represented as integers is that a developer 390*16467b97STreehugger Robotneeds to be able to reference the unknown type value by name in action code. 391*16467b97STreehugger RobotFurthermore, code that references the type by name and tokens that can be 392*16467b97STreehugger Robotinspected with names in place of type values are more meaningful to a developer. 393*16467b97STreehugger Robot 394*16467b97STreehugger RobotSince ANTLR requires token type names to follow capital-letter naming 395*16467b97STreehugger Robotconventions, defining types as named constants of the recognizer class resolves 396*16467b97STreehugger Robotthe problem of referencing type values by name. Thus, a token type like 397*16467b97STreehugger Robot``VARIABLE'' can be represented by a number like 5 and referenced within code by 398*16467b97STreehugger Robot+VARIABLE+. However, when a recognizer creates tokens, the name of the token's 399*16467b97STreehugger Robottype cannot be seen without using the data defined in the recognizer. 400*16467b97STreehugger Robot 401*16467b97STreehugger RobotOf course, tokens could be defined with a name attribute that could be specified 402*16467b97STreehugger Robotwhen tokens are created. However, doing so would make tokens take up more space 403*16467b97STreehugger Robotthan necessary, as well as making it difficult to change the type of a token 404*16467b97STreehugger Robotwhile maintaining a correct name value. 405*16467b97STreehugger Robot 406*16467b97STreehugger RobotTokenSchemes exist as a technique to manage token type referencing and name 407*16467b97STreehugger Robotextraction. They: 408*16467b97STreehugger Robot 409*16467b97STreehugger Robot1. keep token type references clear and understandable in recognizer code 410*16467b97STreehugger Robot2. permit access to a token's type-name independently of recognizer objects 411*16467b97STreehugger Robot3. allow multiple classes to share the same token information 412*16467b97STreehugger Robot 413*16467b97STreehugger Robot== Building Token Schemes 414*16467b97STreehugger Robot 415*16467b97STreehugger RobotTokenScheme is a subclass of Module. Thus, it has the method 416*16467b97STreehugger Robot<tt>TokenScheme.new(tk_class = nil) { ... module-level code ...}</tt>, which 417*16467b97STreehugger Robotwill evaluate the block in the context of the scheme (module), similarly to 418*16467b97STreehugger RobotModule#module_eval. Before evaluating the block, <tt>.new</tt> will setup the 419*16467b97STreehugger Robotmodule with the following actions: 420*16467b97STreehugger Robot 421*16467b97STreehugger Robot1. define a customized token class (more on that below) 422*16467b97STreehugger Robot2. add a new constant, TOKEN_NAMES, which is a hash that maps types to names 423*16467b97STreehugger Robot3. dynamically populate the new scheme module with a couple instance methods 424*16467b97STreehugger Robot4. include ANTLR3::Constants in the new scheme module 425*16467b97STreehugger Robot 426*16467b97STreehugger RobotAs TokenScheme the class functions as a metaclass, figuring out some of the 427*16467b97STreehugger Robotscoping behavior can be mildly confusing if you're trying to get a handle of the 428*16467b97STreehugger Robotentity for your own purposes. Remember that all of the instance methods of 429*16467b97STreehugger RobotTokenScheme function as module-level methods of TokenScheme instances, ala 430*16467b97STreehugger Robot+attr_accessor+ and friends. 431*16467b97STreehugger Robot 432*16467b97STreehugger Robot<tt>TokenScheme#define_token(name_symbol, int_value)</tt> adds a constant 433*16467b97STreehugger Robotdefinition <tt>name_symbol</tt> with the value <tt>int_value</tt>. It is 434*16467b97STreehugger Robotessentially like <tt>Module#const_set</tt>, except it forbids constant 435*16467b97STreehugger Robotoverwriting (which would mess up recognizer code fairly badly) and adds an 436*16467b97STreehugger Robotinverse type-to-name map to its own <tt>TOKEN_NAMES</tt> table. 437*16467b97STreehugger Robot<tt>TokenScheme#define_tokens</tt> is a convenience method for defining many 438*16467b97STreehugger Robottypes with a hash pairing names to values. 439*16467b97STreehugger Robot 440*16467b97STreehugger Robot<tt>TokenScheme#register_name(value, name_string)</tt> specifies a custom 441*16467b97STreehugger Robottype-to-name definition. This is particularly useful for the anonymous tokens 442*16467b97STreehugger Robotthat ANTLR generates for literal strings in the grammar specification. For 443*16467b97STreehugger Robotexample, if you refer to the literal <tt>'='</tt> in some parser rule in your 444*16467b97STreehugger Robotgrammar, ANTLR will add a lexer rule for the literal and give the token a name 445*16467b97STreehugger Robotlike <tt>T__<i>x</i></tt>, where <tt><i>x</i></tt> is the type's integer value. 446*16467b97STreehugger RobotSince this is pretty meaningless to a developer, generated code should add a 447*16467b97STreehugger Robotspecial name definition for type value <tt><i>x</i></tt> with the string 448*16467b97STreehugger Robot<tt>"'='"</tt>. 449*16467b97STreehugger Robot 450*16467b97STreehugger Robot=== Sample TokenScheme Construction 451*16467b97STreehugger Robot 452*16467b97STreehugger Robot TokenData = ANTLR3::TokenScheme.new do 453*16467b97STreehugger Robot define_tokens( 454*16467b97STreehugger Robot :INT => 4, 455*16467b97STreehugger Robot :ID => 6, 456*16467b97STreehugger Robot :T__5 => 5, 457*16467b97STreehugger Robot :WS => 7 458*16467b97STreehugger Robot ) 459*16467b97STreehugger Robot 460*16467b97STreehugger Robot # note the self:: scoping below is due to the fact that 461*16467b97STreehugger Robot # ruby lexically-scopes constant names instead of 462*16467b97STreehugger Robot # looking up in the current scope 463*16467b97STreehugger Robot register_name(self::T__5, "'='") 464*16467b97STreehugger Robot end 465*16467b97STreehugger Robot 466*16467b97STreehugger Robot TokenData::ID # => 6 467*16467b97STreehugger Robot TokenData::T__5 # => 5 468*16467b97STreehugger Robot TokenData.token_name(4) # => 'INT' 469*16467b97STreehugger Robot TokenData.token_name(5) # => "'='" 470*16467b97STreehugger Robot 471*16467b97STreehugger Robot class ARecognizerOrSuch < ANTLR3::Parser 472*16467b97STreehugger Robot include TokenData 473*16467b97STreehugger Robot ID # => 6 474*16467b97STreehugger Robot end 475*16467b97STreehugger Robot 476*16467b97STreehugger Robot== Custom Token Classes and Relationship with Tokens 477*16467b97STreehugger Robot 478*16467b97STreehugger RobotWhen a TokenScheme is created, it will define a subclass of ANTLR3::CommonToken 479*16467b97STreehugger Robotand assigned it to the constant name +Token+. This token class will both include 480*16467b97STreehugger Robotand extend the scheme module. Since token schemes define the private instance 481*16467b97STreehugger Robotmethod <tt>token_name(type)</tt>, instances of the token class are now able to 482*16467b97STreehugger Robotprovide their type names. The Token method <tt>name</tt> uses the 483*16467b97STreehugger Robot<tt>token_name</tt> method to provide the type name as if it were a simple 484*16467b97STreehugger Robotattribute without storing the name itself. 485*16467b97STreehugger Robot 486*16467b97STreehugger RobotWhen a TokenScheme is included in a recognizer class, the class will now have 487*16467b97STreehugger Robotthe token types as named constants, a type-to-name map constant +TOKEN_NAMES+, 488*16467b97STreehugger Robotand a grammar-specific subclass of ANTLR3::CommonToken assigned to the constant 489*16467b97STreehugger RobotToken. Thus, when recognizers need to manufacture tokens, instead of using the 490*16467b97STreehugger Robotgeneric CommonToken class, they can create tokens using the customized Token 491*16467b97STreehugger Robotclass provided by the token scheme. 492*16467b97STreehugger Robot 493*16467b97STreehugger RobotIf you need to use a token class other than CommonToken, you can pass the class 494*16467b97STreehugger Robotas a parameter to TokenScheme.new, which will be used in place of the 495*16467b97STreehugger Robotdynamically-created CommonToken subclass. 496*16467b97STreehugger Robot 497*16467b97STreehugger Robot=end 498*16467b97STreehugger Robot 499*16467b97STreehugger Robotclass TokenScheme < ::Module 500*16467b97STreehugger Robot include TokenFactory 501*16467b97STreehugger Robot 502*16467b97STreehugger Robot def self.new( tk_class = nil, &body ) 503*16467b97STreehugger Robot super() do 504*16467b97STreehugger Robot tk_class ||= Class.new( ::ANTLR3::CommonToken ) 505*16467b97STreehugger Robot self.token_class = tk_class 506*16467b97STreehugger Robot 507*16467b97STreehugger Robot const_set( :TOKEN_NAMES, ::ANTLR3::Constants::BUILT_IN_TOKEN_NAMES.clone ) 508*16467b97STreehugger Robot 509*16467b97STreehugger Robot @types = ::ANTLR3::Constants::BUILT_IN_TOKEN_NAMES.invert 510*16467b97STreehugger Robot @unused = ::ANTLR3::Constants::MIN_TOKEN_TYPE 511*16467b97STreehugger Robot 512*16467b97STreehugger Robot scheme = self 513*16467b97STreehugger Robot define_method( :token_scheme ) { scheme } 514*16467b97STreehugger Robot define_method( :token_names ) { scheme::TOKEN_NAMES } 515*16467b97STreehugger Robot define_method( :token_name ) do |type| 516*16467b97STreehugger Robot begin 517*16467b97STreehugger Robot token_names[ type ] or super 518*16467b97STreehugger Robot rescue NoMethodError 519*16467b97STreehugger Robot ::ANTLR3::CommonToken.token_name( type ) 520*16467b97STreehugger Robot end 521*16467b97STreehugger Robot end 522*16467b97STreehugger Robot module_function :token_name, :token_names 523*16467b97STreehugger Robot 524*16467b97STreehugger Robot include ANTLR3::Constants 525*16467b97STreehugger Robot 526*16467b97STreehugger Robot body and module_eval( &body ) 527*16467b97STreehugger Robot end 528*16467b97STreehugger Robot end 529*16467b97STreehugger Robot 530*16467b97STreehugger Robot def self.build( *token_names ) 531*16467b97STreehugger Robot token_names = [ token_names ].flatten! 532*16467b97STreehugger Robot token_names.compact! 533*16467b97STreehugger Robot token_names.uniq! 534*16467b97STreehugger Robot tk_class = Class === token_names.first ? token_names.shift : nil 535*16467b97STreehugger Robot value_maps, names = token_names.partition { |i| Hash === i } 536*16467b97STreehugger Robot new( tk_class ) do 537*16467b97STreehugger Robot for value_map in value_maps 538*16467b97STreehugger Robot define_tokens( value_map ) 539*16467b97STreehugger Robot end 540*16467b97STreehugger Robot 541*16467b97STreehugger Robot for name in names 542*16467b97STreehugger Robot define_token( name ) 543*16467b97STreehugger Robot end 544*16467b97STreehugger Robot end 545*16467b97STreehugger Robot end 546*16467b97STreehugger Robot 547*16467b97STreehugger Robot 548*16467b97STreehugger Robot def included( mod ) 549*16467b97STreehugger Robot super 550*16467b97STreehugger Robot mod.extend( self ) 551*16467b97STreehugger Robot end 552*16467b97STreehugger Robot private :included 553*16467b97STreehugger Robot 554*16467b97STreehugger Robot attr_reader :unused, :types 555*16467b97STreehugger Robot 556*16467b97STreehugger Robot def define_tokens( token_map = {} ) 557*16467b97STreehugger Robot for token_name, token_value in token_map 558*16467b97STreehugger Robot define_token( token_name, token_value ) 559*16467b97STreehugger Robot end 560*16467b97STreehugger Robot return self 561*16467b97STreehugger Robot end 562*16467b97STreehugger Robot 563*16467b97STreehugger Robot def define_token( name, value = nil ) 564*16467b97STreehugger Robot name = name.to_s 565*16467b97STreehugger Robot 566*16467b97STreehugger Robot if current_value = @types[ name ] 567*16467b97STreehugger Robot # token type has already been defined 568*16467b97STreehugger Robot # raise an error unless value is the same as the current value 569*16467b97STreehugger Robot value ||= current_value 570*16467b97STreehugger Robot unless current_value == value 571*16467b97STreehugger Robot raise NameError.new( 572*16467b97STreehugger Robot "new token type definition ``#{ name } = #{ value }'' conflicts " << 573*16467b97STreehugger Robot "with existing type definition ``#{ name } = #{ current_value }''", name 574*16467b97STreehugger Robot ) 575*16467b97STreehugger Robot end 576*16467b97STreehugger Robot else 577*16467b97STreehugger Robot value ||= @unused 578*16467b97STreehugger Robot if name =~ /^[A-Z]\w*$/ 579*16467b97STreehugger Robot const_set( name, @types[ name ] = value ) 580*16467b97STreehugger Robot else 581*16467b97STreehugger Robot constant = "T__#{ value }" 582*16467b97STreehugger Robot const_set( constant, @types[ constant ] = value ) 583*16467b97STreehugger Robot @types[ name ] = value 584*16467b97STreehugger Robot end 585*16467b97STreehugger Robot register_name( value, name ) unless built_in_type?( value ) 586*16467b97STreehugger Robot end 587*16467b97STreehugger Robot 588*16467b97STreehugger Robot value >= @unused and @unused = value + 1 589*16467b97STreehugger Robot return self 590*16467b97STreehugger Robot end 591*16467b97STreehugger Robot 592*16467b97STreehugger Robot def register_names( *names ) 593*16467b97STreehugger Robot if names.length == 1 and Hash === names.first 594*16467b97STreehugger Robot names.first.each do |value, name| 595*16467b97STreehugger Robot register_name( value, name ) 596*16467b97STreehugger Robot end 597*16467b97STreehugger Robot else 598*16467b97STreehugger Robot names.each_with_index do |name, i| 599*16467b97STreehugger Robot type_value = Constants::MIN_TOKEN_TYPE + i 600*16467b97STreehugger Robot register_name( type_value, name ) 601*16467b97STreehugger Robot end 602*16467b97STreehugger Robot end 603*16467b97STreehugger Robot end 604*16467b97STreehugger Robot 605*16467b97STreehugger Robot def register_name( type_value, name ) 606*16467b97STreehugger Robot name = name.to_s.freeze 607*16467b97STreehugger Robot if token_names.has_key?( type_value ) 608*16467b97STreehugger Robot current_name = token_names[ type_value ] 609*16467b97STreehugger Robot current_name == name and return name 610*16467b97STreehugger Robot 611*16467b97STreehugger Robot if current_name == "T__#{ type_value }" 612*16467b97STreehugger Robot # only an anonymous name is registered -- upgrade the name to the full literal name 613*16467b97STreehugger Robot token_names[ type_value ] = name 614*16467b97STreehugger Robot elsif name == "T__#{ type_value }" 615*16467b97STreehugger Robot # ignore name downgrade from literal to anonymous constant 616*16467b97STreehugger Robot return current_name 617*16467b97STreehugger Robot else 618*16467b97STreehugger Robot error = NameError.new( 619*16467b97STreehugger Robot "attempted assignment of token type #{ type_value }" << 620*16467b97STreehugger Robot " to name #{ name } conflicts with existing name #{ current_name }", name 621*16467b97STreehugger Robot ) 622*16467b97STreehugger Robot raise error 623*16467b97STreehugger Robot end 624*16467b97STreehugger Robot else 625*16467b97STreehugger Robot token_names[ type_value ] = name.to_s.freeze 626*16467b97STreehugger Robot end 627*16467b97STreehugger Robot end 628*16467b97STreehugger Robot 629*16467b97STreehugger Robot def built_in_type?( type_value ) 630*16467b97STreehugger Robot Constants::BUILT_IN_TOKEN_NAMES.fetch( type_value, false ) and true 631*16467b97STreehugger Robot end 632*16467b97STreehugger Robot 633*16467b97STreehugger Robot def token_defined?( name_or_value ) 634*16467b97STreehugger Robot case value 635*16467b97STreehugger Robot when Integer then token_names.has_key?( name_or_value ) 636*16467b97STreehugger Robot else const_defined?( name_or_value.to_s ) 637*16467b97STreehugger Robot end 638*16467b97STreehugger Robot end 639*16467b97STreehugger Robot 640*16467b97STreehugger Robot def []( name_or_value ) 641*16467b97STreehugger Robot case name_or_value 642*16467b97STreehugger Robot when Integer then token_names.fetch( name_or_value, nil ) 643*16467b97STreehugger Robot else const_get( name_or_value.to_s ) rescue token_names.index( name_or_value ) 644*16467b97STreehugger Robot end 645*16467b97STreehugger Robot end 646*16467b97STreehugger Robot 647*16467b97STreehugger Robot def token_class 648*16467b97STreehugger Robot self::Token 649*16467b97STreehugger Robot end 650*16467b97STreehugger Robot 651*16467b97STreehugger Robot def token_class=( klass ) 652*16467b97STreehugger Robot Class === klass or raise( TypeError, "token_class must be a Class" ) 653*16467b97STreehugger Robot Util.silence_warnings do 654*16467b97STreehugger Robot klass < self or klass.send( :include, self ) 655*16467b97STreehugger Robot const_set( :Token, klass ) 656*16467b97STreehugger Robot end 657*16467b97STreehugger Robot end 658*16467b97STreehugger Robot 659*16467b97STreehugger Robotend 660*16467b97STreehugger Robot 661*16467b97STreehugger Robotend 662