xref: /aosp_15_r20/external/antlr/runtime/Ruby/lib/antlr3/token.rb (revision 16467b971bd3e2009fad32dd79016f2c7e421deb)
1*16467b97STreehugger Robot#!/usr/bin/ruby
2*16467b97STreehugger Robot# encoding: utf-8
3*16467b97STreehugger Robot
4*16467b97STreehugger Robot=begin LICENSE
5*16467b97STreehugger Robot
6*16467b97STreehugger Robot[The "BSD licence"]
7*16467b97STreehugger RobotCopyright (c) 2009-2010 Kyle Yetter
8*16467b97STreehugger RobotAll rights reserved.
9*16467b97STreehugger Robot
10*16467b97STreehugger RobotRedistribution and use in source and binary forms, with or without
11*16467b97STreehugger Robotmodification, are permitted provided that the following conditions
12*16467b97STreehugger Robotare met:
13*16467b97STreehugger Robot
14*16467b97STreehugger Robot 1. Redistributions of source code must retain the above copyright
15*16467b97STreehugger Robot    notice, this list of conditions and the following disclaimer.
16*16467b97STreehugger Robot 2. Redistributions in binary form must reproduce the above copyright
17*16467b97STreehugger Robot    notice, this list of conditions and the following disclaimer in the
18*16467b97STreehugger Robot    documentation and/or other materials provided with the distribution.
19*16467b97STreehugger Robot 3. The name of the author may not be used to endorse or promote products
20*16467b97STreehugger Robot    derived from this software without specific prior written permission.
21*16467b97STreehugger Robot
22*16467b97STreehugger RobotTHIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
23*16467b97STreehugger RobotIMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
24*16467b97STreehugger RobotOF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
25*16467b97STreehugger RobotIN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
26*16467b97STreehugger RobotINCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
27*16467b97STreehugger RobotNOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
28*16467b97STreehugger RobotDATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
29*16467b97STreehugger RobotTHEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
30*16467b97STreehugger Robot(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
31*16467b97STreehugger RobotTHIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
32*16467b97STreehugger Robot
33*16467b97STreehugger Robot=end
34*16467b97STreehugger Robot
35*16467b97STreehugger Robotmodule ANTLR3
36*16467b97STreehugger Robot
37*16467b97STreehugger Robot=begin rdoc ANTLR3::Token
38*16467b97STreehugger Robot
39*16467b97STreehugger RobotAt a minimum, tokens are data structures that bind together a chunk of text and
40*16467b97STreehugger Robota corresponding type symbol, which categorizes/characterizes the content of the
41*16467b97STreehugger Robottext. Tokens also usually carry information about their location in the input,
42*16467b97STreehugger Robotsuch as absolute character index, line number, and position within the line (or
43*16467b97STreehugger Robotcolumn).
44*16467b97STreehugger Robot
45*16467b97STreehugger RobotFurthermore, ANTLR tokens are assigned a "channel" number, an extra degree of
46*16467b97STreehugger Robotcategorization that groups things on a larger scale. Parsers will usually ignore
47*16467b97STreehugger Robottokens that have channel value 99 (the HIDDEN_CHANNEL), so you can keep things
48*16467b97STreehugger Robotlike comment and white space huddled together with neighboring tokens,
49*16467b97STreehugger Roboteffectively ignoring them without discarding them.
50*16467b97STreehugger Robot
51*16467b97STreehugger RobotANTLR tokens also keep a reference to the source stream from which they
52*16467b97STreehugger Robotoriginated. Token streams will also provide an index value for the token, which
53*16467b97STreehugger Robotindicates the position of the token relative to other tokens in the stream,
54*16467b97STreehugger Robotstarting at zero. For example, the 22nd token pulled from a lexer by
55*16467b97STreehugger RobotCommonTokenStream will have index value 21.
56*16467b97STreehugger Robot
57*16467b97STreehugger Robot== Token as an Interface
58*16467b97STreehugger Robot
59*16467b97STreehugger RobotThis library provides a token implementation (see CommonToken). Additionally,
60*16467b97STreehugger Robotyou may write your own token class as long as you provide methods that give
61*16467b97STreehugger Robotaccess to the attributes expected by a token. Even though most of the ANTLR
62*16467b97STreehugger Robotlibrary tries to use duck-typing techniques instead of pure object-oriented type
63*16467b97STreehugger Robotchecking, it's a good idea to include this ANTLR3::Token into your customized
64*16467b97STreehugger Robottoken class.
65*16467b97STreehugger Robot
66*16467b97STreehugger Robot=end
67*16467b97STreehugger Robot
68*16467b97STreehugger Robotmodule Token
69*16467b97STreehugger Robot  include ANTLR3::Constants
70*16467b97STreehugger Robot  include Comparable
71*16467b97STreehugger Robot
72*16467b97STreehugger Robot  # the token's associated chunk of text
73*16467b97STreehugger Robot  attr_accessor :text
74*16467b97STreehugger Robot
75*16467b97STreehugger Robot  # the integer value associated with the token's type
76*16467b97STreehugger Robot  attr_accessor :type
77*16467b97STreehugger Robot
78*16467b97STreehugger Robot  # the text's starting line number within the source (indexed starting at 1)
79*16467b97STreehugger Robot  attr_accessor :line
80*16467b97STreehugger Robot
81*16467b97STreehugger Robot  # the text's starting position in the line within the source (indexed starting at 0)
82*16467b97STreehugger Robot  attr_accessor :column
83*16467b97STreehugger Robot
84*16467b97STreehugger Robot  # the integer value of the channel to which the token is assigned
85*16467b97STreehugger Robot  attr_accessor :channel
86*16467b97STreehugger Robot
87*16467b97STreehugger Robot  # the index of the token with respect to other the other tokens produced during lexing
88*16467b97STreehugger Robot  attr_accessor :index
89*16467b97STreehugger Robot
90*16467b97STreehugger Robot  # a reference to the input stream from which the token was extracted
91*16467b97STreehugger Robot  attr_accessor :input
92*16467b97STreehugger Robot
93*16467b97STreehugger Robot  # the absolute character index in the input at which the text starts
94*16467b97STreehugger Robot  attr_accessor :start
95*16467b97STreehugger Robot
96*16467b97STreehugger Robot  # the absolute character index in the input at which the text ends
97*16467b97STreehugger Robot  attr_accessor :stop
98*16467b97STreehugger Robot
99*16467b97STreehugger Robot  alias :input_stream :input
100*16467b97STreehugger Robot  alias :input_stream= :input=
101*16467b97STreehugger Robot  alias :token_index :index
102*16467b97STreehugger Robot  alias :token_index= :index=
103*16467b97STreehugger Robot
104*16467b97STreehugger Robot  #
105*16467b97STreehugger Robot  # The match operator has been implemented to match against several different
106*16467b97STreehugger Robot  # attributes of a token for convenience in quick scripts
107*16467b97STreehugger Robot  #
108*16467b97STreehugger Robot  # @example Match against an integer token type constant
109*16467b97STreehugger Robot  #   token =~ VARIABLE_NAME   => true/false
110*16467b97STreehugger Robot  # @example Match against a token type name as a Symbol
111*16467b97STreehugger Robot  #   token =~ :FLOAT          => true/false
112*16467b97STreehugger Robot  # @example Match the token text against a Regular Expression
113*16467b97STreehugger Robot  #   token =~ /^@[a-z_]\w*$/i
114*16467b97STreehugger Robot  # @example Compare the token's text to a string
115*16467b97STreehugger Robot  #   token =~ "class"
116*16467b97STreehugger Robot  #
117*16467b97STreehugger Robot  def =~ obj
118*16467b97STreehugger Robot    case obj
119*16467b97STreehugger Robot    when Integer then type == obj
120*16467b97STreehugger Robot    when Symbol then name == obj.to_s
121*16467b97STreehugger Robot    when Regexp then obj =~ text
122*16467b97STreehugger Robot    when String then text == obj
123*16467b97STreehugger Robot    else super
124*16467b97STreehugger Robot    end
125*16467b97STreehugger Robot  end
126*16467b97STreehugger Robot
127*16467b97STreehugger Robot  #
128*16467b97STreehugger Robot  # Tokens are comparable by their stream index values
129*16467b97STreehugger Robot  #
130*16467b97STreehugger Robot  def <=> tk2
131*16467b97STreehugger Robot    index <=> tk2.index
132*16467b97STreehugger Robot  end
133*16467b97STreehugger Robot
134*16467b97STreehugger Robot  def initialize_copy( orig )
135*16467b97STreehugger Robot    self.index   = -1
136*16467b97STreehugger Robot    self.type    = orig.type
137*16467b97STreehugger Robot    self.channel = orig.channel
138*16467b97STreehugger Robot    self.text    = orig.text.clone if orig.text
139*16467b97STreehugger Robot    self.start   = orig.start
140*16467b97STreehugger Robot    self.stop    = orig.stop
141*16467b97STreehugger Robot    self.line    = orig.line
142*16467b97STreehugger Robot    self.column  = orig.column
143*16467b97STreehugger Robot    self.input   = orig.input
144*16467b97STreehugger Robot  end
145*16467b97STreehugger Robot
146*16467b97STreehugger Robot  def concrete?
147*16467b97STreehugger Robot    input && start && stop ? true : false
148*16467b97STreehugger Robot  end
149*16467b97STreehugger Robot
150*16467b97STreehugger Robot  def imaginary?
151*16467b97STreehugger Robot    input && start && stop ? false : true
152*16467b97STreehugger Robot  end
153*16467b97STreehugger Robot
154*16467b97STreehugger Robot  def name
155*16467b97STreehugger Robot    token_name( type )
156*16467b97STreehugger Robot  end
157*16467b97STreehugger Robot
158*16467b97STreehugger Robot  def source_name
159*16467b97STreehugger Robot    i = input and i.source_name
160*16467b97STreehugger Robot  end
161*16467b97STreehugger Robot
162*16467b97STreehugger Robot  def hidden?
163*16467b97STreehugger Robot    channel == HIDDEN_CHANNEL
164*16467b97STreehugger Robot  end
165*16467b97STreehugger Robot
166*16467b97STreehugger Robot  def source_text
167*16467b97STreehugger Robot    concrete? ? input.substring( start, stop ) : text
168*16467b97STreehugger Robot  end
169*16467b97STreehugger Robot
170*16467b97STreehugger Robot  #
171*16467b97STreehugger Robot  # Sets the token's channel value to HIDDEN_CHANNEL
172*16467b97STreehugger Robot  #
173*16467b97STreehugger Robot  def hide!
174*16467b97STreehugger Robot    self.channel = HIDDEN_CHANNEL
175*16467b97STreehugger Robot  end
176*16467b97STreehugger Robot
177*16467b97STreehugger Robot  def inspect
178*16467b97STreehugger Robot    text_inspect    = text  ? "[#{ text.inspect }] " : ' '
179*16467b97STreehugger Robot    text_position   = line > 0  ? "@ line #{ line } col #{ column } " : ''
180*16467b97STreehugger Robot    stream_position = start ? "(#{ range.inspect })" : ''
181*16467b97STreehugger Robot
182*16467b97STreehugger Robot    front =  index >= 0 ? "#{ index } " : ''
183*16467b97STreehugger Robot    rep = front << name << text_inspect <<
184*16467b97STreehugger Robot                text_position << stream_position
185*16467b97STreehugger Robot    rep.strip!
186*16467b97STreehugger Robot    channel == DEFAULT_CHANNEL or rep << " (#{ channel.to_s })"
187*16467b97STreehugger Robot    return( rep )
188*16467b97STreehugger Robot  end
189*16467b97STreehugger Robot
190*16467b97STreehugger Robot  def pretty_print( printer )
191*16467b97STreehugger Robot    printer.text( inspect )
192*16467b97STreehugger Robot  end
193*16467b97STreehugger Robot
194*16467b97STreehugger Robot  def range
195*16467b97STreehugger Robot    start..stop rescue nil
196*16467b97STreehugger Robot  end
197*16467b97STreehugger Robot
198*16467b97STreehugger Robot  def to_i
199*16467b97STreehugger Robot    index.to_i
200*16467b97STreehugger Robot  end
201*16467b97STreehugger Robot
202*16467b97STreehugger Robot  def to_s
203*16467b97STreehugger Robot    text.to_s
204*16467b97STreehugger Robot  end
205*16467b97STreehugger Robot
206*16467b97STreehugger Robotprivate
207*16467b97STreehugger Robot
208*16467b97STreehugger Robot  def token_name( type )
209*16467b97STreehugger Robot    BUILT_IN_TOKEN_NAMES[ type ]
210*16467b97STreehugger Robot  end
211*16467b97STreehugger Robotend
212*16467b97STreehugger Robot
213*16467b97STreehugger RobotCommonToken = Struct.new( :type, :channel, :text, :input, :start,
214*16467b97STreehugger Robot                         :stop, :index, :line, :column )
215*16467b97STreehugger Robot
216*16467b97STreehugger Robot=begin rdoc ANTLR3::CommonToken
217*16467b97STreehugger Robot
218*16467b97STreehugger RobotThe base class for the standard implementation of Token. It is implemented as a
219*16467b97STreehugger Robotsimple Struct as tokens are basically simple data structures binding together a
220*16467b97STreehugger Robotbunch of different information and Structs are slightly faster than a standard
221*16467b97STreehugger RobotObject with accessor methods implementation.
222*16467b97STreehugger Robot
223*16467b97STreehugger RobotBy default, ANTLR generated ruby code will provide a customized subclass of
224*16467b97STreehugger RobotCommonToken to track token-type names efficiently for debugging, inspection, and
225*16467b97STreehugger Robotgeneral utility. Thus code generated for a standard combo lexer-parser grammar
226*16467b97STreehugger Robotnamed XYZ will have a base module named XYZ and a customized CommonToken
227*16467b97STreehugger Robotsubclass named XYZ::Token.
228*16467b97STreehugger Robot
229*16467b97STreehugger RobotHere is the token structure attribute list in order:
230*16467b97STreehugger Robot
231*16467b97STreehugger Robot* <tt>type</tt>
232*16467b97STreehugger Robot* <tt>channel</tt>
233*16467b97STreehugger Robot* <tt>text</tt>
234*16467b97STreehugger Robot* <tt>input</tt>
235*16467b97STreehugger Robot* <tt>start</tt>
236*16467b97STreehugger Robot* <tt>stop</tt>
237*16467b97STreehugger Robot* <tt>index</tt>
238*16467b97STreehugger Robot* <tt>line</tt>
239*16467b97STreehugger Robot* <tt>column</tt>
240*16467b97STreehugger Robot
241*16467b97STreehugger Robot=end
242*16467b97STreehugger Robot
243*16467b97STreehugger Robotclass CommonToken
244*16467b97STreehugger Robot  include Token
245*16467b97STreehugger Robot  DEFAULT_VALUES = {
246*16467b97STreehugger Robot    :channel => DEFAULT_CHANNEL,
247*16467b97STreehugger Robot    :index   => -1,
248*16467b97STreehugger Robot    :line    =>  0,
249*16467b97STreehugger Robot    :column  => -1
250*16467b97STreehugger Robot  }.freeze
251*16467b97STreehugger Robot
252*16467b97STreehugger Robot  def self.token_name( type )
253*16467b97STreehugger Robot    BUILT_IN_TOKEN_NAMES[ type ]
254*16467b97STreehugger Robot  end
255*16467b97STreehugger Robot
256*16467b97STreehugger Robot  def self.create( fields = {} )
257*16467b97STreehugger Robot    fields = DEFAULT_VALUES.merge( fields )
258*16467b97STreehugger Robot    args = members.map { |name| fields[ name.to_sym ] }
259*16467b97STreehugger Robot    new( *args )
260*16467b97STreehugger Robot  end
261*16467b97STreehugger Robot
262*16467b97STreehugger Robot  # allows you to make a copy of a token with a different class
263*16467b97STreehugger Robot  def self.from_token( token )
264*16467b97STreehugger Robot    new(
265*16467b97STreehugger Robot      token.type,  token.channel, token.text ? token.text.clone : nil,
266*16467b97STreehugger Robot      token.input, token.start,   token.stop, -1, token.line, token.column
267*16467b97STreehugger Robot    )
268*16467b97STreehugger Robot  end
269*16467b97STreehugger Robot
270*16467b97STreehugger Robot  def initialize( type = nil, channel = DEFAULT_CHANNEL, text = nil,
271*16467b97STreehugger Robot                 input = nil, start = nil, stop = nil, index = -1,
272*16467b97STreehugger Robot                 line = 0, column = -1 )
273*16467b97STreehugger Robot    super
274*16467b97STreehugger Robot    block_given? and yield( self )
275*16467b97STreehugger Robot    self.text.nil? && self.start && self.stop and
276*16467b97STreehugger Robot      self.text = self.input.substring( self.start, self.stop )
277*16467b97STreehugger Robot  end
278*16467b97STreehugger Robot
279*16467b97STreehugger Robot  alias :input_stream :input
280*16467b97STreehugger Robot  alias :input_stream= :input=
281*16467b97STreehugger Robot  alias :token_index :index
282*16467b97STreehugger Robot  alias :token_index= :index=
283*16467b97STreehugger Robotend
284*16467b97STreehugger Robot
285*16467b97STreehugger Robotmodule Constants
286*16467b97STreehugger Robot
287*16467b97STreehugger Robot  # End of File / End of Input character and token type
288*16467b97STreehugger Robot  EOF_TOKEN = CommonToken.new( EOF ).freeze
289*16467b97STreehugger Robot  INVALID_TOKEN = CommonToken.new( INVALID_TOKEN_TYPE ).freeze
290*16467b97STreehugger Robot  SKIP_TOKEN = CommonToken.new( INVALID_TOKEN_TYPE ).freeze
291*16467b97STreehugger Robotend
292*16467b97STreehugger Robot
293*16467b97STreehugger Robot
294*16467b97STreehugger Robot
295*16467b97STreehugger Robot=begin rdoc ANTLR3::TokenSource
296*16467b97STreehugger Robot
297*16467b97STreehugger RobotTokenSource is a simple mixin module that demands an
298*16467b97STreehugger Robotimplementation of the method #next_token. In return, it
299*16467b97STreehugger Robotdefines methods #next and #each, which provide basic
300*16467b97STreehugger Robotiterator methods for token generators. Furthermore, it
301*16467b97STreehugger Robotincludes Enumerable to provide the standard Ruby iteration
302*16467b97STreehugger Robotmethods to token generators, like lexers.
303*16467b97STreehugger Robot
304*16467b97STreehugger Robot=end
305*16467b97STreehugger Robot
306*16467b97STreehugger Robotmodule TokenSource
307*16467b97STreehugger Robot  include Constants
308*16467b97STreehugger Robot  include Enumerable
309*16467b97STreehugger Robot  extend ClassMacros
310*16467b97STreehugger Robot
311*16467b97STreehugger Robot  abstract :next_token
312*16467b97STreehugger Robot
313*16467b97STreehugger Robot  def next
314*16467b97STreehugger Robot    token = next_token()
315*16467b97STreehugger Robot    raise StopIteration if token.nil? || token.type == EOF
316*16467b97STreehugger Robot    return token
317*16467b97STreehugger Robot  end
318*16467b97STreehugger Robot
319*16467b97STreehugger Robot  def each
320*16467b97STreehugger Robot    block_given? or return enum_for( :each )
321*16467b97STreehugger Robot    while token = next_token and token.type != EOF
322*16467b97STreehugger Robot      yield( token )
323*16467b97STreehugger Robot    end
324*16467b97STreehugger Robot    return self
325*16467b97STreehugger Robot  end
326*16467b97STreehugger Robot
327*16467b97STreehugger Robot  def to_stream( options = {} )
328*16467b97STreehugger Robot    if block_given?
329*16467b97STreehugger Robot      CommonTokenStream.new( self, options ) { | t, stream | yield( t, stream ) }
330*16467b97STreehugger Robot    else
331*16467b97STreehugger Robot      CommonTokenStream.new( self, options )
332*16467b97STreehugger Robot    end
333*16467b97STreehugger Robot  end
334*16467b97STreehugger Robotend
335*16467b97STreehugger Robot
336*16467b97STreehugger Robot
337*16467b97STreehugger Robot=begin rdoc ANTLR3::TokenFactory
338*16467b97STreehugger Robot
339*16467b97STreehugger RobotThere are a variety of different entities throughout the ANTLR runtime library
340*16467b97STreehugger Robotthat need to create token objects This module serves as a mixin that provides
341*16467b97STreehugger Robotmethods for constructing tokens.
342*16467b97STreehugger Robot
343*16467b97STreehugger RobotIncluding this module provides a +token_class+ attribute. Instance of the
344*16467b97STreehugger Robotincluding class can create tokens using the token class (which defaults to
345*16467b97STreehugger RobotANTLR3::CommonToken). Token classes are presumed to have an #initialize method
346*16467b97STreehugger Robotthat can be called without any parameters and the token objects are expected to
347*16467b97STreehugger Robothave the standard token attributes (see ANTLR3::Token).
348*16467b97STreehugger Robot
349*16467b97STreehugger Robot=end
350*16467b97STreehugger Robot
351*16467b97STreehugger Robotmodule TokenFactory
352*16467b97STreehugger Robot  attr_writer :token_class
353*16467b97STreehugger Robot  def token_class
354*16467b97STreehugger Robot    @token_class ||= begin
355*16467b97STreehugger Robot      self.class.token_class rescue
356*16467b97STreehugger Robot      self::Token rescue
357*16467b97STreehugger Robot      ANTLR3::CommonToken
358*16467b97STreehugger Robot    end
359*16467b97STreehugger Robot  end
360*16467b97STreehugger Robot
361*16467b97STreehugger Robot  def create_token( *args )
362*16467b97STreehugger Robot    if block_given?
363*16467b97STreehugger Robot      token_class.new( *args ) do |*targs|
364*16467b97STreehugger Robot        yield( *targs )
365*16467b97STreehugger Robot      end
366*16467b97STreehugger Robot    else
367*16467b97STreehugger Robot      token_class.new( *args )
368*16467b97STreehugger Robot    end
369*16467b97STreehugger Robot  end
370*16467b97STreehugger Robotend
371*16467b97STreehugger Robot
372*16467b97STreehugger Robot
373*16467b97STreehugger Robot=begin rdoc ANTLR3::TokenScheme
374*16467b97STreehugger Robot
375*16467b97STreehugger RobotTokenSchemes exist to handle the problem of defining token types as integer
376*16467b97STreehugger Robotvalues while maintaining meaningful text names for the types. They are
377*16467b97STreehugger Robotdynamically defined modules that map integer values to constants with token-type
378*16467b97STreehugger Robotnames.
379*16467b97STreehugger Robot
380*16467b97STreehugger Robot---
381*16467b97STreehugger Robot
382*16467b97STreehugger RobotFundamentally, tokens exist to take a chunk of text and identify it as belonging
383*16467b97STreehugger Robotto some category, like "VARIABLE" or "INTEGER". In code, the category is
384*16467b97STreehugger Robotrepresented by an integer -- some arbitrary value that ANTLR will decide to use
385*16467b97STreehugger Robotas it is creating the recognizer. The purpose of using an integer (instead of
386*16467b97STreehugger Robotsay, a ruby symbol) is that ANTLR's decision logic often needs to test whether a
387*16467b97STreehugger Robottoken's type falls within a range, which is not possible with symbols.
388*16467b97STreehugger Robot
389*16467b97STreehugger RobotThe downside of token types being represented as integers is that a developer
390*16467b97STreehugger Robotneeds to be able to reference the unknown type value by name in action code.
391*16467b97STreehugger RobotFurthermore, code that references the type by name and tokens that can be
392*16467b97STreehugger Robotinspected with names in place of type values are more meaningful to a developer.
393*16467b97STreehugger Robot
394*16467b97STreehugger RobotSince ANTLR requires token type names to follow capital-letter naming
395*16467b97STreehugger Robotconventions, defining types as named constants of the recognizer class resolves
396*16467b97STreehugger Robotthe problem of referencing type values by name. Thus, a token type like
397*16467b97STreehugger Robot``VARIABLE'' can be represented by a number like 5 and referenced within code by
398*16467b97STreehugger Robot+VARIABLE+. However, when a recognizer creates tokens, the name of the token's
399*16467b97STreehugger Robottype cannot be seen without using the data defined in the recognizer.
400*16467b97STreehugger Robot
401*16467b97STreehugger RobotOf course, tokens could be defined with a name attribute that could be specified
402*16467b97STreehugger Robotwhen tokens are created. However, doing so would make tokens take up more space
403*16467b97STreehugger Robotthan necessary, as well as making it difficult to change the type of a token
404*16467b97STreehugger Robotwhile maintaining a correct name value.
405*16467b97STreehugger Robot
406*16467b97STreehugger RobotTokenSchemes exist as a technique to manage token type referencing and name
407*16467b97STreehugger Robotextraction. They:
408*16467b97STreehugger Robot
409*16467b97STreehugger Robot1. keep token type references clear and understandable in recognizer code
410*16467b97STreehugger Robot2. permit access to a token's type-name independently of recognizer objects
411*16467b97STreehugger Robot3. allow multiple classes to share the same token information
412*16467b97STreehugger Robot
413*16467b97STreehugger Robot== Building Token Schemes
414*16467b97STreehugger Robot
415*16467b97STreehugger RobotTokenScheme is a subclass of Module. Thus, it has the method
416*16467b97STreehugger Robot<tt>TokenScheme.new(tk_class = nil) { ... module-level code ...}</tt>, which
417*16467b97STreehugger Robotwill evaluate the block in the context of the scheme (module), similarly to
418*16467b97STreehugger RobotModule#module_eval. Before evaluating the block, <tt>.new</tt> will setup the
419*16467b97STreehugger Robotmodule with the following actions:
420*16467b97STreehugger Robot
421*16467b97STreehugger Robot1. define a customized token class (more on that below)
422*16467b97STreehugger Robot2. add a new constant, TOKEN_NAMES, which is a hash that maps types to names
423*16467b97STreehugger Robot3. dynamically populate the new scheme module with a couple instance methods
424*16467b97STreehugger Robot4. include ANTLR3::Constants in the new scheme module
425*16467b97STreehugger Robot
426*16467b97STreehugger RobotAs TokenScheme the class functions as a metaclass, figuring out some of the
427*16467b97STreehugger Robotscoping behavior can be mildly confusing if you're trying to get a handle of the
428*16467b97STreehugger Robotentity for your own purposes. Remember that all of the instance methods of
429*16467b97STreehugger RobotTokenScheme function as module-level methods of TokenScheme instances, ala
430*16467b97STreehugger Robot+attr_accessor+ and friends.
431*16467b97STreehugger Robot
432*16467b97STreehugger Robot<tt>TokenScheme#define_token(name_symbol, int_value)</tt> adds a constant
433*16467b97STreehugger Robotdefinition <tt>name_symbol</tt> with the value <tt>int_value</tt>. It is
434*16467b97STreehugger Robotessentially like <tt>Module#const_set</tt>, except it forbids constant
435*16467b97STreehugger Robotoverwriting (which would mess up recognizer code fairly badly) and adds an
436*16467b97STreehugger Robotinverse type-to-name map to its own <tt>TOKEN_NAMES</tt> table.
437*16467b97STreehugger Robot<tt>TokenScheme#define_tokens</tt> is a convenience method for defining many
438*16467b97STreehugger Robottypes with a hash pairing names to values.
439*16467b97STreehugger Robot
440*16467b97STreehugger Robot<tt>TokenScheme#register_name(value, name_string)</tt> specifies a custom
441*16467b97STreehugger Robottype-to-name definition. This is particularly useful for the anonymous tokens
442*16467b97STreehugger Robotthat ANTLR generates for literal strings in the grammar specification. For
443*16467b97STreehugger Robotexample, if you refer to the literal <tt>'='</tt> in some parser rule in your
444*16467b97STreehugger Robotgrammar, ANTLR will add a lexer rule for the literal and give the token a name
445*16467b97STreehugger Robotlike <tt>T__<i>x</i></tt>, where <tt><i>x</i></tt> is the type's integer value.
446*16467b97STreehugger RobotSince this is pretty meaningless to a developer, generated code should add a
447*16467b97STreehugger Robotspecial name definition for type value <tt><i>x</i></tt> with the string
448*16467b97STreehugger Robot<tt>"'='"</tt>.
449*16467b97STreehugger Robot
450*16467b97STreehugger Robot=== Sample TokenScheme Construction
451*16467b97STreehugger Robot
452*16467b97STreehugger Robot  TokenData = ANTLR3::TokenScheme.new do
453*16467b97STreehugger Robot    define_tokens(
454*16467b97STreehugger Robot      :INT  => 4,
455*16467b97STreehugger Robot      :ID   => 6,
456*16467b97STreehugger Robot      :T__5 => 5,
457*16467b97STreehugger Robot      :WS   => 7
458*16467b97STreehugger Robot    )
459*16467b97STreehugger Robot
460*16467b97STreehugger Robot    # note the self:: scoping below is due to the fact that
461*16467b97STreehugger Robot    # ruby lexically-scopes constant names instead of
462*16467b97STreehugger Robot    # looking up in the current scope
463*16467b97STreehugger Robot    register_name(self::T__5, "'='")
464*16467b97STreehugger Robot  end
465*16467b97STreehugger Robot
466*16467b97STreehugger Robot  TokenData::ID           # => 6
467*16467b97STreehugger Robot  TokenData::T__5         # => 5
468*16467b97STreehugger Robot  TokenData.token_name(4) # => 'INT'
469*16467b97STreehugger Robot  TokenData.token_name(5) # => "'='"
470*16467b97STreehugger Robot
471*16467b97STreehugger Robot  class ARecognizerOrSuch < ANTLR3::Parser
472*16467b97STreehugger Robot    include TokenData
473*16467b97STreehugger Robot    ID   # => 6
474*16467b97STreehugger Robot  end
475*16467b97STreehugger Robot
476*16467b97STreehugger Robot== Custom Token Classes and Relationship with Tokens
477*16467b97STreehugger Robot
478*16467b97STreehugger RobotWhen a TokenScheme is created, it will define a subclass of ANTLR3::CommonToken
479*16467b97STreehugger Robotand assigned it to the constant name +Token+. This token class will both include
480*16467b97STreehugger Robotand extend the scheme module. Since token schemes define the private instance
481*16467b97STreehugger Robotmethod <tt>token_name(type)</tt>, instances of the token class are now able to
482*16467b97STreehugger Robotprovide their type names. The Token method <tt>name</tt> uses the
483*16467b97STreehugger Robot<tt>token_name</tt> method to provide the type name as if it were a simple
484*16467b97STreehugger Robotattribute without storing the name itself.
485*16467b97STreehugger Robot
486*16467b97STreehugger RobotWhen a TokenScheme is included in a recognizer class, the class will now have
487*16467b97STreehugger Robotthe token types as named constants, a type-to-name map constant +TOKEN_NAMES+,
488*16467b97STreehugger Robotand a grammar-specific subclass of ANTLR3::CommonToken assigned to the constant
489*16467b97STreehugger RobotToken. Thus, when recognizers need to manufacture tokens, instead of using the
490*16467b97STreehugger Robotgeneric CommonToken class, they can create tokens using the customized Token
491*16467b97STreehugger Robotclass provided by the token scheme.
492*16467b97STreehugger Robot
493*16467b97STreehugger RobotIf you need to use a token class other than CommonToken, you can pass the class
494*16467b97STreehugger Robotas a parameter to TokenScheme.new, which will be used in place of the
495*16467b97STreehugger Robotdynamically-created CommonToken subclass.
496*16467b97STreehugger Robot
497*16467b97STreehugger Robot=end
498*16467b97STreehugger Robot
499*16467b97STreehugger Robotclass TokenScheme < ::Module
500*16467b97STreehugger Robot  include TokenFactory
501*16467b97STreehugger Robot
502*16467b97STreehugger Robot  def self.new( tk_class = nil, &body )
503*16467b97STreehugger Robot    super() do
504*16467b97STreehugger Robot      tk_class ||= Class.new( ::ANTLR3::CommonToken )
505*16467b97STreehugger Robot      self.token_class = tk_class
506*16467b97STreehugger Robot
507*16467b97STreehugger Robot      const_set( :TOKEN_NAMES, ::ANTLR3::Constants::BUILT_IN_TOKEN_NAMES.clone )
508*16467b97STreehugger Robot
509*16467b97STreehugger Robot      @types  = ::ANTLR3::Constants::BUILT_IN_TOKEN_NAMES.invert
510*16467b97STreehugger Robot      @unused = ::ANTLR3::Constants::MIN_TOKEN_TYPE
511*16467b97STreehugger Robot
512*16467b97STreehugger Robot      scheme = self
513*16467b97STreehugger Robot      define_method( :token_scheme ) { scheme }
514*16467b97STreehugger Robot      define_method( :token_names )  { scheme::TOKEN_NAMES }
515*16467b97STreehugger Robot      define_method( :token_name ) do |type|
516*16467b97STreehugger Robot        begin
517*16467b97STreehugger Robot          token_names[ type ] or super
518*16467b97STreehugger Robot        rescue NoMethodError
519*16467b97STreehugger Robot          ::ANTLR3::CommonToken.token_name( type )
520*16467b97STreehugger Robot        end
521*16467b97STreehugger Robot      end
522*16467b97STreehugger Robot      module_function :token_name, :token_names
523*16467b97STreehugger Robot
524*16467b97STreehugger Robot      include ANTLR3::Constants
525*16467b97STreehugger Robot
526*16467b97STreehugger Robot      body and module_eval( &body )
527*16467b97STreehugger Robot    end
528*16467b97STreehugger Robot  end
529*16467b97STreehugger Robot
530*16467b97STreehugger Robot  def self.build( *token_names )
531*16467b97STreehugger Robot    token_names = [ token_names ].flatten!
532*16467b97STreehugger Robot    token_names.compact!
533*16467b97STreehugger Robot    token_names.uniq!
534*16467b97STreehugger Robot    tk_class = Class === token_names.first ? token_names.shift : nil
535*16467b97STreehugger Robot    value_maps, names = token_names.partition { |i| Hash === i }
536*16467b97STreehugger Robot    new( tk_class ) do
537*16467b97STreehugger Robot      for value_map in value_maps
538*16467b97STreehugger Robot        define_tokens( value_map )
539*16467b97STreehugger Robot      end
540*16467b97STreehugger Robot
541*16467b97STreehugger Robot      for name in names
542*16467b97STreehugger Robot        define_token( name )
543*16467b97STreehugger Robot      end
544*16467b97STreehugger Robot    end
545*16467b97STreehugger Robot  end
546*16467b97STreehugger Robot
547*16467b97STreehugger Robot
548*16467b97STreehugger Robot  def included( mod )
549*16467b97STreehugger Robot    super
550*16467b97STreehugger Robot    mod.extend( self )
551*16467b97STreehugger Robot  end
552*16467b97STreehugger Robot  private :included
553*16467b97STreehugger Robot
554*16467b97STreehugger Robot  attr_reader :unused, :types
555*16467b97STreehugger Robot
556*16467b97STreehugger Robot  def define_tokens( token_map = {} )
557*16467b97STreehugger Robot    for token_name, token_value in token_map
558*16467b97STreehugger Robot      define_token( token_name, token_value )
559*16467b97STreehugger Robot    end
560*16467b97STreehugger Robot    return self
561*16467b97STreehugger Robot  end
562*16467b97STreehugger Robot
563*16467b97STreehugger Robot  def define_token( name, value = nil )
564*16467b97STreehugger Robot    name = name.to_s
565*16467b97STreehugger Robot
566*16467b97STreehugger Robot    if current_value = @types[ name ]
567*16467b97STreehugger Robot      # token type has already been defined
568*16467b97STreehugger Robot      # raise an error unless value is the same as the current value
569*16467b97STreehugger Robot      value ||= current_value
570*16467b97STreehugger Robot      unless current_value == value
571*16467b97STreehugger Robot        raise NameError.new(
572*16467b97STreehugger Robot          "new token type definition ``#{ name } = #{ value }'' conflicts " <<
573*16467b97STreehugger Robot          "with existing type definition ``#{ name } = #{ current_value }''", name
574*16467b97STreehugger Robot        )
575*16467b97STreehugger Robot      end
576*16467b97STreehugger Robot    else
577*16467b97STreehugger Robot      value ||= @unused
578*16467b97STreehugger Robot      if name =~ /^[A-Z]\w*$/
579*16467b97STreehugger Robot        const_set( name, @types[ name ] = value )
580*16467b97STreehugger Robot      else
581*16467b97STreehugger Robot        constant = "T__#{ value }"
582*16467b97STreehugger Robot        const_set( constant, @types[ constant ] = value )
583*16467b97STreehugger Robot        @types[ name ] = value
584*16467b97STreehugger Robot      end
585*16467b97STreehugger Robot      register_name( value, name ) unless built_in_type?( value )
586*16467b97STreehugger Robot    end
587*16467b97STreehugger Robot
588*16467b97STreehugger Robot    value >= @unused and @unused = value + 1
589*16467b97STreehugger Robot    return self
590*16467b97STreehugger Robot  end
591*16467b97STreehugger Robot
592*16467b97STreehugger Robot  def register_names( *names )
593*16467b97STreehugger Robot    if names.length == 1 and Hash === names.first
594*16467b97STreehugger Robot      names.first.each do |value, name|
595*16467b97STreehugger Robot        register_name( value, name )
596*16467b97STreehugger Robot      end
597*16467b97STreehugger Robot    else
598*16467b97STreehugger Robot      names.each_with_index do |name, i|
599*16467b97STreehugger Robot        type_value = Constants::MIN_TOKEN_TYPE + i
600*16467b97STreehugger Robot        register_name( type_value, name )
601*16467b97STreehugger Robot      end
602*16467b97STreehugger Robot    end
603*16467b97STreehugger Robot  end
604*16467b97STreehugger Robot
605*16467b97STreehugger Robot  def register_name( type_value, name )
606*16467b97STreehugger Robot    name = name.to_s.freeze
607*16467b97STreehugger Robot    if token_names.has_key?( type_value )
608*16467b97STreehugger Robot      current_name = token_names[ type_value ]
609*16467b97STreehugger Robot      current_name == name and return name
610*16467b97STreehugger Robot
611*16467b97STreehugger Robot      if current_name == "T__#{ type_value }"
612*16467b97STreehugger Robot        # only an anonymous name is registered -- upgrade the name to the full literal name
613*16467b97STreehugger Robot        token_names[ type_value ] = name
614*16467b97STreehugger Robot      elsif name == "T__#{ type_value }"
615*16467b97STreehugger Robot        # ignore name downgrade from literal to anonymous constant
616*16467b97STreehugger Robot        return current_name
617*16467b97STreehugger Robot      else
618*16467b97STreehugger Robot        error = NameError.new(
619*16467b97STreehugger Robot          "attempted assignment of token type #{ type_value }" <<
620*16467b97STreehugger Robot          " to name #{ name } conflicts with existing name #{ current_name }", name
621*16467b97STreehugger Robot        )
622*16467b97STreehugger Robot        raise error
623*16467b97STreehugger Robot      end
624*16467b97STreehugger Robot    else
625*16467b97STreehugger Robot      token_names[ type_value ] = name.to_s.freeze
626*16467b97STreehugger Robot    end
627*16467b97STreehugger Robot  end
628*16467b97STreehugger Robot
629*16467b97STreehugger Robot  def built_in_type?( type_value )
630*16467b97STreehugger Robot    Constants::BUILT_IN_TOKEN_NAMES.fetch( type_value, false ) and true
631*16467b97STreehugger Robot  end
632*16467b97STreehugger Robot
633*16467b97STreehugger Robot  def token_defined?( name_or_value )
634*16467b97STreehugger Robot    case value
635*16467b97STreehugger Robot    when Integer then token_names.has_key?( name_or_value )
636*16467b97STreehugger Robot    else const_defined?( name_or_value.to_s )
637*16467b97STreehugger Robot    end
638*16467b97STreehugger Robot  end
639*16467b97STreehugger Robot
640*16467b97STreehugger Robot  def []( name_or_value )
641*16467b97STreehugger Robot    case name_or_value
642*16467b97STreehugger Robot    when Integer then token_names.fetch( name_or_value, nil )
643*16467b97STreehugger Robot    else const_get( name_or_value.to_s ) rescue token_names.index( name_or_value )
644*16467b97STreehugger Robot    end
645*16467b97STreehugger Robot  end
646*16467b97STreehugger Robot
647*16467b97STreehugger Robot  def token_class
648*16467b97STreehugger Robot    self::Token
649*16467b97STreehugger Robot  end
650*16467b97STreehugger Robot
651*16467b97STreehugger Robot  def token_class=( klass )
652*16467b97STreehugger Robot    Class === klass or raise( TypeError, "token_class must be a Class" )
653*16467b97STreehugger Robot    Util.silence_warnings do
654*16467b97STreehugger Robot      klass < self or klass.send( :include, self )
655*16467b97STreehugger Robot      const_set( :Token, klass )
656*16467b97STreehugger Robot    end
657*16467b97STreehugger Robot  end
658*16467b97STreehugger Robot
659*16467b97STreehugger Robotend
660*16467b97STreehugger Robot
661*16467b97STreehugger Robotend
662