xref: /aosp_15_r20/external/antlr/runtime/Ruby/lib/antlr3/streams.rb (revision 16467b971bd3e2009fad32dd79016f2c7e421deb)
1*16467b97STreehugger Robot#!/usr/bin/ruby
2*16467b97STreehugger Robot# encoding: utf-8
3*16467b97STreehugger Robot
4*16467b97STreehugger Robot=begin LICENSE
5*16467b97STreehugger Robot
6*16467b97STreehugger Robot[The "BSD licence"]
7*16467b97STreehugger RobotCopyright (c) 2009-2010 Kyle Yetter
8*16467b97STreehugger RobotAll rights reserved.
9*16467b97STreehugger Robot
10*16467b97STreehugger RobotRedistribution and use in source and binary forms, with or without
11*16467b97STreehugger Robotmodification, are permitted provided that the following conditions
12*16467b97STreehugger Robotare met:
13*16467b97STreehugger Robot
14*16467b97STreehugger Robot 1. Redistributions of source code must retain the above copyright
15*16467b97STreehugger Robot    notice, this list of conditions and the following disclaimer.
16*16467b97STreehugger Robot 2. Redistributions in binary form must reproduce the above copyright
17*16467b97STreehugger Robot    notice, this list of conditions and the following disclaimer in the
18*16467b97STreehugger Robot    documentation and/or other materials provided with the distribution.
19*16467b97STreehugger Robot 3. The name of the author may not be used to endorse or promote products
20*16467b97STreehugger Robot    derived from this software without specific prior written permission.
21*16467b97STreehugger Robot
22*16467b97STreehugger RobotTHIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
23*16467b97STreehugger RobotIMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
24*16467b97STreehugger RobotOF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
25*16467b97STreehugger RobotIN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
26*16467b97STreehugger RobotINCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
27*16467b97STreehugger RobotNOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
28*16467b97STreehugger RobotDATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
29*16467b97STreehugger RobotTHEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
30*16467b97STreehugger Robot(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
31*16467b97STreehugger RobotTHIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
32*16467b97STreehugger Robot
33*16467b97STreehugger Robot=end
34*16467b97STreehugger Robot
35*16467b97STreehugger Robotmodule ANTLR3
36*16467b97STreehugger Robot
37*16467b97STreehugger Robot
38*16467b97STreehugger Robot=begin rdoc ANTLR3::Stream
39*16467b97STreehugger Robot
40*16467b97STreehugger Robot= ANTLR3 Streams
41*16467b97STreehugger Robot
42*16467b97STreehugger RobotThis documentation first covers the general concept of streams as used by ANTLR
43*16467b97STreehugger Robotrecognizers, and then discusses the specific <tt>ANTLR3::Stream</tt> module.
44*16467b97STreehugger Robot
45*16467b97STreehugger Robot== ANTLR Stream Classes
46*16467b97STreehugger Robot
47*16467b97STreehugger RobotANTLR recognizers need a way to walk through input data in a serialized IO-style
48*16467b97STreehugger Robotfashion. They also need some book-keeping about the input to provide useful
49*16467b97STreehugger Robotinformation to developers, such as current line number and column. Furthermore,
50*16467b97STreehugger Robotto implement backtracking and various error recovery techniques, recognizers
51*16467b97STreehugger Robotneed a way to record various locations in the input at a number of points in the
52*16467b97STreehugger Robotrecognition process so the input state may be restored back to a prior state.
53*16467b97STreehugger Robot
54*16467b97STreehugger RobotANTLR bundles all of this functionality into a number of Stream classes, each
55*16467b97STreehugger Robotdesigned to be used by recognizers for a specific recognition task. Most of the
56*16467b97STreehugger RobotStream hierarchy is implemented in antlr3/stream.rb, which is loaded by default
57*16467b97STreehugger Robotwhen 'antlr3' is required.
58*16467b97STreehugger Robot
59*16467b97STreehugger Robot---
60*16467b97STreehugger Robot
61*16467b97STreehugger RobotHere's a brief overview of the various stream classes and their respective
62*16467b97STreehugger Robotpurpose:
63*16467b97STreehugger Robot
64*16467b97STreehugger RobotStringStream::
65*16467b97STreehugger Robot  Similar to StringIO from the standard Ruby library, StringStream wraps raw
66*16467b97STreehugger Robot  String data in a Stream interface for use by ANTLR lexers.
67*16467b97STreehugger RobotFileStream::
68*16467b97STreehugger Robot  A subclass of StringStream, FileStream simply wraps data read from an IO or
69*16467b97STreehugger Robot  File object for use by lexers.
70*16467b97STreehugger RobotCommonTokenStream::
71*16467b97STreehugger Robot  The job of a TokenStream is to read lexer output and then provide ANTLR
72*16467b97STreehugger Robot  parsers with the means to sequential walk through series of tokens.
73*16467b97STreehugger Robot  CommonTokenStream is the default TokenStream implementation.
74*16467b97STreehugger RobotTokenRewriteStream::
75*16467b97STreehugger Robot  A subclass of CommonTokenStream, TokenRewriteStreams provide rewriting-parsers
76*16467b97STreehugger Robot  the ability to produce new output text from an input token-sequence by
77*16467b97STreehugger Robot  managing rewrite "programs" on top of the stream.
78*16467b97STreehugger RobotCommonTreeNodeStream::
79*16467b97STreehugger Robot  In a similar fashion to CommonTokenStream, CommonTreeNodeStream feeds tokens
80*16467b97STreehugger Robot  to recognizers in a sequential fashion. However, the stream object serializes
81*16467b97STreehugger Robot  an Abstract Syntax Tree into a flat, one-dimensional sequence, but preserves
82*16467b97STreehugger Robot  the two-dimensional shape of the tree using special UP and DOWN tokens. The
83*16467b97STreehugger Robot  sequence is primarily used by ANTLR Tree Parsers. *note* -- this is not
84*16467b97STreehugger Robot  defined in antlr3/stream.rb, but antlr3/tree.rb
85*16467b97STreehugger Robot
86*16467b97STreehugger Robot---
87*16467b97STreehugger Robot
88*16467b97STreehugger RobotThe next few sections cover the most significant methods of all stream classes.
89*16467b97STreehugger Robot
90*16467b97STreehugger Robot=== consume / look / peek
91*16467b97STreehugger Robot
92*16467b97STreehugger Robot<tt>stream.consume</tt> is used to advance a stream one unit. StringStreams are
93*16467b97STreehugger Robotadvanced by one character and TokenStreams are advanced by one token.
94*16467b97STreehugger Robot
95*16467b97STreehugger Robot<tt>stream.peek(k = 1)</tt> is used to quickly retrieve the object of interest
96*16467b97STreehugger Robotto a recognizer at look-ahead position specified by <tt>k</tt>. For
97*16467b97STreehugger Robot<b>StringStreams</b>, this is the <i>integer value of the character</i>
98*16467b97STreehugger Robot<tt>k</tt> characters ahead of the stream cursor. For <b>TokenStreams</b>, this
99*16467b97STreehugger Robotis the <i>integer token type of the token</i> <tt>k</tt> tokens ahead of the
100*16467b97STreehugger Robotstream cursor.
101*16467b97STreehugger Robot
102*16467b97STreehugger Robot<tt>stream.look(k = 1)</tt> is used to retrieve the full object of interest at
103*16467b97STreehugger Robotlook-ahead position specified by <tt>k</tt>. While <tt>peek</tt> provides the
104*16467b97STreehugger Robot<i>bare-minimum lightweight information</i> that the recognizer needs,
105*16467b97STreehugger Robot<tt>look</tt> provides the <i>full object of concern</i> in the stream. For
106*16467b97STreehugger Robot<b>StringStreams</b>, this is a <i>string object containing the single
107*16467b97STreehugger Robotcharacter</i> <tt>k</tt> characters ahead of the stream cursor. For
108*16467b97STreehugger Robot<b>TokenStreams</b>, this is the <i>full token structure</i> <tt>k</tt> tokens
109*16467b97STreehugger Robotahead of the stream cursor.
110*16467b97STreehugger Robot
111*16467b97STreehugger Robot<b>Note:</b> in most ANTLR runtime APIs for other languages, <tt>peek</tt> is
112*16467b97STreehugger Robotimplemented by some method with a name like <tt>LA(k)</tt> and <tt>look</tt> is
113*16467b97STreehugger Robotimplemented by some method with a name like <tt>LT(k)</tt>. When writing this
114*16467b97STreehugger RobotRuby runtime API, I found this naming practice both confusing, ambiguous, and
115*16467b97STreehugger Robotun-Ruby-like. Thus, I chose <tt>peek</tt> and <tt>look</tt> to represent a
116*16467b97STreehugger Robotquick-look (peek) and a full-fledged look-ahead operation (look). If this causes
117*16467b97STreehugger Robotconfusion or any sort of compatibility strife for developers using this
118*16467b97STreehugger Robotimplementation, all apologies.
119*16467b97STreehugger Robot
120*16467b97STreehugger Robot=== mark / rewind / release
121*16467b97STreehugger Robot
122*16467b97STreehugger Robot<tt>marker = stream.mark</tt> causes the stream to record important information
123*16467b97STreehugger Robotabout the current stream state, place the data in an internal memory table, and
124*16467b97STreehugger Robotreturn a memento, <tt>marker</tt>. The marker object is typically an integer key
125*16467b97STreehugger Robotto the stream's internal memory table.
126*16467b97STreehugger Robot
127*16467b97STreehugger RobotUsed in tandem with, <tt>stream.rewind(mark = last_marker)</tt>, the marker can
128*16467b97STreehugger Robotbe used to restore the stream to an earlier state. This is used by recognizers
129*16467b97STreehugger Robotto perform tasks such as backtracking and error recovery.
130*16467b97STreehugger Robot
131*16467b97STreehugger Robot<tt>stream.release(marker = last_marker)</tt> can be used to release an existing
132*16467b97STreehugger Robotstate marker from the memory table.
133*16467b97STreehugger Robot
134*16467b97STreehugger Robot=== seek
135*16467b97STreehugger Robot
136*16467b97STreehugger Robot<tt>stream.seek(position)</tt> moves the stream cursor to an absolute position
137*16467b97STreehugger Robotwithin the stream, basically like typical ruby <tt>IO#seek</tt> style methods.
138*16467b97STreehugger RobotHowever, unlike <tt>IO#seek</tt>, ANTLR streams currently always use absolute
139*16467b97STreehugger Robotposition seeking.
140*16467b97STreehugger Robot
141*16467b97STreehugger Robot== The Stream Module
142*16467b97STreehugger Robot
143*16467b97STreehugger Robot<tt>ANTLR3::Stream</tt> is an abstract-ish base mixin for all IO-like stream
144*16467b97STreehugger Robotclasses used by ANTLR recognizers.
145*16467b97STreehugger Robot
146*16467b97STreehugger RobotThe module doesn't do much on its own besides define arguably annoying
147*16467b97STreehugger Robot``abstract'' pseudo-methods that demand implementation when it is mixed in to a
148*16467b97STreehugger Robotclass that wants to be a Stream. Right now this exists as an artifact of porting
149*16467b97STreehugger Robotthe ANTLR Java/Python runtime library to Ruby. In Java, of course, this is
150*16467b97STreehugger Robotrepresented as an interface. In Ruby, however, objects are duck-typed and
151*16467b97STreehugger Robotinterfaces aren't that useful as programmatic entities -- in fact, it's mildly
152*16467b97STreehugger Robotwasteful to have a module like this hanging out. Thus, I may axe it.
153*16467b97STreehugger Robot
154*16467b97STreehugger RobotWhen mixed in, it does give the class a #size and #source_name attribute
155*16467b97STreehugger Robotmethods.
156*16467b97STreehugger Robot
157*16467b97STreehugger RobotExcept in a small handful of places, most of the ANTLR runtime library uses
158*16467b97STreehugger Robotduck-typing and not type checking on objects. This means that the methods which
159*16467b97STreehugger Robotmanipulate stream objects don't usually bother checking that the object is a
160*16467b97STreehugger RobotStream and assume that the object implements the proper stream interface. Thus,
161*16467b97STreehugger Robotit is not strictly necessary that custom stream objects include ANTLR3::Stream,
162*16467b97STreehugger Robotthough it isn't a bad idea.
163*16467b97STreehugger Robot
164*16467b97STreehugger Robot=end
165*16467b97STreehugger Robot
166*16467b97STreehugger Robotmodule Stream
167*16467b97STreehugger Robot  include ANTLR3::Constants
168*16467b97STreehugger Robot  extend ClassMacros
169*16467b97STreehugger Robot
170*16467b97STreehugger Robot  ##
171*16467b97STreehugger Robot  # :method: consume
172*16467b97STreehugger Robot  # used to advance a stream one unit (such as character or token)
173*16467b97STreehugger Robot  abstract :consume
174*16467b97STreehugger Robot
175*16467b97STreehugger Robot  ##
176*16467b97STreehugger Robot  # :method: peek( k = 1 )
177*16467b97STreehugger Robot  # used to quickly retreive the object of interest to a recognizer at lookahead
178*16467b97STreehugger Robot  # position specified by <tt>k</tt> (such as integer value of a character or an
179*16467b97STreehugger Robot  # integer token type)
180*16467b97STreehugger Robot  abstract :peek
181*16467b97STreehugger Robot
182*16467b97STreehugger Robot  ##
183*16467b97STreehugger Robot  # :method: look( k = 1 )
184*16467b97STreehugger Robot  # used to retreive the full object of interest at lookahead position specified
185*16467b97STreehugger Robot  # by <tt>k</tt> (such as a character string or a token structure)
186*16467b97STreehugger Robot  abstract :look
187*16467b97STreehugger Robot
188*16467b97STreehugger Robot  ##
189*16467b97STreehugger Robot  # :method: mark
190*16467b97STreehugger Robot  # saves the current position for the purposes of backtracking and
191*16467b97STreehugger Robot  # returns a value to pass to #rewind at a later time
192*16467b97STreehugger Robot  abstract :mark
193*16467b97STreehugger Robot
194*16467b97STreehugger Robot  ##
195*16467b97STreehugger Robot  # :method: index
196*16467b97STreehugger Robot  # returns the current position of the stream
197*16467b97STreehugger Robot  abstract :index
198*16467b97STreehugger Robot
199*16467b97STreehugger Robot  ##
200*16467b97STreehugger Robot  # :method: rewind( marker = last_marker )
201*16467b97STreehugger Robot  # restores the stream position using the state information previously saved
202*16467b97STreehugger Robot  # by the given marker
203*16467b97STreehugger Robot  abstract :rewind
204*16467b97STreehugger Robot
205*16467b97STreehugger Robot  ##
206*16467b97STreehugger Robot  # :method: release( marker = last_marker )
207*16467b97STreehugger Robot  # clears the saved state information associated with the given marker value
208*16467b97STreehugger Robot  abstract :release
209*16467b97STreehugger Robot
210*16467b97STreehugger Robot  ##
211*16467b97STreehugger Robot  # :method: seek( position )
212*16467b97STreehugger Robot  # move the stream to the given absolute index given by +position+
213*16467b97STreehugger Robot  abstract :seek
214*16467b97STreehugger Robot
215*16467b97STreehugger Robot  ##
216*16467b97STreehugger Robot  # the total number of symbols in the stream
217*16467b97STreehugger Robot  attr_reader :size
218*16467b97STreehugger Robot
219*16467b97STreehugger Robot  ##
220*16467b97STreehugger Robot  # indicates an identifying name for the stream -- usually the file path of the input
221*16467b97STreehugger Robot  attr_accessor :source_name
222*16467b97STreehugger Robotend
223*16467b97STreehugger Robot
224*16467b97STreehugger Robot=begin rdoc ANTLR3::CharacterStream
225*16467b97STreehugger Robot
226*16467b97STreehugger RobotCharacterStream further extends the abstract-ish base mixin Stream to add
227*16467b97STreehugger Robotmethods specific to navigating character-based input data. Thus, it serves as an
228*16467b97STreehugger Robotimmitation of the Java interface for text-based streams, which are primarily
229*16467b97STreehugger Robotused by lexers.
230*16467b97STreehugger Robot
231*16467b97STreehugger RobotIt adds the ``abstract'' method, <tt>substring(start, stop)</tt>, which must be
232*16467b97STreehugger Robotimplemented to return a slice of the input string from position <tt>start</tt>
233*16467b97STreehugger Robotto position <tt>stop</tt>. It also adds attribute accessor methods <tt>line</tt>
234*16467b97STreehugger Robotand <tt>column</tt>, which are expected to indicate the current line number and
235*16467b97STreehugger Robotposition within the current line, respectively.
236*16467b97STreehugger Robot
237*16467b97STreehugger Robot== A Word About <tt>line</tt> and <tt>column</tt> attributes
238*16467b97STreehugger Robot
239*16467b97STreehugger RobotPresumably, the concept of <tt>line</tt> and <tt>column</tt> attirbutes of text
240*16467b97STreehugger Robotare familliar to most developers. Line numbers of text are indexed from number 1
241*16467b97STreehugger Robotup (not 0). Column numbers are indexed from 0 up. Thus, examining sample text:
242*16467b97STreehugger Robot
243*16467b97STreehugger Robot  Hey this is the first line.
244*16467b97STreehugger Robot  Oh, and this is the second line.
245*16467b97STreehugger Robot
246*16467b97STreehugger RobotLine 1 is the string "Hey this is the first line\\n". If a character stream is at
247*16467b97STreehugger Robotline 2, character 0, the stream cursor is sitting between the characters "\\n"
248*16467b97STreehugger Robotand "O".
249*16467b97STreehugger Robot
250*16467b97STreehugger Robot*Note:* most ANTLR runtime APIs for other languages refer to <tt>column</tt>
251*16467b97STreehugger Robotwith the more-precise, but lengthy name <tt>charPositionInLine</tt>. I prefered
252*16467b97STreehugger Robotto keep it simple and familliar in this Ruby runtime API.
253*16467b97STreehugger Robot
254*16467b97STreehugger Robot=end
255*16467b97STreehugger Robot
256*16467b97STreehugger Robotmodule CharacterStream
257*16467b97STreehugger Robot  include Stream
258*16467b97STreehugger Robot  extend ClassMacros
259*16467b97STreehugger Robot  include Constants
260*16467b97STreehugger Robot
261*16467b97STreehugger Robot  ##
262*16467b97STreehugger Robot  # :method: substring(start,stop)
263*16467b97STreehugger Robot  abstract :substring
264*16467b97STreehugger Robot
265*16467b97STreehugger Robot  attr_accessor :line
266*16467b97STreehugger Robot  attr_accessor :column
267*16467b97STreehugger Robotend
268*16467b97STreehugger Robot
269*16467b97STreehugger Robot
270*16467b97STreehugger Robot=begin rdoc ANTLR3::TokenStream
271*16467b97STreehugger Robot
272*16467b97STreehugger RobotTokenStream further extends the abstract-ish base mixin Stream to add methods
273*16467b97STreehugger Robotspecific to navigating token sequences. Thus, it serves as an imitation of the
274*16467b97STreehugger RobotJava interface for token-based streams, which are used by many different
275*16467b97STreehugger Robotcomponents in ANTLR, including parsers and tree parsers.
276*16467b97STreehugger Robot
277*16467b97STreehugger Robot== Token Streams
278*16467b97STreehugger Robot
279*16467b97STreehugger RobotToken streams wrap a sequence of token objects produced by some token source,
280*16467b97STreehugger Robotusually a lexer. They provide the operations required by higher-level
281*16467b97STreehugger Robotrecognizers, such as parsers and tree parsers for navigating through the
282*16467b97STreehugger Robotsequence of tokens. Unlike simple character-based streams, such as StringStream,
283*16467b97STreehugger Robottoken-based streams have an additional level of complexity because they must
284*16467b97STreehugger Robotmanage the task of "tuning" to a specific token channel.
285*16467b97STreehugger Robot
286*16467b97STreehugger RobotOne of the main advantages of ANTLR-based recognition is the token
287*16467b97STreehugger Robot<i>channel</i> feature, which allows you to hold on to all tokens of interest
288*16467b97STreehugger Robotwhile only presenting a specific set of interesting tokens to a parser. For
289*16467b97STreehugger Robotexample, if you need to hide whitespace and comments from a parser, but hang on
290*16467b97STreehugger Robotto them for some other purpose, you have the lexer assign the comments and
291*16467b97STreehugger Robotwhitespace to channel value HIDDEN as it creates the tokens.
292*16467b97STreehugger Robot
293*16467b97STreehugger RobotWhen you create a token stream, you can tune it to some specific channel value.
294*16467b97STreehugger RobotThen, all <tt>peek</tt>, <tt>look</tt>, and <tt>consume</tt> operations only
295*16467b97STreehugger Robotyield tokens that have the same value for <tt>channel</tt>. The stream skips
296*16467b97STreehugger Robotover any non-matching tokens in between.
297*16467b97STreehugger Robot
298*16467b97STreehugger Robot== The TokenStream Interface
299*16467b97STreehugger Robot
300*16467b97STreehugger RobotIn addition to the abstract methods and attribute methods provided by the base
301*16467b97STreehugger RobotStream module, TokenStream adds a number of additional method implementation
302*16467b97STreehugger Robotrequirements and attributes.
303*16467b97STreehugger Robot
304*16467b97STreehugger Robot=end
305*16467b97STreehugger Robot
306*16467b97STreehugger Robotmodule TokenStream
307*16467b97STreehugger Robot  include Stream
308*16467b97STreehugger Robot  extend ClassMacros
309*16467b97STreehugger Robot
310*16467b97STreehugger Robot  ##
311*16467b97STreehugger Robot  # expected to return the token source object (such as a lexer) from which
312*16467b97STreehugger Robot  # all tokens in the stream were retreived
313*16467b97STreehugger Robot  attr_reader :token_source
314*16467b97STreehugger Robot
315*16467b97STreehugger Robot  ##
316*16467b97STreehugger Robot  # expected to return the value of the last marker produced by a call to
317*16467b97STreehugger Robot  # <tt>stream.mark</tt>
318*16467b97STreehugger Robot  attr_reader :last_marker
319*16467b97STreehugger Robot
320*16467b97STreehugger Robot  ##
321*16467b97STreehugger Robot  # expected to return the integer index of the stream cursor
322*16467b97STreehugger Robot  attr_reader :position
323*16467b97STreehugger Robot
324*16467b97STreehugger Robot  ##
325*16467b97STreehugger Robot  # the integer channel value to which the stream is ``tuned''
326*16467b97STreehugger Robot  attr_accessor :channel
327*16467b97STreehugger Robot
328*16467b97STreehugger Robot  ##
329*16467b97STreehugger Robot  # :method: to_s(start=0,stop=tokens.length-1)
330*16467b97STreehugger Robot  # should take the tokens between start and stop in the sequence, extract their text
331*16467b97STreehugger Robot  # and return the concatenation of all the text chunks
332*16467b97STreehugger Robot  abstract :to_s
333*16467b97STreehugger Robot
334*16467b97STreehugger Robot  ##
335*16467b97STreehugger Robot  # :method: at( i )
336*16467b97STreehugger Robot  # return the stream symbol at index +i+
337*16467b97STreehugger Robot  abstract :at
338*16467b97STreehugger Robotend
339*16467b97STreehugger Robot
340*16467b97STreehugger Robot=begin rdoc ANTLR3::StringStream
341*16467b97STreehugger Robot
342*16467b97STreehugger RobotA StringStream's purpose is to wrap the basic, naked text input of a recognition
343*16467b97STreehugger Robotsystem. Like all other stream types, it provides serial navigation of the input;
344*16467b97STreehugger Robota recognizer can arbitrarily step forward and backward through the stream's
345*16467b97STreehugger Robotsymbols as it requires. StringStream and its subclasses are they main way to
346*16467b97STreehugger Robotfeed text input into an ANTLR Lexer for token processing.
347*16467b97STreehugger Robot
348*16467b97STreehugger RobotThe stream's symbols of interest, of course, are character values. Thus, the
349*16467b97STreehugger Robot#peek method returns the integer character value at look-ahead position
350*16467b97STreehugger Robot<tt>k</tt> and the #look method returns the character value as a +String+. They
351*16467b97STreehugger Robotalso track various pieces of information such as the line and column numbers at
352*16467b97STreehugger Robotthe current position.
353*16467b97STreehugger Robot
354*16467b97STreehugger Robot=== Note About Text Encoding
355*16467b97STreehugger Robot
356*16467b97STreehugger RobotThis version of the runtime library primarily targets ruby version 1.8, which
357*16467b97STreehugger Robotdoes not have strong built-in support for multi-byte character encodings. Thus,
358*16467b97STreehugger Robotcharacters are assumed to be represented by a single byte -- an integer between
359*16467b97STreehugger Robot0 and 255. Ruby 1.9 does provide built-in encoding support for multi-byte
360*16467b97STreehugger Robotcharacters, but currently this library does not provide any streams to handle
361*16467b97STreehugger Robotnon-ASCII encoding. However, encoding-savvy recognition code is a future
362*16467b97STreehugger Robotdevelopment goal for this project.
363*16467b97STreehugger Robot
364*16467b97STreehugger Robot=end
365*16467b97STreehugger Robot
366*16467b97STreehugger Robotclass StringStream
367*16467b97STreehugger Robot  NEWLINE = ?\n.ord
368*16467b97STreehugger Robot
369*16467b97STreehugger Robot  include CharacterStream
370*16467b97STreehugger Robot
371*16467b97STreehugger Robot  # current integer character index of the stream
372*16467b97STreehugger Robot  attr_reader :position
373*16467b97STreehugger Robot
374*16467b97STreehugger Robot  # the current line number of the input, indexed upward from 1
375*16467b97STreehugger Robot  attr_reader :line
376*16467b97STreehugger Robot
377*16467b97STreehugger Robot  # the current character position within the current line, indexed upward from 0
378*16467b97STreehugger Robot  attr_reader :column
379*16467b97STreehugger Robot
380*16467b97STreehugger Robot  # the name associated with the stream -- usually a file name
381*16467b97STreehugger Robot  # defaults to <tt>"(string)"</tt>
382*16467b97STreehugger Robot  attr_accessor :name
383*16467b97STreehugger Robot
384*16467b97STreehugger Robot  # the entire string that is wrapped by the stream
385*16467b97STreehugger Robot  attr_reader :data
386*16467b97STreehugger Robot  attr_reader :string
387*16467b97STreehugger Robot
388*16467b97STreehugger Robot  if RUBY_VERSION =~ /^1\.9/
389*16467b97STreehugger Robot
390*16467b97STreehugger Robot    # creates a new StringStream object where +data+ is the string data to stream.
391*16467b97STreehugger Robot    # accepts the following options in a symbol-to-value hash:
392*16467b97STreehugger Robot    #
393*16467b97STreehugger Robot    # [:file or :name] the (file) name to associate with the stream; default: <tt>'(string)'</tt>
394*16467b97STreehugger Robot    # [:line] the initial line number; default: +1+
395*16467b97STreehugger Robot    # [:column] the initial column number; default: +0+
396*16467b97STreehugger Robot    #
397*16467b97STreehugger Robot    def initialize( data, options = {} )      # for 1.9
398*16467b97STreehugger Robot      @string   = data.to_s.encode( Encoding::UTF_8 ).freeze
399*16467b97STreehugger Robot      @data     = @string.codepoints.to_a.freeze
400*16467b97STreehugger Robot      @position = options.fetch :position, 0
401*16467b97STreehugger Robot      @line     = options.fetch :line, 1
402*16467b97STreehugger Robot      @column   = options.fetch :column, 0
403*16467b97STreehugger Robot      @markers  = []
404*16467b97STreehugger Robot      @name   ||= options[ :file ] || options[ :name ] # || '(string)'
405*16467b97STreehugger Robot      mark
406*16467b97STreehugger Robot    end
407*16467b97STreehugger Robot
408*16467b97STreehugger Robot    #
409*16467b97STreehugger Robot    # identical to #peek, except it returns the character value as a String
410*16467b97STreehugger Robot    #
411*16467b97STreehugger Robot    def look( k = 1 )               # for 1.9
412*16467b97STreehugger Robot      k == 0 and return nil
413*16467b97STreehugger Robot      k += 1 if k < 0
414*16467b97STreehugger Robot
415*16467b97STreehugger Robot      index = @position + k - 1
416*16467b97STreehugger Robot      index < 0 and return nil
417*16467b97STreehugger Robot
418*16467b97STreehugger Robot      @string[ index ]
419*16467b97STreehugger Robot    end
420*16467b97STreehugger Robot
421*16467b97STreehugger Robot  else
422*16467b97STreehugger Robot
423*16467b97STreehugger Robot    # creates a new StringStream object where +data+ is the string data to stream.
424*16467b97STreehugger Robot    # accepts the following options in a symbol-to-value hash:
425*16467b97STreehugger Robot    #
426*16467b97STreehugger Robot    # [:file or :name] the (file) name to associate with the stream; default: <tt>'(string)'</tt>
427*16467b97STreehugger Robot    # [:line] the initial line number; default: +1+
428*16467b97STreehugger Robot    # [:column] the initial column number; default: +0+
429*16467b97STreehugger Robot    #
430*16467b97STreehugger Robot    def initialize( data, options = {} )    # for 1.8
431*16467b97STreehugger Robot      @data = data.to_s
432*16467b97STreehugger Robot      @data.equal?( data ) and @data = @data.clone
433*16467b97STreehugger Robot      @data.freeze
434*16467b97STreehugger Robot      @string = @data
435*16467b97STreehugger Robot      @position = options.fetch :position, 0
436*16467b97STreehugger Robot      @line = options.fetch :line, 1
437*16467b97STreehugger Robot      @column = options.fetch :column, 0
438*16467b97STreehugger Robot      @markers = []
439*16467b97STreehugger Robot      @name ||= options[ :file ] || options[ :name ] # || '(string)'
440*16467b97STreehugger Robot      mark
441*16467b97STreehugger Robot    end
442*16467b97STreehugger Robot
443*16467b97STreehugger Robot    #
444*16467b97STreehugger Robot    # identical to #peek, except it returns the character value as a String
445*16467b97STreehugger Robot    #
446*16467b97STreehugger Robot    def look( k = 1 )                        # for 1.8
447*16467b97STreehugger Robot      k == 0 and return nil
448*16467b97STreehugger Robot      k += 1 if k < 0
449*16467b97STreehugger Robot
450*16467b97STreehugger Robot      index = @position + k - 1
451*16467b97STreehugger Robot      index < 0 and return nil
452*16467b97STreehugger Robot
453*16467b97STreehugger Robot      c = @data[ index ] and c.chr
454*16467b97STreehugger Robot    end
455*16467b97STreehugger Robot
456*16467b97STreehugger Robot  end
457*16467b97STreehugger Robot
458*16467b97STreehugger Robot  def size
459*16467b97STreehugger Robot    @data.length
460*16467b97STreehugger Robot  end
461*16467b97STreehugger Robot
462*16467b97STreehugger Robot  alias length size
463*16467b97STreehugger Robot
464*16467b97STreehugger Robot  #
465*16467b97STreehugger Robot  # rewinds the stream back to the start and clears out any existing marker entries
466*16467b97STreehugger Robot  #
467*16467b97STreehugger Robot  def reset
468*16467b97STreehugger Robot    initial_location = @markers.first
469*16467b97STreehugger Robot    @position, @line, @column = initial_location
470*16467b97STreehugger Robot    @markers.clear
471*16467b97STreehugger Robot    @markers << initial_location
472*16467b97STreehugger Robot    return self
473*16467b97STreehugger Robot  end
474*16467b97STreehugger Robot
475*16467b97STreehugger Robot  #
476*16467b97STreehugger Robot  # advance the stream by one character; returns the character consumed
477*16467b97STreehugger Robot  #
478*16467b97STreehugger Robot  def consume
479*16467b97STreehugger Robot    c = @data[ @position ] || EOF
480*16467b97STreehugger Robot    if @position < @data.length
481*16467b97STreehugger Robot      @column += 1
482*16467b97STreehugger Robot      if c == NEWLINE
483*16467b97STreehugger Robot        @line += 1
484*16467b97STreehugger Robot        @column = 0
485*16467b97STreehugger Robot      end
486*16467b97STreehugger Robot      @position += 1
487*16467b97STreehugger Robot    end
488*16467b97STreehugger Robot    return( c )
489*16467b97STreehugger Robot  end
490*16467b97STreehugger Robot
491*16467b97STreehugger Robot  #
492*16467b97STreehugger Robot  # return the character at look-ahead distance +k+ as an integer. <tt>k = 1</tt> represents
493*16467b97STreehugger Robot  # the current character. +k+ greater than 1 represents upcoming characters. A negative
494*16467b97STreehugger Robot  # value of +k+ returns previous characters consumed, where <tt>k = -1</tt> is the last
495*16467b97STreehugger Robot  # character consumed. <tt>k = 0</tt> has undefined behavior and returns +nil+
496*16467b97STreehugger Robot  #
497*16467b97STreehugger Robot  def peek( k = 1 )
498*16467b97STreehugger Robot    k == 0 and return nil
499*16467b97STreehugger Robot    k += 1 if k < 0
500*16467b97STreehugger Robot    index = @position + k - 1
501*16467b97STreehugger Robot    index < 0 and return nil
502*16467b97STreehugger Robot    @data[ index ] or EOF
503*16467b97STreehugger Robot  end
504*16467b97STreehugger Robot
505*16467b97STreehugger Robot  #
506*16467b97STreehugger Robot  # return a substring around the stream cursor at a distance +k+
507*16467b97STreehugger Robot  # if <tt>k >= 0</tt>, return the next k characters
508*16467b97STreehugger Robot  # if <tt>k < 0</tt>, return the previous <tt>|k|</tt> characters
509*16467b97STreehugger Robot  #
510*16467b97STreehugger Robot  def through( k )
511*16467b97STreehugger Robot    if k >= 0 then @string[ @position, k ] else
512*16467b97STreehugger Robot      start = ( @position + k ).at_least( 0 ) # start cannot be negative or index will wrap around
513*16467b97STreehugger Robot      @string[ start ... @position ]
514*16467b97STreehugger Robot    end
515*16467b97STreehugger Robot  end
516*16467b97STreehugger Robot
517*16467b97STreehugger Robot  # operator style look-ahead
518*16467b97STreehugger Robot  alias >> look
519*16467b97STreehugger Robot
520*16467b97STreehugger Robot  # operator style look-behind
521*16467b97STreehugger Robot  def <<( k )
522*16467b97STreehugger Robot    self << -k
523*16467b97STreehugger Robot  end
524*16467b97STreehugger Robot
525*16467b97STreehugger Robot  alias index position
526*16467b97STreehugger Robot  alias character_index position
527*16467b97STreehugger Robot
528*16467b97STreehugger Robot  alias source_name name
529*16467b97STreehugger Robot
530*16467b97STreehugger Robot  #
531*16467b97STreehugger Robot  # Returns true if the stream appears to be at the beginning of a new line.
532*16467b97STreehugger Robot  # This is an extra utility method for use inside lexer actions if needed.
533*16467b97STreehugger Robot  #
534*16467b97STreehugger Robot  def beginning_of_line?
535*16467b97STreehugger Robot    @position.zero? or @data[ @position - 1 ] == NEWLINE
536*16467b97STreehugger Robot  end
537*16467b97STreehugger Robot
538*16467b97STreehugger Robot  #
539*16467b97STreehugger Robot  # Returns true if the stream appears to be at the end of a new line.
540*16467b97STreehugger Robot  # This is an extra utility method for use inside lexer actions if needed.
541*16467b97STreehugger Robot  #
542*16467b97STreehugger Robot  def end_of_line?
543*16467b97STreehugger Robot    @data[ @position ] == NEWLINE #if @position < @data.length
544*16467b97STreehugger Robot  end
545*16467b97STreehugger Robot
546*16467b97STreehugger Robot  #
547*16467b97STreehugger Robot  # Returns true if the stream has been exhausted.
548*16467b97STreehugger Robot  # This is an extra utility method for use inside lexer actions if needed.
549*16467b97STreehugger Robot  #
550*16467b97STreehugger Robot  def end_of_string?
551*16467b97STreehugger Robot    @position >= @data.length
552*16467b97STreehugger Robot  end
553*16467b97STreehugger Robot
554*16467b97STreehugger Robot  #
555*16467b97STreehugger Robot  # Returns true if the stream appears to be at the beginning of a stream (position = 0).
556*16467b97STreehugger Robot  # This is an extra utility method for use inside lexer actions if needed.
557*16467b97STreehugger Robot  #
558*16467b97STreehugger Robot  def beginning_of_string?
559*16467b97STreehugger Robot    @position == 0
560*16467b97STreehugger Robot  end
561*16467b97STreehugger Robot
562*16467b97STreehugger Robot  alias eof? end_of_string?
563*16467b97STreehugger Robot  alias bof? beginning_of_string?
564*16467b97STreehugger Robot
565*16467b97STreehugger Robot  #
566*16467b97STreehugger Robot  # record the current stream location parameters in the stream's marker table and
567*16467b97STreehugger Robot  # return an integer-valued bookmark that may be used to restore the stream's
568*16467b97STreehugger Robot  # position with the #rewind method. This method is used to implement backtracking.
569*16467b97STreehugger Robot  #
570*16467b97STreehugger Robot  def mark
571*16467b97STreehugger Robot    state = [ @position, @line, @column ].freeze
572*16467b97STreehugger Robot    @markers << state
573*16467b97STreehugger Robot    return @markers.length - 1
574*16467b97STreehugger Robot  end
575*16467b97STreehugger Robot
576*16467b97STreehugger Robot  #
577*16467b97STreehugger Robot  # restore the stream to an earlier location recorded by #mark. If no marker value is
578*16467b97STreehugger Robot  # provided, the last marker generated by #mark will be used.
579*16467b97STreehugger Robot  #
580*16467b97STreehugger Robot  def rewind( marker = @markers.length - 1, release = true )
581*16467b97STreehugger Robot    ( marker >= 0 and location = @markers[ marker ] ) or return( self )
582*16467b97STreehugger Robot    @position, @line, @column = location
583*16467b97STreehugger Robot    release( marker ) if release
584*16467b97STreehugger Robot    return self
585*16467b97STreehugger Robot  end
586*16467b97STreehugger Robot
587*16467b97STreehugger Robot  #
588*16467b97STreehugger Robot  # the total number of markers currently in existence
589*16467b97STreehugger Robot  #
590*16467b97STreehugger Robot  def mark_depth
591*16467b97STreehugger Robot    @markers.length
592*16467b97STreehugger Robot  end
593*16467b97STreehugger Robot
594*16467b97STreehugger Robot  #
595*16467b97STreehugger Robot  # the last marker value created by a call to #mark
596*16467b97STreehugger Robot  #
597*16467b97STreehugger Robot  def last_marker
598*16467b97STreehugger Robot    @markers.length - 1
599*16467b97STreehugger Robot  end
600*16467b97STreehugger Robot
601*16467b97STreehugger Robot  #
602*16467b97STreehugger Robot  # let go of the bookmark data for the marker and all marker
603*16467b97STreehugger Robot  # values created after the marker.
604*16467b97STreehugger Robot  #
605*16467b97STreehugger Robot  def release( marker = @markers.length - 1 )
606*16467b97STreehugger Robot    marker.between?( 1, @markers.length - 1 ) or return
607*16467b97STreehugger Robot    @markers.pop( @markers.length - marker )
608*16467b97STreehugger Robot    return self
609*16467b97STreehugger Robot  end
610*16467b97STreehugger Robot
611*16467b97STreehugger Robot  #
612*16467b97STreehugger Robot  # jump to the absolute position value given by +index+.
613*16467b97STreehugger Robot  # note: if +index+ is before the current position, the +line+ and +column+
614*16467b97STreehugger Robot  #       attributes of the stream will probably be incorrect
615*16467b97STreehugger Robot  #
616*16467b97STreehugger Robot  def seek( index )
617*16467b97STreehugger Robot    index = index.bound( 0, @data.length )  # ensures index is within the stream's range
618*16467b97STreehugger Robot    if index > @position
619*16467b97STreehugger Robot      skipped = through( index - @position )
620*16467b97STreehugger Robot      if lc = skipped.count( "\n" ) and lc.zero?
621*16467b97STreehugger Robot        @column += skipped.length
622*16467b97STreehugger Robot      else
623*16467b97STreehugger Robot        @line += lc
624*16467b97STreehugger Robot        @column = skipped.length - skipped.rindex( "\n" ) - 1
625*16467b97STreehugger Robot      end
626*16467b97STreehugger Robot    end
627*16467b97STreehugger Robot    @position = index
628*16467b97STreehugger Robot    return nil
629*16467b97STreehugger Robot  end
630*16467b97STreehugger Robot
631*16467b97STreehugger Robot  #
632*16467b97STreehugger Robot  # customized object inspection that shows:
633*16467b97STreehugger Robot  # * the stream class
634*16467b97STreehugger Robot  # * the stream's location in <tt>index / line:column</tt> format
635*16467b97STreehugger Robot  # * +before_chars+ characters before the cursor (6 characters by default)
636*16467b97STreehugger Robot  # * +after_chars+ characters after the cursor (10 characters by default)
637*16467b97STreehugger Robot  #
638*16467b97STreehugger Robot  def inspect( before_chars = 6, after_chars = 10 )
639*16467b97STreehugger Robot    before = through( -before_chars ).inspect
640*16467b97STreehugger Robot    @position - before_chars > 0 and before.insert( 0, '... ' )
641*16467b97STreehugger Robot
642*16467b97STreehugger Robot    after = through( after_chars ).inspect
643*16467b97STreehugger Robot    @position + after_chars + 1 < @data.length and after << ' ...'
644*16467b97STreehugger Robot
645*16467b97STreehugger Robot    location = "#@position / line #@line:#@column"
646*16467b97STreehugger Robot    "#<#{ self.class }: #{ before } | #{ after } @ #{ location }>"
647*16467b97STreehugger Robot  end
648*16467b97STreehugger Robot
649*16467b97STreehugger Robot  #
650*16467b97STreehugger Robot  # return the string slice between position +start+ and +stop+
651*16467b97STreehugger Robot  #
652*16467b97STreehugger Robot  def substring( start, stop )
653*16467b97STreehugger Robot    @string[ start, stop - start + 1 ]
654*16467b97STreehugger Robot  end
655*16467b97STreehugger Robot
656*16467b97STreehugger Robot  #
657*16467b97STreehugger Robot  # identical to String#[]
658*16467b97STreehugger Robot  #
659*16467b97STreehugger Robot  def []( start, *args )
660*16467b97STreehugger Robot    @string[ start, *args ]
661*16467b97STreehugger Robot  end
662*16467b97STreehugger Robotend
663*16467b97STreehugger Robot
664*16467b97STreehugger Robot
665*16467b97STreehugger Robot=begin rdoc ANTLR3::FileStream
666*16467b97STreehugger Robot
667*16467b97STreehugger RobotFileStream is a character stream that uses data stored in some external file. It
668*16467b97STreehugger Robotis nearly identical to StringStream and functions as use data located in a file
669*16467b97STreehugger Robotwhile automatically setting up the +source_name+ and +line+ parameters. It does
670*16467b97STreehugger Robotnot actually use any buffered IO operations throughout the stream navigation
671*16467b97STreehugger Robotprocess. Instead, it reads the file data once when the stream is initialized.
672*16467b97STreehugger Robot
673*16467b97STreehugger Robot=end
674*16467b97STreehugger Robot
675*16467b97STreehugger Robotclass FileStream < StringStream
676*16467b97STreehugger Robot
677*16467b97STreehugger Robot  #
678*16467b97STreehugger Robot  # creates a new FileStream object using the given +file+ object.
679*16467b97STreehugger Robot  # If +file+ is a path string, the file will be read and the contents
680*16467b97STreehugger Robot  # will be used and the +name+ attribute will be set to the path.
681*16467b97STreehugger Robot  # If +file+ is an IO-like object (that responds to :read),
682*16467b97STreehugger Robot  # the content of the object will be used and the stream will
683*16467b97STreehugger Robot  # attempt to set its +name+ object first trying the method #name
684*16467b97STreehugger Robot  # on the object, then trying the method #path on the object.
685*16467b97STreehugger Robot  #
686*16467b97STreehugger Robot  # see StringStream.new for a list of additional options
687*16467b97STreehugger Robot  # the constructer accepts
688*16467b97STreehugger Robot  #
689*16467b97STreehugger Robot  def initialize( file, options = {} )
690*16467b97STreehugger Robot    case file
691*16467b97STreehugger Robot    when $stdin then
692*16467b97STreehugger Robot      data = $stdin.read
693*16467b97STreehugger Robot      @name = '(stdin)'
694*16467b97STreehugger Robot    when ARGF
695*16467b97STreehugger Robot      data = file.read
696*16467b97STreehugger Robot      @name = file.path
697*16467b97STreehugger Robot    when ::File then
698*16467b97STreehugger Robot      file = file.clone
699*16467b97STreehugger Robot      file.reopen( file.path, 'r' )
700*16467b97STreehugger Robot      @name = file.path
701*16467b97STreehugger Robot      data = file.read
702*16467b97STreehugger Robot      file.close
703*16467b97STreehugger Robot    else
704*16467b97STreehugger Robot      if file.respond_to?( :read )
705*16467b97STreehugger Robot        data = file.read
706*16467b97STreehugger Robot        if file.respond_to?( :name ) then @name = file.name
707*16467b97STreehugger Robot        elsif file.respond_to?( :path ) then @name = file.path
708*16467b97STreehugger Robot        end
709*16467b97STreehugger Robot      else
710*16467b97STreehugger Robot        @name = file.to_s
711*16467b97STreehugger Robot        if test( ?f, @name ) then data = File.read( @name )
712*16467b97STreehugger Robot        else raise ArgumentError, "could not find an existing file at %p" % @name
713*16467b97STreehugger Robot        end
714*16467b97STreehugger Robot      end
715*16467b97STreehugger Robot    end
716*16467b97STreehugger Robot    super( data, options )
717*16467b97STreehugger Robot  end
718*16467b97STreehugger Robot
719*16467b97STreehugger Robotend
720*16467b97STreehugger Robot
721*16467b97STreehugger Robot=begin rdoc ANTLR3::CommonTokenStream
722*16467b97STreehugger Robot
723*16467b97STreehugger RobotCommonTokenStream serves as the primary token stream implementation for feeding
724*16467b97STreehugger Robotsequential token input into parsers.
725*16467b97STreehugger Robot
726*16467b97STreehugger RobotUsing some TokenSource (such as a lexer), the stream collects a token sequence,
727*16467b97STreehugger Robotsetting the token's <tt>index</tt> attribute to indicate the token's position
728*16467b97STreehugger Robotwithin the stream. The streams may be tuned to some channel value; off-channel
729*16467b97STreehugger Robottokens will be filtered out by the #peek, #look, and #consume methods.
730*16467b97STreehugger Robot
731*16467b97STreehugger Robot=== Sample Usage
732*16467b97STreehugger Robot
733*16467b97STreehugger Robot
734*16467b97STreehugger Robot  source_input = ANTLR3::StringStream.new("35 * 4 - 1")
735*16467b97STreehugger Robot  lexer = Calculator::Lexer.new(source_input)
736*16467b97STreehugger Robot  tokens = ANTLR3::CommonTokenStream.new(lexer)
737*16467b97STreehugger Robot
738*16467b97STreehugger Robot  # assume this grammar defines whitespace as tokens on channel HIDDEN
739*16467b97STreehugger Robot  # and numbers and operations as tokens on channel DEFAULT
740*16467b97STreehugger Robot  tokens.look         # => 0 INT['35'] @ line 1 col 0 (0..1)
741*16467b97STreehugger Robot  tokens.look(2)      # => 2 MULT["*"] @ line 1 col 2 (3..3)
742*16467b97STreehugger Robot  tokens.tokens(0, 2)
743*16467b97STreehugger Robot    # => [0 INT["35"] @line 1 col 0 (0..1),
744*16467b97STreehugger Robot    #     1 WS[" "] @line 1 col 2 (1..1),
745*16467b97STreehugger Robot    #     2 MULT["*"] @ line 1 col 3 (3..3)]
746*16467b97STreehugger Robot    # notice the #tokens method does not filter off-channel tokens
747*16467b97STreehugger Robot
748*16467b97STreehugger Robot  lexer.reset
749*16467b97STreehugger Robot  hidden_tokens =
750*16467b97STreehugger Robot    ANTLR3::CommonTokenStream.new(lexer, :channel => ANTLR3::HIDDEN)
751*16467b97STreehugger Robot  hidden_tokens.look # => 1 WS[' '] @ line 1 col 2 (1..1)
752*16467b97STreehugger Robot
753*16467b97STreehugger Robot=end
754*16467b97STreehugger Robot
755*16467b97STreehugger Robotclass CommonTokenStream
756*16467b97STreehugger Robot  include TokenStream
757*16467b97STreehugger Robot  include Enumerable
758*16467b97STreehugger Robot
759*16467b97STreehugger Robot  #
760*16467b97STreehugger Robot  # constructs a new token stream using the +token_source+ provided. +token_source+ is
761*16467b97STreehugger Robot  # usually a lexer, but can be any object that implements +next_token+ and includes
762*16467b97STreehugger Robot  # ANTLR3::TokenSource.
763*16467b97STreehugger Robot  #
764*16467b97STreehugger Robot  # If a block is provided, each token harvested will be yielded and if the block
765*16467b97STreehugger Robot  # returns a +nil+ or +false+ value, the token will not be added to the stream --
766*16467b97STreehugger Robot  # it will be discarded.
767*16467b97STreehugger Robot  #
768*16467b97STreehugger Robot  # === Options
769*16467b97STreehugger Robot  # [:channel] The channel value the stream should be tuned to initially
770*16467b97STreehugger Robot  # [:source_name] The source name (file name) attribute of the stream
771*16467b97STreehugger Robot  #
772*16467b97STreehugger Robot  # === Example
773*16467b97STreehugger Robot  #
774*16467b97STreehugger Robot  #   # create a new token stream that is tuned to channel :comment, and
775*16467b97STreehugger Robot  #   # discard all WHITE_SPACE tokens
776*16467b97STreehugger Robot  #   ANTLR3::CommonTokenStream.new(lexer, :channel => :comment) do |token|
777*16467b97STreehugger Robot  #     token.name != 'WHITE_SPACE'
778*16467b97STreehugger Robot  #   end
779*16467b97STreehugger Robot  #
780*16467b97STreehugger Robot  def initialize( token_source, options = {} )
781*16467b97STreehugger Robot    case token_source
782*16467b97STreehugger Robot    when CommonTokenStream
783*16467b97STreehugger Robot      # this is useful in cases where you want to convert a CommonTokenStream
784*16467b97STreehugger Robot      # to a RewriteTokenStream or other variation of the standard token stream
785*16467b97STreehugger Robot      stream = token_source
786*16467b97STreehugger Robot      @token_source = stream.token_source
787*16467b97STreehugger Robot      @channel = options.fetch( :channel ) { stream.channel or DEFAULT_CHANNEL }
788*16467b97STreehugger Robot      @source_name = options.fetch( :source_name ) { stream.source_name }
789*16467b97STreehugger Robot      tokens = stream.tokens.map { | t | t.dup }
790*16467b97STreehugger Robot    else
791*16467b97STreehugger Robot      @token_source = token_source
792*16467b97STreehugger Robot      @channel = options.fetch( :channel, DEFAULT_CHANNEL )
793*16467b97STreehugger Robot      @source_name = options.fetch( :source_name ) {  @token_source.source_name rescue nil }
794*16467b97STreehugger Robot      tokens = @token_source.to_a
795*16467b97STreehugger Robot    end
796*16467b97STreehugger Robot    @last_marker = nil
797*16467b97STreehugger Robot    @tokens = block_given? ? tokens.select { | t | yield( t, self ) } : tokens
798*16467b97STreehugger Robot    @tokens.each_with_index { |t, i| t.index = i }
799*16467b97STreehugger Robot    @position =
800*16467b97STreehugger Robot      if first_token = @tokens.find { |t| t.channel == @channel }
801*16467b97STreehugger Robot        @tokens.index( first_token )
802*16467b97STreehugger Robot      else @tokens.length
803*16467b97STreehugger Robot      end
804*16467b97STreehugger Robot  end
805*16467b97STreehugger Robot
806*16467b97STreehugger Robot  #
807*16467b97STreehugger Robot  # resets the token stream and rebuilds it with a potentially new token source.
808*16467b97STreehugger Robot  # If no +token_source+ value is provided, the stream will attempt to reset the
809*16467b97STreehugger Robot  # current +token_source+ by calling +reset+ on the object. The stream will
810*16467b97STreehugger Robot  # then clear the token buffer and attempt to harvest new tokens. Identical in
811*16467b97STreehugger Robot  # behavior to CommonTokenStream.new, if a block is provided, tokens will be
812*16467b97STreehugger Robot  # yielded and discarded if the block returns a +false+ or +nil+ value.
813*16467b97STreehugger Robot  #
814*16467b97STreehugger Robot  def rebuild( token_source = nil )
815*16467b97STreehugger Robot    if token_source.nil?
816*16467b97STreehugger Robot      @token_source.reset rescue nil
817*16467b97STreehugger Robot    else @token_source = token_source
818*16467b97STreehugger Robot    end
819*16467b97STreehugger Robot    @tokens = block_given? ? @token_source.select { |token| yield( token ) } :
820*16467b97STreehugger Robot                             @token_source.to_a
821*16467b97STreehugger Robot    @tokens.each_with_index { |t, i| t.index = i }
822*16467b97STreehugger Robot    @last_marker = nil
823*16467b97STreehugger Robot    @position =
824*16467b97STreehugger Robot      if first_token = @tokens.find { |t| t.channel == @channel }
825*16467b97STreehugger Robot        @tokens.index( first_token )
826*16467b97STreehugger Robot      else @tokens.length
827*16467b97STreehugger Robot      end
828*16467b97STreehugger Robot    return self
829*16467b97STreehugger Robot  end
830*16467b97STreehugger Robot
831*16467b97STreehugger Robot  #
832*16467b97STreehugger Robot  # tune the stream to a new channel value
833*16467b97STreehugger Robot  #
834*16467b97STreehugger Robot  def tune_to( channel )
835*16467b97STreehugger Robot    @channel = channel
836*16467b97STreehugger Robot  end
837*16467b97STreehugger Robot
838*16467b97STreehugger Robot  def token_class
839*16467b97STreehugger Robot    @token_source.token_class
840*16467b97STreehugger Robot  rescue NoMethodError
841*16467b97STreehugger Robot    @position == -1 and fill_buffer
842*16467b97STreehugger Robot    @tokens.empty? ? CommonToken : @tokens.first.class
843*16467b97STreehugger Robot  end
844*16467b97STreehugger Robot
845*16467b97STreehugger Robot  alias index position
846*16467b97STreehugger Robot
847*16467b97STreehugger Robot  def size
848*16467b97STreehugger Robot    @tokens.length
849*16467b97STreehugger Robot  end
850*16467b97STreehugger Robot
851*16467b97STreehugger Robot  alias length size
852*16467b97STreehugger Robot
853*16467b97STreehugger Robot  ###### State-Control ################################################
854*16467b97STreehugger Robot
855*16467b97STreehugger Robot  #
856*16467b97STreehugger Robot  # rewind the stream to its initial state
857*16467b97STreehugger Robot  #
858*16467b97STreehugger Robot  def reset
859*16467b97STreehugger Robot    @position = 0
860*16467b97STreehugger Robot    @position += 1 while token = @tokens[ @position ] and
861*16467b97STreehugger Robot                         token.channel != @channel
862*16467b97STreehugger Robot    @last_marker = nil
863*16467b97STreehugger Robot    return self
864*16467b97STreehugger Robot  end
865*16467b97STreehugger Robot
866*16467b97STreehugger Robot  #
867*16467b97STreehugger Robot  # bookmark the current position of the input stream
868*16467b97STreehugger Robot  #
869*16467b97STreehugger Robot  def mark
870*16467b97STreehugger Robot    @last_marker = @position
871*16467b97STreehugger Robot  end
872*16467b97STreehugger Robot
873*16467b97STreehugger Robot  def release( marker = nil )
874*16467b97STreehugger Robot    # do nothing
875*16467b97STreehugger Robot  end
876*16467b97STreehugger Robot
877*16467b97STreehugger Robot
878*16467b97STreehugger Robot  def rewind( marker = @last_marker, release = true )
879*16467b97STreehugger Robot    seek( marker )
880*16467b97STreehugger Robot  end
881*16467b97STreehugger Robot
882*16467b97STreehugger Robot  #
883*16467b97STreehugger Robot  # saves the current stream position, yields to the block,
884*16467b97STreehugger Robot  # and then ensures the stream's position is restored before
885*16467b97STreehugger Robot  # returning the value of the block
886*16467b97STreehugger Robot  #
887*16467b97STreehugger Robot  def hold( pos = @position )
888*16467b97STreehugger Robot    block_given? or return enum_for( :hold, pos )
889*16467b97STreehugger Robot    begin
890*16467b97STreehugger Robot      yield
891*16467b97STreehugger Robot    ensure
892*16467b97STreehugger Robot      seek( pos )
893*16467b97STreehugger Robot    end
894*16467b97STreehugger Robot  end
895*16467b97STreehugger Robot
896*16467b97STreehugger Robot  ###### Stream Navigation ###########################################
897*16467b97STreehugger Robot
898*16467b97STreehugger Robot  #
899*16467b97STreehugger Robot  # advance the stream one step to the next on-channel token
900*16467b97STreehugger Robot  #
901*16467b97STreehugger Robot  def consume
902*16467b97STreehugger Robot    token = @tokens[ @position ] || EOF_TOKEN
903*16467b97STreehugger Robot    if @position < @tokens.length
904*16467b97STreehugger Robot      @position = future?( 2 ) || @tokens.length
905*16467b97STreehugger Robot    end
906*16467b97STreehugger Robot    return( token )
907*16467b97STreehugger Robot  end
908*16467b97STreehugger Robot
909*16467b97STreehugger Robot  #
910*16467b97STreehugger Robot  # jump to the stream position specified by +index+
911*16467b97STreehugger Robot  # note: seek does not check whether or not the
912*16467b97STreehugger Robot  #       token at the specified position is on-channel,
913*16467b97STreehugger Robot  #
914*16467b97STreehugger Robot  def seek( index )
915*16467b97STreehugger Robot    @position = index.to_i.bound( 0, @tokens.length )
916*16467b97STreehugger Robot    return self
917*16467b97STreehugger Robot  end
918*16467b97STreehugger Robot
919*16467b97STreehugger Robot  #
920*16467b97STreehugger Robot  # return the type of the on-channel token at look-ahead distance +k+. <tt>k = 1</tt> represents
921*16467b97STreehugger Robot  # the current token. +k+ greater than 1 represents upcoming on-channel tokens. A negative
922*16467b97STreehugger Robot  # value of +k+ returns previous on-channel tokens consumed, where <tt>k = -1</tt> is the last
923*16467b97STreehugger Robot  # on-channel token consumed. <tt>k = 0</tt> has undefined behavior and returns +nil+
924*16467b97STreehugger Robot  #
925*16467b97STreehugger Robot  def peek( k = 1 )
926*16467b97STreehugger Robot    tk = look( k ) and return( tk.type )
927*16467b97STreehugger Robot  end
928*16467b97STreehugger Robot
929*16467b97STreehugger Robot  #
930*16467b97STreehugger Robot  # operates simillarly to #peek, but returns the full token object at look-ahead position +k+
931*16467b97STreehugger Robot  #
932*16467b97STreehugger Robot  def look( k = 1 )
933*16467b97STreehugger Robot    index = future?( k ) or return nil
934*16467b97STreehugger Robot    @tokens.fetch( index, EOF_TOKEN )
935*16467b97STreehugger Robot  end
936*16467b97STreehugger Robot
937*16467b97STreehugger Robot  alias >> look
938*16467b97STreehugger Robot  def << k
939*16467b97STreehugger Robot    self >> -k
940*16467b97STreehugger Robot  end
941*16467b97STreehugger Robot
942*16467b97STreehugger Robot  #
943*16467b97STreehugger Robot  # returns the index of the on-channel token at look-ahead position +k+ or nil if no other
944*16467b97STreehugger Robot  # on-channel tokens exist
945*16467b97STreehugger Robot  #
946*16467b97STreehugger Robot  def future?( k = 1 )
947*16467b97STreehugger Robot    @position == -1 and fill_buffer
948*16467b97STreehugger Robot
949*16467b97STreehugger Robot    case
950*16467b97STreehugger Robot    when k == 0 then nil
951*16467b97STreehugger Robot    when k < 0 then past?( -k )
952*16467b97STreehugger Robot    when k == 1 then @position
953*16467b97STreehugger Robot    else
954*16467b97STreehugger Robot      # since the stream only yields on-channel
955*16467b97STreehugger Robot      # tokens, the stream can't just go to the
956*16467b97STreehugger Robot      # next position, but rather must skip
957*16467b97STreehugger Robot      # over off-channel tokens
958*16467b97STreehugger Robot      ( k - 1 ).times.inject( @position ) do |cursor, |
959*16467b97STreehugger Robot        begin
960*16467b97STreehugger Robot          tk = @tokens.at( cursor += 1 ) or return( cursor )
961*16467b97STreehugger Robot          # ^- if tk is nil (i.e. i is outside array limits)
962*16467b97STreehugger Robot        end until tk.channel == @channel
963*16467b97STreehugger Robot        cursor
964*16467b97STreehugger Robot      end
965*16467b97STreehugger Robot    end
966*16467b97STreehugger Robot  end
967*16467b97STreehugger Robot
968*16467b97STreehugger Robot  #
969*16467b97STreehugger Robot  # returns the index of the on-channel token at look-behind position +k+ or nil if no other
970*16467b97STreehugger Robot  # on-channel tokens exist before the current token
971*16467b97STreehugger Robot  #
972*16467b97STreehugger Robot  def past?( k = 1 )
973*16467b97STreehugger Robot    @position == -1 and fill_buffer
974*16467b97STreehugger Robot
975*16467b97STreehugger Robot    case
976*16467b97STreehugger Robot    when k == 0 then nil
977*16467b97STreehugger Robot    when @position - k < 0 then nil
978*16467b97STreehugger Robot    else
979*16467b97STreehugger Robot
980*16467b97STreehugger Robot      k.times.inject( @position ) do |cursor, |
981*16467b97STreehugger Robot        begin
982*16467b97STreehugger Robot          cursor <= 0 and return( nil )
983*16467b97STreehugger Robot          tk = @tokens.at( cursor -= 1 ) or return( nil )
984*16467b97STreehugger Robot        end until tk.channel == @channel
985*16467b97STreehugger Robot        cursor
986*16467b97STreehugger Robot      end
987*16467b97STreehugger Robot
988*16467b97STreehugger Robot    end
989*16467b97STreehugger Robot  end
990*16467b97STreehugger Robot
991*16467b97STreehugger Robot  #
992*16467b97STreehugger Robot  # yields each token in the stream (including off-channel tokens)
993*16467b97STreehugger Robot  # If no block is provided, the method returns an Enumerator object.
994*16467b97STreehugger Robot  # #each accepts the same arguments as #tokens
995*16467b97STreehugger Robot  #
996*16467b97STreehugger Robot  def each( *args )
997*16467b97STreehugger Robot    block_given? or return enum_for( :each, *args )
998*16467b97STreehugger Robot    tokens( *args ).each { |token| yield( token ) }
999*16467b97STreehugger Robot  end
1000*16467b97STreehugger Robot
1001*16467b97STreehugger Robot
1002*16467b97STreehugger Robot  #
1003*16467b97STreehugger Robot  # yields each token in the stream with the given channel value
1004*16467b97STreehugger Robot  # If no channel value is given, the stream's tuned channel value will be used.
1005*16467b97STreehugger Robot  # If no block is given, an enumerator will be returned.
1006*16467b97STreehugger Robot  #
1007*16467b97STreehugger Robot  def each_on_channel( channel = @channel )
1008*16467b97STreehugger Robot    block_given? or return enum_for( :each_on_channel, channel )
1009*16467b97STreehugger Robot    for token in @tokens
1010*16467b97STreehugger Robot      token.channel == channel and yield( token )
1011*16467b97STreehugger Robot    end
1012*16467b97STreehugger Robot  end
1013*16467b97STreehugger Robot
1014*16467b97STreehugger Robot  #
1015*16467b97STreehugger Robot  # iterates through the token stream, yielding each on channel token along the way.
1016*16467b97STreehugger Robot  # After iteration has completed, the stream's position will be restored to where
1017*16467b97STreehugger Robot  # it was before #walk was called. While #each or #each_on_channel does not change
1018*16467b97STreehugger Robot  # the positions stream during iteration, #walk advances through the stream. This
1019*16467b97STreehugger Robot  # makes it possible to look ahead and behind the current token during iteration.
1020*16467b97STreehugger Robot  # If no block is given, an enumerator will be returned.
1021*16467b97STreehugger Robot  #
1022*16467b97STreehugger Robot  def walk
1023*16467b97STreehugger Robot    block_given? or return enum_for( :walk )
1024*16467b97STreehugger Robot    initial_position = @position
1025*16467b97STreehugger Robot    begin
1026*16467b97STreehugger Robot      while token = look and token.type != EOF
1027*16467b97STreehugger Robot        consume
1028*16467b97STreehugger Robot        yield( token )
1029*16467b97STreehugger Robot      end
1030*16467b97STreehugger Robot      return self
1031*16467b97STreehugger Robot    ensure
1032*16467b97STreehugger Robot      @position = initial_position
1033*16467b97STreehugger Robot    end
1034*16467b97STreehugger Robot  end
1035*16467b97STreehugger Robot
1036*16467b97STreehugger Robot  #
1037*16467b97STreehugger Robot  # returns a copy of the token buffer. If +start+ and +stop+ are provided, tokens
1038*16467b97STreehugger Robot  # returns a slice of the token buffer from <tt>start..stop</tt>. The parameters
1039*16467b97STreehugger Robot  # are converted to integers with their <tt>to_i</tt> methods, and thus tokens
1040*16467b97STreehugger Robot  # can be provided to specify start and stop. If a block is provided, tokens are
1041*16467b97STreehugger Robot  # yielded and filtered out of the return array if the block returns a +false+
1042*16467b97STreehugger Robot  # or +nil+ value.
1043*16467b97STreehugger Robot  #
1044*16467b97STreehugger Robot  def tokens( start = nil, stop = nil )
1045*16467b97STreehugger Robot    stop.nil?  || stop >= @tokens.length and stop = @tokens.length - 1
1046*16467b97STreehugger Robot    start.nil? || stop < 0 and start = 0
1047*16467b97STreehugger Robot    tokens = @tokens[ start..stop ]
1048*16467b97STreehugger Robot
1049*16467b97STreehugger Robot    if block_given?
1050*16467b97STreehugger Robot      tokens.delete_if { |t| not yield( t ) }
1051*16467b97STreehugger Robot    end
1052*16467b97STreehugger Robot
1053*16467b97STreehugger Robot    return( tokens )
1054*16467b97STreehugger Robot  end
1055*16467b97STreehugger Robot
1056*16467b97STreehugger Robot
1057*16467b97STreehugger Robot  def at( i )
1058*16467b97STreehugger Robot    @tokens.at i
1059*16467b97STreehugger Robot  end
1060*16467b97STreehugger Robot
1061*16467b97STreehugger Robot  #
1062*16467b97STreehugger Robot  # identical to Array#[], as applied to the stream's token buffer
1063*16467b97STreehugger Robot  #
1064*16467b97STreehugger Robot  def []( i, *args )
1065*16467b97STreehugger Robot    @tokens[ i, *args ]
1066*16467b97STreehugger Robot  end
1067*16467b97STreehugger Robot
1068*16467b97STreehugger Robot  ###### Standard Conversion Methods ###############################
1069*16467b97STreehugger Robot  def inspect
1070*16467b97STreehugger Robot    string = "#<%p: @token_source=%p @ %p/%p" %
1071*16467b97STreehugger Robot      [ self.class, @token_source.class, @position, @tokens.length ]
1072*16467b97STreehugger Robot    tk = look( -1 ) and string << " #{ tk.inspect } <--"
1073*16467b97STreehugger Robot    tk = look( 1 ) and string << " --> #{ tk.inspect }"
1074*16467b97STreehugger Robot    string << '>'
1075*16467b97STreehugger Robot  end
1076*16467b97STreehugger Robot
1077*16467b97STreehugger Robot  #
1078*16467b97STreehugger Robot  # fetches the text content of all tokens between +start+ and +stop+ and
1079*16467b97STreehugger Robot  # joins the chunks into a single string
1080*16467b97STreehugger Robot  #
1081*16467b97STreehugger Robot  def extract_text( start = 0, stop = @tokens.length - 1 )
1082*16467b97STreehugger Robot    start = start.to_i.at_least( 0 )
1083*16467b97STreehugger Robot    stop = stop.to_i.at_most( @tokens.length )
1084*16467b97STreehugger Robot    @tokens[ start..stop ].map! { |t| t.text }.join( '' )
1085*16467b97STreehugger Robot  end
1086*16467b97STreehugger Robot
1087*16467b97STreehugger Robot  alias to_s extract_text
1088*16467b97STreehugger Robot
1089*16467b97STreehugger Robotend
1090*16467b97STreehugger Robot
1091*16467b97STreehugger Robotend
1092