1*4947cdc7SCole Faust 2*4947cdc7SCole Faust# Starlark in Go: Implementation 3*4947cdc7SCole Faust 4*4947cdc7SCole FaustThis document (a work in progress) describes some of the design 5*4947cdc7SCole Faustchoices of the Go implementation of Starlark. 6*4947cdc7SCole Faust 7*4947cdc7SCole Faust * [Scanner](#scanner) 8*4947cdc7SCole Faust * [Parser](#parser) 9*4947cdc7SCole Faust * [Resolver](#resolver) 10*4947cdc7SCole Faust * [Evaluator](#evaluator) 11*4947cdc7SCole Faust * [Data types](#data-types) 12*4947cdc7SCole Faust * [Freezing](#freezing) 13*4947cdc7SCole Faust * [Testing](#testing) 14*4947cdc7SCole Faust 15*4947cdc7SCole Faust 16*4947cdc7SCole Faust## Scanner 17*4947cdc7SCole Faust 18*4947cdc7SCole FaustThe scanner is derived from Russ Cox's 19*4947cdc7SCole Faust[buildifier](https://github.com/bazelbuild/buildtools/tree/master/buildifier) 20*4947cdc7SCole Fausttool, which pretty-prints Bazel BUILD files. 21*4947cdc7SCole Faust 22*4947cdc7SCole FaustMost of the work happens in `(*scanner).nextToken`. 23*4947cdc7SCole Faust 24*4947cdc7SCole Faust## Parser 25*4947cdc7SCole Faust 26*4947cdc7SCole FaustThe parser is hand-written recursive-descent parser. It uses the 27*4947cdc7SCole Fausttechnique of [precedence 28*4947cdc7SCole Faustclimbing](http://www.engr.mun.ca/~theo/Misc/exp_parsing.htm#climbing) 29*4947cdc7SCole Faustto reduce the number of productions. 30*4947cdc7SCole Faust 31*4947cdc7SCole FaustIn some places the parser accepts a larger set of programs than are 32*4947cdc7SCole Fauststrictly valid, leaving the task of rejecting them to the subsequent 33*4947cdc7SCole Faustresolver pass. For example, in the function call `f(a, b=c)` the 34*4947cdc7SCole Faustparser accepts any expression for `a` and `b`, even though `b` may 35*4947cdc7SCole Faustlegally be only an identifier. For the parser to distinguish these 36*4947cdc7SCole Faustcases would require additional lookahead. 37*4947cdc7SCole Faust 38*4947cdc7SCole Faust## Resolver 39*4947cdc7SCole Faust 40*4947cdc7SCole FaustThe resolver reports structural errors in the program, such as the use 41*4947cdc7SCole Faustof `break` and `continue` outside of a loop. 42*4947cdc7SCole Faust 43*4947cdc7SCole FaustStarlark has stricter syntactic limitations than Python. For example, 44*4947cdc7SCole Faustit does not permit `for` loops or `if` statements at top level, nor 45*4947cdc7SCole Faustdoes it permit global variables to be bound more than once. 46*4947cdc7SCole FaustThese limitations come from the Bazel project's desire to make it easy 47*4947cdc7SCole Faustto identify the sole statement that defines each global, permitting 48*4947cdc7SCole Faustaccurate cross-reference documentation. 49*4947cdc7SCole Faust 50*4947cdc7SCole FaustIn addition, the resolver validates all variable names, classifying 51*4947cdc7SCole Faustthem as references to universal, global, local, or free variables. 52*4947cdc7SCole FaustLocal and free variables are mapped to a small integer, allowing the 53*4947cdc7SCole Faustevaluator to use an efficient (flat) representation for the 54*4947cdc7SCole Faustenvironment. 55*4947cdc7SCole Faust 56*4947cdc7SCole FaustNot all features of the Go implementation are "standard" (that is, 57*4947cdc7SCole Faustsupported by Bazel's Java implementation), at least for now, so 58*4947cdc7SCole Faustnon-standard features such as `lambda`, `float`, and `set` 59*4947cdc7SCole Faustare flag-controlled. The resolver reports 60*4947cdc7SCole Faustany uses of dialect features that have not been enabled. 61*4947cdc7SCole Faust 62*4947cdc7SCole Faust 63*4947cdc7SCole Faust## Evaluator 64*4947cdc7SCole Faust 65*4947cdc7SCole Faust### Data types 66*4947cdc7SCole Faust 67*4947cdc7SCole Faust<b>Integers:</b> Integers are representing using `big.Int`, an 68*4947cdc7SCole Faustarbitrary precision integer. This representation was chosen because, 69*4947cdc7SCole Faustfor many applications, Starlark must be able to handle without loss 70*4947cdc7SCole Faustprotocol buffer values containing signed and unsigned 64-bit integers, 71*4947cdc7SCole Faustwhich requires 65 bits of precision. 72*4947cdc7SCole Faust 73*4947cdc7SCole FaustSmall integers (<256) are preallocated, but all other values require 74*4947cdc7SCole Faustmemory allocation. Integer performance is relatively poor, but it 75*4947cdc7SCole Faustmatters little for Bazel-like workloads which depend much 76*4947cdc7SCole Faustmore on lists of strings than on integers. (Recall that a typical loop 77*4947cdc7SCole Faustover a list in Starlark does not materialize the loop index as an `int`.) 78*4947cdc7SCole Faust 79*4947cdc7SCole FaustAn optimization worth trying would be to represent integers using 80*4947cdc7SCole Fausteither an `int32` or `big.Int`, with the `big.Int` used only when 81*4947cdc7SCole Faust`int32` does not suffice. Using `int32`, not `int64`, for "small" 82*4947cdc7SCole Faustnumbers would make it easier to detect overflow from operations like 83*4947cdc7SCole Faust`int32 * int32`, which would trigger the use of `big.Int`. 84*4947cdc7SCole Faust 85*4947cdc7SCole Faust<b>Floating point</b>: 86*4947cdc7SCole FaustFloating point numbers are represented using Go's `float64`. 87*4947cdc7SCole FaustAgain, `float` support is required to support protocol buffers. The 88*4947cdc7SCole Faustexistence of floating-point NaN and its infamous comparison behavior 89*4947cdc7SCole Faust(`NaN != NaN`) had many ramifications for the API, since we cannot 90*4947cdc7SCole Faustassume the result of an ordered comparison is either less than, 91*4947cdc7SCole Faustgreater than, or equal: it may also fail. 92*4947cdc7SCole Faust 93*4947cdc7SCole Faust<b>Strings</b>: 94*4947cdc7SCole Faust 95*4947cdc7SCole FaustTODO: discuss UTF-8 and string.bytes method. 96*4947cdc7SCole Faust 97*4947cdc7SCole Faust<b>Dictionaries and sets</b>: 98*4947cdc7SCole FaustStarlark dictionaries have predictable iteration order. 99*4947cdc7SCole FaustFurthermore, many Starlark values are hashable in Starlark even though 100*4947cdc7SCole Faustthe Go values that represent them are not hashable in Go: big 101*4947cdc7SCole Faustintegers, for example. 102*4947cdc7SCole FaustConsequently, we cannot use Go maps to implement Starlark's dictionary. 103*4947cdc7SCole Faust 104*4947cdc7SCole FaustWe use a simple hash table whose buckets are linked lists, each 105*4947cdc7SCole Faustelement of which holds up to 8 key/value pairs. In a well-distributed 106*4947cdc7SCole Fausttable the list should rarely exceed length 1. In addition, each 107*4947cdc7SCole Faustkey/value item is part of doubly-linked list that maintains the 108*4947cdc7SCole Faustinsertion order of the elements for iteration. 109*4947cdc7SCole Faust 110*4947cdc7SCole Faust<b>Struct:</b> 111*4947cdc7SCole FaustThe `starlarkstruct` Go package provides a non-standard Starlark 112*4947cdc7SCole Faustextension data type, `struct`, that maps field identifiers to 113*4947cdc7SCole Faustarbitrary values. Fields are accessed using dot notation: `y = s.f`. 114*4947cdc7SCole FaustThis data type is extensively used in Bazel, but its specification is 115*4947cdc7SCole Faustcurrently evolving. 116*4947cdc7SCole Faust 117*4947cdc7SCole FaustStarlark has no `class` mechanism, nor equivalent of Python's 118*4947cdc7SCole Faust`namedtuple`, though it is likely that future versions will support 119*4947cdc7SCole Faustsome way to define a record data type of several fields, with a 120*4947cdc7SCole Faustrepresentation more efficient than a hash table. 121*4947cdc7SCole Faust 122*4947cdc7SCole Faust 123*4947cdc7SCole Faust### Freezing 124*4947cdc7SCole Faust 125*4947cdc7SCole FaustAll mutable values created during module initialization are _frozen_ 126*4947cdc7SCole Faustupon its completion. It is this property that permits a Starlark module 127*4947cdc7SCole Faustto be referenced by two Starlark threads running concurrently (such as 128*4947cdc7SCole Faustthe initialization threads of two other modules) without the 129*4947cdc7SCole Faustpossibility of a data race. 130*4947cdc7SCole Faust 131*4947cdc7SCole FaustThe Go implementation supports freezing by storing an additional 132*4947cdc7SCole Faust"frozen" Boolean variable in each mutable object. Once this flag is set, 133*4947cdc7SCole Faustall subsequent attempts at mutation fail. Every value defines a 134*4947cdc7SCole FaustFreeze method that sets its own frozen flag if not already set, and 135*4947cdc7SCole Faustcalls Freeze for each value that it contains. 136*4947cdc7SCole FaustFor example, when a list is frozen, it freezes each of its elements; 137*4947cdc7SCole Faustwhen a dictionary is frozen, it freezes each of its keys and values; 138*4947cdc7SCole Faustand when a function value is frozen, it freezes each of the free 139*4947cdc7SCole Faustvariables and parameter default values implicitly referenced by its closure. 140*4947cdc7SCole FaustApplication-defined types must also follow this discipline. 141*4947cdc7SCole Faust 142*4947cdc7SCole FaustThe freeze mechanism in the Go implementation is finer grained than in 143*4947cdc7SCole Faustthe Java implementation: in effect, the latter has one "frozen" flag 144*4947cdc7SCole Faustper module, and every value holds a reference to the frozen flag of 145*4947cdc7SCole Faustits module. This makes setting the frozen flag more efficient---a 146*4947cdc7SCole Faustsimple bit flip, no need to traverse the object graph---but coarser 147*4947cdc7SCole Faustgrained. Also, it complicates the API slightly because to construct a 148*4947cdc7SCole Faustlist, say, requires a reference to the frozen flag it should use. 149*4947cdc7SCole Faust 150*4947cdc7SCole FaustThe Go implementation would also permit the freeze operation to be 151*4947cdc7SCole Faustexposed to the program, for example as a built-in function. 152*4947cdc7SCole FaustThis has proven valuable in writing tests of the freeze mechanism 153*4947cdc7SCole Faustitself, but is otherwise mostly a curiosity. 154*4947cdc7SCole Faust 155*4947cdc7SCole Faust 156*4947cdc7SCole Faust### Fail-fast iterators 157*4947cdc7SCole Faust 158*4947cdc7SCole FaustIn some languages (such as Go), a program may mutate a data structure 159*4947cdc7SCole Faustwhile iterating over it; for example, a range loop over a map may 160*4947cdc7SCole Faustdelete map elements. In other languages (such as Java), iterators do 161*4947cdc7SCole Faustextra bookkeeping so that modification of the underlying collection 162*4947cdc7SCole Faustinvalidates the iterator, and the next attempt to use it fails. 163*4947cdc7SCole FaustThis often helps to detect subtle mistakes. 164*4947cdc7SCole Faust 165*4947cdc7SCole FaustStarlark takes this a step further. Instead of mutation of the 166*4947cdc7SCole Faustcollection invalidating the iterator, the act of iterating makes the 167*4947cdc7SCole Faustcollection temporarily immutable, so that an attempt to, say, delete a 168*4947cdc7SCole Faustdict element while looping over the dict, will fail. The error is 169*4947cdc7SCole Faustreported against the delete operation, not the iteration. 170*4947cdc7SCole Faust 171*4947cdc7SCole FaustThis is implemented by having each mutable iterable value record a 172*4947cdc7SCole Faustcounter of active iterators. Starting a loop increments this counter, 173*4947cdc7SCole Faustand completing a loop decrements it. A collection with a nonzero 174*4947cdc7SCole Faustcounter behaves as if frozen. If the collection is actually frozen, 175*4947cdc7SCole Faustthe counter bookkeeping is unnecessary. (Consequently, iterator 176*4947cdc7SCole Faustbookkeeping is needed only while objects are still mutable, before 177*4947cdc7SCole Faustthey can have been published to another thread, and thus no 178*4947cdc7SCole Faustsynchronization is necessary.) 179*4947cdc7SCole Faust 180*4947cdc7SCole FaustA consequence of this design is that in the Go API, it is imperative 181*4947cdc7SCole Faustto call `Done` on each iterator once it is no longer needed. 182*4947cdc7SCole Faust 183*4947cdc7SCole Faust``` 184*4947cdc7SCole FaustTODO 185*4947cdc7SCole Fauststarlark.Value interface and subinterfaces 186*4947cdc7SCole Faustargument passing to builtins: UnpackArgs, UnpackPositionalArgs. 187*4947cdc7SCole Faust``` 188*4947cdc7SCole Faust 189*4947cdc7SCole Faust<b>Evaluation strategy:</b> 190*4947cdc7SCole FaustThe evaluator uses a simple recursive tree walk, returning a value or 191*4947cdc7SCole Faustan error for each expression. We have experimented with just-in-time 192*4947cdc7SCole Faustcompilation of syntax trees to bytecode, but two limitations in the 193*4947cdc7SCole Faustcurrent Go compiler prevent this strategy from outperforming the 194*4947cdc7SCole Fausttree-walking evaluator. 195*4947cdc7SCole Faust 196*4947cdc7SCole FaustFirst, the Go compiler does not generate a "computed goto" for a 197*4947cdc7SCole Faustswitch statement ([Go issue 198*4947cdc7SCole Faust5496](https://github.com/golang/go/issues/5496)). A bytecode 199*4947cdc7SCole Faustinterpreter's main loop is a for-loop around a switch statement with 200*4947cdc7SCole Faustdozens or hundreds of cases, and the speed with which each case can be 201*4947cdc7SCole Faustdispatched strongly affects overall performance. 202*4947cdc7SCole FaustCurrently, a switch statement generates a binary tree of ordered 203*4947cdc7SCole Faustcomparisons, requiring several branches instead of one. 204*4947cdc7SCole Faust 205*4947cdc7SCole FaustSecond, the Go compiler's escape analysis assumes that the underlying 206*4947cdc7SCole Faustarray from a `make([]Value, n)` allocation always escapes 207*4947cdc7SCole Faust([Go issue 20533](https://github.com/golang/go/issues/20533)). 208*4947cdc7SCole FaustBecause the bytecode interpreter's operand stack has a non-constant 209*4947cdc7SCole Faustlength, it must be allocated with `make`. The resulting allocation 210*4947cdc7SCole Faustadds to the cost of each Starlark function call; this can be tolerated 211*4947cdc7SCole Faustby amortizing one very large stack allocation across many calls. 212*4947cdc7SCole FaustMore problematic appears to be the cost of the additional GC write 213*4947cdc7SCole Faustbarriers incurred by every VM operation: every intermediate result is 214*4947cdc7SCole Faustsaved to the VM's operand stack, which is on the heap. 215*4947cdc7SCole FaustBy contrast, intermediate results in the tree-walking evaluator are 216*4947cdc7SCole Faustnever stored to the heap. 217*4947cdc7SCole Faust 218*4947cdc7SCole Faust``` 219*4947cdc7SCole FaustTODO 220*4947cdc7SCole Faustframes, backtraces, errors. 221*4947cdc7SCole Faustthreads 222*4947cdc7SCole FaustPrint 223*4947cdc7SCole FaustLoad 224*4947cdc7SCole Faust``` 225*4947cdc7SCole Faust 226*4947cdc7SCole Faust## Testing 227*4947cdc7SCole Faust 228*4947cdc7SCole Faust``` 229*4947cdc7SCole FaustTODO 230*4947cdc7SCole Fauststarlarktest package 231*4947cdc7SCole Faust`assert` module 232*4947cdc7SCole Fauststarlarkstruct 233*4947cdc7SCole Faustintegration with Go testing.T 234*4947cdc7SCole Faust``` 235*4947cdc7SCole Faust 236*4947cdc7SCole Faust 237*4947cdc7SCole Faust## TODO 238*4947cdc7SCole Faust 239*4947cdc7SCole Faust 240*4947cdc7SCole Faust``` 241*4947cdc7SCole FaustDiscuss practical separation of code and data. 242*4947cdc7SCole Faust``` 243