1*99e0aae7SDavid Rees# Design of the Emboss Tool 2*99e0aae7SDavid Rees 3*99e0aae7SDavid ReesThis document describes the internals of Emboss. End users do not need to read 4*99e0aae7SDavid Reesthis document. 5*99e0aae7SDavid Rees 6*99e0aae7SDavid Rees*TODO(bolms): Update this doc to include the newer passes.* 7*99e0aae7SDavid Rees 8*99e0aae7SDavid ReesThe Emboss compiler is divided into separate "front end" and "back end" 9*99e0aae7SDavid Reesprograms. The front end parses Emboss files (`.emb` files) and produces a 10*99e0aae7SDavid Reesstable intermediate representation (IR), which is consumed by the back ends. 11*99e0aae7SDavid ReesThis IR is defined in [public/ir_data.py][ir_pb2_py]. 12*99e0aae7SDavid Rees 13*99e0aae7SDavid Rees[ir_pb2_py]: public/ir_data.py 14*99e0aae7SDavid Rees 15*99e0aae7SDavid ReesThe back ends read the IR and emit code to view and manipulate Emboss-defined 16*99e0aae7SDavid Reesdata structures. Currently, only a C++ back-end exists. 17*99e0aae7SDavid Rees 18*99e0aae7SDavid Rees*TODO(bolms): Split the symbol resolution and validation steps in a separate 19*99e0aae7SDavid Rees"middle" component, to allow external code generators to generate undecorated 20*99e0aae7SDavid ReesEmboss IR instead of Emboss source text?* 21*99e0aae7SDavid Rees 22*99e0aae7SDavid Rees## Front End 23*99e0aae7SDavid Rees 24*99e0aae7SDavid Rees*Implemented in [front_end/...][front_end]* 25*99e0aae7SDavid Rees 26*99e0aae7SDavid Rees[front_end]: front_end/ 27*99e0aae7SDavid Rees 28*99e0aae7SDavid ReesThe front end is responsible for reading in Emboss definitions and producing a 29*99e0aae7SDavid Reesnormalized intermediate representation (IR). It is divided into several steps: 30*99e0aae7SDavid Reesroughly, parsing, import resolution, symbol resolution, and validation. 31*99e0aae7SDavid Rees 32*99e0aae7SDavid ReesThe front end is orchestrated by [glue.py][glue_py], which runs each front end 33*99e0aae7SDavid Reescomponent in the proper order to construct an IR suitable for consumption by the 34*99e0aae7SDavid Reesback end. 35*99e0aae7SDavid Rees 36*99e0aae7SDavid Rees[glue_py]: front_end/glue.py 37*99e0aae7SDavid Rees 38*99e0aae7SDavid ReesThe actual driver program is [emboss_front_end.py][emboss_front_end_py], which 39*99e0aae7SDavid Reesjust calls `glue.ParseEmbossFile` and prints the results. 40*99e0aae7SDavid Rees 41*99e0aae7SDavid Rees[emboss_front_end_py]: front_end/emboss_front_end.py 42*99e0aae7SDavid Rees 43*99e0aae7SDavid Rees### File Parsing 44*99e0aae7SDavid Rees 45*99e0aae7SDavid ReesPer-file parsing consumes the text of a single Emboss module, and produces an 46*99e0aae7SDavid Rees"undecorated" IR for the module, containing only syntactic-level information 47*99e0aae7SDavid Reesfrom the module. 48*99e0aae7SDavid Rees 49*99e0aae7SDavid ReesThis "undecorated" IR is (almost) a subset of the final IR: later steps will add 50*99e0aae7SDavid Reesinformation and perform validation, but will rarely remove anything from the IR 51*99e0aae7SDavid Reesbefore it is emitted. 52*99e0aae7SDavid Rees 53*99e0aae7SDavid Rees#### Tokenization 54*99e0aae7SDavid Rees 55*99e0aae7SDavid Rees*Implemented in [tokenizer.py][tokenizer_py]* 56*99e0aae7SDavid Rees 57*99e0aae7SDavid Rees[tokenizer_py]: front_end/tokenizer.py 58*99e0aae7SDavid Rees 59*99e0aae7SDavid ReesThe tokenizer is a fairly standard tokenizer, with Indent/Dedent insertion a la 60*99e0aae7SDavid ReesPython. It divides source text into `parse_types.Symbol` objects, suitable for 61*99e0aae7SDavid Reesfeeding into the parser. 62*99e0aae7SDavid Rees 63*99e0aae7SDavid Rees#### Syntax Tree Generation 64*99e0aae7SDavid Rees 65*99e0aae7SDavid Rees*Implemented in [lr1.py][lr1_py] and [parser_generator.py][parser_generator_py], with a façade in [structure_parser.py][structure_parser_py]* 66*99e0aae7SDavid Rees 67*99e0aae7SDavid Rees[lr1_py]: front_end/lr1.py 68*99e0aae7SDavid Rees[parser_generator_py]: front_end/parser_generator.py 69*99e0aae7SDavid Rees[structure_parser_py]: front_end/structure_parser.py 70*99e0aae7SDavid Rees 71*99e0aae7SDavid ReesEmboss uses a pretty standard Shift-Reduce LR(1) parser. This is implemented in 72*99e0aae7SDavid Reesthree parts in Emboss: 73*99e0aae7SDavid Rees 74*99e0aae7SDavid Rees* A generic parser generator implementing the table generation algorithms from 75*99e0aae7SDavid Rees *[Compilers: Principles, Techniques, & Tools][dragon_book]* and the 76*99e0aae7SDavid Rees error-marking algorithm from *[Generating LR Syntax Error Messages from 77*99e0aae7SDavid Rees Examples][jeffery_2003]*. 78*99e0aae7SDavid Rees* An Emboss-specific parser builder which glues the Emboss tokenizer, grammar, 79*99e0aae7SDavid Rees and error examples to the parser generator, producing an Emboss parser. 80*99e0aae7SDavid Rees* The Emboss grammar, which is extracted from the file normalizer 81*99e0aae7SDavid Rees (*[module_ir.py][module_ir_py]*). 82*99e0aae7SDavid Rees 83*99e0aae7SDavid Rees[dragon_book]: http://www.amazon.com/Compilers-Principles-Techniques-Tools-2nd/dp/0321486811 84*99e0aae7SDavid Rees[jeffery_2003]: http://dl.acm.org/citation.cfm?id=937566 85*99e0aae7SDavid Rees 86*99e0aae7SDavid Rees#### Normalization 87*99e0aae7SDavid Rees 88*99e0aae7SDavid Rees*Implemented in [module_ir.py][module_ir_py]* 89*99e0aae7SDavid Rees 90*99e0aae7SDavid Rees[module_ir_py]: front_end/module_ir.py 91*99e0aae7SDavid Rees 92*99e0aae7SDavid ReesOnce a parse tree has been generated, it is fed into a normalizer which 93*99e0aae7SDavid Reesrecursively turns the raw syntax tree into a "first stage" intermediate 94*99e0aae7SDavid Reesrepresentation (IR). The first stage IR serves to isolate later stages from 95*99e0aae7SDavid Reesminor changes in the grammar, but only contains information from a single file, 96*99e0aae7SDavid Reesand does not perform any semantic checking. 97*99e0aae7SDavid Rees 98*99e0aae7SDavid Rees### Import Resolution 99*99e0aae7SDavid Rees 100*99e0aae7SDavid Rees*TODO(bolms): Implement imports.* 101*99e0aae7SDavid Rees 102*99e0aae7SDavid ReesAfter each file is parsed, any new imports it has are added to a work queue. 103*99e0aae7SDavid ReesEach file in the work queue is parsed, potentially adding more imports to the 104*99e0aae7SDavid Reesqueue, until the queue is empty. 105*99e0aae7SDavid Rees 106*99e0aae7SDavid Rees### Symbol Resolution 107*99e0aae7SDavid Rees 108*99e0aae7SDavid Rees*Implemented in [symbol_resolver.py][symbol_resolver_py]* 109*99e0aae7SDavid Rees 110*99e0aae7SDavid Rees[symbol_resolver_py]: front_end/symbol_resolver.py 111*99e0aae7SDavid Rees 112*99e0aae7SDavid ReesSymbol resolution is the process of correlating names in the IR. At the end of 113*99e0aae7SDavid Reessymbol resolution, every named entity (type definition, field definition, enum 114*99e0aae7SDavid Reesname, etc.) has a `CanonicalName`, and every reference in the IR has a 115*99e0aae7SDavid Rees`Reference` to the entity to which it refers. 116*99e0aae7SDavid Rees 117*99e0aae7SDavid ReesThis assignment occurs in two passes. First, the full IR is scanned, generating 118*99e0aae7SDavid Reesscoped symbol tables (nested dictionaries of names to `CanonicalName`), and 119*99e0aae7SDavid Reesassigning identities to each `Name` in the IR. Then the IR is fully scanned a 120*99e0aae7SDavid Reessecond time, and each `Reference` in the IR is resolved: all scopes visible to 121*99e0aae7SDavid Reesthe reference are scanned for the name, and the corresponding `CanonicalName` is 122*99e0aae7SDavid Reesassigned to the reference. 123*99e0aae7SDavid Rees 124*99e0aae7SDavid Rees### Validation 125*99e0aae7SDavid Rees 126*99e0aae7SDavid Rees*TODO(bolms): other validations?* 127*99e0aae7SDavid Rees 128*99e0aae7SDavid Rees#### Size Checking 129*99e0aae7SDavid Rees 130*99e0aae7SDavid Rees*TODO(bolms): describe* 131*99e0aae7SDavid Rees 132*99e0aae7SDavid Rees#### Overlap Checking 133*99e0aae7SDavid Rees 134*99e0aae7SDavid Rees*TODO(bolms): describe* 135*99e0aae7SDavid Rees 136*99e0aae7SDavid Rees## Back End 137*99e0aae7SDavid Rees 138*99e0aae7SDavid Rees*Implemented in [back_end/...][back_end]* 139*99e0aae7SDavid Rees 140*99e0aae7SDavid Rees[back_end]: back_end/ 141*99e0aae7SDavid Rees 142*99e0aae7SDavid ReesCurrently, only a C++ back end is implemented. 143*99e0aae7SDavid Rees 144*99e0aae7SDavid ReesA back end takes Emboss IR and produces code in a specific language for 145*99e0aae7SDavid Reesmanipulating the Emboss-defined data structures. 146*99e0aae7SDavid Rees 147*99e0aae7SDavid Rees### C++ 148*99e0aae7SDavid Rees 149*99e0aae7SDavid Rees*Implemented in [header_generator.py][header_generator_py] with templates in 150*99e0aae7SDavid Rees[generated_code_templates][generated_code_templates], support code in 151*99e0aae7SDavid Rees[emboss_cpp_util.h][emboss_cpp_util_h], and a driver program in 152*99e0aae7SDavid Rees[emboss_codegen_cpp.py][emboss_codegen_cpp_py]* 153*99e0aae7SDavid Rees 154*99e0aae7SDavid Rees[header_generator_py]: back_end/cpp/header_generator.py 155*99e0aae7SDavid Rees[generated_code_templates]: back_end/cpp/generated_code_templates 156*99e0aae7SDavid Rees[emboss_cpp_util_h]: back_end/cpp/emboss_cpp_util.h 157*99e0aae7SDavid Rees[emboss_codegen_cpp_py]: back_end/cpp/emboss_codegen_cpp.py 158*99e0aae7SDavid Rees 159*99e0aae7SDavid ReesThe C++ code generator is currently very minimal. `header_generator.py` 160*99e0aae7SDavid Reesessentially inserts values from the IR into text templates. 161*99e0aae7SDavid Rees 162*99e0aae7SDavid Rees*TODO(bolms): add more documentation once the C++ back end has more features.* 163