1# Design of the Emboss Tool 2 3This document describes the internals of Emboss. End users do not need to read 4this document. 5 6*TODO(bolms): Update this doc to include the newer passes.* 7 8The Emboss compiler is divided into separate "front end" and "back end" 9programs. The front end parses Emboss files (`.emb` files) and produces a 10stable intermediate representation (IR), which is consumed by the back ends. 11This IR is defined in [public/ir_data.py][ir_pb2_py]. 12 13[ir_pb2_py]: public/ir_data.py 14 15The back ends read the IR and emit code to view and manipulate Emboss-defined 16data structures. Currently, only a C++ back-end exists. 17 18*TODO(bolms): Split the symbol resolution and validation steps in a separate 19"middle" component, to allow external code generators to generate undecorated 20Emboss IR instead of Emboss source text?* 21 22## Front End 23 24*Implemented in [front_end/...][front_end]* 25 26[front_end]: front_end/ 27 28The front end is responsible for reading in Emboss definitions and producing a 29normalized intermediate representation (IR). It is divided into several steps: 30roughly, parsing, import resolution, symbol resolution, and validation. 31 32The front end is orchestrated by [glue.py][glue_py], which runs each front end 33component in the proper order to construct an IR suitable for consumption by the 34back end. 35 36[glue_py]: front_end/glue.py 37 38The actual driver program is [emboss_front_end.py][emboss_front_end_py], which 39just calls `glue.ParseEmbossFile` and prints the results. 40 41[emboss_front_end_py]: front_end/emboss_front_end.py 42 43### File Parsing 44 45Per-file parsing consumes the text of a single Emboss module, and produces an 46"undecorated" IR for the module, containing only syntactic-level information 47from the module. 48 49This "undecorated" IR is (almost) a subset of the final IR: later steps will add 50information and perform validation, but will rarely remove anything from the IR 51before it is emitted. 52 53#### Tokenization 54 55*Implemented in [tokenizer.py][tokenizer_py]* 56 57[tokenizer_py]: front_end/tokenizer.py 58 59The tokenizer is a fairly standard tokenizer, with Indent/Dedent insertion a la 60Python. It divides source text into `parse_types.Symbol` objects, suitable for 61feeding into the parser. 62 63#### Syntax Tree Generation 64 65*Implemented in [lr1.py][lr1_py] and [parser_generator.py][parser_generator_py], with a façade in [structure_parser.py][structure_parser_py]* 66 67[lr1_py]: front_end/lr1.py 68[parser_generator_py]: front_end/parser_generator.py 69[structure_parser_py]: front_end/structure_parser.py 70 71Emboss uses a pretty standard Shift-Reduce LR(1) parser. This is implemented in 72three parts in Emboss: 73 74* A generic parser generator implementing the table generation algorithms from 75 *[Compilers: Principles, Techniques, & Tools][dragon_book]* and the 76 error-marking algorithm from *[Generating LR Syntax Error Messages from 77 Examples][jeffery_2003]*. 78* An Emboss-specific parser builder which glues the Emboss tokenizer, grammar, 79 and error examples to the parser generator, producing an Emboss parser. 80* The Emboss grammar, which is extracted from the file normalizer 81 (*[module_ir.py][module_ir_py]*). 82 83[dragon_book]: http://www.amazon.com/Compilers-Principles-Techniques-Tools-2nd/dp/0321486811 84[jeffery_2003]: http://dl.acm.org/citation.cfm?id=937566 85 86#### Normalization 87 88*Implemented in [module_ir.py][module_ir_py]* 89 90[module_ir_py]: front_end/module_ir.py 91 92Once a parse tree has been generated, it is fed into a normalizer which 93recursively turns the raw syntax tree into a "first stage" intermediate 94representation (IR). The first stage IR serves to isolate later stages from 95minor changes in the grammar, but only contains information from a single file, 96and does not perform any semantic checking. 97 98### Import Resolution 99 100*TODO(bolms): Implement imports.* 101 102After each file is parsed, any new imports it has are added to a work queue. 103Each file in the work queue is parsed, potentially adding more imports to the 104queue, until the queue is empty. 105 106### Symbol Resolution 107 108*Implemented in [symbol_resolver.py][symbol_resolver_py]* 109 110[symbol_resolver_py]: front_end/symbol_resolver.py 111 112Symbol resolution is the process of correlating names in the IR. At the end of 113symbol resolution, every named entity (type definition, field definition, enum 114name, etc.) has a `CanonicalName`, and every reference in the IR has a 115`Reference` to the entity to which it refers. 116 117This assignment occurs in two passes. First, the full IR is scanned, generating 118scoped symbol tables (nested dictionaries of names to `CanonicalName`), and 119assigning identities to each `Name` in the IR. Then the IR is fully scanned a 120second time, and each `Reference` in the IR is resolved: all scopes visible to 121the reference are scanned for the name, and the corresponding `CanonicalName` is 122assigned to the reference. 123 124### Validation 125 126*TODO(bolms): other validations?* 127 128#### Size Checking 129 130*TODO(bolms): describe* 131 132#### Overlap Checking 133 134*TODO(bolms): describe* 135 136## Back End 137 138*Implemented in [back_end/...][back_end]* 139 140[back_end]: back_end/ 141 142Currently, only a C++ back end is implemented. 143 144A back end takes Emboss IR and produces code in a specific language for 145manipulating the Emboss-defined data structures. 146 147### C++ 148 149*Implemented in [header_generator.py][header_generator_py] with templates in 150[generated_code_templates][generated_code_templates], support code in 151[emboss_cpp_util.h][emboss_cpp_util_h], and a driver program in 152[emboss_codegen_cpp.py][emboss_codegen_cpp_py]* 153 154[header_generator_py]: back_end/cpp/header_generator.py 155[generated_code_templates]: back_end/cpp/generated_code_templates 156[emboss_cpp_util_h]: back_end/cpp/emboss_cpp_util.h 157[emboss_codegen_cpp_py]: back_end/cpp/emboss_codegen_cpp.py 158 159The C++ code generator is currently very minimal. `header_generator.py` 160essentially inserts values from the IR into text templates. 161 162*TODO(bolms): add more documentation once the C++ back end has more features.* 163