xref: /aosp_15_r20/external/emboss/doc/design.md (revision 99e0aae7469b87d12f0ad23e61142c2d74c1ef70)
1*99e0aae7SDavid Rees# Design of the Emboss Tool
2*99e0aae7SDavid Rees
3*99e0aae7SDavid ReesThis document describes the internals of Emboss.  End users do not need to read
4*99e0aae7SDavid Reesthis document.
5*99e0aae7SDavid Rees
6*99e0aae7SDavid Rees*TODO(bolms): Update this doc to include the newer passes.*
7*99e0aae7SDavid Rees
8*99e0aae7SDavid ReesThe Emboss compiler is divided into separate "front end" and "back end"
9*99e0aae7SDavid Reesprograms.  The front end parses Emboss files (`.emb` files) and produces a
10*99e0aae7SDavid Reesstable intermediate representation (IR), which is consumed by the back ends.
11*99e0aae7SDavid ReesThis IR is defined in [public/ir_data.py][ir_pb2_py].
12*99e0aae7SDavid Rees
13*99e0aae7SDavid Rees[ir_pb2_py]: public/ir_data.py
14*99e0aae7SDavid Rees
15*99e0aae7SDavid ReesThe back ends read the IR and emit code to view and manipulate Emboss-defined
16*99e0aae7SDavid Reesdata structures.  Currently, only a C++ back-end exists.
17*99e0aae7SDavid Rees
18*99e0aae7SDavid Rees*TODO(bolms): Split the symbol resolution and validation steps in a separate
19*99e0aae7SDavid Rees"middle" component, to allow external code generators to generate undecorated
20*99e0aae7SDavid ReesEmboss IR instead of Emboss source text?*
21*99e0aae7SDavid Rees
22*99e0aae7SDavid Rees## Front End
23*99e0aae7SDavid Rees
24*99e0aae7SDavid Rees*Implemented in [front_end/...][front_end]*
25*99e0aae7SDavid Rees
26*99e0aae7SDavid Rees[front_end]: front_end/
27*99e0aae7SDavid Rees
28*99e0aae7SDavid ReesThe front end is responsible for reading in Emboss definitions and producing a
29*99e0aae7SDavid Reesnormalized intermediate representation (IR).  It is divided into several steps:
30*99e0aae7SDavid Reesroughly, parsing, import resolution, symbol resolution, and validation.
31*99e0aae7SDavid Rees
32*99e0aae7SDavid ReesThe front end is orchestrated by [glue.py][glue_py], which runs each front end
33*99e0aae7SDavid Reescomponent in the proper order to construct an IR suitable for consumption by the
34*99e0aae7SDavid Reesback end.
35*99e0aae7SDavid Rees
36*99e0aae7SDavid Rees[glue_py]: front_end/glue.py
37*99e0aae7SDavid Rees
38*99e0aae7SDavid ReesThe actual driver program is [emboss_front_end.py][emboss_front_end_py], which
39*99e0aae7SDavid Reesjust calls `glue.ParseEmbossFile` and prints the results.
40*99e0aae7SDavid Rees
41*99e0aae7SDavid Rees[emboss_front_end_py]: front_end/emboss_front_end.py
42*99e0aae7SDavid Rees
43*99e0aae7SDavid Rees### File Parsing
44*99e0aae7SDavid Rees
45*99e0aae7SDavid ReesPer-file parsing consumes the text of a single Emboss module, and produces an
46*99e0aae7SDavid Rees"undecorated" IR for the module, containing only syntactic-level information
47*99e0aae7SDavid Reesfrom the module.
48*99e0aae7SDavid Rees
49*99e0aae7SDavid ReesThis "undecorated" IR is (almost) a subset of the final IR: later steps will add
50*99e0aae7SDavid Reesinformation and perform validation, but will rarely remove anything from the IR
51*99e0aae7SDavid Reesbefore it is emitted.
52*99e0aae7SDavid Rees
53*99e0aae7SDavid Rees#### Tokenization
54*99e0aae7SDavid Rees
55*99e0aae7SDavid Rees*Implemented in [tokenizer.py][tokenizer_py]*
56*99e0aae7SDavid Rees
57*99e0aae7SDavid Rees[tokenizer_py]: front_end/tokenizer.py
58*99e0aae7SDavid Rees
59*99e0aae7SDavid ReesThe tokenizer is a fairly standard tokenizer, with Indent/Dedent insertion a la
60*99e0aae7SDavid ReesPython.  It divides source text into `parse_types.Symbol` objects, suitable for
61*99e0aae7SDavid Reesfeeding into the parser.
62*99e0aae7SDavid Rees
63*99e0aae7SDavid Rees#### Syntax Tree Generation
64*99e0aae7SDavid Rees
65*99e0aae7SDavid Rees*Implemented in [lr1.py][lr1_py] and [parser_generator.py][parser_generator_py], with a façade in [structure_parser.py][structure_parser_py]*
66*99e0aae7SDavid Rees
67*99e0aae7SDavid Rees[lr1_py]: front_end/lr1.py
68*99e0aae7SDavid Rees[parser_generator_py]: front_end/parser_generator.py
69*99e0aae7SDavid Rees[structure_parser_py]: front_end/structure_parser.py
70*99e0aae7SDavid Rees
71*99e0aae7SDavid ReesEmboss uses a pretty standard Shift-Reduce LR(1) parser.  This is implemented in
72*99e0aae7SDavid Reesthree parts in Emboss:
73*99e0aae7SDavid Rees
74*99e0aae7SDavid Rees* A generic parser generator implementing the table generation algorithms from
75*99e0aae7SDavid Rees  *[Compilers: Principles, Techniques, & Tools][dragon_book]* and the
76*99e0aae7SDavid Rees  error-marking algorithm from *[Generating LR Syntax Error Messages from
77*99e0aae7SDavid Rees  Examples][jeffery_2003]*.
78*99e0aae7SDavid Rees* An Emboss-specific parser builder which glues the Emboss tokenizer, grammar,
79*99e0aae7SDavid Rees  and error examples to the parser generator, producing an Emboss parser.
80*99e0aae7SDavid Rees* The Emboss grammar, which is extracted from the file normalizer
81*99e0aae7SDavid Rees  (*[module_ir.py][module_ir_py]*).
82*99e0aae7SDavid Rees
83*99e0aae7SDavid Rees[dragon_book]: http://www.amazon.com/Compilers-Principles-Techniques-Tools-2nd/dp/0321486811
84*99e0aae7SDavid Rees[jeffery_2003]: http://dl.acm.org/citation.cfm?id=937566
85*99e0aae7SDavid Rees
86*99e0aae7SDavid Rees#### Normalization
87*99e0aae7SDavid Rees
88*99e0aae7SDavid Rees*Implemented in [module_ir.py][module_ir_py]*
89*99e0aae7SDavid Rees
90*99e0aae7SDavid Rees[module_ir_py]: front_end/module_ir.py
91*99e0aae7SDavid Rees
92*99e0aae7SDavid ReesOnce a parse tree has been generated, it is fed into a normalizer which
93*99e0aae7SDavid Reesrecursively turns the raw syntax tree into a "first stage" intermediate
94*99e0aae7SDavid Reesrepresentation (IR).  The first stage IR serves to isolate later stages from
95*99e0aae7SDavid Reesminor changes in the grammar, but only contains information from a single file,
96*99e0aae7SDavid Reesand does not perform any semantic checking.
97*99e0aae7SDavid Rees
98*99e0aae7SDavid Rees### Import Resolution
99*99e0aae7SDavid Rees
100*99e0aae7SDavid Rees*TODO(bolms): Implement imports.*
101*99e0aae7SDavid Rees
102*99e0aae7SDavid ReesAfter each file is parsed, any new imports it has are added to a work queue.
103*99e0aae7SDavid ReesEach file in the work queue is parsed, potentially adding more imports to the
104*99e0aae7SDavid Reesqueue, until the queue is empty.
105*99e0aae7SDavid Rees
106*99e0aae7SDavid Rees### Symbol Resolution
107*99e0aae7SDavid Rees
108*99e0aae7SDavid Rees*Implemented in [symbol_resolver.py][symbol_resolver_py]*
109*99e0aae7SDavid Rees
110*99e0aae7SDavid Rees[symbol_resolver_py]: front_end/symbol_resolver.py
111*99e0aae7SDavid Rees
112*99e0aae7SDavid ReesSymbol resolution is the process of correlating names in the IR.  At the end of
113*99e0aae7SDavid Reessymbol resolution, every named entity (type definition, field definition, enum
114*99e0aae7SDavid Reesname, etc.) has a `CanonicalName`, and every reference in the IR has a
115*99e0aae7SDavid Rees`Reference` to the entity to which it refers.
116*99e0aae7SDavid Rees
117*99e0aae7SDavid ReesThis assignment occurs in two passes.  First, the full IR is scanned, generating
118*99e0aae7SDavid Reesscoped symbol tables (nested dictionaries of names to `CanonicalName`), and
119*99e0aae7SDavid Reesassigning identities to each `Name` in the IR.  Then the IR is fully scanned a
120*99e0aae7SDavid Reessecond time, and each `Reference` in the IR is resolved: all scopes visible to
121*99e0aae7SDavid Reesthe reference are scanned for the name, and the corresponding `CanonicalName` is
122*99e0aae7SDavid Reesassigned to the reference.
123*99e0aae7SDavid Rees
124*99e0aae7SDavid Rees### Validation
125*99e0aae7SDavid Rees
126*99e0aae7SDavid Rees*TODO(bolms): other validations?*
127*99e0aae7SDavid Rees
128*99e0aae7SDavid Rees#### Size Checking
129*99e0aae7SDavid Rees
130*99e0aae7SDavid Rees*TODO(bolms): describe*
131*99e0aae7SDavid Rees
132*99e0aae7SDavid Rees#### Overlap Checking
133*99e0aae7SDavid Rees
134*99e0aae7SDavid Rees*TODO(bolms): describe*
135*99e0aae7SDavid Rees
136*99e0aae7SDavid Rees## Back End
137*99e0aae7SDavid Rees
138*99e0aae7SDavid Rees*Implemented in [back_end/...][back_end]*
139*99e0aae7SDavid Rees
140*99e0aae7SDavid Rees[back_end]: back_end/
141*99e0aae7SDavid Rees
142*99e0aae7SDavid ReesCurrently, only a C++ back end is implemented.
143*99e0aae7SDavid Rees
144*99e0aae7SDavid ReesA back end takes Emboss IR and produces code in a specific language for
145*99e0aae7SDavid Reesmanipulating the Emboss-defined data structures.
146*99e0aae7SDavid Rees
147*99e0aae7SDavid Rees### C++
148*99e0aae7SDavid Rees
149*99e0aae7SDavid Rees*Implemented in [header_generator.py][header_generator_py] with templates in
150*99e0aae7SDavid Rees[generated_code_templates][generated_code_templates], support code in
151*99e0aae7SDavid Rees[emboss_cpp_util.h][emboss_cpp_util_h], and a driver program in
152*99e0aae7SDavid Rees[emboss_codegen_cpp.py][emboss_codegen_cpp_py]*
153*99e0aae7SDavid Rees
154*99e0aae7SDavid Rees[header_generator_py]: back_end/cpp/header_generator.py
155*99e0aae7SDavid Rees[generated_code_templates]: back_end/cpp/generated_code_templates
156*99e0aae7SDavid Rees[emboss_cpp_util_h]: back_end/cpp/emboss_cpp_util.h
157*99e0aae7SDavid Rees[emboss_codegen_cpp_py]: back_end/cpp/emboss_codegen_cpp.py
158*99e0aae7SDavid Rees
159*99e0aae7SDavid ReesThe C++ code generator is currently very minimal.  `header_generator.py`
160*99e0aae7SDavid Reesessentially inserts values from the IR into text templates.
161*99e0aae7SDavid Rees
162*99e0aae7SDavid Rees*TODO(bolms): add more documentation once the C++ back end has more features.*
163