xref: /aosp_15_r20/external/tensorflow/tensorflow/core/ir/README.md (revision b6fb3261f9314811a0f4371741dbb8839866f948)
1#  TensorFlow Graph IR
2
3This directory contains the definition of the Intermediate Representation (IR)
4for TensorFlow graphs using MLIR.
5
6## Introduction
7
8This directory defined an MLIR dialect, the “TensorFlow Graph dialect”, that
9represents accurately TensorFlow graphs. Contrary to the previous TensorFlow
10dialect which made some opinionated choices that diverged from GraphDef and
11TensorFlow Graph semantics, this dialect embraces TensorFlow Graph as it is. In
12particular the concepts of control dependencies, requested device, assigned
13device, and node name are all first-class attributes on the MLIR operations in
14this dialect.
15
16The main principle that drove the development of this dialect has been to ensure
17perfect round-trip and general compatibility with existing TensorFlow semantics,
18so that this solution can be deployed by default in any situation where "Graph
19Optimization" and Grappler transformations are involved today, regardless of
20TensorFlow V1 or V2. This new approach is also made possible by evolutions in
21MLIR that allow representing graphs in a way that wasn’t possible before (more
22in the [Graph operation design](#graph_operation_design) section below).
23
24## History of Dialects for TensorFlow
25
26MLIR started with a basic structure reflecting LLVM in that it defined a
27`Module` containing a list of `Functions`. Each of these was defining a body
28constrained to be a Control-Flow Graph (CFG): a list of `Blocks`, each of them
29containing a list of `Operations`. A fundamental aspect of the CFG
30representation is the notion of “control”: the abstract semantic model considers
31that a single `Operation` executes at a given time, and the next `Operation` to
32execute is necessarily the one listed immediately after[^1]. The last
33`Operation` in a `Block` is a `Terminator`: it decides what is the next `Block`
34where the control will be transferred (think of a branch).
35
36When MLIR started, a first dialect -- that we were referring to as “TF control
37dialect” -- was developed to model TensorFlow graphs. This dialect supported
38control dependencies, but didn’t allow cycles in the graph, which forced some
39tricks to model TensorFlow V1 loops and in particular the `NextIteration`
40operation. While this dialect enabled some experimentation, it wasn’t seen as
41really practical and another dialect was co-existing: the “tf” dialect that
42we’re using currently. This dialect was designed before TF2.0
43[was released](https://blog.tensorflow.org/2019/09/tensorflow-20-is-now-available.html),
44and made strong assumptions about TensorFlow evolving towards a world where
45eager execution and function execution become unified and V1 specific constructs
46would be deprecated and disappear. As such control dependencies are not
47supported and are instead implicit, control-flow V1 ops (such as Switch & Merge)
48and deadness aren’t supported[^2], new device placement modelling solutions were
49considered. These choices in the model enabled us to write graph transformations
50as stateless DAG-to-DAG patterns that can be applied to a subgraph, without
51considering the entire graph.
52
53## Motivation
54
55The combination of the TensorFlow and executor dialects allows for importing
56most TensorFlow graphs and the TensorFlow dialect has proven enough to implement
57the TF/XLA bridge, TFLite converter, and TFRT . However the intent was for
58TensorFlow 2.0 to trace TensorFlow functions directly in the TensorFlow dialect,
59leaving the executor dialect only as a way to provide limited support for
60TensorFlow V1 graphs.
61
62However, the implementation of TensorFlow 2.0 didn't break away from TensorFlow
63V1 entirely, instead TensorFlow functions are wrapped above TensorFlow V1 and
64expose a leaky abstraction over the classical graph. As a result, the TensorFlow
65dialect never got in a position to be enabled by default in TensorFlow. In
66particular there are many subtle way in which TensorFlow functions diverges from
67the sequential eager interpretation. For example the following pattern has been
68recommended to users who intended to call a function `bar` knowing that the
69first argument wasn’t necessary if they only used the first result.
70
71```
72  @tf.function
73  def foo(z):
74    x = tf.Placeholder(tf.int32)
75    y, _ = bar(x, z)
76    return y
77```
78
79The use of a placeholder would throw an exception in eager mode, but “works” in
80graph mode as long as inlining and pruning ensure the placeholder is removed
81before execution.
82
83Other cases involve the need for control dependencies beyond what the
84auto-control-dependency tracking offers. For example the
85[tf.recompute\_grad](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/custom_gradient.py#L497)
86creates control-dependencies on non-side-effecting ops to have a finer grain
87control of memory usage.
88
89Finally, the error modelling in TensorFlow can also be surprising. While in
90eager op-by-op mode the execution is interrupted as soon as an error occurs,
91`tf.function` tracing does not consider error handling as side-effecting
92(otherwise it would have to add a control dependency between every node!) and as
93such a program like:
94
95```
96@tf.function
97def foo(x, y, variable):
98   b = tf.matmul(x, y)
99   variable.assign(1.0)
100   return b
101```
102
103Does not guarantee that the assignment to the variable won’t occur if an error
104occurs while processing the matmul, so calling:
105
106```
107foo(1., 2., variable)
108```
109
110Throws an exception because `tf.matmul` expects rank-2 tensors, but the variable
111may or may not have been assigned. As such a user may want to opt-in a safer
112behavior for their function:
113
114```
115@tf.function
116def foo(x, y, variable):
117   b = tf.matmul(x, y)
118   with tf.control_dependencies([b]):
119     variable.assign(1.0)
120   return b
121```
122
123However this control dependency cannot be modelled in the TensorFlow dialect: it
124will be just dropped! There is no solution today to prevent the variable
125assignment to be executed ahead of the `matmul` in the TensorFlow Dialect.
126
127While many of these cases could be modeled with different constructs at the
128source level, this would be a major overhaul of TensorFlow itself, and more
129importantly its ecosystem. Instead, we recognize that the TensorFlow dialect as
130it exists today cannot support all of these use-cases and it prevented MLIR from
131providing a general graph transformation solution for TensorFlow, contributing
132to more fragmentation instead of reducing it as promised.
133
134The rest of this document describe how this new dialect follows a more pragmatic
135approach to enable MLIR deployment in TensorFlow.
136
137## Design
138
139This new dialect intends to allow us to replace Grappler and existing graph
140transformations, for TensorFlow V1 and V2 without constraints. As such the main
141principle is to support perfect roundtrip between TensorFlow Graph/GraphDef and
142MLIR.
143
144### General Operations
145
146An individual TensorFlow `NodeDef` is translated into an individual MLIR
147operation using the following form:
148
149```
150%AddV2, %ctl = tfg.AddV2(%placeholder, %placeholder_1) [%ctl_1, %ctl_2]
151                     device("GPU") assigned_device("TPU") name("add")
152                     {some_attribute = "some attr!"}
153                     : (tensor<*xi32>, tensor<*xi32>) -> (tensor<*xi32>)
154```
155
156*   Each operation returns an optional variadic number of tensors as well as a
157    control token to express potential control dependencies.
158*   The node type is carried in the operation mnemonic.
159*   The list of regular inputs is in-between parentheses.
160*   Optional control dependencies are exposed after the regular inputs and
161    printed between square brackets.
162*   The pre-placement “requested device” as well as the post-placement “assigned
163    device” information are preserved.
164*   The node name is carried as a first-class attribute.
165*   Optional “op specific” attributes can be listed between curly brackets.
166*   Finally the type signature follows, omitting the control dependencies.
167
168This structure allows for a perfect round-trip to NodeDef, while still being
169ergonomic when manipulating it in MLIR (compared to the `tf\_executor` dialect
170for example). The tradeoff we are making here is that we preserve all
171attributes, including the “derived” ones[^3], which creates some amount of
172redundancy with the signature. We may consider pruning these redundant
173attributes in the future in the same way as we do in the TensorFlow dialect.
174
175### Graph Operation
176
177A structural operation is introduced as a container: `tfg.graph` acts as a bag
178of unordered TensorFlow operations, and carries a “version” attribute that
179corresponds to the
180[VersionDef](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/framework/versions.proto)
181present in GraphDef:
182
183```
184tfg.graph #tfg.version<producer = 42, min_consumer = 33> {
185  %arg0, %ctl_0 = tfg.placeholder() : () -> (tensor<*xi32>)
186  %add, %ctl_1 = tfg.AddV2(%arg0, %arg1)
187                    : (tensor<*xi32>, tensor<*xi32>) -> (tensor<*xi32>)
188  %arg1, %ctl_2 = tfg.placeholder() : () -> (tensor<*xi32>)
189}
190```
191
192Note that the `AddV2` operation is using the result of a `placeholder` operation
193that is defined later in the list. This wasn’t possible in MLIR 2 years ago when
194the TensorFlow dialect was designed. It was actually
195[attempted to allow such unordered semantics](https://groups.google.com/a/tensorflow.org/g/mlir/c/gPQFIy9XpVw/m/hfxmBGF8AQAJ)
196and break away from the CFG-centric representation but we couldn’t reach a
197consensus, and some key members of the team believed that a departure from
198CFG/SSA would limit the reusability of many algorithms. On the other hand, this
199choice prevented us to design a graph dialect that can just replace TensorFlow
200Graph structure as-is. Since then MLIR evolved to become more general and this
201feature is now available (it was motivated by the
202[support for HW synthesis tools](https://llvm.discourse.group/t/rfc-allowing-dialects-to-relax-the-ssa-dominance-condition/833)).
203Another recent development that made it also more friendly is the
204[removal of the requirement for terminators](https://llvm.discourse.group/t/rfc-making-terminator-optional-for-single-block-graph-regions/2997):
205the `tfg.graph` operation above contains a single block listing operations, and
206a terminator does not have any role to play. Finally a Dialect can now
207[acts as fallback for OpInterfaces](https://llvm.discourse.group/t/rfc-dialect-fallback-for-opinterface/3074),
208which allows us to reuse more of the TensorFlow registry to provide information
209to MLIR passes about TensorFlow operation without having to register them with
210MLIR in the first place.
211
212The `tfg.graph` operation round-trips almost perfectly to
213[Graph](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/graph/graph.h#L504),
214with the exception of the `Function Library`, which I address below.
215
216### Function Library
217
218Functions in TensorFlow are stored as
219[FunctionDef](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/framework/function.proto),
220which has a signature, holds attributes, identifies argument and returned
221values, and finally contains a list of nodes for its body. While on the surface
222this `repeated NodeDef node_def` field looks identical to the body of
223[GraphDef](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/framework/graph.proto#L17),
224there are fundamental differences in the representation, and in particular the
225format the edges are represented is different.
226
227To understand these differences, it is important to realize that a key aspect of
228`FunctionsDef` is that they are stored uninstantiated, and can be considered in
229a similar way to a C++ template function. The signature is actually an `OpDef`,
230and just like any regular TensorFlow operation the types of the arguments and
231the results are encoded and constrained with attributes. These attributes are
232only provided or inferred based on the function’s use: the call-site is
233responsible for instantiating a function before it’s body can be represented as
234a Graph. Because of this, the body of an uninstantiated function is modeled
235differently than Graph body:
236
237```
238  tfg.func generic @foo(%arg0 : !tfg.tensor {tfg.name = "input"},
239                        %arg1 : !tfg.tensor {tfg.name = "another_input"})
240      -> (!tfg.tensor {tfg.name = "result1"},
241          !tfg.tensor {tfg.name = "result2"})
242      attributes {description = "function foo"} {
243    %Greater, %ctl_0 = tfg.Greater(%arg0, %arg1) name("Greater")
244    %G_z = tfg.get_result(%Greater) "z" : 0
245    %Switch, %ctl_1 = tfg.Switch(%G_z, %G_z) name("cond/Switch")
246    %s_true = tfg.get_result %Switch "output_true" : 0
247    %s_false = tfg.get_result %Switch "output_false" : 0
248    tfg.return(%s_true, %s_false) [%ctl_0]
249  }
250```
251
252Note how the tensor types `!tfg.tensor` are opaque, and every operation returns
253a single tensor output and a control token. The tensor output is then unpacked
254by looking up individual results by name. This is particularly visible with the
255`Switch` operation where the two results are accessed using `tfg.get_result`
256looking them up by name `output_true:0` and `output_false:0`. This is required
257because the OpDef can define the number of output based on the attribute present
258on the NodeDef, and these attributes can in turn be dependent on the attributes
259added on the function during instantiation (you can read more about it in the
260[description of the placeholder attribute value](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/framework/attr_value.proto#L48-L55)).
261
262Post-instantiation, a function body is similar to the one of a graph:
263
264```
265  tfg.func @foo(%arg0 : tensor<*xf32> {tfg.name = "input"},
266                %arg1 : tensor<*xf32> {tfg.name = "another_input"})
267      -> (tensor<*xi1> {tfg.name = "result1"},
268          tensor<*xi1> {tfg.name = "result2"})
269      attributes {description = "function foo"} {
270    %Greater, %ctl_0 = tfg.Greater(%arg0, %arg1) [%arg1.ctl] name("Greater")
271                          : (tensor<*xf32>, tensor<*xf32>) -> tensor<*xi1>
272    %Switch:2, %ctl_1 = tfg.Switch(%Greater, %Greater) name("cond/Switch")
273                          : (tensor<*xi1>, tensor<*xi1>) -> tensor<*xi1>
274   tfg.return(%Switch#0, %Switch#1) [%ctl_0]
275  }
276```
277
278The operations aren’t ordered, except for the `tfg.return` which is a terminator
279and must be the last operation. The only remaining difference with a graph is in
280the handling of the function signature (arguments and returned values), and
281attributes.
282
283There is one aspect of the modelling worth mentioning from the MLIR point of
284view: FunctionDef allows for nodes in a graph to express input control
285dependencies from function arguments. However in MLIR you need an actual
286[SSA](https://en.wikipedia.org/wiki/Static_single_assignment_form) value to add
287an edge between two operations. These values are typed and this is why
288operations define a control token (like `%ctl_0`). We apply the same recipe for
289arguments and for each of them we define a control token. We omit these “shadow
290arguments” from the textual form, but in-memory the MLIR function has really 4
291arguments:
292
293```
294 tfg.func @foo(%arg0 : tensor<*xf32> {tfg.name = "input"}, %arg0.ctl : !tfg.control
295      %arg1 : tensor<*xf32> {tfg.name = "another_input"}, %arg1.ctl : !tfg.control)
296      -> (tensor<*xi1> {tfg.name = "result1"},
297          tensor<*xi1> {tfg.name = "result2"})
298      attributes {description = "function foo"} {
299   ...
300```
301
302The convention is that callers are only exposed to the non-control input
303(`%arg0` and `%arg1`) while the control tokens are only intended to be visible
304and used in the body. This makes it very aligned with how TensorFlow works.
305Inside the body, values for the control dependencies on the arguments are
306available with a `.ctl` suffix (i.e. `%arg0.ctl` and `%arg1.ctl`).
307
308### Saved Model
309
310The basic blocks above are enough to model `GraphDef`, but not the entirety of
311SavedModel. However, most of the use cases that we’re targeting right now are in
312the scope of the existing GraphOptimization and Grappler APIs, which aren’t
313really coupled to SavedModel. The user can load a SavedModel independently of
314MLIR and invoke MLIR transformations on a Function or Graph from there. There is
315also already a dialect to model the specific aspects of SavedModel, it is
316currently wrapping around the TensorFlow executor dialect and the TensorFlow
317dialect and we may look into integrating it with the `tfg` dialect in the
318future. For these reasons, we mostly leave out modeling the Saved Model for
319future work right now.
320
321### Future Enhancements
322
323Functional control-flow is modeled with nodes in the graph invoking functions in
324the library. MLIR supports `region`s, which is a concept that allows attaching
325subgraphs directly inside a graph, making it more friendly to optimizations. For
326example a conditional operation can represent the two branches subgraph in the
327TensorFlow dialect directly as follow:
328
329```
330  %0, %1, %2 = "tf.IfRegion"(%arg0) ({
331     %t0 = "tf.Abs"(%arg1) : (tensor<2xf32>) -> tensor<2xf32>
332     %t1 = "tf.Acos"(%arg1) : (tensor<2xf32>) -> tensor<2xf32>
333     %t2 = "tf.Acosh"(%arg1) : (tensor<2xf32>) -> tensor<2xf32>
334    "tf.Yield"(%t0, %t1, %t2) : (tensor<2xf32>, tensor<2xf32>, tensor<2xf32>) -> ()
335  }, {
336     %e0 = "tf.Neg"(%arg1) : (tensor<2xf32>) -> tensor<2xf32>
337     %e1 = "tf.Relu"(%arg1) : (tensor<2xf32>) -> tensor<2xf32>
338     %e2 = "tf.Sin"(%arg1) : (tensor<2xf32>) -> tensor<2xf32>
339     "tf.Yield"(%e0, %e1, %e2) : (tensor<2xf32>, tensor<2xf32>, tensor<2xf32>)
340  }): (tensor<i1>) -> (tensor<2xf32>, tensor<2xf32>, tensor<2xf32>)
341  %3 = "tf.Add"(%0, %1) : (tensor<2xf32>, tensor<2xf32>) -> tensor<2xf32>
342  %4 = "tf.Add"(%2, %3) : (tensor<2xf32>, tensor<2xf32>) -> tensor<2xf32>
343```
344
345## Integration
346
347MLIR transformations in this dialect will operate on a module that will contain
348at most one `graph` operation as well as a list of functions. This interface
349will make such transformations suitable for fit within Grappler or as
350GraphOptimization interchangeably.
351
352Instead of a flat graph, an entry function will be provided when feeds/fetches
353are available for the main graph (PRE_PLACEMENT graph optimizations execute in
354Session before feeds/fetches are provided).
355
356## FAQ
357
358### Why not just use the TensorFlow Executor Dialect?
359
360The executor dialect wasn’t designed to write transformation: it is designed as
361a wrapper around the TensorFlow dialect: the intent was for it to be a stepping
362stone to integrate MLIR and TensorFlow, and then disappear when TensorFlow V1
363graphs would be deprecated. This new dialect embraces TensorFlow as it is
364instead of as I wish it would be.
365
366In particular the executor dialect represents each TensorFlow node as an
367isolated “subgraph” nested under an “island” operation. This requires 3
368operations and an extra region for each TensorFlow node, which is quite
369inefficient in memory as well as requiring extra indirection when pattern
370matching or updating nodes in the graph.
371
372### What happens to the existing TensorFlow Dialects?
373
374The existing TensorFlow dialect is suitable for representing a large subset of
375TensorFlow programs (like models that intends to convert to TFLite, or XLA), and
376for such cases we will continue to use it.
377
378### What happens to the existing TensorFlow Executor Dialect?
379
380This new TensorFlow Graph Dialect could be used to replace the Executor Dialect
381as the standalone staging importing format. Importing from GraphDef/Graph would
382always go through the TensorFlow Graph Dialect before using some clustering or
383promotion algorithms to raise some subgraphs to the TensorFlow Dialect, just
384like we do now to cluster islands operations in TensorFlow Executor Dialect.
385The details of such mechanisms are left for future work.
386
387<!-- Footnotes -->
388
389[^1]: While the semantic model is sequential, this does not prevent an
390    implementation to execute operation in parallel when proven safe. This is
391    similar to how a superscalar CPU involves implicit parallelism. For
392    example when mapping the TensorFlow dialect to TFRT, only side-effecting
393    operations (Variable accesses for example) are sequenced.
394[^2]: One of the first tools built with this was the TF->TFlite converter
395    (replacing TOCO). Since V1 control-flow isn’t supported on TFLite this
396    wasn’t a limitation.
397[^3]: Derived attributes is a concept used in the TensorFlow dialect: since MLIR
398    models type and shape information on each individual result produced by an
399    operation, some attributes that are inserted for the sole purpose of
400    typing are redundant and eliminated in MLIR.
401