1# TensorFlow Graph IR 2 3This directory contains the definition of the Intermediate Representation (IR) 4for TensorFlow graphs using MLIR. 5 6## Introduction 7 8This directory defined an MLIR dialect, the “TensorFlow Graph dialect”, that 9represents accurately TensorFlow graphs. Contrary to the previous TensorFlow 10dialect which made some opinionated choices that diverged from GraphDef and 11TensorFlow Graph semantics, this dialect embraces TensorFlow Graph as it is. In 12particular the concepts of control dependencies, requested device, assigned 13device, and node name are all first-class attributes on the MLIR operations in 14this dialect. 15 16The main principle that drove the development of this dialect has been to ensure 17perfect round-trip and general compatibility with existing TensorFlow semantics, 18so that this solution can be deployed by default in any situation where "Graph 19Optimization" and Grappler transformations are involved today, regardless of 20TensorFlow V1 or V2. This new approach is also made possible by evolutions in 21MLIR that allow representing graphs in a way that wasn’t possible before (more 22in the [Graph operation design](#graph_operation_design) section below). 23 24## History of Dialects for TensorFlow 25 26MLIR started with a basic structure reflecting LLVM in that it defined a 27`Module` containing a list of `Functions`. Each of these was defining a body 28constrained to be a Control-Flow Graph (CFG): a list of `Blocks`, each of them 29containing a list of `Operations`. A fundamental aspect of the CFG 30representation is the notion of “control”: the abstract semantic model considers 31that a single `Operation` executes at a given time, and the next `Operation` to 32execute is necessarily the one listed immediately after[^1]. The last 33`Operation` in a `Block` is a `Terminator`: it decides what is the next `Block` 34where the control will be transferred (think of a branch). 35 36When MLIR started, a first dialect -- that we were referring to as “TF control 37dialect” -- was developed to model TensorFlow graphs. This dialect supported 38control dependencies, but didn’t allow cycles in the graph, which forced some 39tricks to model TensorFlow V1 loops and in particular the `NextIteration` 40operation. While this dialect enabled some experimentation, it wasn’t seen as 41really practical and another dialect was co-existing: the “tf” dialect that 42we’re using currently. This dialect was designed before TF2.0 43[was released](https://blog.tensorflow.org/2019/09/tensorflow-20-is-now-available.html), 44and made strong assumptions about TensorFlow evolving towards a world where 45eager execution and function execution become unified and V1 specific constructs 46would be deprecated and disappear. As such control dependencies are not 47supported and are instead implicit, control-flow V1 ops (such as Switch & Merge) 48and deadness aren’t supported[^2], new device placement modelling solutions were 49considered. These choices in the model enabled us to write graph transformations 50as stateless DAG-to-DAG patterns that can be applied to a subgraph, without 51considering the entire graph. 52 53## Motivation 54 55The combination of the TensorFlow and executor dialects allows for importing 56most TensorFlow graphs and the TensorFlow dialect has proven enough to implement 57the TF/XLA bridge, TFLite converter, and TFRT . However the intent was for 58TensorFlow 2.0 to trace TensorFlow functions directly in the TensorFlow dialect, 59leaving the executor dialect only as a way to provide limited support for 60TensorFlow V1 graphs. 61 62However, the implementation of TensorFlow 2.0 didn't break away from TensorFlow 63V1 entirely, instead TensorFlow functions are wrapped above TensorFlow V1 and 64expose a leaky abstraction over the classical graph. As a result, the TensorFlow 65dialect never got in a position to be enabled by default in TensorFlow. In 66particular there are many subtle way in which TensorFlow functions diverges from 67the sequential eager interpretation. For example the following pattern has been 68recommended to users who intended to call a function `bar` knowing that the 69first argument wasn’t necessary if they only used the first result. 70 71``` 72 @tf.function 73 def foo(z): 74 x = tf.Placeholder(tf.int32) 75 y, _ = bar(x, z) 76 return y 77``` 78 79The use of a placeholder would throw an exception in eager mode, but “works” in 80graph mode as long as inlining and pruning ensure the placeholder is removed 81before execution. 82 83Other cases involve the need for control dependencies beyond what the 84auto-control-dependency tracking offers. For example the 85[tf.recompute\_grad](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/custom_gradient.py#L497) 86creates control-dependencies on non-side-effecting ops to have a finer grain 87control of memory usage. 88 89Finally, the error modelling in TensorFlow can also be surprising. While in 90eager op-by-op mode the execution is interrupted as soon as an error occurs, 91`tf.function` tracing does not consider error handling as side-effecting 92(otherwise it would have to add a control dependency between every node!) and as 93such a program like: 94 95``` 96@tf.function 97def foo(x, y, variable): 98 b = tf.matmul(x, y) 99 variable.assign(1.0) 100 return b 101``` 102 103Does not guarantee that the assignment to the variable won’t occur if an error 104occurs while processing the matmul, so calling: 105 106``` 107foo(1., 2., variable) 108``` 109 110Throws an exception because `tf.matmul` expects rank-2 tensors, but the variable 111may or may not have been assigned. As such a user may want to opt-in a safer 112behavior for their function: 113 114``` 115@tf.function 116def foo(x, y, variable): 117 b = tf.matmul(x, y) 118 with tf.control_dependencies([b]): 119 variable.assign(1.0) 120 return b 121``` 122 123However this control dependency cannot be modelled in the TensorFlow dialect: it 124will be just dropped! There is no solution today to prevent the variable 125assignment to be executed ahead of the `matmul` in the TensorFlow Dialect. 126 127While many of these cases could be modeled with different constructs at the 128source level, this would be a major overhaul of TensorFlow itself, and more 129importantly its ecosystem. Instead, we recognize that the TensorFlow dialect as 130it exists today cannot support all of these use-cases and it prevented MLIR from 131providing a general graph transformation solution for TensorFlow, contributing 132to more fragmentation instead of reducing it as promised. 133 134The rest of this document describe how this new dialect follows a more pragmatic 135approach to enable MLIR deployment in TensorFlow. 136 137## Design 138 139This new dialect intends to allow us to replace Grappler and existing graph 140transformations, for TensorFlow V1 and V2 without constraints. As such the main 141principle is to support perfect roundtrip between TensorFlow Graph/GraphDef and 142MLIR. 143 144### General Operations 145 146An individual TensorFlow `NodeDef` is translated into an individual MLIR 147operation using the following form: 148 149``` 150%AddV2, %ctl = tfg.AddV2(%placeholder, %placeholder_1) [%ctl_1, %ctl_2] 151 device("GPU") assigned_device("TPU") name("add") 152 {some_attribute = "some attr!"} 153 : (tensor<*xi32>, tensor<*xi32>) -> (tensor<*xi32>) 154``` 155 156* Each operation returns an optional variadic number of tensors as well as a 157 control token to express potential control dependencies. 158* The node type is carried in the operation mnemonic. 159* The list of regular inputs is in-between parentheses. 160* Optional control dependencies are exposed after the regular inputs and 161 printed between square brackets. 162* The pre-placement “requested device” as well as the post-placement “assigned 163 device” information are preserved. 164* The node name is carried as a first-class attribute. 165* Optional “op specific” attributes can be listed between curly brackets. 166* Finally the type signature follows, omitting the control dependencies. 167 168This structure allows for a perfect round-trip to NodeDef, while still being 169ergonomic when manipulating it in MLIR (compared to the `tf\_executor` dialect 170for example). The tradeoff we are making here is that we preserve all 171attributes, including the “derived” ones[^3], which creates some amount of 172redundancy with the signature. We may consider pruning these redundant 173attributes in the future in the same way as we do in the TensorFlow dialect. 174 175### Graph Operation 176 177A structural operation is introduced as a container: `tfg.graph` acts as a bag 178of unordered TensorFlow operations, and carries a “version” attribute that 179corresponds to the 180[VersionDef](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/framework/versions.proto) 181present in GraphDef: 182 183``` 184tfg.graph #tfg.version<producer = 42, min_consumer = 33> { 185 %arg0, %ctl_0 = tfg.placeholder() : () -> (tensor<*xi32>) 186 %add, %ctl_1 = tfg.AddV2(%arg0, %arg1) 187 : (tensor<*xi32>, tensor<*xi32>) -> (tensor<*xi32>) 188 %arg1, %ctl_2 = tfg.placeholder() : () -> (tensor<*xi32>) 189} 190``` 191 192Note that the `AddV2` operation is using the result of a `placeholder` operation 193that is defined later in the list. This wasn’t possible in MLIR 2 years ago when 194the TensorFlow dialect was designed. It was actually 195[attempted to allow such unordered semantics](https://groups.google.com/a/tensorflow.org/g/mlir/c/gPQFIy9XpVw/m/hfxmBGF8AQAJ) 196and break away from the CFG-centric representation but we couldn’t reach a 197consensus, and some key members of the team believed that a departure from 198CFG/SSA would limit the reusability of many algorithms. On the other hand, this 199choice prevented us to design a graph dialect that can just replace TensorFlow 200Graph structure as-is. Since then MLIR evolved to become more general and this 201feature is now available (it was motivated by the 202[support for HW synthesis tools](https://llvm.discourse.group/t/rfc-allowing-dialects-to-relax-the-ssa-dominance-condition/833)). 203Another recent development that made it also more friendly is the 204[removal of the requirement for terminators](https://llvm.discourse.group/t/rfc-making-terminator-optional-for-single-block-graph-regions/2997): 205the `tfg.graph` operation above contains a single block listing operations, and 206a terminator does not have any role to play. Finally a Dialect can now 207[acts as fallback for OpInterfaces](https://llvm.discourse.group/t/rfc-dialect-fallback-for-opinterface/3074), 208which allows us to reuse more of the TensorFlow registry to provide information 209to MLIR passes about TensorFlow operation without having to register them with 210MLIR in the first place. 211 212The `tfg.graph` operation round-trips almost perfectly to 213[Graph](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/graph/graph.h#L504), 214with the exception of the `Function Library`, which I address below. 215 216### Function Library 217 218Functions in TensorFlow are stored as 219[FunctionDef](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/framework/function.proto), 220which has a signature, holds attributes, identifies argument and returned 221values, and finally contains a list of nodes for its body. While on the surface 222this `repeated NodeDef node_def` field looks identical to the body of 223[GraphDef](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/framework/graph.proto#L17), 224there are fundamental differences in the representation, and in particular the 225format the edges are represented is different. 226 227To understand these differences, it is important to realize that a key aspect of 228`FunctionsDef` is that they are stored uninstantiated, and can be considered in 229a similar way to a C++ template function. The signature is actually an `OpDef`, 230and just like any regular TensorFlow operation the types of the arguments and 231the results are encoded and constrained with attributes. These attributes are 232only provided or inferred based on the function’s use: the call-site is 233responsible for instantiating a function before it’s body can be represented as 234a Graph. Because of this, the body of an uninstantiated function is modeled 235differently than Graph body: 236 237``` 238 tfg.func generic @foo(%arg0 : !tfg.tensor {tfg.name = "input"}, 239 %arg1 : !tfg.tensor {tfg.name = "another_input"}) 240 -> (!tfg.tensor {tfg.name = "result1"}, 241 !tfg.tensor {tfg.name = "result2"}) 242 attributes {description = "function foo"} { 243 %Greater, %ctl_0 = tfg.Greater(%arg0, %arg1) name("Greater") 244 %G_z = tfg.get_result(%Greater) "z" : 0 245 %Switch, %ctl_1 = tfg.Switch(%G_z, %G_z) name("cond/Switch") 246 %s_true = tfg.get_result %Switch "output_true" : 0 247 %s_false = tfg.get_result %Switch "output_false" : 0 248 tfg.return(%s_true, %s_false) [%ctl_0] 249 } 250``` 251 252Note how the tensor types `!tfg.tensor` are opaque, and every operation returns 253a single tensor output and a control token. The tensor output is then unpacked 254by looking up individual results by name. This is particularly visible with the 255`Switch` operation where the two results are accessed using `tfg.get_result` 256looking them up by name `output_true:0` and `output_false:0`. This is required 257because the OpDef can define the number of output based on the attribute present 258on the NodeDef, and these attributes can in turn be dependent on the attributes 259added on the function during instantiation (you can read more about it in the 260[description of the placeholder attribute value](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/framework/attr_value.proto#L48-L55)). 261 262Post-instantiation, a function body is similar to the one of a graph: 263 264``` 265 tfg.func @foo(%arg0 : tensor<*xf32> {tfg.name = "input"}, 266 %arg1 : tensor<*xf32> {tfg.name = "another_input"}) 267 -> (tensor<*xi1> {tfg.name = "result1"}, 268 tensor<*xi1> {tfg.name = "result2"}) 269 attributes {description = "function foo"} { 270 %Greater, %ctl_0 = tfg.Greater(%arg0, %arg1) [%arg1.ctl] name("Greater") 271 : (tensor<*xf32>, tensor<*xf32>) -> tensor<*xi1> 272 %Switch:2, %ctl_1 = tfg.Switch(%Greater, %Greater) name("cond/Switch") 273 : (tensor<*xi1>, tensor<*xi1>) -> tensor<*xi1> 274 tfg.return(%Switch#0, %Switch#1) [%ctl_0] 275 } 276``` 277 278The operations aren’t ordered, except for the `tfg.return` which is a terminator 279and must be the last operation. The only remaining difference with a graph is in 280the handling of the function signature (arguments and returned values), and 281attributes. 282 283There is one aspect of the modelling worth mentioning from the MLIR point of 284view: FunctionDef allows for nodes in a graph to express input control 285dependencies from function arguments. However in MLIR you need an actual 286[SSA](https://en.wikipedia.org/wiki/Static_single_assignment_form) value to add 287an edge between two operations. These values are typed and this is why 288operations define a control token (like `%ctl_0`). We apply the same recipe for 289arguments and for each of them we define a control token. We omit these “shadow 290arguments” from the textual form, but in-memory the MLIR function has really 4 291arguments: 292 293``` 294 tfg.func @foo(%arg0 : tensor<*xf32> {tfg.name = "input"}, %arg0.ctl : !tfg.control 295 %arg1 : tensor<*xf32> {tfg.name = "another_input"}, %arg1.ctl : !tfg.control) 296 -> (tensor<*xi1> {tfg.name = "result1"}, 297 tensor<*xi1> {tfg.name = "result2"}) 298 attributes {description = "function foo"} { 299 ... 300``` 301 302The convention is that callers are only exposed to the non-control input 303(`%arg0` and `%arg1`) while the control tokens are only intended to be visible 304and used in the body. This makes it very aligned with how TensorFlow works. 305Inside the body, values for the control dependencies on the arguments are 306available with a `.ctl` suffix (i.e. `%arg0.ctl` and `%arg1.ctl`). 307 308### Saved Model 309 310The basic blocks above are enough to model `GraphDef`, but not the entirety of 311SavedModel. However, most of the use cases that we’re targeting right now are in 312the scope of the existing GraphOptimization and Grappler APIs, which aren’t 313really coupled to SavedModel. The user can load a SavedModel independently of 314MLIR and invoke MLIR transformations on a Function or Graph from there. There is 315also already a dialect to model the specific aspects of SavedModel, it is 316currently wrapping around the TensorFlow executor dialect and the TensorFlow 317dialect and we may look into integrating it with the `tfg` dialect in the 318future. For these reasons, we mostly leave out modeling the Saved Model for 319future work right now. 320 321### Future Enhancements 322 323Functional control-flow is modeled with nodes in the graph invoking functions in 324the library. MLIR supports `region`s, which is a concept that allows attaching 325subgraphs directly inside a graph, making it more friendly to optimizations. For 326example a conditional operation can represent the two branches subgraph in the 327TensorFlow dialect directly as follow: 328 329``` 330 %0, %1, %2 = "tf.IfRegion"(%arg0) ({ 331 %t0 = "tf.Abs"(%arg1) : (tensor<2xf32>) -> tensor<2xf32> 332 %t1 = "tf.Acos"(%arg1) : (tensor<2xf32>) -> tensor<2xf32> 333 %t2 = "tf.Acosh"(%arg1) : (tensor<2xf32>) -> tensor<2xf32> 334 "tf.Yield"(%t0, %t1, %t2) : (tensor<2xf32>, tensor<2xf32>, tensor<2xf32>) -> () 335 }, { 336 %e0 = "tf.Neg"(%arg1) : (tensor<2xf32>) -> tensor<2xf32> 337 %e1 = "tf.Relu"(%arg1) : (tensor<2xf32>) -> tensor<2xf32> 338 %e2 = "tf.Sin"(%arg1) : (tensor<2xf32>) -> tensor<2xf32> 339 "tf.Yield"(%e0, %e1, %e2) : (tensor<2xf32>, tensor<2xf32>, tensor<2xf32>) 340 }): (tensor<i1>) -> (tensor<2xf32>, tensor<2xf32>, tensor<2xf32>) 341 %3 = "tf.Add"(%0, %1) : (tensor<2xf32>, tensor<2xf32>) -> tensor<2xf32> 342 %4 = "tf.Add"(%2, %3) : (tensor<2xf32>, tensor<2xf32>) -> tensor<2xf32> 343``` 344 345## Integration 346 347MLIR transformations in this dialect will operate on a module that will contain 348at most one `graph` operation as well as a list of functions. This interface 349will make such transformations suitable for fit within Grappler or as 350GraphOptimization interchangeably. 351 352Instead of a flat graph, an entry function will be provided when feeds/fetches 353are available for the main graph (PRE_PLACEMENT graph optimizations execute in 354Session before feeds/fetches are provided). 355 356## FAQ 357 358### Why not just use the TensorFlow Executor Dialect? 359 360The executor dialect wasn’t designed to write transformation: it is designed as 361a wrapper around the TensorFlow dialect: the intent was for it to be a stepping 362stone to integrate MLIR and TensorFlow, and then disappear when TensorFlow V1 363graphs would be deprecated. This new dialect embraces TensorFlow as it is 364instead of as I wish it would be. 365 366In particular the executor dialect represents each TensorFlow node as an 367isolated “subgraph” nested under an “island” operation. This requires 3 368operations and an extra region for each TensorFlow node, which is quite 369inefficient in memory as well as requiring extra indirection when pattern 370matching or updating nodes in the graph. 371 372### What happens to the existing TensorFlow Dialects? 373 374The existing TensorFlow dialect is suitable for representing a large subset of 375TensorFlow programs (like models that intends to convert to TFLite, or XLA), and 376for such cases we will continue to use it. 377 378### What happens to the existing TensorFlow Executor Dialect? 379 380This new TensorFlow Graph Dialect could be used to replace the Executor Dialect 381as the standalone staging importing format. Importing from GraphDef/Graph would 382always go through the TensorFlow Graph Dialect before using some clustering or 383promotion algorithms to raise some subgraphs to the TensorFlow Dialect, just 384like we do now to cluster islands operations in TensorFlow Executor Dialect. 385The details of such mechanisms are left for future work. 386 387<!-- Footnotes --> 388 389[^1]: While the semantic model is sequential, this does not prevent an 390 implementation to execute operation in parallel when proven safe. This is 391 similar to how a superscalar CPU involves implicit parallelism. For 392 example when mapping the TensorFlow dialect to TFRT, only side-effecting 393 operations (Variable accesses for example) are sequenced. 394[^2]: One of the first tools built with this was the TF->TFlite converter 395 (replacing TOCO). Since V1 control-flow isn’t supported on TFLite this 396 wasn’t a limitation. 397[^3]: Derived attributes is a concept used in the TensorFlow dialect: since MLIR 398 models type and shape information on each individual result produced by an 399 operation, some attributes that are inserted for the sole purpose of 400 typing are redundant and eliminated in MLIR. 401