xref: /aosp_15_r20/external/tensorflow/tensorflow/core/api_def/base_api/api_def_StopGradient.pbtxt (revision b6fb3261f9314811a0f4371741dbb8839866f948)
1op {
2  graph_op_name: "StopGradient"
3  summary: "Stops gradient computation."
4  description: <<END
5When executed in a graph, this op outputs its input tensor as-is.
6
7When building ops to compute gradients, this op prevents the contribution of
8its inputs to be taken into account.  Normally, the gradient generator adds ops
9to a graph to compute the derivatives of a specified 'loss' by recursively
10finding out inputs that contributed to its computation.  If you insert this op
11in the graph it inputs are masked from the gradient generator.  They are not
12taken into account for computing gradients.
13
14This is useful any time you want to compute a value with TensorFlow but need
15to pretend that the value was a constant. For example, the softmax function
16for a vector x can be written as
17
18```python
19
20  def softmax(x):
21    numerator = tf.exp(x)
22    denominator = tf.reduce_sum(numerator)
23    return numerator / denominator
24```
25
26This however is susceptible to overflow if the values in x are large. An
27alternative more stable way is to subtract the maximum of x from each of the
28values.
29
30```python
31
32  def stable_softmax(x):
33    z = x - tf.reduce_max(x)
34    numerator = tf.exp(z)
35    denominator = tf.reduce_sum(numerator)
36    return numerator / denominator
37```
38
39However, when we backprop through the softmax to x, we dont want to backprop
40through the `tf.reduce_max(x)` (if the max values are not unique then the
41gradient could flow to the wrong input) calculation and treat that as a
42constant. Therefore, we should write this out as
43
44```python
45
46  def stable_softmax(x):
47    z = x - tf.stop_gradient(tf.reduce_max(x))
48    numerator = tf.exp(z)
49    denominator = tf.reduce_sum(numerator)
50    return numerator / denominator
51```
52
53Some other examples include:
54
55*  The *EM* algorithm where the *M-step* should not involve backpropagation
56   through the output of the *E-step*.
57*  Contrastive divergence training of Boltzmann machines where, when
58   differentiating the energy function, the training must not backpropagate
59   through the graph that generated the samples from the model.
60*  Adversarial training, where no backprop should happen through the adversarial
61   example generation process.
62END
63}
64