xref: /aosp_15_r20/external/pytorch/torch/cuda/tunable.py (revision da0073e96a02ea20f0ac840b70461e3646d07c45)
1r"""
2This module exposes a TunableOp interface.
3
4Some operations, such as GEMMs, could be implemented using more than one library
5or more than one technique. For example, a GEMM could be implemented for CUDA or
6ROCm using either the blas or blasLt libraries. Further, ROCm's rocblas and
7hipblaslt libraries allow the user to query for all possible algorithms and then
8choose one. How does one know which implementation is the fastest and should be
9chosen? That's what TunableOp provides.
10
11Enabling TunableOp and Tuning Separately
12========================================
13
14The TunableOp feature is enabled separately from enabling the tuning phase
15itself. Enabling TunableOp means that PyTorch will replace any standard
16operators with their Tunable implementations. Any call to a TunableOp first
17checks whether it has already been tuned for the given operator inputs. If so,
18it will immediately call the tuned operation; no further tuning will take place
19even when the tuning setting is enabled. Instead if no tuning result is found,
20and tuning is enabled, the TunableOp will benchmark every registered
21implementation of that operator for the given set of inputs and select the
22fastest.
23
24File Input and Output
25=====================
26
27The first time any TunableOp is invoked, the internal database of tuned
28operations will be prepared by attempting to read the results from the given
29file. The default filename is 'tunableop_results.csv'. To support tuning when
30multiple GPUs are used across multiple processes, the GPU device ordinal is
31automatically inserted into the filename to avoid multiple processes overwriting
32the same file.
33
34If tuning is enabled and new tunings are discovered during the course of your
35workload, it will also write out to this same filename with all tunings, both
36the ones it read in at startup as well as the new ones found at runtime. This
37can be used, for example, to build up a tunings file across many workloads by
38reusing the same file. The output file is automatically created when the
39application terminates. This behavior can be controlled by the C++ and Python
40APIs but not the environment variables.
41
42Assuming you specified a filename, you'll end up with a CSV file with contents
43like so::
44
45  Validator,PT_VERSION,2.2.0
46  Validator,ROCM_VERSION,6.0.0.0-12969-1544e39
47  Validator,HIPBLASLT_VERSION,0.6.0-a9c5cc7
48  Validator,ROCBLAS_VERSION,4.0.0-72e57364-dirty
49  GemmTunableOp_float_NT,nt_25088_4096_64,1219,1.262
50  GemmTunableOp_float_NT,nt_4096_4096_64,1216,0.033
51
52Note the "Validator" lines. If you change a library verison, or ROCm version, or
53PyTorch version, TunableOp will detect this and reject the tunings file because
54the prior tunings are likely affected by other software changes.
55
56The remaining lines are the tuned solutions for each TunableOp encountered
57during your execution. Each line consists of 4 comma-separated fields: operator
58name, operator parameters, solution name, and average execution time. The
59execution time is an optional field. The CSV file can be edited, but with
60caution. For example, the solution name (field 3) can be changed to "Default"
61and it will fall back to the original PyTorch untuned implementation. Or, in the
62case of ROCm's hipBLAS or hipBLASLt libraries, if you know the specific solution
63index you can override the solution that TunableOp selected by replacing the
64value. The operator name and parameters (fields 1 and 2) are internally named
65and should not be modified. In the case of GemmTunableOp, field 1 indicates the
66datatype and whether the inputs are transposed (T) or not (N) and field 2
67indicates the M, N, K input shapes.
68
69There is an option to enable verbose output but it is only recommended for
70debugging purposes. This will produce a lot of diagnostic messages but may be
71useful to see if TunableOp is being used at all. Otherwise, TunableOp is
72completely silent, besides file output, unless there is a warning or error
73during its use. The verbose option is only available by setting the environment
74variable PYTORCH_TUNABLEOP_VEROBSE=1.
75
76A Note on Tuning Behavior
77=========================
78
79Tuning an operator consists of iterating through the list or registered
80implementations and profiling each one. The profile is established by running a
81single implementation in a loop multiple times and taking the average execution
82time.
83
84By default, each possible solution for a given operator will be run for either
85100 iterations or as many iterations that can be run within 30ms, whichever is
86smaller, and its average execution will be calculated. The fastest solution
87among all that were successfully profiled will be chosen. A profile might fail
88if the given solution doesn't achieve the same accuracy as the default
89implementation or if the solution returns an error code.
90
91Current Tunable Operators
92=========================
93
94TunableGemm for ROCm
95--------------------
96
97Currently only a TunableGemm for ROCm is implemented. Note that CUDA builds of
98PyTorch will function correctly when using TunableOp but the only solution
99available to CUDA builds is the 'Default' implementation i.e. the original
100cuBLAS default, now called through TunableOp. Any call to at::cuda::blas::gemm()
101or ::bgemm() will be routed through TunableOp when enabled. Calling gemm() for a
102given set of input arguments (transa, transb, m, n, k) will attempt to use the
103fastest available implementation across both rocblas and hipblaslt.
104
105Tuning Context
106==============
107
108The behavior of TunableOp is currently manipulated through environment
109variables, the C++ interface of at::cuda::tunable::getTuningContext(), or the
110torch.cuda.tunable python interfaces that wrap the C++ TuningContext. The
111environment variables take precedence over any setting you manipulate using the
112C++ or Python APIs.
113
114"""
115from typing import Optional, Tuple
116
117import torch
118
119
120__all__ = [
121    "enable",
122    "is_enabled",
123    "tuning_enable",
124    "tuning_is_enabled",
125    "set_max_tuning_duration",
126    "get_max_tuning_duration",
127    "set_max_tuning_iterations",
128    "get_max_tuning_iterations",
129    "set_filename",
130    "get_filename",
131    "get_results",
132    "get_validators",
133    "write_file_on_exit",
134    "write_file",
135    "read_file",
136]
137
138
139def enable(val: bool = True) -> None:
140    r"""This is the big on/off switch for all TunableOp implementations."""
141    torch._C._cuda_tunableop_enable(val)  # type: ignore[attr-defined]
142
143
144def is_enabled() -> bool:
145    r"""Returns whether the TunableOp feature is enabled."""
146    return torch._C._cuda_tunableop_is_enabled()  # type: ignore[attr-defined]
147
148
149def tuning_enable(val: bool = True) -> None:
150    r"""Enable tuning of TunableOp implementations.
151
152    When enabled, if a tuned entry isn't found, run the tuning step and record
153    the entry.
154    """
155    torch._C._cuda_tunableop_tuning_enable(val)  # type: ignore[attr-defined]
156
157
158def tuning_is_enabled() -> bool:
159    r"""Returns whether TunableOp implementations can be tuned."""
160    return torch._C._cuda_tunableop_tuning_is_enabled()  # type: ignore[attr-defined]
161
162
163def set_max_tuning_duration(duration: int) -> None:
164    r"""Set max time in milliseconds to spend tuning a given solution.
165
166    If both max tuning duration and iterations are set, the smaller of the two
167    will be honored. At minimum 1 tuning iteration will always be run.
168    """
169    torch._C._cuda_tunableop_set_max_tuning_duration(duration)  # type: ignore[attr-defined]
170
171
172def get_max_tuning_duration() -> int:
173    r"""Get max time to spend tuning a given solution."""
174    return torch._C._cuda_tunableop_get_max_tuning_duration()  # type: ignore[attr-defined]
175
176
177def set_max_tuning_iterations(iterations: int) -> None:
178    r"""Set max number of iterations to spend tuning a given solution.
179
180    If both max tuning duration and iterations are set, the smaller of the two
181    will be honored. At minimum 1 tuning iteration will always be run.
182    """
183    torch._C._cuda_tunableop_set_max_tuning_iterations(iterations)  # type: ignore[attr-defined]
184
185
186def get_max_tuning_iterations() -> int:
187    r"""Get max iterations to spend tuning a given solution."""
188    return torch._C._cuda_tunableop_get_max_tuning_iterations()  # type: ignore[attr-defined]
189
190
191def set_filename(filename: str, insert_device_ordinal: bool = False) -> None:
192    r"""Set the filename to use for input/output of tuning results.
193
194    If :attr:`insert_device_ordinal` is ``True`` then the current device ordinal
195    will be added to the given filename automatically. This can be used in a
196    1-process-per-gpu cenario to ensure all processes write to a separate file.
197    """
198    torch._C._cuda_tunableop_set_filename(filename, insert_device_ordinal)  # type: ignore[attr-defined]
199
200
201def get_filename() -> str:
202    r"""Get the results filename."""
203    return torch._C._cuda_tunableop_get_filename()  # type: ignore[attr-defined]
204
205
206def get_results() -> Tuple[str, str, str, float]:
207    r"""Return all TunableOp results."""
208    return torch._C._cuda_tunableop_get_results()  # type: ignore[attr-defined]
209
210
211def get_validators() -> Tuple[str, str]:
212    r"""Return the TunableOp validators."""
213    return torch._C._cuda_tunableop_get_validators()  # type: ignore[attr-defined]
214
215
216def write_file_on_exit(val: bool) -> None:
217    r"""During Tuning Context destruction, write file to disk.
218
219    This is useful as a final flush of your results to disk if your application
220    terminates as result of normal operation or an error. Manual flushing of
221    your results can be achieved by manually calling ``write_file()``."""
222    torch._C._cuda_tunableop_write_file_on_exit(val)  # type: ignore[attr-defined]
223
224
225def write_file(filename: Optional[str] = None) -> bool:
226    r"""Write results to a CSV file.
227
228    If :attr:`filename` is not given, ``get_filename()`` is called.
229    """
230    if filename is None:
231        filename = get_filename()
232    return torch._C._cuda_tunableop_write_file(filename)  # type: ignore[attr-defined]
233
234
235def read_file(filename: Optional[str] = None) -> bool:
236    r"""Read results from a TunableOp CSV file.
237
238    If :attr:`filename` is not given, ``get_filename()`` is called.
239    """
240    if filename is None:
241        filename = get_filename()
242    return torch._C._cuda_tunableop_read_file(filename)  # type: ignore[attr-defined]
243