1r""" 2This module exposes a TunableOp interface. 3 4Some operations, such as GEMMs, could be implemented using more than one library 5or more than one technique. For example, a GEMM could be implemented for CUDA or 6ROCm using either the blas or blasLt libraries. Further, ROCm's rocblas and 7hipblaslt libraries allow the user to query for all possible algorithms and then 8choose one. How does one know which implementation is the fastest and should be 9chosen? That's what TunableOp provides. 10 11Enabling TunableOp and Tuning Separately 12======================================== 13 14The TunableOp feature is enabled separately from enabling the tuning phase 15itself. Enabling TunableOp means that PyTorch will replace any standard 16operators with their Tunable implementations. Any call to a TunableOp first 17checks whether it has already been tuned for the given operator inputs. If so, 18it will immediately call the tuned operation; no further tuning will take place 19even when the tuning setting is enabled. Instead if no tuning result is found, 20and tuning is enabled, the TunableOp will benchmark every registered 21implementation of that operator for the given set of inputs and select the 22fastest. 23 24File Input and Output 25===================== 26 27The first time any TunableOp is invoked, the internal database of tuned 28operations will be prepared by attempting to read the results from the given 29file. The default filename is 'tunableop_results.csv'. To support tuning when 30multiple GPUs are used across multiple processes, the GPU device ordinal is 31automatically inserted into the filename to avoid multiple processes overwriting 32the same file. 33 34If tuning is enabled and new tunings are discovered during the course of your 35workload, it will also write out to this same filename with all tunings, both 36the ones it read in at startup as well as the new ones found at runtime. This 37can be used, for example, to build up a tunings file across many workloads by 38reusing the same file. The output file is automatically created when the 39application terminates. This behavior can be controlled by the C++ and Python 40APIs but not the environment variables. 41 42Assuming you specified a filename, you'll end up with a CSV file with contents 43like so:: 44 45 Validator,PT_VERSION,2.2.0 46 Validator,ROCM_VERSION,6.0.0.0-12969-1544e39 47 Validator,HIPBLASLT_VERSION,0.6.0-a9c5cc7 48 Validator,ROCBLAS_VERSION,4.0.0-72e57364-dirty 49 GemmTunableOp_float_NT,nt_25088_4096_64,1219,1.262 50 GemmTunableOp_float_NT,nt_4096_4096_64,1216,0.033 51 52Note the "Validator" lines. If you change a library verison, or ROCm version, or 53PyTorch version, TunableOp will detect this and reject the tunings file because 54the prior tunings are likely affected by other software changes. 55 56The remaining lines are the tuned solutions for each TunableOp encountered 57during your execution. Each line consists of 4 comma-separated fields: operator 58name, operator parameters, solution name, and average execution time. The 59execution time is an optional field. The CSV file can be edited, but with 60caution. For example, the solution name (field 3) can be changed to "Default" 61and it will fall back to the original PyTorch untuned implementation. Or, in the 62case of ROCm's hipBLAS or hipBLASLt libraries, if you know the specific solution 63index you can override the solution that TunableOp selected by replacing the 64value. The operator name and parameters (fields 1 and 2) are internally named 65and should not be modified. In the case of GemmTunableOp, field 1 indicates the 66datatype and whether the inputs are transposed (T) or not (N) and field 2 67indicates the M, N, K input shapes. 68 69There is an option to enable verbose output but it is only recommended for 70debugging purposes. This will produce a lot of diagnostic messages but may be 71useful to see if TunableOp is being used at all. Otherwise, TunableOp is 72completely silent, besides file output, unless there is a warning or error 73during its use. The verbose option is only available by setting the environment 74variable PYTORCH_TUNABLEOP_VEROBSE=1. 75 76A Note on Tuning Behavior 77========================= 78 79Tuning an operator consists of iterating through the list or registered 80implementations and profiling each one. The profile is established by running a 81single implementation in a loop multiple times and taking the average execution 82time. 83 84By default, each possible solution for a given operator will be run for either 85100 iterations or as many iterations that can be run within 30ms, whichever is 86smaller, and its average execution will be calculated. The fastest solution 87among all that were successfully profiled will be chosen. A profile might fail 88if the given solution doesn't achieve the same accuracy as the default 89implementation or if the solution returns an error code. 90 91Current Tunable Operators 92========================= 93 94TunableGemm for ROCm 95-------------------- 96 97Currently only a TunableGemm for ROCm is implemented. Note that CUDA builds of 98PyTorch will function correctly when using TunableOp but the only solution 99available to CUDA builds is the 'Default' implementation i.e. the original 100cuBLAS default, now called through TunableOp. Any call to at::cuda::blas::gemm() 101or ::bgemm() will be routed through TunableOp when enabled. Calling gemm() for a 102given set of input arguments (transa, transb, m, n, k) will attempt to use the 103fastest available implementation across both rocblas and hipblaslt. 104 105Tuning Context 106============== 107 108The behavior of TunableOp is currently manipulated through environment 109variables, the C++ interface of at::cuda::tunable::getTuningContext(), or the 110torch.cuda.tunable python interfaces that wrap the C++ TuningContext. The 111environment variables take precedence over any setting you manipulate using the 112C++ or Python APIs. 113 114""" 115from typing import Optional, Tuple 116 117import torch 118 119 120__all__ = [ 121 "enable", 122 "is_enabled", 123 "tuning_enable", 124 "tuning_is_enabled", 125 "set_max_tuning_duration", 126 "get_max_tuning_duration", 127 "set_max_tuning_iterations", 128 "get_max_tuning_iterations", 129 "set_filename", 130 "get_filename", 131 "get_results", 132 "get_validators", 133 "write_file_on_exit", 134 "write_file", 135 "read_file", 136] 137 138 139def enable(val: bool = True) -> None: 140 r"""This is the big on/off switch for all TunableOp implementations.""" 141 torch._C._cuda_tunableop_enable(val) # type: ignore[attr-defined] 142 143 144def is_enabled() -> bool: 145 r"""Returns whether the TunableOp feature is enabled.""" 146 return torch._C._cuda_tunableop_is_enabled() # type: ignore[attr-defined] 147 148 149def tuning_enable(val: bool = True) -> None: 150 r"""Enable tuning of TunableOp implementations. 151 152 When enabled, if a tuned entry isn't found, run the tuning step and record 153 the entry. 154 """ 155 torch._C._cuda_tunableop_tuning_enable(val) # type: ignore[attr-defined] 156 157 158def tuning_is_enabled() -> bool: 159 r"""Returns whether TunableOp implementations can be tuned.""" 160 return torch._C._cuda_tunableop_tuning_is_enabled() # type: ignore[attr-defined] 161 162 163def set_max_tuning_duration(duration: int) -> None: 164 r"""Set max time in milliseconds to spend tuning a given solution. 165 166 If both max tuning duration and iterations are set, the smaller of the two 167 will be honored. At minimum 1 tuning iteration will always be run. 168 """ 169 torch._C._cuda_tunableop_set_max_tuning_duration(duration) # type: ignore[attr-defined] 170 171 172def get_max_tuning_duration() -> int: 173 r"""Get max time to spend tuning a given solution.""" 174 return torch._C._cuda_tunableop_get_max_tuning_duration() # type: ignore[attr-defined] 175 176 177def set_max_tuning_iterations(iterations: int) -> None: 178 r"""Set max number of iterations to spend tuning a given solution. 179 180 If both max tuning duration and iterations are set, the smaller of the two 181 will be honored. At minimum 1 tuning iteration will always be run. 182 """ 183 torch._C._cuda_tunableop_set_max_tuning_iterations(iterations) # type: ignore[attr-defined] 184 185 186def get_max_tuning_iterations() -> int: 187 r"""Get max iterations to spend tuning a given solution.""" 188 return torch._C._cuda_tunableop_get_max_tuning_iterations() # type: ignore[attr-defined] 189 190 191def set_filename(filename: str, insert_device_ordinal: bool = False) -> None: 192 r"""Set the filename to use for input/output of tuning results. 193 194 If :attr:`insert_device_ordinal` is ``True`` then the current device ordinal 195 will be added to the given filename automatically. This can be used in a 196 1-process-per-gpu cenario to ensure all processes write to a separate file. 197 """ 198 torch._C._cuda_tunableop_set_filename(filename, insert_device_ordinal) # type: ignore[attr-defined] 199 200 201def get_filename() -> str: 202 r"""Get the results filename.""" 203 return torch._C._cuda_tunableop_get_filename() # type: ignore[attr-defined] 204 205 206def get_results() -> Tuple[str, str, str, float]: 207 r"""Return all TunableOp results.""" 208 return torch._C._cuda_tunableop_get_results() # type: ignore[attr-defined] 209 210 211def get_validators() -> Tuple[str, str]: 212 r"""Return the TunableOp validators.""" 213 return torch._C._cuda_tunableop_get_validators() # type: ignore[attr-defined] 214 215 216def write_file_on_exit(val: bool) -> None: 217 r"""During Tuning Context destruction, write file to disk. 218 219 This is useful as a final flush of your results to disk if your application 220 terminates as result of normal operation or an error. Manual flushing of 221 your results can be achieved by manually calling ``write_file()``.""" 222 torch._C._cuda_tunableop_write_file_on_exit(val) # type: ignore[attr-defined] 223 224 225def write_file(filename: Optional[str] = None) -> bool: 226 r"""Write results to a CSV file. 227 228 If :attr:`filename` is not given, ``get_filename()`` is called. 229 """ 230 if filename is None: 231 filename = get_filename() 232 return torch._C._cuda_tunableop_write_file(filename) # type: ignore[attr-defined] 233 234 235def read_file(filename: Optional[str] = None) -> bool: 236 r"""Read results from a TunableOp CSV file. 237 238 If :attr:`filename` is not given, ``get_filename()`` is called. 239 """ 240 if filename is None: 241 filename = get_filename() 242 return torch._C._cuda_tunableop_read_file(filename) # type: ignore[attr-defined] 243