1# PyTorch Operator Micro-benchmarks 2 3This benchmark suite provides a systemic way to measure the performance of operators for a wide range of inputs. The generated benchmark data fully characterized the performance of an operator in terms of execution time and the efficiency of the PyTorch frameworks used. 4 5## Features 6 7Key Features: 8 91\. Language used: Python 10 112\. Supported Frameworks: PyTorch 12 133\. Supported PyTorch mode: eager and JIT 14 154\. Input shapes: user-defined shapes, randomly generated shapes 16 17## Getting Started 18 19## Initial Setup 20The instruction below installs a cpp\_extension for PyTorch and it is required to run the benchmark suite. 21```bash 22cd pt_extension 23python setup.py install 24``` 25 26## How to run the benchmarks: 27 28Run `torch.add` benchmark: 29```bash 30cd pytorch/benchmarks/operator_benchmark 31python -m pt.add_test --omp-num-threads 1 --mkl-num-threads 1 32``` 33Note: we set the number of OpenMP and MKL threads both to 1. If you want to benchmark operators with multithreading (intra-op parallelism), use the `--omp-num-threads` and `--mkl-num-threads` flags. 34 35List all the supported tests: 36```bash 37python -m pt.add_test --list-tests 38``` 39 40Filter and run a test (use `add_M8_N16_K32` as an example): 41```bash 42python -m pt.add_test --test-name add_K32_M8_N1 43--omp-num-threads 1 --mkl-num-threads 1 44``` 45 46Run all the supported benchmarks: 47```bash 48python -m benchmark_all_test 49``` 50 51## Code to support `torch.add` in the benchmark 52The following example shows the code to support `torch.add` with 27 different tests. In the subpages of this wiki, we'll step through the complete flow of adding PyTorch operators to the benchmark suite. Existing benchmarks for operators are in the `pt` directory and we highly recommend putting your new operators in those locations. 53 54```python 55add_short_configs = op_bench.cross_product_configs( 56 M=[8, 64, 128], 57 N=range(2, 10, 3), 58 K=[2 ** x for x in range(0, 3)], 59 tags=["short"] 60) 61 62class AddBenchmark(op_bench.TorchBenchmarkBase): 63 def init(self, M, N, K, device): 64 self.inputs = { 65 "input_one": torch.rand(M, N, K, device=device, requires_grad=self.auto_set()), 66 "input_two": torch.rand(M, N, K, device=device, requires_grad=self.auto_set()) 67 } 68 self.set_module_name("add") 69 70 def forward(self, input_one, input_two): 71 return torch.add(input_one, input_two) 72 73op_bench.generate_pt_test(add_short_configs, AddBenchmark) 74``` 75 76## Output and Command Line Control of the Benchmark 77The output is intended to be a human readable format. Here is an example output for `torch.add`: 78``` 79# ---------------------------------------- 80# PyTorch Operator Micro-benchmarks 81# ---------------------------------------- 82# Tag : short 83 84# Benchmarking PyTorch: add 85# Mode: Eager 86# Name: add_M8_N16_K32 87# Input: M: 8, N: 16, K: 32 88Forward Execution Time (us) : 6.651 89 90# Benchmarking PyTorch: add 91# Mode: Eager 92# Name: add_M16_N16_K64 93# Input: M: 16, N: 16, K: 64 94Forward Execution Time (us) : 11.976 95 96# Benchmarking PyTorch: add 97# Mode: Eager 98# Name: add_M64_N64_K128 99# Input: M: 64, N: 64, K: 128 100Forward Execution Time (us) : 222.370 101``` 102At a high level, the output includes the execution time of `torch.add` with three different inputs. Let's look at each line in detail: 103 1041\. `Tag: short` tags a group of inputs. For each operator, you could be interested in a large number of inputs, but you may not always want to run all the inputs. `Tag` allows you to only run some of the inputs. Most of the inputs to operators being supported in the benchmark are grouped using two tags. One group is tagged with `short` which stores some commonly used shapes. The other group is tagged with `long` which stores many random inputs to have better coverage compared with `short`. 105 1062\. `Benchmarking PyTorch: Add` shows name of the operator being benchmarked. 107 1083\. `Mode: Eager` shows that PyTorch eager mode is here. 109 1104\. `Name: add_M8_N16_K32` is the name of the test and it can be used to filter tests. 111 1125\. `Input: M: 8, N: 16, K: 32` shows inputs to the operator. 113 1146\. `Forward Execution Time (us) : 6.651` reports the execution time of an operator in microseconds. 115 116### Command-Line Control 117You can control all the aspects of the benchmark suite through the command-line. Please find details of those arguments by running the following command or look into `benchmark_runner.py`. 118```bash 119python benchmark_runner.py --help 120``` 121 122Run all the supported benchmarks: 123```bash 124python -m benchmark_all_test --omp-num-threads 1 --mkl-num-threads 1 125``` 126 127List all the supported operators: 128```bash 129python -m benchmark_all_test --list-ops 130``` 131 132List all the supported tests: 133```bash 134python -m benchmark_all_test --list-tests 135``` 136 137Filter and run an operator (use add as an example): 138```bash 139python -m benchmark_all_test --operators add --omp-num-threads 1 --mkl-num-threads 1 140``` 141Note: this filter is based on the operator name rather than the file name. 142 143Run torch.add benchmark with tag 'long': 144```bash 145python -m pt.add_test --tag-filter long 146``` 147 148## Adding New Operators to the Benchmark Suite 149In the previous sections, we gave several examples to show how to run the already available operators in the benchmark suite. In the following sections, we'll step through the complete flow of adding PyTorch operators to the benchmark suite. Existing benchmarks for operators are in the `pt` directory and we highly recommend putting your new operators in those directories as well. 150 151### Add a New PyTorch Operator 152Let's say you want to measure the execution time of the following operator: 153```python 154C = torch.add(A, B) # Shape of A and B is [M, N, K] 155``` 156The code below shows how to add it to the benchmark suite. Let's go over the example line by line. 157```python 158import operator_benchmark as op_bench 159import torch 160 161add_long_configs = op_bench.cross_product_configs( 162 M=[8, 64, 128], 163 N=range(2, 10, 3), 164 K=[2 ** x for x in range(0, 3)], 165 tags=["long"] 166) 167 168add_short_configs = op_bench.config_list( 169 attr_names=["M", "N", "K"], 170 attrs=[ 171 [8, 16, 32], 172 [16, 16, 64], 173 [64, 64, 128], 174 ], 175 tags=["short"], 176) 177 178class AddBenchmark(op_bench.TorchBenchmarkBase): 179 def init(self, M, N, K, device): 180 self.inputs = { 181 "input_one": torch.rand(M, N, K, device=device, requires_grad=self.auto_set()), 182 "input_two": torch.rand(M, N, K, device=device, requires_grad=self.auto_set()) 183 } 184 self.set_module_name("add") 185 186 def forward(self, input_one, input_two): 187 return torch.add(input_one, input_two) 188 189op_bench.generate_pt_test(add_long_configs + add_short_configs, AddBenchmark) 190 191if __name__ == "__main__": 192 op_bench.benchmark_runner.main() 193``` 194 195#### Part 1. Specify Inputs to Operators 196For the `torch.add` operator, we would like to make sure it delivers good performance with input tensors which are of small, medium and large sizes. We have introduced two helper functions for users to easily generate a combination of inputs. 197```python 198# Generate list configurations that will be used for benchmark experiments 199add_long_configs = op_bench.cross_product_configs( 200 M=[8, 64, 128], 201 N=range(2, 10, 3), 202 K=[2 ** x for x in range(0, 3)], 203 tags=["long"] 204) 205 206add_short_configs = op_bench.config_list( 207 attr_names=["M", "N", "K"], 208 attrs=[ 209 [8, 16, 32], 210 [16, 16, 64], 211 [64, 64, 128], 212 ], 213 tags=["short"], 214) 215``` 216Let's look at it in detail: 217 2181\. `op_bench.config_list` is a helper function which specifies a list of inputs to operators. It takes three parameters which are `attrs_names, attrs, and tags`, all of them are python lists. `attr_names` stores the names of the inputs. `attrs` stores the real value of each input. In this example, three different inputs will be returned which are: `M=8, N=16, K=32; M=16, N=16, K=64; M=64, N=64, K=128`. 219 2202\. `op_bench.cross_product_configs` is another helper function to generate a cartesian product of the inputs. Each input is specified in a python list. In this example, the helper method will return a combination of 27 (len(M) * len(N) * len(K)) inputs. 221 222#### Part 2. Create Tensors and Add Computation 223After inputs are provided, we now look at adding the computation of an operator. Adding a new operator requires implementing a new `TorchBenchmarkBase` subclass. Every new class is required to implement 2 methods: 224* `init` is used to create tensors based on the inputs we provided before. In this example, the parameters to `init` are `M, N, and K` which have been specified in the input configuration. `init` also packed all the needed inputs together into a dictionary `self.inputs` which will be provided to `forward` as arguments for running the benchmark. 225* `forward` includes the operator to be tested and the computation based on the created tensors in `init`. Apart from `self`, the order of the arguments must match the entries specified in `self.inputs`. 226 227The example below shows the code for `torch.add`: 228```python 229# Given one set of M, N, K, the init method creates input tensors based on 230# that. The forward method does torch.add calculation on those input tensors. 231 232class AddBenchmark(op_bench.TorchBenchmarkBase): 233 def init(self, M, N, K, device): 234 # this is the method where you need to create tensors 235 # M, N, and K can be in different order, but they must match with 236 # names in the configs. 237 self.inputs = { 238 "input_one": torch.rand(M, N, K, device=device, requires_grad=self.auto_set()), 239 "input_two": torch.rand(M, N, K, device=device, requires_grad=self.auto_set()) 240 } 241 self.set_module_name("add") 242 243 def forward(self, input_one, input_two): 244 # this is the method to have operator and do computation 245 return torch.add(input_one, input_two) 246``` 247 248#### Part 3. Register Tests With the Benchmark Suite 249After we have inputs and the benchmark class, it's time to register them with our benchmark suite. Here is how it looks like: 250```python 251op_bench.generate_pt_test(add_long_configs + add_short_configs, AddBenchmark) 252``` 253`generate_pt_test` takes two parameters which are inputs configs and the benchmark class. 254 255#### Part 4. Run the Registered Tests 256To run the benchmark, we use the main method in `benchmark_runner` module. 257```python 258if __name__ == "__main__": 259 op_bench.benchmark_runner.main() 260``` 261That's it. You just added a new operator to the benchmark suite! 262 263### Add a List of Operators 264In the previous sections, we introduced the steps required to add a single operator to the benchmark suite. There are scenarios where you want to extend the benchmark suite with a list of operators which can share the same inputs. For example, to benchmark `abs` and `acos` operators, you can use the same set of inputs for both. 265 266Let's say we want to benchmark the following operators separately: 267```python 268C = torch.abs(A) # Shape of A [M, N] 269C = torch.acos(A) # Shape of A [M, N] 270``` 271The following code shows how to do that: 272```python 273import operator_benchmark as op_bench 274import torch 275 276unary_ops_configs = op_bench.config_list( 277 attrs=[ 278 [128, 128], 279 [256, 256], 280 [1024, 1024], 281 ], 282 attr_names=["M", "N"], 283 tags=["short"] 284) 285 286unary_ops_list = op_bench.op_list( 287 attr_names=["op_name", "op_func"], 288 attrs=[ 289 ["abs", torch.abs], 290 ["acos", torch.acos], 291 ], 292) 293 294class UnaryOpBenchmark(op_bench.TorchBenchmarkBase): 295 def init(self, M, N, device, op_func): 296 self.inputs = { 297 "input": torch.rand(M, N, device=device) 298 } 299 self.op_func = op_func 300 301 def forward(self, input): 302 return self.op_func(input) 303 304op_bench.generate_pt_tests_from_op_list(unary_ops_list, unary_ops_configs, UnaryOpBenchmark) 305 306if __name__ == "__main__": 307 op_bench.benchmark_runner.main() 308``` 309The inputs to those operators are specified using the same method we went over before. So we just skip it here. 310 311#### Part 1. Specify the List of Operators 312To add a list of operators to the benchmark suite, we introduce the `op_bench.op_list` method which takes two parameters: 313* `attrs` stores the name of the operator and the method to do the real calculation. 314* `attr_names` stores the names of values in attrs. 315 316The example below shows the code to add `torch.abs` and `torch.acos` : 317```python 318unary_ops_list = op_bench.op_list( 319 attr_names=["op_name", "op_func"], 320 attrs=[ 321 ["abs", torch.abs], 322 ["acos", torch.acos], 323 ], 324) 325``` 326 327#### Part 2. Create Tensors and Add Computation 328In this example, both operators share the same input so we only need to implement one TorchBenchmarkBase subclass. 329Every new subclass is required to implement 3 methods: 330* `init` is used to create tensors and set the operator name and function. In this example, the parameters to `init` are `M`, `N`, and `op_func` which have been specified in the configurations. 331* `forward` includes the operator to be tested and the computation based on the created tensors in `init`. Apart from `self`, the order of the arguments must match the entries specified in `self.inputs`. 332Here is the code for `abs` and `acos`: 333 334```python 335class UnaryOpBenchmark(op_bench.TorchBenchmarkBase): 336 def init(self, M, N, device, op_func): 337 # The M and N match with the attr_names in the input configuration 338 # The op_func matches with the attr_name in the ops configuration 339 self.inputs = { 340 "input": torch.rand(M, N, device=device) 341 } 342 self.op_func = op_func 343 344 def forward(self, input): 345 return self.op_func(input) 346``` 347 348#### Part 3. Register a List of Operators 349To register multiple operators, we introduced the `generate_pt_tests_from_op_list` function which takes three parameters. First, the list of operators. Second,the configs. Third, the benchmark class. 350Here is an example: 351```python 352op_bench.generate_pt_tests_from_op_list(unary_ops_list, unary_ops_configs, UnaryOpBenchmark) 353``` 354 355 356### Add Gradient Ops 357In this section, we go over the steps to benchmark the backward path of operators. 358#### For PyTorch Gradient Ops 359To measure the performance of an operator in its backward path, there are only two changes needed in addition to the steps we covered for the forward path: 360 3611\. Specify `requires_grad=True` when creating the tensor. This is a standard PyTorch way of enabling backward path. 362 3632\. Use `generate_pt_gradient_test` to register the tests. 364 365The example below shows the relevant code for that: 366```python 367self.input_one = torch.rand(M, N, K, requires_grad=True) 368generate_pt_gradient_test(long_configs + short_configs, TorchAddBenchmark) 369``` 370