Name Date Size #Lines LOC

..--

cpu/H25-Apr-2025-664530

cuda/H25-Apr-2025-734589

README.mdH A D25-Apr-20251.9 KiB1711

arg_spec.hH A D25-Apr-20251.4 KiB5738

codegen.cppH A D25-Apr-202523.7 KiB687556

codegen.hH A D25-Apr-2025745 2517

compiler.cppH A D25-Apr-20259.9 KiB299226

compiler.hH A D25-Apr-20251.8 KiB5740

executor.cppH A D25-Apr-202513.4 KiB406309

executor.hH A D25-Apr-2025500 2012

fallback.cppH A D25-Apr-20251.4 KiB4837

fallback.hH A D25-Apr-2025174 126

fused_kernel.hH A D25-Apr-20253.2 KiB9961

interface.cppH A D25-Apr-20253.1 KiB10879

interface.hH A D25-Apr-20251.7 KiB5525

kernel_cache.cppH A D25-Apr-20252.6 KiB8963

kernel_cache.hH A D25-Apr-2025994 3416

kernel_spec.hH A D25-Apr-20254.4 KiB148109

partition_desc.hH A D25-Apr-20251.7 KiB5941

tensor_desc.hH A D25-Apr-20252.6 KiB9975

tensor_info.hH A D25-Apr-2025536 2516

README.md

1# PyTorch Fuser
2
3The fuser accepts subgraphs wrapped in "fusion nodes" and tries to execute them by just-in-time (JIT) compiling kernels that run all the graph operations.
4
5## Code Organization
6
7The fuser is designed hierarchically with device-independent logic eventually deferring to device-specific logic and implementation. The device-specific code is (mostly) found in each devices' subdirectory. The device-independent logic has six components:
8
9* The Interface (interface.h/cpp) has functions to register and run fusions, interrogate fusion functionality, and perform debugging.
10* The Compiler (compiler.h/cpp) performs "upfront" and "runtime" compilation. When fusions are registered, upfront compilation produces fallback code and and performs some shape inference. When a fusion is run, runtime compilation invokes code generation and the device-specific compilation logic.
11* The Code Generator (codegen.h/cpp) produces the string to be compiled on the device.
12* The Executor (executor.h/cpp) runs requested fusions. It performs shape inference, expands tensors as necessary, determines the device to run on, acquires a cached compiled kernel or requests the Compiler produce a new one, invokes device-specific code to launch the kernel and updates the stack.
13* The Fallback (fallback.h/cpp) runs subgraphs that can't be fused because shape inference didn't determine a common tensor size or the device the tensors are on doesn't support fusion.
14* The Kernel Specification Cache (kernel_cache.h/cpp) is a thread-safe cache holding the device-independent specifications produced during upfront compilation. These specifications each have their own thread-safe stores of compiled kernels that the Executor checks before requesting runtime compilation.
15
16The device-specific components have logic for compiling and running code in FusedKernelCPU (cpu/fused_kernel.h/cpp) and FusedKernelCUDA (cuda/fused_kernel.h/cpp).
17