1========================================================= 2NVIDIA Tegra SoC Uncore Performance Monitoring Unit (PMU) 3========================================================= 4 5The NVIDIA Tegra SoC includes various system PMUs to measure key performance 6metrics like memory bandwidth, latency, and utilization: 7 8* Scalable Coherency Fabric (SCF) 9* NVLink-C2C0 10* NVLink-C2C1 11* CNVLink 12* PCIE 13 14PMU Driver 15---------- 16 17The PMUs in this document are based on ARM CoreSight PMU Architecture as 18described in document: ARM IHI 0091. Since this is a standard architecture, the 19PMUs are managed by a common driver "arm-cs-arch-pmu". This driver describes 20the available events and configuration of each PMU in sysfs. Please see the 21sections below to get the sysfs path of each PMU. Like other uncore PMU drivers, 22the driver provides "cpumask" sysfs attribute to show the CPU id used to handle 23the PMU event. There is also "associated_cpus" sysfs attribute, which contains a 24list of CPUs associated with the PMU instance. 25 26.. _SCF_PMU_Section: 27 28SCF PMU 29------- 30 31The SCF PMU monitors system level cache events, CPU traffic, and 32strongly-ordered (SO) PCIE write traffic to local/remote memory. Please see 33:ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section` for more info about the PMU 34traffic coverage. 35 36The events and configuration options of this PMU device are described in sysfs, 37see /sys/bus/event_source/devices/nvidia_scf_pmu_<socket-id>. 38 39Example usage: 40 41* Count event id 0x0 in socket 0:: 42 43 perf stat -a -e nvidia_scf_pmu_0/event=0x0/ 44 45* Count event id 0x0 in socket 1:: 46 47 perf stat -a -e nvidia_scf_pmu_1/event=0x0/ 48 49NVLink-C2C0 PMU 50-------------------- 51 52The NVLink-C2C0 PMU monitors incoming traffic from a GPU/CPU connected with 53NVLink-C2C (Chip-2-Chip) interconnect. The type of traffic captured by this PMU 54varies dependent on the chip configuration: 55 56* NVIDIA Grace Hopper Superchip: Hopper GPU is connected with Grace SoC. 57 58 In this config, the PMU captures GPU ATS translated or EGM traffic from the GPU. 59 60* NVIDIA Grace CPU Superchip: two Grace CPU SoCs are connected. 61 62 In this config, the PMU captures read and relaxed ordered (RO) writes from 63 PCIE device of the remote SoC. 64 65Please see :ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section` for more info about 66the PMU traffic coverage. 67 68The events and configuration options of this PMU device are described in sysfs, 69see /sys/bus/event_source/devices/nvidia_nvlink_c2c0_pmu_<socket-id>. 70 71Example usage: 72 73* Count event id 0x0 from the GPU/CPU connected with socket 0:: 74 75 perf stat -a -e nvidia_nvlink_c2c0_pmu_0/event=0x0/ 76 77* Count event id 0x0 from the GPU/CPU connected with socket 1:: 78 79 perf stat -a -e nvidia_nvlink_c2c0_pmu_1/event=0x0/ 80 81* Count event id 0x0 from the GPU/CPU connected with socket 2:: 82 83 perf stat -a -e nvidia_nvlink_c2c0_pmu_2/event=0x0/ 84 85* Count event id 0x0 from the GPU/CPU connected with socket 3:: 86 87 perf stat -a -e nvidia_nvlink_c2c0_pmu_3/event=0x0/ 88 89The NVLink-C2C has two ports that can be connected to one GPU (occupying both 90ports) or to two GPUs (one GPU per port). The user can use "port" bitmap 91parameter to select the port(s) to monitor. Each bit represents the port number, 92e.g. "port=0x1" corresponds to port 0 and "port=0x3" is for port 0 and 1. The 93PMU will monitor both ports by default if not specified. 94 95Example for port filtering: 96 97* Count event id 0x0 from the GPU connected with socket 0 on port 0:: 98 99 perf stat -a -e nvidia_nvlink_c2c0_pmu_0/event=0x0,port=0x1/ 100 101* Count event id 0x0 from the GPUs connected with socket 0 on port 0 and port 1:: 102 103 perf stat -a -e nvidia_nvlink_c2c0_pmu_0/event=0x0,port=0x3/ 104 105NVLink-C2C1 PMU 106------------------- 107 108The NVLink-C2C1 PMU monitors incoming traffic from a GPU connected with 109NVLink-C2C (Chip-2-Chip) interconnect. This PMU captures untranslated GPU 110traffic, in contrast with NvLink-C2C0 PMU that captures ATS translated traffic. 111Please see :ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section` for more info about 112the PMU traffic coverage. 113 114The events and configuration options of this PMU device are described in sysfs, 115see /sys/bus/event_source/devices/nvidia_nvlink_c2c1_pmu_<socket-id>. 116 117Example usage: 118 119* Count event id 0x0 from the GPU connected with socket 0:: 120 121 perf stat -a -e nvidia_nvlink_c2c1_pmu_0/event=0x0/ 122 123* Count event id 0x0 from the GPU connected with socket 1:: 124 125 perf stat -a -e nvidia_nvlink_c2c1_pmu_1/event=0x0/ 126 127* Count event id 0x0 from the GPU connected with socket 2:: 128 129 perf stat -a -e nvidia_nvlink_c2c1_pmu_2/event=0x0/ 130 131* Count event id 0x0 from the GPU connected with socket 3:: 132 133 perf stat -a -e nvidia_nvlink_c2c1_pmu_3/event=0x0/ 134 135The NVLink-C2C has two ports that can be connected to one GPU (occupying both 136ports) or to two GPUs (one GPU per port). The user can use "port" bitmap 137parameter to select the port(s) to monitor. Each bit represents the port number, 138e.g. "port=0x1" corresponds to port 0 and "port=0x3" is for port 0 and 1. The 139PMU will monitor both ports by default if not specified. 140 141Example for port filtering: 142 143* Count event id 0x0 from the GPU connected with socket 0 on port 0:: 144 145 perf stat -a -e nvidia_nvlink_c2c1_pmu_0/event=0x0,port=0x1/ 146 147* Count event id 0x0 from the GPUs connected with socket 0 on port 0 and port 1:: 148 149 perf stat -a -e nvidia_nvlink_c2c1_pmu_0/event=0x0,port=0x3/ 150 151CNVLink PMU 152--------------- 153 154The CNVLink PMU monitors traffic from GPU and PCIE device on remote sockets 155to local memory. For PCIE traffic, this PMU captures read and relaxed ordered 156(RO) write traffic. Please see :ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section` 157for more info about the PMU traffic coverage. 158 159The events and configuration options of this PMU device are described in sysfs, 160see /sys/bus/event_source/devices/nvidia_cnvlink_pmu_<socket-id>. 161 162Each SoC socket can be connected to one or more sockets via CNVLink. The user can 163use "rem_socket" bitmap parameter to select the remote socket(s) to monitor. 164Each bit represents the socket number, e.g. "rem_socket=0xE" corresponds to 165socket 1 to 3. The PMU will monitor all remote sockets by default if not 166specified. 167/sys/bus/event_source/devices/nvidia_cnvlink_pmu_<socket-id>/format/rem_socket 168shows the valid bits that can be set in the "rem_socket" parameter. 169 170The PMU can not distinguish the remote traffic initiator, therefore it does not 171provide filter to select the traffic source to monitor. It reports combined 172traffic from remote GPU and PCIE devices. 173 174Example usage: 175 176* Count event id 0x0 for the traffic from remote socket 1, 2, and 3 to socket 0:: 177 178 perf stat -a -e nvidia_cnvlink_pmu_0/event=0x0,rem_socket=0xE/ 179 180* Count event id 0x0 for the traffic from remote socket 0, 2, and 3 to socket 1:: 181 182 perf stat -a -e nvidia_cnvlink_pmu_1/event=0x0,rem_socket=0xD/ 183 184* Count event id 0x0 for the traffic from remote socket 0, 1, and 3 to socket 2:: 185 186 perf stat -a -e nvidia_cnvlink_pmu_2/event=0x0,rem_socket=0xB/ 187 188* Count event id 0x0 for the traffic from remote socket 0, 1, and 2 to socket 3:: 189 190 perf stat -a -e nvidia_cnvlink_pmu_3/event=0x0,rem_socket=0x7/ 191 192 193PCIE PMU 194------------ 195 196The PCIE PMU monitors all read/write traffic from PCIE root ports to 197local/remote memory. Please see :ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section` 198for more info about the PMU traffic coverage. 199 200The events and configuration options of this PMU device are described in sysfs, 201see /sys/bus/event_source/devices/nvidia_pcie_pmu_<socket-id>. 202 203Each SoC socket can support multiple root ports. The user can use 204"root_port" bitmap parameter to select the port(s) to monitor, i.e. 205"root_port=0xF" corresponds to root port 0 to 3. The PMU will monitor all root 206ports by default if not specified. 207/sys/bus/event_source/devices/nvidia_pcie_pmu_<socket-id>/format/root_port 208shows the valid bits that can be set in the "root_port" parameter. 209 210Example usage: 211 212* Count event id 0x0 from root port 0 and 1 of socket 0:: 213 214 perf stat -a -e nvidia_pcie_pmu_0/event=0x0,root_port=0x3/ 215 216* Count event id 0x0 from root port 0 and 1 of socket 1:: 217 218 perf stat -a -e nvidia_pcie_pmu_1/event=0x0,root_port=0x3/ 219 220.. _NVIDIA_Uncore_PMU_Traffic_Coverage_Section: 221 222Traffic Coverage 223---------------- 224 225The PMU traffic coverage may vary dependent on the chip configuration: 226 227* **NVIDIA Grace Hopper Superchip**: Hopper GPU is connected with Grace SoC. 228 229 Example configuration with two Grace SoCs:: 230 231 ********************************* ********************************* 232 * SOCKET-A * * SOCKET-B * 233 * * * * 234 * :::::::: * * :::::::: * 235 * : PCIE : * * : PCIE : * 236 * :::::::: * * :::::::: * 237 * | * * | * 238 * | * * | * 239 * ::::::: ::::::::: * * ::::::::: ::::::: * 240 * : : : : * * : : : : * 241 * : GPU :<--NVLink-->: Grace :<---CNVLink--->: Grace :<--NVLink-->: GPU : * 242 * : : C2C : SoC : * * : SoC : C2C : : * 243 * ::::::: ::::::::: * * ::::::::: ::::::: * 244 * | | * * | | * 245 * | | * * | | * 246 * &&&&&&&& &&&&&&&& * * &&&&&&&& &&&&&&&& * 247 * & GMEM & & CMEM & * * & CMEM & & GMEM & * 248 * &&&&&&&& &&&&&&&& * * &&&&&&&& &&&&&&&& * 249 * * * * 250 ********************************* ********************************* 251 252 GMEM = GPU Memory (e.g. HBM) 253 CMEM = CPU Memory (e.g. LPDDR5X) 254 255 | 256 | Following table contains traffic coverage of Grace SoC PMU in socket-A: 257 258 :: 259 260 +--------------+-------+-----------+-----------+-----+----------+----------+ 261 | | Source | 262 + +-------+-----------+-----------+-----+----------+----------+ 263 | Destination | |GPU ATS |GPU Not-ATS| | Socket-B | Socket-B | 264 | |PCI R/W|Translated,|Translated | CPU | CPU/PCIE1| GPU/PCIE2| 265 | | |EGM | | | | | 266 +==============+=======+===========+===========+=====+==========+==========+ 267 | Local | PCIE |NVLink-C2C0|NVLink-C2C1| SCF | SCF PMU | CNVLink | 268 | SYSRAM/CMEM | PMU |PMU |PMU | PMU | | PMU | 269 +--------------+-------+-----------+-----------+-----+----------+----------+ 270 | Local GMEM | PCIE | N/A |NVLink-C2C1| SCF | SCF PMU | CNVLink | 271 | | PMU | |PMU | PMU | | PMU | 272 +--------------+-------+-----------+-----------+-----+----------+----------+ 273 | Remote | PCIE |NVLink-C2C0|NVLink-C2C1| SCF | | | 274 | SYSRAM/CMEM | PMU |PMU |PMU | PMU | N/A | N/A | 275 | over CNVLink | | | | | | | 276 +--------------+-------+-----------+-----------+-----+----------+----------+ 277 | Remote GMEM | PCIE |NVLink-C2C0|NVLink-C2C1| SCF | | | 278 | over CNVLink | PMU |PMU |PMU | PMU | N/A | N/A | 279 +--------------+-------+-----------+-----------+-----+----------+----------+ 280 281 PCIE1 traffic represents strongly ordered (SO) writes. 282 PCIE2 traffic represents reads and relaxed ordered (RO) writes. 283 284* **NVIDIA Grace CPU Superchip**: two Grace CPU SoCs are connected. 285 286 Example configuration with two Grace SoCs:: 287 288 ******************* ******************* 289 * SOCKET-A * * SOCKET-B * 290 * * * * 291 * :::::::: * * :::::::: * 292 * : PCIE : * * : PCIE : * 293 * :::::::: * * :::::::: * 294 * | * * | * 295 * | * * | * 296 * ::::::::: * * ::::::::: * 297 * : : * * : : * 298 * : Grace :<--------NVLink------->: Grace : * 299 * : SoC : * C2C * : SoC : * 300 * ::::::::: * * ::::::::: * 301 * | * * | * 302 * | * * | * 303 * &&&&&&&& * * &&&&&&&& * 304 * & CMEM & * * & CMEM & * 305 * &&&&&&&& * * &&&&&&&& * 306 * * * * 307 ******************* ******************* 308 309 GMEM = GPU Memory (e.g. HBM) 310 CMEM = CPU Memory (e.g. LPDDR5X) 311 312 | 313 | Following table contains traffic coverage of Grace SoC PMU in socket-A: 314 315 :: 316 317 +-----------------+-----------+---------+----------+-------------+ 318 | | Source | 319 + +-----------+---------+----------+-------------+ 320 | Destination | | | Socket-B | Socket-B | 321 | | PCI R/W | CPU | CPU/PCIE1| PCIE2 | 322 | | | | | | 323 +=================+===========+=========+==========+=============+ 324 | Local | PCIE PMU | SCF PMU | SCF PMU | NVLink-C2C0 | 325 | SYSRAM/CMEM | | | | PMU | 326 +-----------------+-----------+---------+----------+-------------+ 327 | Remote | | | | | 328 | SYSRAM/CMEM | PCIE PMU | SCF PMU | N/A | N/A | 329 | over NVLink-C2C | | | | | 330 +-----------------+-----------+---------+----------+-------------+ 331 332 PCIE1 traffic represents strongly ordered (SO) writes. 333 PCIE2 traffic represents reads and relaxed ordered (RO) writes. 334