1=========================================================
2NVIDIA Tegra SoC Uncore Performance Monitoring Unit (PMU)
3=========================================================
4
5The NVIDIA Tegra SoC includes various system PMUs to measure key performance
6metrics like memory bandwidth, latency, and utilization:
7
8* Scalable Coherency Fabric (SCF)
9* NVLink-C2C0
10* NVLink-C2C1
11* CNVLink
12* PCIE
13
14PMU Driver
15----------
16
17The PMUs in this document are based on ARM CoreSight PMU Architecture as
18described in document: ARM IHI 0091. Since this is a standard architecture, the
19PMUs are managed by a common driver "arm-cs-arch-pmu". This driver describes
20the available events and configuration of each PMU in sysfs. Please see the
21sections below to get the sysfs path of each PMU. Like other uncore PMU drivers,
22the driver provides "cpumask" sysfs attribute to show the CPU id used to handle
23the PMU event. There is also "associated_cpus" sysfs attribute, which contains a
24list of CPUs associated with the PMU instance.
25
26.. _SCF_PMU_Section:
27
28SCF PMU
29-------
30
31The SCF PMU monitors system level cache events, CPU traffic, and
32strongly-ordered (SO) PCIE write traffic to local/remote memory. Please see
33:ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section` for more info about the PMU
34traffic coverage.
35
36The events and configuration options of this PMU device are described in sysfs,
37see /sys/bus/event_source/devices/nvidia_scf_pmu_<socket-id>.
38
39Example usage:
40
41* Count event id 0x0 in socket 0::
42
43   perf stat -a -e nvidia_scf_pmu_0/event=0x0/
44
45* Count event id 0x0 in socket 1::
46
47   perf stat -a -e nvidia_scf_pmu_1/event=0x0/
48
49NVLink-C2C0 PMU
50--------------------
51
52The NVLink-C2C0 PMU monitors incoming traffic from a GPU/CPU connected with
53NVLink-C2C (Chip-2-Chip) interconnect. The type of traffic captured by this PMU
54varies dependent on the chip configuration:
55
56* NVIDIA Grace Hopper Superchip: Hopper GPU is connected with Grace SoC.
57
58  In this config, the PMU captures GPU ATS translated or EGM traffic from the GPU.
59
60* NVIDIA Grace CPU Superchip: two Grace CPU SoCs are connected.
61
62  In this config, the PMU captures read and relaxed ordered (RO) writes from
63  PCIE device of the remote SoC.
64
65Please see :ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section` for more info about
66the PMU traffic coverage.
67
68The events and configuration options of this PMU device are described in sysfs,
69see /sys/bus/event_source/devices/nvidia_nvlink_c2c0_pmu_<socket-id>.
70
71Example usage:
72
73* Count event id 0x0 from the GPU/CPU connected with socket 0::
74
75   perf stat -a -e nvidia_nvlink_c2c0_pmu_0/event=0x0/
76
77* Count event id 0x0 from the GPU/CPU connected with socket 1::
78
79   perf stat -a -e nvidia_nvlink_c2c0_pmu_1/event=0x0/
80
81* Count event id 0x0 from the GPU/CPU connected with socket 2::
82
83   perf stat -a -e nvidia_nvlink_c2c0_pmu_2/event=0x0/
84
85* Count event id 0x0 from the GPU/CPU connected with socket 3::
86
87   perf stat -a -e nvidia_nvlink_c2c0_pmu_3/event=0x0/
88
89The NVLink-C2C has two ports that can be connected to one GPU (occupying both
90ports) or to two GPUs (one GPU per port). The user can use "port" bitmap
91parameter to select the port(s) to monitor. Each bit represents the port number,
92e.g. "port=0x1" corresponds to port 0 and "port=0x3" is for port 0 and 1. The
93PMU will monitor both ports by default if not specified.
94
95Example for port filtering:
96
97* Count event id 0x0 from the GPU connected with socket 0 on port 0::
98
99   perf stat -a -e nvidia_nvlink_c2c0_pmu_0/event=0x0,port=0x1/
100
101* Count event id 0x0 from the GPUs connected with socket 0 on port 0 and port 1::
102
103   perf stat -a -e nvidia_nvlink_c2c0_pmu_0/event=0x0,port=0x3/
104
105NVLink-C2C1 PMU
106-------------------
107
108The NVLink-C2C1 PMU monitors incoming traffic from a GPU connected with
109NVLink-C2C (Chip-2-Chip) interconnect. This PMU captures untranslated GPU
110traffic, in contrast with NvLink-C2C0 PMU that captures ATS translated traffic.
111Please see :ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section` for more info about
112the PMU traffic coverage.
113
114The events and configuration options of this PMU device are described in sysfs,
115see /sys/bus/event_source/devices/nvidia_nvlink_c2c1_pmu_<socket-id>.
116
117Example usage:
118
119* Count event id 0x0 from the GPU connected with socket 0::
120
121   perf stat -a -e nvidia_nvlink_c2c1_pmu_0/event=0x0/
122
123* Count event id 0x0 from the GPU connected with socket 1::
124
125   perf stat -a -e nvidia_nvlink_c2c1_pmu_1/event=0x0/
126
127* Count event id 0x0 from the GPU connected with socket 2::
128
129   perf stat -a -e nvidia_nvlink_c2c1_pmu_2/event=0x0/
130
131* Count event id 0x0 from the GPU connected with socket 3::
132
133   perf stat -a -e nvidia_nvlink_c2c1_pmu_3/event=0x0/
134
135The NVLink-C2C has two ports that can be connected to one GPU (occupying both
136ports) or to two GPUs (one GPU per port). The user can use "port" bitmap
137parameter to select the port(s) to monitor. Each bit represents the port number,
138e.g. "port=0x1" corresponds to port 0 and "port=0x3" is for port 0 and 1. The
139PMU will monitor both ports by default if not specified.
140
141Example for port filtering:
142
143* Count event id 0x0 from the GPU connected with socket 0 on port 0::
144
145   perf stat -a -e nvidia_nvlink_c2c1_pmu_0/event=0x0,port=0x1/
146
147* Count event id 0x0 from the GPUs connected with socket 0 on port 0 and port 1::
148
149   perf stat -a -e nvidia_nvlink_c2c1_pmu_0/event=0x0,port=0x3/
150
151CNVLink PMU
152---------------
153
154The CNVLink PMU monitors traffic from GPU and PCIE device on remote sockets
155to local memory. For PCIE traffic, this PMU captures read and relaxed ordered
156(RO) write traffic. Please see :ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section`
157for more info about the PMU traffic coverage.
158
159The events and configuration options of this PMU device are described in sysfs,
160see /sys/bus/event_source/devices/nvidia_cnvlink_pmu_<socket-id>.
161
162Each SoC socket can be connected to one or more sockets via CNVLink. The user can
163use "rem_socket" bitmap parameter to select the remote socket(s) to monitor.
164Each bit represents the socket number, e.g. "rem_socket=0xE" corresponds to
165socket 1 to 3. The PMU will monitor all remote sockets by default if not
166specified.
167/sys/bus/event_source/devices/nvidia_cnvlink_pmu_<socket-id>/format/rem_socket
168shows the valid bits that can be set in the "rem_socket" parameter.
169
170The PMU can not distinguish the remote traffic initiator, therefore it does not
171provide filter to select the traffic source to monitor. It reports combined
172traffic from remote GPU and PCIE devices.
173
174Example usage:
175
176* Count event id 0x0 for the traffic from remote socket 1, 2, and 3 to socket 0::
177
178   perf stat -a -e nvidia_cnvlink_pmu_0/event=0x0,rem_socket=0xE/
179
180* Count event id 0x0 for the traffic from remote socket 0, 2, and 3 to socket 1::
181
182   perf stat -a -e nvidia_cnvlink_pmu_1/event=0x0,rem_socket=0xD/
183
184* Count event id 0x0 for the traffic from remote socket 0, 1, and 3 to socket 2::
185
186   perf stat -a -e nvidia_cnvlink_pmu_2/event=0x0,rem_socket=0xB/
187
188* Count event id 0x0 for the traffic from remote socket 0, 1, and 2 to socket 3::
189
190   perf stat -a -e nvidia_cnvlink_pmu_3/event=0x0,rem_socket=0x7/
191
192
193PCIE PMU
194------------
195
196The PCIE PMU monitors all read/write traffic from PCIE root ports to
197local/remote memory. Please see :ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section`
198for more info about the PMU traffic coverage.
199
200The events and configuration options of this PMU device are described in sysfs,
201see /sys/bus/event_source/devices/nvidia_pcie_pmu_<socket-id>.
202
203Each SoC socket can support multiple root ports. The user can use
204"root_port" bitmap parameter to select the port(s) to monitor, i.e.
205"root_port=0xF" corresponds to root port 0 to 3. The PMU will monitor all root
206ports by default if not specified.
207/sys/bus/event_source/devices/nvidia_pcie_pmu_<socket-id>/format/root_port
208shows the valid bits that can be set in the "root_port" parameter.
209
210Example usage:
211
212* Count event id 0x0 from root port 0 and 1 of socket 0::
213
214   perf stat -a -e nvidia_pcie_pmu_0/event=0x0,root_port=0x3/
215
216* Count event id 0x0 from root port 0 and 1 of socket 1::
217
218   perf stat -a -e nvidia_pcie_pmu_1/event=0x0,root_port=0x3/
219
220.. _NVIDIA_Uncore_PMU_Traffic_Coverage_Section:
221
222Traffic Coverage
223----------------
224
225The PMU traffic coverage may vary dependent on the chip configuration:
226
227* **NVIDIA Grace Hopper Superchip**: Hopper GPU is connected with Grace SoC.
228
229  Example configuration with two Grace SoCs::
230
231   *********************************          *********************************
232   * SOCKET-A                      *          * SOCKET-B                      *
233   *                               *          *                               *
234   *                     ::::::::  *          *  ::::::::                     *
235   *                     : PCIE :  *          *  : PCIE :                     *
236   *                     ::::::::  *          *  ::::::::                     *
237   *                         |     *          *      |                        *
238   *                         |     *          *      |                        *
239   *  :::::::            ::::::::: *          *  :::::::::            ::::::: *
240   *  :     :            :       : *          *  :       :            :     : *
241   *  : GPU :<--NVLink-->: Grace :<---CNVLink--->: Grace :<--NVLink-->: GPU : *
242   *  :     :    C2C     :  SoC  : *          *  :  SoC  :    C2C     :     : *
243   *  :::::::            ::::::::: *          *  :::::::::            ::::::: *
244   *     |                   |     *          *      |                   |    *
245   *     |                   |     *          *      |                   |    *
246   *  &&&&&&&&           &&&&&&&&  *          *   &&&&&&&&           &&&&&&&& *
247   *  & GMEM &           & CMEM &  *          *   & CMEM &           & GMEM & *
248   *  &&&&&&&&           &&&&&&&&  *          *   &&&&&&&&           &&&&&&&& *
249   *                               *          *                               *
250   *********************************          *********************************
251
252   GMEM = GPU Memory (e.g. HBM)
253   CMEM = CPU Memory (e.g. LPDDR5X)
254
255  |
256  | Following table contains traffic coverage of Grace SoC PMU in socket-A:
257
258  ::
259
260   +--------------+-------+-----------+-----------+-----+----------+----------+
261   |              |                        Source                             |
262   +              +-------+-----------+-----------+-----+----------+----------+
263   | Destination  |       |GPU ATS    |GPU Not-ATS|     | Socket-B | Socket-B |
264   |              |PCI R/W|Translated,|Translated | CPU | CPU/PCIE1| GPU/PCIE2|
265   |              |       |EGM        |           |     |          |          |
266   +==============+=======+===========+===========+=====+==========+==========+
267   | Local        | PCIE  |NVLink-C2C0|NVLink-C2C1| SCF | SCF PMU  | CNVLink  |
268   | SYSRAM/CMEM  | PMU   |PMU        |PMU        | PMU |          | PMU      |
269   +--------------+-------+-----------+-----------+-----+----------+----------+
270   | Local GMEM   | PCIE  |    N/A    |NVLink-C2C1| SCF | SCF PMU  | CNVLink  |
271   |              | PMU   |           |PMU        | PMU |          | PMU      |
272   +--------------+-------+-----------+-----------+-----+----------+----------+
273   | Remote       | PCIE  |NVLink-C2C0|NVLink-C2C1| SCF |          |          |
274   | SYSRAM/CMEM  | PMU   |PMU        |PMU        | PMU |   N/A    |   N/A    |
275   | over CNVLink |       |           |           |     |          |          |
276   +--------------+-------+-----------+-----------+-----+----------+----------+
277   | Remote GMEM  | PCIE  |NVLink-C2C0|NVLink-C2C1| SCF |          |          |
278   | over CNVLink | PMU   |PMU        |PMU        | PMU |   N/A    |   N/A    |
279   +--------------+-------+-----------+-----------+-----+----------+----------+
280
281   PCIE1 traffic represents strongly ordered (SO) writes.
282   PCIE2 traffic represents reads and relaxed ordered (RO) writes.
283
284* **NVIDIA Grace CPU Superchip**: two Grace CPU SoCs are connected.
285
286  Example configuration with two Grace SoCs::
287
288   *******************             *******************
289   * SOCKET-A        *             * SOCKET-B        *
290   *                 *             *                 *
291   *    ::::::::     *             *    ::::::::     *
292   *    : PCIE :     *             *    : PCIE :     *
293   *    ::::::::     *             *    ::::::::     *
294   *        |        *             *        |        *
295   *        |        *             *        |        *
296   *    :::::::::    *             *    :::::::::    *
297   *    :       :    *             *    :       :    *
298   *    : Grace :<--------NVLink------->: Grace :    *
299   *    :  SoC  :    *     C2C     *    :  SoC  :    *
300   *    :::::::::    *             *    :::::::::    *
301   *        |        *             *        |        *
302   *        |        *             *        |        *
303   *     &&&&&&&&    *             *     &&&&&&&&    *
304   *     & CMEM &    *             *     & CMEM &    *
305   *     &&&&&&&&    *             *     &&&&&&&&    *
306   *                 *             *                 *
307   *******************             *******************
308
309   GMEM = GPU Memory (e.g. HBM)
310   CMEM = CPU Memory (e.g. LPDDR5X)
311
312  |
313  | Following table contains traffic coverage of Grace SoC PMU in socket-A:
314
315  ::
316
317   +-----------------+-----------+---------+----------+-------------+
318   |                 |                      Source                  |
319   +                 +-----------+---------+----------+-------------+
320   | Destination     |           |         | Socket-B | Socket-B    |
321   |                 |  PCI R/W  |   CPU   | CPU/PCIE1| PCIE2       |
322   |                 |           |         |          |             |
323   +=================+===========+=========+==========+=============+
324   | Local           |  PCIE PMU | SCF PMU | SCF PMU  | NVLink-C2C0 |
325   | SYSRAM/CMEM     |           |         |          | PMU         |
326   +-----------------+-----------+---------+----------+-------------+
327   | Remote          |           |         |          |             |
328   | SYSRAM/CMEM     |  PCIE PMU | SCF PMU |   N/A    |     N/A     |
329   | over NVLink-C2C |           |         |          |             |
330   +-----------------+-----------+---------+----------+-------------+
331
332   PCIE1 traffic represents strongly ordered (SO) writes.
333   PCIE2 traffic represents reads and relaxed ordered (RO) writes.
334