scaling.rst - OpenGrok cross reference for /linux-6.14.4/Documentation/networking/scaling.rst

Lines Matching +full:per +full:- +full:queue
1 .. SPDX-License-Identifier: GPL-2.0
13 multi-processor systems.
17 - RSS: Receive Side Scaling
18 - RPS: Receive Packet Steering
19 - RFS: Receive Flow Steering
20 - Accelerated Receive Flow Steering
21 - XPS: Transmit Packet Steering
28 (multi-queue). On reception, a NIC can send different packets to different
32 queue, which in turn can be processed by separate CPUs. This mechanism is
33 generally known as “Receive-side Scaling” (RSS). The goal of RSS and
35 Multi-queue distribution can also be used for traffic prioritization, but
39 and/or transport layer headers-- for example, a 4-tuple hash over
41 implementation of RSS uses a 128-entry indirection table where each entry
42 stores a queue number. The receive queue for a packet is determined
51 both directions of the flow to land on the same Rx queue (and CPU). The
52 "Symmetric-XOR" is a type of RSS algorithms that achieves this hash
64 can be directed to their own receive queue. Such “n-tuple” filters can
65 be configured from ethtool (--config-ntuple).
69 -----------------
71 The driver for a multi-queue capable NIC typically provides a kernel
74 num_queues. A typical RSS configuration would be to have one receive queue
79 The indirection table of an RSS device, which resolves a queue by masked
83 commands (--show-rxfh-indir and --set-rxfh-indir). Modifying the
91 Each receive queue has a separate IRQ associated with it. The NIC triggers
92 this to notify a CPU when new packets arrive on the given queue. The
93 signaling path for PCIe devices uses message signaled interrupts (MSI-X),
96 an IRQ may be handled on any CPU. Because a non-negligible part of packet
99 affinity of each interrupt see Documentation/core-api/irq/irq-affinity.rst. Some systems
109 decreases queue length. For low latency networking, the optimal setting
111 NIC maximum, if lower). The most efficient high-rate configuration
113 receive queue overflows due to a saturated CPU, because in default
115 interrupts (and thus work) grows with each additional queue.
117 Per-cpu load can be observed using the mpstat utility, but note that on
126 Modern NICs support creating multiple co-existing RSS configurations
135   # ethtool -X eth0 hfunc toeplitz context new
142   # ethtool -x eth0 context 1
147   # ethtool -X eth0 equal 2 context 1
148   # ethtool -x eth0 context 1
154 To make use of the new context direct traffic to it using an n-tuple
157   # ethtool -N eth0 flow-type tcp6 dst-port 22 context 1
162   # ethtool -N eth0 delete 1023
163   # ethtool -X eth0 context 1 delete
171 Whereas RSS selects the queue and hence CPU that will run the hardware
174 on the desired CPU’s backlog queue and waking up the CPU for processing.
180    introduce inter-processor interrupts (IPIs))
185 selects the queue that should process a packet.
188 flow hash over the packet’s addresses or ports (2-tuple or 4-tuple hash
194 skb->hash and can be used elsewhere in the stack as a hash of the
197 Each receive hardware queue has an associated list of CPUs to which
201 and the packet is queued to the tail of that CPU’s backlog queue. At
203 packets have been queued to their backlog queue. The IPI wakes backlog
209 -----------------
214 can be configured for each receive queue using a sysfs file entry::
216   /sys/class/net/<dev>/queues/rx-<n>/rps_cpus
220 CPU. Documentation/core-api/irq/irq-affinity.rst explains how CPUs are assigned to
227 For a single queue device, a typical RPS configuration would be to set
233 For a multi-queue system, if RSS is configured so that a hardware
234 receive queue is mapped to each CPU, then RPS is probably redundant
236 RPS might be beneficial if the rps_cpus for each queue are the ones that
237 share the same memory domain as the interrupting CPU for that queue.
241 --------------
244 reordering. The trade-off to sending all packets from the same flow
255 queue exceeds half the maximum queue length (as set by sysctl
256 net.core.netdev_max_backlog), the kernel starts a per-flow packet
260 dropped once the input packet queue reaches netdev_max_backlog.
261 No packets are dropped when the input packet queue length is below
271 and cache contention) and toggled per CPU by setting the relevant bit
277 Per-flow rate is calculated by hashing each packet into a hashtable
278 bucket and incrementing a per-bucket counter. The hash function is
280 be much larger than the number of CPUs, flow limit has finer-grained
298 The feature depends on the input packet queue length to exceed
336 receive queue of each device. Each table value stores a CPU index and a
346 queue has a head counter that is incremented on dequeue. A tail counter
347 is computed as head counter + queue length. In other words, the counter
355 and the rps_dev_flow table of the queue that the packet was received on
362   - The current CPU's queue head counter >= the recorded tail counter
364   - The current CPU is unset (>= nr_cpu_ids)
365   - The current CPU is offline
375 -----------------
383 The number of entries in the per-queue flow table are set through::
385   /sys/class/net/<dev>/queues/rx-<n>/rps_flow_cnt
391 Both of these need to be set before RFS is enabled for a receive queue.
398 For a single queue device, the rps_flow_cnt value for the single queue
400 For a multi-queue device, the rps_flow_cnt for each queue might be
403 are 16 configured receive queues, rps_flow_cnt for each queue might be
410 Accelerated RFS is to RFS what RSS is to RPS: a hardware-accelerated load
420 queue for packets matching a particular flow. The network stack
425 The hardware queue for a flow is derived from the CPU recorded in
426 rps_dev_flow_table. The stack consults a CPU to hardware queue map which
427 is maintained by the NIC driver. This is an auto-generated reverse map of
430 to populate the map. For each CPU, the corresponding queue in the map is
435 -----------------------------
441 configured for each receive queue by the driver, so no additional
456 which transmit queue to use when transmitting a packet on a multi-queue
458 a mapping of CPU to hardware queue(s) or a mapping of receive queue(s)
459 to hardware transmit queue(s).
466 provides two benefits. First, contention on the device queue lock is
467 significantly reduced since fewer CPUs contend for the same queue
469 transmit queue). Secondly, cache miss rate on transmit completion is
475 This mapping is used to pick transmit queue based on the receive
476 queue(s) map configuration set by the administrator. A set of receive
479 on the same queue associations for transmit and receive. This is useful for
480 busy polling multi-threaded workloads where there are challenges in
483 received on a single queue. The receive queue number is cached in the
485 transmit queue corresponding to the associated receive queue has benefits
487 the same queue-association that a given application is polling on. This
493 XPS is configured per transmit queue by setting a bitmap of
494 CPUs/receive-queues that may use that queue to transmit. The reverse
495 mapping, from CPUs to transmit queues or from receive-queues to transmit
498 called to select a queue. This function uses the ID of the receive queue
499 for the socket connection for a match in the receive queue-to-transmit queue
501 running CPU as a key into the CPU-to-queue lookup table. If the
502 ID matches a single queue, that is used for transmission. If multiple
504 into the set. When selecting the transmit queue based on receive queue(s)
508 The queue chosen for transmitting a particular flow is saved in the
510 This transmit queue is used for subsequent packets sent on the flow to
513 ooo packets, the queue for a flow can subsequently only be changed if
514 skb->ooo_okay is set for a packet in the flow. This flag indicates that
515 there are no outstanding packets in the flow, so the transmit queue can
522 -----------------
526 how, XPS is configured at device init. The mapping of CPUs/receive-queues
527 to transmit queue can be inspected and configured using sysfs:
531   /sys/class/net/<dev>/queues/tx-<n>/xps_cpus
533 For selection based on receive-queues map::
535   /sys/class/net/<dev>/queues/tx-<n>/xps_rxqs
541 For a network device with a single transmission queue, XPS configuration
542 has no effect, since there is no choice in this case. In a multi-queue
543 system, XPS is preferably configured so that each CPU maps onto one queue.
545 queue can also map onto one CPU, resulting in exclusive pairings that
547 best CPUs to share a given queue are probably those that share the cache
548 with the CPU that processes transmit completions for that queue
551 For transmit queue selection based on receive queue(s), XPS has to be
552 explicitly configured mapping receive-queue(s) to transmit queue(s). If the
553 user configuration for receive-queue map does not apply, then the transmit
554 queue is selected based on the CPUs map.
557 Per TX Queue rate limitation
560 These are rate-limitation mechanisms implemented by HW, where currently
561 a max-rate attribute is supported, by setting a Mbps value to::
563   /sys/class/net/<dev>/queues/tx-<n>/tx_maxrate
579 - Tom Herbert ([email protected])
580 - Willem de Bruijn ([email protected])