Introduction
Figure 6-1 depicts a simple scale-out backend network for an AI data center. The topology follows a modular design, allowing the network to scale out or scale in as needed. The smallest building block in this example is a segment, which consists of two nodes, two rail switches, and one spine switch. Each node in the segment is equipped with a dual-port UET NIC and two GPUs.
Within a segment, GPUs are connected to the leaf switches using a rail-based topology. For example, in Segment 1A, the communication path between GPU 0 on Node A1 and GPU 0 on Node A2 uses Rail A0 (Leaf 1A-1). Similarly, GPU 1 on both nodes is connected to Rail A1 (Leaf 1A-2). In this example, we assume that intra-node GPU collective communication takes place over an internal, high-bandwidth scale-up network (such as NVLink). As a result, intra-segment GPU traffic never reaches the spine layer. Communication between segments is carried over the spine layer.
The example network is a best-effort (that is, PFC is not enabled) two-tier, three-stage non-blocking fat-tree topology, where each leaf and spine switch has four 100-Gbps links. Leaf switches have two host-facing links and two inter-switch links, while spine switches have four inter-switch links. All inter-switch and host links are Layer-3 point-to-point interfaces, meaning that no Layer-2 VLANs are used in the example network.
Links between a node’s NIC and the leaf switches are Layer-3 point-to-point connections. The IP addressing scheme uses /31 subnets, where the first address is assigned to the host NIC and the second address to the leaf switch interface. These subnets are allocated in a contiguous manner so they can be advertised as a single BGP aggregate route toward the spine layer.
The trade-off of this aggregation model is that host-link or NIC failures cannot rely solely on BGP route withdrawal for fast failure detection. Additional local failure-detection mechanisms are therefore required at the leaf switch.
Although not shown in Figure 6-1, the example design supports a scalable multi-pod architecture. Multiple pods can be interconnected through a super-spine layer, enabling large-scale backend networks.
Note: The OSI between GPUs within a node indicates that both GPUs belong to the same Operating System Instance (OSI). The link between GPUs, in turn, is part of a high-bandwidth domain (the scale-up backend).
Figure 6-1: Example of AI DC Backend Networks Topology.
Congestion Types
In this text, we categorize congestion into two distinct domains: congestion within nodes, which includes incast, local, and outcast congestion, and congestion in scale-out backend networks, which includes link and network congestion. The following sections describe each congestion type in detail.
Incast Congestion
Figure 6-2 depicts an AllReduce collective communication using a star topology, where Rank 0 on node A1 acts as the central process. During the reduce phase, each participating rank sends its local gradients to Rank 0, which aggregates them (by summation or averaging). The aggregated result is then distributed back to all ranks during the broadcast phase, which is described in the next section.
Note: Tree- and ring-based topologies are significantly more efficient than a star topology and are therefore commonly used in practice. The star topology is shown here purely for demonstration purposes.
Incast congestion occurs when ingress data packets cannot be processed at line rate, causing the ingress buffer on the receiver NIC to overflow. At first glance, this may seem surprising because both ends of the link operate at the same speed. In theory, the receiver NIC should therefore be able to handle all incoming packets at line rate.
In this example, Rank 0 receives six interleaved packet streams at line rate. This occurs during the second neural network training iteration, at which point all required Packet Delivery Contexts (PDCs) were already established during the first iteration.
Connections from Rank 0 to Rank 2 and Rank 3 on node A2 both target FEP-2, identified by Fabric Address 10.1.1.2. On node A2, Rank 2 and Rank 3 execute under the same operating system instance and therefore share the same Front-End Port (FEP). As a result, both processes are exposed to the fabric through the same destination FA.
In Ultra Ethernet Transport, a PDC is defined by the combination of Job ID, destination FA, Traffic Class (TC), and Service Mode (in this case, Reliable Unordered Delivery). When these parameters match, multiple communication channels can reuse the same PDC, even if they originate from different application processes.
In this scenario, both ranks participate in the same distributed training job (Job ID 101) and request identical transport characteristics: Traffic Class Low and Reliable Unordered Delivery. Because the source and destination FEPs are the same and all PDC-defining parameters match, a single PDC—identified by PDCID 12—is sufficient on the Rank 0 side to serve communication with both Rank 2 and Rank 3.
Although Rank 2 and Rank 3 are separate application processes, the NIC multiplexes their packets over the same PDC while preserving the required reliability and delivery semantics at the protocol level. The same PDC reuse logic applies to connections between Rank 0 and other remote processes, provided that the Job ID, destination FA, Traffic Class, and Service Mode remain unchanged.
For every received packet, the NIC must perform several processing steps. It first examines the PDS header to determine whether an acknowledgment is required and to identify the next protocol header. It then processes the relative address information to determine the requested operation and, in the case of a UET_WRITE, the target memory location. These operations must be performed for every packet across all interleaved packet streams.
When many packets arrive simultaneously from multiple senders, the cumulative per-packet processing load can exceed the NIC’s ingress processing capacity, even though the physical link itself is not oversubscribed. As a result, ingress buffers may overflow, leading to incast congestion.
Note: In Figure 6-2, each node has a dual-port NIC and a single FEP. The NIC IP addresses are used for routing packets across the backend fabric, while the FA address serves as the FEP identifier.
Figure 6-2: Intra-node Congestion - Incast Congestions.
Local Congestion
Local congestion is another form of congestion that occurs within a node. It arises when the High-Bandwidth Memory (HBM) controller, which manages access to the GPU’s memory channels, becomes a bottleneck. The HBM controller arbitrates all read and write requests to GPU memory, regardless of their source. These requests may originate from the GPU’s compute cores, from a peer GPU via NVLink, or from a network interface card (NIC) performing remote memory access (RMA) operations.
With a UET_WRITE operation, the target GPU compute cores are bypassed: the NIC writes data directly into GPU memory using DMA. The GPU does not participate in the data transfer itself, and the NIC handles packet reception and memory writes. Even in this case, however, the data must still pass through the HBM controller, which serves as the shared gateway to the GPU’s memory system.
In Figure 6-3, the HBM controller of Rank 0 receives seven concurrent memory access requests: six inter-node RMA write requests and one intra-node request. The controller must arbitrate among these requests, determining the order and timing of each access. If the aggregate demand exceeds the available memory bandwidth or arbitration capacity, some requests are delayed. These memory-access delays are referred to as local congestion.
Figure 6-3: Intra-node Congestion - Local Congestions.
Outcast Congestion
Outcast congestion is the third type of congestion observed in collective operations. It occurs when multiple packet streams share the same egress port, and some flows are temporarily delayed relative to others. Unlike incast congestion, which arises from simultaneous arrivals at a receiver, outcast happens when certain flows dominate the output resources, causing other flows to experience unfair delays or buffer pressure.
Consider the broadcast phase of the AllReduce operation. After Rank 0 has aggregated the gradients from all participating ranks, it sends the averaged results back to all other ranks. Suppose Rank 0 sends these updates simultaneously to ranks on node A2 and node A3 over the same egress queue of its NIC. If one destination flow slightly exceeds the others in packet rate, the remaining flows experience longer queuing delays or may even be dropped if the egress buffer becomes full. These delayed flows are “outcast” relative to the dominant flows.
In this scenario, the NIC at Rank 0 must perform multiple UET_WRITE operations in parallel, generating high egress traffic toward several remote FEPs. At the same time, the HBM controller on Rank 0 may become a bottleneck because the data must be read from memory to feed the NIC. Thus, local congestion can occur concurrently with outcast congestion, especially during large-scale AllReduce broadcasts where multiple high-bandwidth streams are active simultaneously.
Outcast congestion illustrates that even when the network’s total capacity is sufficient, uneven traffic patterns can cause some flows to be temporarily delayed or throttled. Mitigating outcast congestion is addressed by appropriate egress scheduling and flow-control mechanisms to ensure fair access to shared resources and predictable collective operation performance. These mechanisms are explained in the upcoming Network-Signaled Congestion Control (NSCC) and Receiver Credit-Based Congestion Control (RCCC) chapters.
No comments:
Post a Comment