Saturday, 27 December 2025

Understanding Congestion in AI Fabric Backend Networks

Introduction

Figure 6-1 depicts a simple scale-out backend network for an AI data center. The topology follows a modular design, which allows the network to scale out or scale in when needed. The smallest building block in this example is a segment, which consists of two nodes, two rail switches, and one spine switch. Each node in the segment is equipped with a dual-port UET NIC and two GPUs.

Within a segment, GPUs are connected to the leaf switches using a rail-based topology. For example, in Segment 1A, the communication path between GPU 0 on Node A1 and GPU 0 on Node A2 uses Rail A0 (Leaf 1A-1). Similarly, GPU 1 on both nodes is connected to Rail A1 (Leaf 1A-2). In this example, we assume that intra-node GPU collective communication takes place over an internal, high-bandwidth scale-up network (such as NVLink). As a result, intra-segment GPU traffic never reaches the spine layer. Communication between segments is carried over the spine layer. 

The example network is a two-tier, three-stage non-blocking fat-tree topology, with each leaf and spine switch having four 100‑Gbps links. Each leaf switch uses two host-facing links and two inter-switch links, while each spine switch uses four inter-switch links. All inter-switch and host links are Layer-3 point-to-point interfaces, meaning that no Layer-2 VLANs are used in the example network. 

For node connectivity, each leaf (rail) switch is assigned its own /24 subnet. This /24 subnet is further divided into smaller /31 subnets, each providing two IP addresses. These /31 subnets are used on point-to-point links between the leaf switch and the nodes connected to that switch. With this approach, each segment contains two /24 subnets, one per rail switch. This design allows the leaf switch to advertise a single aggregated /24 route toward the spine layer using BGP, instead of advertising individual /31 routes.

The trade-off of this aggregation model is that host link or NIC failures must not rely solely on BGP route withdrawal for fast failure detection. Additional local failure detection mechanisms are required at the leaf switch. The same per-switch IP addressing and aggregation scheme is also applied to the spine switches.

Although not shown in Figure 6-1, the example design supports a scalable multi-pod architecture. Multiple pods can be interconnected through a super-spine layer, enabling large-scale, scalable backend networks.



Figure 6-1: Example of AI DC Backend Network Topology.

Congestion In Datacenter Network


Figure 6-2 illustrates four common congestion points in an AI fabric. Three of them; outcast, incast, and local congestion occur at the node level, while congested links occur within the switching infrastructure. The fifth congestion, network congestion is depicted in Figure 6-3.

Outcast Congestion


Outcast congestion occurs when a process attempts to send more packets onto the wire than the NIC can transmit at line rate. A common example is a multi-threaded GPU process that initiates synchronization with multiple target processes simultaneously.

Consider the example 1a shown in Figure 6-2. The memory region bound to the process on GPU 0 (Rank 0) of Node A1 holds multiple chunks of gradients. The process starts gradient synchronization using a ReduceScatter–AllGather collective operation, with multiple peer communications active concurrently. The operation communicates simultaneously with Ranks 2, 4, and 6. In Ultra Ethernet terminology, these ranks correspond to PIDonFEPs located on distinct Fabric Endpoints (FEPs).

In the Ultra Ethernet Packet Delivery Service (PDS) sublayer, the PDS Manager creates a Packet Delivery Context (PDC) for each connection, based on the tuple (destination Fabric Endpoint, JobID, Traffic Class, Delivery Mode). Although all destination ranks participate in the same job (JobID 101) and use the same Traffic Class (Low) and Delivery Mode (RUD), each target process resides on a different physical node with its own dedicated NIC. As a result, each target requires a unique PDC.

If all these PDCs attempt to transmit data at 100 Gbps simultaneously, the egress buffer of the UET NIC on Node A1 (port 0) can overflow. This can lead to outcast congestion, sometimes referred to as head-of-line blocking, where some flows are unfairly starved due to buffer contention.

Outcast congestion can also be caused by head-of-line blocking when inter-node, cross-rail communication is forwarded over a high-speed scale-up link, such as NVLink. Example 1b illustrates this scenario.

In this case, Rank 1 on GPU 1 in Node A1 sends data to a remote rank 2, but the traffic is first forwarded over NVLink to GPU 0 for cross-rail transmission. As a result, the NIC associated with GPU 0 becomes the fabric-facing endpoint for this communication. This causes the PDS Manager to create a PDC associated with port P0 on GPU 0, even though the original source of the data is Rank 1.

During model training, computation and communication phases often happen simultaneously across ranks. Consequently, Rank 0 and Rank 1 may initiate gradient synchronization at the same time. When both flows are injected through the same NIC egress, the egress buffer on port P0 can become oversubscribed, leading to outcast congestion.

Incast Congestion


Incast congestion occurs when ingress data packets cannot be processed at line rate, causing the ingress buffer on the receiver NIC to overflow. At first, this may seem surprising because both ends of the link operate at the same rate. In theory, the NIC should be able to handle all incoming packets at line rate.

However, the received packets may belong to different flows targeting distinct processes, each associated with a dedicated PDC. As a result, the NIC must interleave packets across multiple flows. In Figure 6-2, for example, rank 7 on Node B1 receives three different flows, each sent by a different process.

For every packet, the NIC first checks whether it requires acknowledgment from the PDS header and identifies the next header. It then reads the relative address information to determine what operation is requested, and in the case of a UET_WRITE, where the data should be written. This process must be repeated for every packet in all flows, which can overwhelm the NIC’s ingress logic and lead to incast congestion.

Local Congestion


Local congestion is a third example of congestion within a node. It occurs when the High-Bandwidth Memory Controller (HBM Ctrl), which manages access to the GPU’s memory lanes, becomes a bottleneck. The HBM controller coordinates all read and write requests to GPU memory, no matter the source. These requests can come from the GPU’s own compute cores, from another GPU via NVLink, or from a network interface card (NIC) using UET_WRITE operations.

With UET_WRITE, the target GPU cores are bypassed: the NIC performs the copy directly into GPU memory. The GPU does not participate in the data transfer, and the NIC handles all network reception and DMA work. However, even in this case, the data must still pass through the HBM controller, because it is the shared gateway to the memory.

In step 3 of Figure 6‑2, the HBM controller for rank 7 memory receives four simultaneous requests to access memory. The controller must arbitrate between them, deciding the order and timing of each access. If the total demand exceeds the memory system’s capacity, some requests are delayed. These delays are what we call local congestion.

Link Congestion

Traffic in a distributed neural network training workload is dominated by bursty, long-lived elephant flows. These flows are aligned with the application’s compute–communication phases: during the forward pass, network traffic is minimal, while during the backward pass, each GPU sends large data transfers at or near line rate. Due to the nature of the backward pass in backpropagation, weight updates can only be computed after gradient synchronization across all workers is complete. For this reason, even a single congested link can delay the entire training step.

In a routed, best-effort fat-tree Clos fabric, link congestion may be caused by Equal-Cost Multi-Path (ECMP) collision. ECMP commonly uses a 5-tuple hash consisting of the source and destination IP addresses, the protocol, and the source and destination ports to select the outgoing path for each flow. During the backward pass, a single rank may have multiple gradient chunks that are synchronized with several remote ranks, creating a point-to-multipoint pattern.

For example, suppose rank 0 and rank 2 in segment 1 start gradient synchronization simultaneously with rank 4 and rank 6 in segment 2. Both source ranks are on rail 0, so Leaf 1A-1 (rail 0) receives four flows. The ECMP algorithm may map three of these flows to the uplink toward Spine 1A and only one flow to Spine 1B, resulting in uneven traffic distribution. This can cause egress buffer overflow on Leaf 1A-1 toward Spine 1A.

In a large-scale AI fabric, there may be thousands of flows, and low entropy in the traffic (many flows sharing similar IP addresses and ports) increases the likelihood of ECMP collisions. As a result, link utilization can become unbalanced, leading to temporary congestion even in a non-blocking network.


Figure 6-2: Congestion Type in AI Fabric Backend Network - Outcast, Incast, Local and Link Congestions.


Network Congestion

Common causes of network congestion include too high oversubscription ration, ECMP collisions, and link or device failures. A less obvious but important source of short-term congestion is Priority Flow Control (PFC), which is commonly used to build lossless Ethernet networks. PFC together with Explicit Congestion Notification (ECN) forms the foundation of Lossless Ethernet for RoCEv2 but should be avoided in UET enabled best-effort network. The upcoming chapters explains why.

PFC relies on two buffer thresholds to control traffic flow: xOFF and xON. The xOFF threshold defines the point at which a switch generates a pause frame when a priority queue becomes congested. A pause frame is an Ethernet MAC control frame that tells the upstream device which Traffic Class (TC) queue is congested and for how long packet transmission for that TC should be paused. Packets belonging to other traffic classes can still be forwarded normally. Once the buffer occupancy drops below the xON threshold, the switch sends a resume signal, allowing traffic for that priority queue to continue.

At first sight, PFC appears to affect only a single link and only a specific traffic class. In practice, however, a PFC pause can trigger a chain reaction across the network. For example, if the egress buffer size exceeds the xOFF threshold for TC-Low on interface to rank 7 on Leaf switch 1B-1, the switch sends PFC pause frames to both connected spine switches, instructing them to temporarily hold TC-Low packets in their buffers. As the egress buffers for TC-Low on the spine switches begin to fill and xOFF threshold is crossed, they in turn sends PFC pause frame to rest of the leaf switches.

This behavior can quickly spread congestion beyond the original point of contention. In the worst case, multiple switches and links may experience temporary pauses. Once buffer occupancy drops below the xON threshold, Leaf switch 1B-1 sends resume signals, and traffic gradually recovers as normal transmission resumes. Even though the congestion episode is short, it disrupts collective operations and negatively impacts distributed training performance.

The upcoming chapters explain how Ultra Ethernet Network-Signal Congestion Control (NSCC) and Receiver-Credit Congestion Control (RCCC) manage the amount of data that sources are allowed to send over the network, maximizing network utilization while avoiding congestion. The next chapters also describe how Explicit Congestion Notification (ECN), Packet Trimming, and Entropy Value-based Packet Spraying, when combined with NSCC and RCCC, contribute to a self-adjusting, reliable backend network.

Figure 6-3: Congestion Type in AI Fabric Backend Network – Network Congestion.


No comments:

Post a Comment