Introduction
Figure 6-1 depicts a simple scale-out backend network for an AI data center. The topology follows a modular design, allowing the network to scale out or scale in as needed. The smallest building block in this example is a segment, which consists of two nodes, two rail switches, and one spine switch. Each node in the segment is equipped with a dual-port UET NIC and two GPUs.
Within a segment, GPUs are connected to the leaf switches using a rail-based topology. For example, in Segment 1A, the communication path between GPU 0 on Node A1 and GPU 0 on Node A2 uses Rail A0 (Leaf 1A-1). Similarly, GPU 1 on both nodes is connected to Rail A1 (Leaf 1A-2). In this example, we assume that intra-node GPU collective communication takes place over an internal, high-bandwidth scale-up network (such as NVLink). As a result, intra-segment GPU traffic never reaches the spine layer. Communication between segments is carried over the spine layer.
The example network is a best-effort (that is, PFC is not enabled) two-tier, three-stage non-blocking fat-tree topology, where each leaf and spine switch has four 100-Gbps links. Leaf switches have two host-facing links and two inter-switch links, while spine switches have four inter-switch links. All inter-switch and host links are Layer-3 point-to-point interfaces, meaning that no Layer-2 VLANs are used in the example network.
Links between a node’s NIC and the leaf switches are Layer-3 point-to-point connections. The IP addressing scheme uses /31 subnets, where the first address is assigned to the host NIC and the second address to the leaf switch interface. These subnets are allocated in a contiguous manner so they can be advertised as a single BGP aggregate route toward the spine layer.
The trade-off of this aggregation model is that host-link or NIC failures cannot rely solely on BGP route withdrawal for fast failure detection. Additional local failure-detection mechanisms are therefore required at the leaf switch.
Although not shown in Figure 6-1, the example design supports a scalable multi-pod architecture. Multiple pods can be interconnected through a super-spine layer, enabling large-scale backend networks.
Note: The OSI between GPUs within a node indicates that both GPUs belong to the same Operating System Instance (OSI). The link between GPUs, in turn, is part of a high-bandwidth domain (the scale-up backend).
Figure 6-1: Example of AI DC Backend Networks Topology.
Congestion Types
In this text, we categorize congestion into two distinct domains: congestion within nodes, which includes incast, local, and outcast congestion, and congestion in scale-out backend networks, which includes link and network congestion. The following sections describe each congestion type in detail.
Incast Congestion
In high-performance networking, Incast is a specific type of congestion that occurs when a many-to-one communication pattern overwhelms a single network point. This is fundamentally a "fan-in" problem, where the traffic volume destined for a single receiver exceeds both the physical line rate of the last-hop switch's egress interface and the storage capacity of its output buffers.
To visualize this, consider the configuration in Figure 6-2. The setup consists of four UET Nodes (A1, A2, B1, and B2), each containing two GPUs. This results in eight total processing units, labeled Rank 0 through Rank 7. Each Rank is equipped with its own dedicated 100G NIC.
The bottleneck forms when multiple sources target a single destination simultaneously. In this scenario, Ranks 1 through 7 all begin transmitting data to Rank 0 at the exact same time, each at a 100G line rate.
The backbone of the network is typically robust enough to handle this aggregate traffic. If the switches are connected via 400G or 800G links, the core of the network stays clear and fast. If the core were to experience congestion, Network Signaled Congestion Control (NSCC) could be enabled to manage it. However, the specific problem here occurs at Leaf 1A-1, the switch where the target (Rank 0) is connected. While the switch receives a combined 600G of data destined for Rank 0, the outgoing interface from the switch to Rank 0 can only move 100G. Note, that Rank 1 use high-speed NVLink, not its Ethernet NIC
A buffer overflow is inevitable when 700G of data arrives at an egress port that can only output 100G. The switch is forced to store the extra 600G of data per second in its internal memory (buffers). Because network buffers are quite small and high-speed data moves incredibly fast, these buffers fill up in microseconds.
Once the buffers are full, the switch has no choice but to drop any new incoming packets. This leads to massive retransmission delays and "stuttering" in application performance. This is particularly devastating for AI training workloads, where all Ranks must stay synchronized to maintain efficiency.
While traditional networks use simple buffer management to deal with this, Ultra Ethernet utilizes a more sophisticated approach. To prevent "fan-in" from ever overwhelming the switch buffers in the first place, UET employs Receiver Credit-based Congestion Control (RCCC). This mechanism ensures the receiver remains in control by distributing credits that define exactly how much data each active source is allowed to transmit at any given time.
Figure 6-2: Intra-node Congestion - Incast Congestions.
Local Congestion
Local congestion arises when the High-Bandwidth Memory (HBM) controller, which manages access to the GPU’s memory channels, becomes a bottleneck. The HBM controller arbitrates all read and write requests to GPU memory, regardless of their source. These requests may originate from the GPU’s compute cores, from a peer GPU via NVLink, or from a network interface card (NIC) performing remote memory access (RMA) operations.
With a UET_WRITE operation, the target GPU compute cores are bypassed: the NIC writes data directly into GPU memory using DMA. The GPU does not participate in the data transfer itself, and the NIC handles packet reception and memory writes. Even in this case, however, the data must still pass through the HBM controller, which serves as the shared gateway to the GPU’s memory system.
In Figure 6-3, the HBM controller of Rank 0 receives seven concurrent memory access requests: six inter-node RMA write requests and one intra-node request. The controller must arbitrate among these requests, determining the order and timing of each access. If the aggregate demand exceeds the available memory bandwidth or arbitration capacity, some requests are delayed. These memory-access delays are referred to as local congestion.
Figure 6-3: Intra-node Congestion - Local Congestions.
Outcast Congestion
Outcast congestion is the third type of congestion observed in collective operations. It occurs when multiple packet streams share the same egress port, and some flows are temporarily delayed relative to others. Unlike incast congestion, which arises from simultaneous arrivals at a receiver, outcast happens when certain flows dominate the output resources, causing other flows to experience unfair delays or buffer pressure.
Consider the broadcast phase of the AllReduce operation. After Rank 0 has aggregated the gradients from all participating ranks, it sends the averaged results back to all other ranks. Suppose Rank 0 sends these updates simultaneously to ranks on node A2 and node A3 over the same egress queue of its NIC. If one destination flow slightly exceeds the others in packet rate, the remaining flows experience longer queuing delays or may even be dropped if the egress buffer becomes full. These delayed flows are “outcast” relative to the dominant flows.
In this scenario, the NIC at Rank 0 must perform multiple UET_WRITE operations in parallel, generating high egress traffic toward several remote FEPs. At the same time, the HBM controller on Rank 0 may become a bottleneck because the data must be read from memory to feed the NIC. Thus, local congestion can occur concurrently with outcast congestion, especially during large-scale AllReduce broadcasts where multiple high-bandwidth streams are active simultaneously.
Outcast congestion illustrates that even when the network’s total capacity is sufficient, uneven traffic patterns can cause some flows to be temporarily delayed or throttled. Mitigating outcast congestion is addressed by appropriate egress scheduling and flow-control mechanisms to ensure fair access to shared resources and predictable collective operation performance. These mechanisms are explained in the upcoming Network-Signaled Congestion Control (NSCC) and Receiver Credit-Based Congestion Control (RCCC) chapters.