Tuesday, 13 January 2026

Ultra Ethernet: Congestion Control Context

 Ultra Ethernet Transport (UET) uses a vendor-neutral, sender-specific congestion window–based congestion control mechanism together with flow-based, adjustable entropy-value (EV) load balancing to manage incast, outcast, local, link, and network congestion events. Congestion control in UET is implemented through coordinated sender-side and receiver-side functions to enforce end-to-end congestion control behavior.

On the sender side, UET relies on the Network-Signaled Congestion Control (NSCC) algorithm. Its main purpose is to regulate how quickly packets are transmitted by a Packet Delivery Context (PDC). The sender adapts its transmission window based on round-trip time (RTT) measurements and Explicit Congestion Notification (ECN) Congestion Experienced (CE) feedback conveyed through acknowledgments from the receiver.

On the receiver side, Receiver Credit-based Congestion Control (RCCC) limits incast pressure by issuing credits to senders. These credits define how much data a sender is permitted to transmit toward the receiver. The receiver also observes ECN-CE markings in incoming packets to detect path congestion. When congestion is detected, the receiver can instruct the sender to change the entropy value, allowing traffic to be steered away from congested paths.

Both sender-side and receiver-side mechanisms ultimately control congestion by limiting the amount of in-flight data, meaning data that has been sent but not yet acknowledged. In UET, this coordination is handled through a Congestion Control Context (CCC). The CCC maintains the congestion control state and determines the effective transmission window, thereby bounding the number of outstanding packets in the network. A single CCC may be associated with one or more PDCs communicating between the same Fabric Endpoint (FEP) within the same traffic class.


Initializing Congestion Control Context (CCC)

When the PDS Manager receives an RMA operation request from the SES layer, it first checks whether a suitable Packet Delivery Context (PDC) already exists for the JobID, destination FEP, traffic class, and delivery mode. If no matching PDC is found, the PDS Manager allocates a new one.

For the first PDC associated with a specific FEP-to-FEP flow, a Congestion Control Context (CCC) is required to manage end-to-end congestion. The PDS Manager requests this context from the CCC Manager within the Congestion Management Sublayer (CMS). Upon instantiation, the CCC initially enters the IDLE state, containing basic data structures without an active configuration.

The CCC Manager then initializes the context by calculating values and thresholds, such as the Initial Congestion Window (Initial CWND) and Maximum CWND (MaxWnd), using pre-defined configuration parameters. Once these initial source states for the NSCC are set, the CCC is bound to the corresponding PDC.

When fully configured, the CCC transitions to the READY state. This transition signals that the CCC is authorized to enforce congestion control policies and monitor traffic. The CCC serves as the central control structure for congestion management, hosting either sender-side (NSCC) or receiver-side (RCCC) algorithms. Because a CCC is unidirectional, it is instantiated independently on both the sender and the receiver.

Once in the READY state, the PDC is permitted to begin data transmission. The CCC maintains the active state required to regulate flow, enabling the NSCC and RCCC to enforce windows, credits, and path usage to prevent network congestion and optimize transport efficiency.

Note: In this model, the PDS Manager acts as the control-plane authority responsible for context management and coordination, while the PDC handles data-plane execution under the guidance of the CCC. Once the CCC is operational, RMA data transfers proceed directly via the PDC without further involvement from the PDS Manager.



Figure 6-6: Congestion Context: Initialization.

Calculating Initial CWND


Following the initialization of the Congestion Control Context (CCC) for a Packet Delivery Context (PDC), specific configuration parameters are used to establish the Initial Congestion Window (CWND) and the Maximum Congestion Window (MaxWnd). 

The Congestion Window (CWND) defines the maximum number of "in-flight" bytes, data that has been transmitted but not yet acknowledged by the receiver. Effectively, the CWND regulates the volume of data allowed on the wire for a specific flow at any given time to prevent network saturation.

The primary element for computing the CWND is the Bandwidth-Delay Product (BDP). To determine the path-specific BDP, the algorithm selects the slowest link speed and multiplies it by the configured base Round-Trip Time (config_base_rtt):

BDP = min(sender.linkspeed, receiver.linkspeed) x config_base_rtt

The config_base_rtt represents the latency over the longest physical path under zero-load conditions. This value is a static constant derived from the cumulative sum of:
  • Serialization delays (time to put bits on the wire)
  • Propagation delays (speed of light through fiber)
  • Switching delays (internal switch traversal)
  • FEC (Forward Error Correction) delays

Setting MaxWnd


The MaxWnd serves as a definitive upper limit for the CWND that cannot be exceeded under any circumstances. It is typically derived by multiplying the calculated BDP by a factor of 1.5.While a CWND equal to 1.0 x BDP is theoretically sufficient to saturate a link, real-world variables, such as transient bursts, scheduling jitter, or variations in switch processing, can cause the link to go idle if the window is too restrictive. UET allows the CWND to grow up to 1.5 x BDP to maintain high utilization and accommodate acknowledgment (ACK) clocking dynamics.

Example Calculation: Consider a flow where the slowest link speed is 100 Gbps and the config_base_rtt is 6.0 µs.

Calculate BDP (Bits): 100 x 109 bps x 0.000006 s = 600,000 bits
Calculate BDP (Bytes): 600,000 / 8 = 75,000 bytes
Calculate MaxWnd: 75,000 x 1.5 = 112,500  bytes

Note on Incast Prevention: While the "ideal" initial CWND is 1.0 x BDP, UET allows the starting window to be configured to a significantly smaller value (e.g., 10–32 KB or a few MTUs). This configuration prevents Incast congestion, a phenomenon where the aggregate traffic from multiple ingress ports exceeds the physical capacity of an egress port. By starting with a conservative CWND, the system ensures that the switch's egress buffers are not exhausted during the first RTT, providing the NSCC algorithm sufficient time to measure RTT inflation and modulate the flow rates.

A common misconception is that the BDP limits the transmission rate. In reality, the BDP defines the volume of data required to keep the "pipe" full. While the Initial CWND may be only 75,000 bytes, it is replenished every RTT. At a 6.0 µs RTT, this volume translates to a full 100 Gbps line rate:

600,000 bits / 6.0 µs = 600,000 / 0.000006 = 100 × 109 bps = 100 Gbps

Therefore, a window of 1.0 x BDP achieves 100% utilization. The 1.5 x BDP (MaxWnd) simply provides the necessary headroom to prevent the link from going idle during minor acknowledgment delays.

Figure 6-7: CC Config Parameters, Initial CWND and MaxWnd.

Calculating New CWND


When the network is uncongested, indicated by a measured RTT remaining near the base_rtt, the NSCC algorithm performs an Additive Increase (AI) to grow the CWND. To ensure fairness across the entire fabric, the algorithm utilizes a universal Base_BDP parameter rather than the path-specific BDP.

The Base_BDP is a fixed protocol constant (typically 150,000 bytes, derived from a reference 100 Gbps link at 12 µs). The new CWND is calculated by adding a fraction of this constant to the current window:

CWND(new) = CWND(Init) + Base_BDP\Scaling Factor

Using a universal constant ensures Scale-Invariance in a mixed-speed fabric (e.g., 100G and 400G NICs). 

If a 400G NIC were to use its own BDP (300,000 bytes) for the increase step, its window would grow four times faster than that of a 100G NIC. By using the shared Base_BDP (150,000 bytes), both NICs increase their throughput by the same number of bytes per second. This "normalized acceleration" prevents faster NICs from starving slower flows during the capacity-seeking phase.

As illustrated in Figure 6-8, consider a flow with an Initial CWND of 75,000 bytes, a Base_BDP of 150,000 bytes, and a Scaling Factor of 1024:

Step Size = 150,000 / 1024 ≈ 146.5  bytes
New CWND = 75,000 + 146.5 = 75,146.5 bytes

Note: Scaling factors are ideally set to powers of 2 (e.g., 512, 1024, 2048, 4096, 8192) to allow the hardware to use fast bit-shifting operations instead of expensive division. 

Higher factors (e.g., 8192): Result in smaller, smoother increments (high stability). 
Lower factors (e.g., 512): Result in larger increments (faster convergence to link rate).

Figure 6-8: Increasing CWND.




Tuesday, 6 January 2026

UET Congestion Management: CCC Base RTT

Calculating Base RTT

[Edit: January 7 2026, RTT role in CWND adjustment process]

As described in the previous section, the Bandwidth-Delay Product (BDP) is a baseline value used when setting the maximum size (MaxWnd) of the Congestion Window (CWND). The BDP is calculated by multiplying the lowest link speed among the source and destination nodes by the Base Round-Trip Time (Base_RTT).

In addition to its role in BDP calculation, Base_RTT plays a key role in the CWND adjustment process. During operation, the RTT measured for each packet is compared against the Base_RTT. If the measured RTT is significantly higher than the Base_RTT, the CWND is reduced. If the RTT is close to or lower than the Base_RTT, the CWND is allowed to increase.

This adjustment process is described in more detail in the upcoming sections.

The config_base_rtt parameter represents the RTT of the longest path between sender and receiver when no other packets are in flight. In other words, it reflects the minimum RTT under uncongested conditions. Figure 6-7 illustrates the individual delay components that together form the RTT.

Serialization Delay: The network shown in Figure 6-7 supports jumbo frames with an MTU of 9216 bytes. Serialization delay is measured in time per bit, so the frame size must first be converted from bytes to bits:

9216 bytes × 8 = 73,728 bits

Serialization delay is then calculated by dividing the frame size in bits by the link speed. For a 100 Gbps link:

73,728 bits / 100 Gbps = 0.737 µs

Note: In a cut-through switched network, which is standard in modern 100 Gbps and above data center fabrics, the switch does not wait for the full 9216-byte frame to arrive before forwarding it. Instead, it processes only the packet header (typically the first 64–128 bytes) to determine the destination MAC or IP address and immediately begins transmitting the packet on the egress port. While the tail of the packet is still arriving on the ingress port, the head is already leaving the switch.

This behavior creates a pipeline effect, where bits flow through the network similarly to water through a pipe. As a result, when calculating end-to-end latency from a first-bit-in to last-bit-out perspective, the serialization delay is effectively incurred only once—the time required to place the packet onto the first link.

Propagation delay: The time it takes for light to travel through the cabling infrastructure. In our example, the combined fiber-optic length between Rank 0 on Node 1A and GPU 7 on Node A2 is 50 meters. Light travels through fiber at approximately 5 ns per meter, resulting in a propagation delay of:

50 m × 5 ns/m = 250 ns = 0.250 µs

Switching Delay (Cut-Through): The time a packet spends inside a network switch while being processed before it is forwarded. This latency arises from internal operations such as examining the packet header, performing a Forwarding Information Base (FIB) lookup to determine the correct egress port, and updating internal buffers and queues.

In modern cut-through switches, much of this processing occurs while the packet is still being received, so the added delay per switch is very small. High-end 400G switches exhibit cut-through latencies on the order of 350–500 ns per switch. For a path traversing three switches, the total switching delay sums to approximately:

3 × 400 ns ≈ 1.2 µs

Thus, even with multiple hops, switching delay contributes only a modest portion to the total Base RTT in 100 Gbps and above data center fabrics.

Forward Error Correction(FEC) Delay: Forward Error Correction (FEC) ensures reliable, “lossless” data transfer in high-speed AI fabrics. It is required because high-speed optical links can experience bit errors due to signal distortion, fiber imperfections, or high-frequency signaling noise.

FEC operates using data blocks and symbols. The outgoing data is divided into fixed-size blocks, each consisting of data symbols. In 100G and 400G Ethernet FEC, one symbol = 10 bits. For example, a 514-symbol data block contains 514 × 10 = 5,140 bits of actual data.

To detect and correct errors, the switch or NIC ASIC computes parity symbols from the data block using Reed-Solomon (RS) math and appends them to the block. The combination of the original data and the parity symbols forms a codeword. For example, in RS(544, 514), the codeword has 544 symbols in total, of which 514 are data symbols and 30 are parity symbols. Each symbol is 10 bits, so the 30 parity symbols add 300 extra bits to the codeword.

At the receiver, the codeword is checked: the parity symbols are used to detect and correct any corrupted symbols in the original data block. Because RS-FEC operates on symbols rather than individual bits, if multiple bits within a single 10-bit symbol are corrupted, the entire symbol is corrected as a single unit.

The FEC latency (or accumulation delay) comes from the requirement to receive the entire codeword before error correction can begin. For a 400G RS(544, 514) codeword:

544 symbols × 10 bits/symbol = 5,440 bits total

At 400 Gbps, this adds a fixed delay of ~150 ns per hop

This delay is a “fixed cost” of high-speed networking and must be included in the Base RTT calculation for AI fabrics. The sum of all delays gives the one-way delay, and the round-trip time (RTT) is obtained by multiplying this value by two. The config_base_rtt value in figure 6-7 is the RTT rounded to a safe, reasonable integer. 

Figure 6-7: Calculating Base_RTT Value.

Saturday, 3 January 2026

UET Congestion Management: Congestion Control Context

Congestion Control Context

Updated 5.1.2026: Added CWND computation example into figure. Added CWND cmputaiton into text.
Updated 13.1.2026: Deprectade by: Ultra Ethernet: Congestion Control Context 

Ultra Ethernet Transport (UET) uses a vendor-neutral, sender-specific congestion window–based congestion control mechanism together with flow-based, adjustable entropy-value (EV) load balancing to manage incast, outcast, local, link, and network congestion events. Congestion control in UET is implemented through coordinated sender-side and receiver-side functions to enforce end-to-end congestion control behavior.

On the sender side, UET relies on the Network-Signaled Congestion Control (NSCC) algorithm. Its main purpose is to regulate how quickly packets are transmitted by a Packet Delivery Context (PDC). The sender adapts its transmission window based on round-trip time (RTT) measurements and Explicit Congestion Notification (ECN) Congestion Experienced (CE) feedback conveyed through acknowledgments from the receiver.

On the receiver side, Receiver Credit-based Congestion Control (RCCC) limits incast pressure by issuing credits to senders. These credits define how much data a sender is permitted to transmit toward the receiver. The receiver also observes ECN-CE markings in incoming packets to detect path congestion. When congestion is detected, the receiver can instruct the sender to change the entropy value, allowing traffic to be steered away from congested paths.

Both sender-side and receiver-side mechanisms ultimately control congestion by limiting the amount of in-flight data, meaning data that has been sent but not yet acknowledged. In UET, this coordination is handled through a Congestion Control Context (CCC). The CCC maintains the congestion control state and determines the effective transmission window, thereby bounding the number of outstanding packets in the network. A single CCC may be associated with one or more PDCs communicating between the same Fabric Endpoint (FEP) within the same traffic class.


Initializing Congestion Control Context (CCC)

When the PDS Manager receives an RMA operation request from the SES layer, it first checks whether a suitable Packet Delivery Context (PDC) already exists for the JobID, destination FEP, traffic class, and delivery mode carried in the request. If no matching PDC is found, the PDS Manager allocates a new one.

For the first PDC associated with a particular destination, a Congestion Control Context (CCC) is required to manage end-to-end congestion for that flow. The PDS Manager requests a CCC from the CCC Manager within the Congestion Management Sublayer (CMS). The CCC Manager creates the CCC, which initially enters the IDLE state, containing only the basic data structures without an active configuration. After creation, the CCC is bound to the PDC.

Next, the CCC is assigned a congestion window (CWND), which is computed based on CCC configuration parameters. The first step is to compute the Bandwidth-Delay Product (BDP), which is used to derive the upper bound for the initial congestion window. The CWND limits the total number of bytes in flight across all paths between the sender and the receiver.

The BDP is computed as:

BDP = min(sender_link_speed, receiver_link_speed) × config_base_rtt

The link speed must be expressed in bytes per second, not bits per second, because BDP is measured in bytes. The min() operator selects the smaller of the sender and receiver link speeds. In an AI fabric, these values are typically identical. The sender link speed, receiver link speed, and config_base_rtt are pre-assigned configuration parameters.

UET typically allows a maximum in-flight volume of 1.5 × BDP to provide throughput headroom while minimizing excessive queuing. A factor of 1.0 represents the minimum required to “fill the pipe” and would set the BDP directly as the maximum congestion window (MaxWnd). However, the UET specification applies a factor of 1.5 to allow controlled oversubscription and improved utilization.

Once the CWND is assigned and the CCC is bound to the PDC, the CCC transitions from the IDLE state to the ACTIVE state. In the ACTIVE state, the CCC holds all configuration information and is associated with the PDC, but data transport has not yet started.

When the CCC is fully configured and ready for operation, it transitions to the READY state. This transition signals that the CCC can enforce congestion control policies and monitor traffic. At this point, the PDC is allowed to begin sending data, and the CCC tracks and regulates the flow according to the configured congestion control algorithms.

The CCC serves as the central control structure for congestion management, hosting either sender-side (NSCC) or receiver-side (RCCC) algorithms. A CCC is unidirectional and is instantiated independently on both the sender and the receiver, where it is locally associated with the corresponding PDC. Once in the READY state, the CCC maintains the state required to regulate data flow, enabling NSCC and RCCC to enforce congestion windows, credits, and path usage to prevent network congestion and maintain efficient data transport.

Note: In this model, the PDS Manager acts as the control-plane authority responsible for context management and coordination, while the Packet Delivery Context (PDC) performs data-plane execution under the control of the Congestion Control Context (CCC). Once the CCC is operational and the PDC is authorized for data transport, RMA data transfers proceed directly over the PDC without further involvement from the PDS Manager.



Figure 6-6: Congestion Context: Initialization.

Monday, 29 December 2025

UET Congestion Management: Introduction

 Introduction


Figure 6-1 depicts a simple scale-out backend network for an AI data center. The topology follows a modular design, allowing the network to scale out or scale in as needed. The smallest building block in this example is a segment, which consists of two nodes, two rail switches, and one spine switch. Each node in the segment is equipped with a dual-port UET NIC and two GPUs.

Within a segment, GPUs are connected to the leaf switches using a rail-based topology. For example, in Segment 1A, the communication path between GPU 0 on Node A1 and GPU 0 on Node A2 uses Rail A0 (Leaf 1A-1). Similarly, GPU 1 on both nodes is connected to Rail A1 (Leaf 1A-2). In this example, we assume that intra-node GPU collective communication takes place over an internal, high-bandwidth scale-up network (such as NVLink). As a result, intra-segment GPU traffic never reaches the spine layer. Communication between segments is carried over the spine layer.

The example network is a best-effort (that is, PFC is not enabled) two-tier, three-stage non-blocking fat-tree topology, where each leaf and spine switch has four 100-Gbps links. Leaf switches have two host-facing links and two inter-switch links, while spine switches have four inter-switch links. All inter-switch and host links are Layer-3 point-to-point interfaces, meaning that no Layer-2 VLANs are used in the example network.

Links between a node’s NIC and the leaf switches are Layer-3 point-to-point connections. The IP addressing scheme uses /31 subnets, where the first address is assigned to the host NIC and the second address to the leaf switch interface. These subnets are allocated in a contiguous manner so they can be advertised as a single BGP aggregate route toward the spine layer.

The trade-off of this aggregation model is that host-link or NIC failures cannot rely solely on BGP route withdrawal for fast failure detection. Additional local failure-detection mechanisms are therefore required at the leaf switch.

Although not shown in Figure 6-1, the example design supports a scalable multi-pod architecture. Multiple pods can be interconnected through a super-spine layer, enabling large-scale backend networks.

Note: The OSI between GPUs within a node indicates that both GPUs belong to the same Operating System Instance (OSI). The link between GPUs, in turn, is part of a high-bandwidth domain (the scale-up backend).

Figure 6-1: Example of AI DC Backend Networks Topology.

Congestion Types

In this text, we categorize congestion into two distinct domains: congestion within nodes, which includes incast, local, and outcast congestion, and congestion in scale-out backend networks, which includes link and network congestion. The following sections describe each congestion type in detail.


Incast Congestion

Figure 6-2 depicts an AllReduce collective communication using a star topology, where Rank 0 on node A1 acts as the central process. During the reduce phase, each participating rank sends its local gradients to Rank 0, which aggregates them (by summation or averaging). The aggregated result is then distributed back to all ranks during the broadcast phase, which is described in the next section.

Note: Tree- and ring-based topologies are significantly more efficient than a star topology and are therefore commonly used in practice. The star topology is shown here purely for demonstration purposes.

Incast congestion occurs when ingress data packets cannot be processed at line rate, causing the ingress buffer on the receiver NIC to overflow. At first glance, this may seem surprising because both ends of the link operate at the same speed. In theory, the receiver NIC should therefore be able to handle all incoming packets at line rate.

In this example, Rank 0 receives six interleaved packet streams at line rate. This occurs during the second neural network training iteration, at which point all required Packet Delivery Contexts (PDCs) were already established during the first iteration.

Connections from Rank 0 to Rank 2 and Rank 3 on node A2 both target FEP-2, identified by Fabric Address 10.1.1.2. On node A2, Rank 2 and Rank 3 execute under the same operating system instance and therefore share the same Front-End Port (FEP). As a result, both processes are exposed to the fabric through the same destination FA.

In Ultra Ethernet Transport, a PDC is defined by the combination of Job ID, destination FA, Traffic Class (TC), and Service Mode (in this case, Reliable Unordered Delivery). When these parameters match, multiple communication channels can reuse the same PDC, even if they originate from different application processes.

In this scenario, both ranks participate in the same distributed training job (Job ID 101) and request identical transport characteristics: Traffic Class Low and Reliable Unordered Delivery. Because the source and destination FEPs are the same and all PDC-defining parameters match, a single PDC—identified by PDCID 12—is sufficient on the Rank 0 side to serve communication with both Rank 2 and Rank 3.

Although Rank 2 and Rank 3 are separate application processes, the NIC multiplexes their packets over the same PDC while preserving the required reliability and delivery semantics at the protocol level. The same PDC reuse logic applies to connections between Rank 0 and other remote processes, provided that the Job ID, destination FA, Traffic Class, and Service Mode remain unchanged.

For every received packet, the NIC must perform several processing steps. It first examines the PDS header to determine whether an acknowledgment is required and to identify the next protocol header. It then processes the relative address information to determine the requested operation and, in the case of a UET_WRITE, the target memory location. These operations must be performed for every packet across all interleaved packet streams.

When many packets arrive simultaneously from multiple senders, the cumulative per-packet processing load can exceed the NIC’s ingress processing capacity, even though the physical link itself is not oversubscribed. As a result, ingress buffers may overflow, leading to incast congestion.

Note: In Figure 6-2, each node has a dual-port NIC and a single FEP. The NIC IP addresses are used for routing packets across the backend fabric, while the FA address serves as the FEP identifier.


Figure 6-2: Intra-node Congestion - Incast Congestions.


Local Congestion

Local congestion is another form of congestion that occurs within a node. It arises when the High-Bandwidth Memory (HBM) controller, which manages access to the GPU’s memory channels, becomes a bottleneck. The HBM controller arbitrates all read and write requests to GPU memory, regardless of their source. These requests may originate from the GPU’s compute cores, from a peer GPU via NVLink, or from a network interface card (NIC) performing remote memory access (RMA) operations.

With a UET_WRITE operation, the target GPU compute cores are bypassed: the NIC writes data directly into GPU memory using DMA. The GPU does not participate in the data transfer itself, and the NIC handles packet reception and memory writes. Even in this case, however, the data must still pass through the HBM controller, which serves as the shared gateway to the GPU’s memory system.

In Figure 6-3, the HBM controller of Rank 0 receives seven concurrent memory access requests: six inter-node RMA write requests and one intra-node request. The controller must arbitrate among these requests, determining the order and timing of each access. If the aggregate demand exceeds the available memory bandwidth or arbitration capacity, some requests are delayed. These memory-access delays are referred to as local congestion.



Figure 6-3: Intra-node Congestion - Local Congestions.


Outcast Congestion

Outcast congestion is the third type of congestion observed in collective operations. It occurs when multiple packet streams share the same egress port, and some flows are temporarily delayed relative to others. Unlike incast congestion, which arises from simultaneous arrivals at a receiver, outcast happens when certain flows dominate the output resources, causing other flows to experience unfair delays or buffer pressure.

Consider the broadcast phase of the AllReduce operation. After Rank 0 has aggregated the gradients from all participating ranks, it sends the averaged results back to all other ranks. Suppose Rank 0 sends these updates simultaneously to ranks on node A2 and node A3 over the same egress queue of its NIC. If one destination flow slightly exceeds the others in packet rate, the remaining flows experience longer queuing delays or may even be dropped if the egress buffer becomes full. These delayed flows are “outcast” relative to the dominant flows.

In this scenario, the NIC at Rank 0 must perform multiple UET_WRITE operations in parallel, generating high egress traffic toward several remote FEPs. At the same time, the HBM controller on Rank 0 may become a bottleneck because the data must be read from memory to feed the NIC. Thus, local congestion can occur concurrently with outcast congestion, especially during large-scale AllReduce broadcasts where multiple high-bandwidth streams are active simultaneously.

Outcast congestion illustrates that even when the network’s total capacity is sufficient, uneven traffic patterns can cause some flows to be temporarily delayed or throttled. Mitigating outcast congestion is addressed by appropriate egress scheduling and flow-control mechanisms to ensure fair access to shared resources and predictable collective operation performance. These mechanisms are explained in the upcoming Network-Signaled Congestion Control (NSCC) and Receiver Credit-Based Congestion Control (RCCC) chapters.


Figure 6-4: Intra-node Congestion - Outcast Congestions.


Link Congestion


Traffic in distributed neural network training workloads is dominated by bursty, long-lived elephant flows. These flows are tightly coupled to the application’s compute–communication phases. During the forward pass, network traffic is minimal, whereas during the backward pass, each GPU transmits large gradient updates at or near line rate. Because weight updates can only be computed after gradient synchronization across all workers has completed, even a single congested link can delay the entire training step.

In a routed, best-effort fat-tree Clos fabric, link congestion may be caused by Equal-Cost Multi-Path (ECMP) collisions. ECMP typically uses a five-tuple hash—comprising the source and destination IP addresses, transport protocol, and source and destination ports—to select an outgoing path for each flow. During the backward pass, a single rank often synchronizes multiple gradient chunks with several remote ranks simultaneously, forming a point-to-multipoint traffic pattern.

For example, suppose Ranks 0–3 in segment 1 initiate gradient synchronization with Ranks 4–7 in segment 2 at the same time. Ranks 0 and 2 are connected to rail 0 through Leaf 1A-1, while Ranks 1 and 3 are connected to rail 1 through Leaf 1A-2. As shown in Figure 6-5, the ECMP hash on Leaf 1A-1 selects the same uplink toward Spine 1A for both flows arriving via rail 0, while the ECMP hash on Leaf 1A-2 distributes its flows evenly across the available spine links.

As a result, two 100-Gbps flows are mapped onto a single 100-Gbps uplink on Leaf 1A-1. The combined traffic exceeds the egress link capacity, causing buffer buildup and eventual buffer overflow on the uplink toward Spine 1A. This condition constitutes link congestion, even though alternative equal-cost paths exist in the topology.

In large-scale AI fabrics, thousands of concurrent flows may be present, and low entropy in traffic patterns—such as many flows sharing similar IP address ranges and port numbers—further increases the likelihood of ECMP collisions. Consequently, link utilization may become uneven, leading to transient congestion and performance degradation even in a nominally non-blocking network.

Ultra Ethernet Transport includes signaling mechanisms that allow endpoints to react to persistent link congestion, including influencing path selection in ECMP-based fabrics. These mechanisms are discussed in later chapters.

Note: Although outcast congestion is fundamentally caused by the same condition—attempting to transmit more data than an egress interface can sustain—Ultra Ethernet Transport distinguishes between host-based and switch-based egress congestion events and applies different signaling and control mechanisms to each. These mechanisms are described in the following congestion control chapters.



Figure 6-5: Link Congestions.

Network Congestion


Common causes of network congestion include too high oversubscription ration, ECMP collisions, and link or device failures. A less obvious but important source of short-term congestion is Priority Flow Control (PFC), which is commonly used to build lossless Ethernet networks. PFC together with Explicit Congestion Notification (ECN) forms the foundation of Lossless Ethernet for RoCEv2 but should be avoided in UET enabled best-effort network. The upcoming chapters explains why.

PFC relies on two buffer thresholds to control traffic flow: xOFF and xON. The xOFF threshold defines the point at which a switch generates a pause frame when a priority queue becomes congested. A pause frame is an Ethernet MAC control frame that tells the upstream device which Traffic Class (TC) queue is congested and for how long packet transmission for that TC should be paused. Packets belonging to other traffic classes can still be forwarded normally. Once the buffer occupancy drops below the xON threshold, the switch sends a resume signal, allowing traffic for that priority queue to continue before the actual pause timer expires.

At first sight, PFC appears to affect only a single link and only a specific traffic class. In practice, however, a PFC pause can trigger a chain reaction across the network. For example, if the egress buffer size exceeds the xOFF threshold for TC-Low on interface to rank 7 on Leaf switch 1B-1, the switch sends PFC pause frames to both connected spine switches, instructing them to temporarily hold TC-Low packets in their buffers. As the egress buffers for TC-Low on the spine switches begin to fill and xOFF threshold is crossed, they in turn sends PFC pause frame to rest of the leaf switches.

This behavior can quickly spread congestion beyond the original point of contention. In the worst case, multiple switches and links may experience temporary pauses. Once buffer occupancy drops below the xON threshold, Leaf switch 1B-1 sends resume signals, and traffic gradually recovers as normal transmission resumes. Even though the congestion episode is short, it disrupts collective operations and negatively impacts distributed training performance.

The upcoming chapters explain how Ultra Ethernet Network-Signal Congestion Control (NSCC) and Receiver-Credit Congestion Control (RCCC) manage the amount of data that sources are allowed to send over the network, maximizing network utilization while avoiding congestion. The next chapters also describe how Explicit Congestion Notification (ECN), Packet Trimming, and Entropy Value-based Packet Spraying, when combined with NSCC and RCCC, contribute to a self-adjusting, reliable backend network.


Monday, 15 December 2025

UET Request–Response Packet Flow Overview

 This section brings together the processes described earlier and explains the packet flow from the node perspective. A detailed network-level packet walk is presented in the following sections..

Initiator – SES Request Packet Transmission

After the Work Request Entity (WRE) and the corresponding SES and PDS headers are constructed, they are submitted to the NIC as a Work Element (WE). As part of this process, a Packet Delivery Context (PDC) is created, and the base Packet Sequence Number (PSN) is selected and encoded into the PDS header.

Once the PDC is established, it begins tracking acknowledged PSNs from the target. For example, the PSN 0x12000 is marked as transmitted. 

The NIC then fetches the payload data from local memory according to the address and length information in the WRE. The NIC autonomously performs these steps without CPU intervention, illustrating the hardware offload capabilities of UET.

Next, the NIC encapsulates the data with the required protocol headers: Ethernet, IP, optional UDP, PDS, and SES, and computes the Cyclic Redundancy Check (CRC). The fully formed packet is then transmitted toward the target with Traffic Class (TC) set to Low.

Note: The Traffic Class is orthogonal to the PDC; a single PDC may carry packets transmitted with Low or High TC depending on their role (data vs control).

Figure 5-9: Initiator: SES Request Processing.

Target – SES Request Reception and PDC Handling


Figure 5-10 illustrates the target-side processing when an PDS Request carrying SES Request is received. Unlike the initiator, the target PDS manager identifies the PDC using the tuple {source IP address, destination IP address, Source PDC Identifier (SPDCID)} to perform a lookup in its PDC mapping table.


Because no matching entry exists, the lookup results in a miss, and the target creates a new PDC. The PDC identifier (PDCID) is allocated from the General PDC pool, as indicated by the DPDCID field in the received PDS header. In this example, the target selects PDCID 0x8001.

This PDCID is subsequently used as the SPDCID when sending the PDS Ack  Response (carrying Semantic Response) back to the initiator. Any subsequent PDS Requests from the initiator reference this PDC using the same DPDCID = 0x8001, ensuring continuity of the PDC across messages.

After the PDC has been created, the UET NIC writes the received data into memory according to the SES header information. The memory placement process follows several steps:

  • Job and rank identification: The relative address in the SES header identifies the JobID (101) and the PIDonFEP (RankID 2).
  • Resource Index (RI) table lookup: The NIC consults the RI table, indexed by 0x00a, and verifies that the ri_generation field (0x01) matches the current table version. This ensures the memory region is valid and has not been re-registered.
  • Remote key validation: The NIC uses the rkey = 0xacce5 to locate the correct RI table entry and confirm permissions for writing.
  • Data placement: The data is written at base address (0xba5eadd1) + buffer_offset (0). The buffer_offset allows fragmented messages to be written sequentially without overwriting previous fragments.

In Figure 5-10, the memory highlighted in orange shows the destination of the first data fragment, starting at the beginning of the registered memory region.

Note: The NIC handles all these steps autonomously, performing direct memory placement and verification, which is essential for high-performance, low-latency applications like AI and HPC workloads.

Figure 5-10: Target: Request Processing – NIC → PDS → SES → Memory.


Target – SES Response, PDS Ack Response and Packet Transmission

After completing the write operation, the UET provider uses Semantic Response (SES Response) to notify the initiator that the operation was successful. The opcode in the SES Response header is set to UET_DEFAULT_RESPONSE, with list= UET_EXPECTED and return_code = RC_OK, indicating that the UET_WRITE operation has been executed successfully and the data has been written to target memory. Other fields, including message_id, ri_generation, JobID, and modified_length, are filled with the same values received in the SES Request, for example, message_id = 1, ri_generation = 0x001, JobID = 101, and modified_length = 16384.

Once the SES Response header is constructed, the UET provider creates a PDS Acknowledgement (PDS Ack) Response. The type is set to PDS_ACK, and the next_header field UET_HDR_RESPONSE references the SES Response type. The ack_psn_offset encodes the PSN from the received PDS Request, while the cumulative PSN (cack_psn) acknowledges all PDS Requests up to and including the current packet. The SPDCID is set to the target’s Initial PDCID (0x8001), and the DPDCID is set to the value received from the PDS Request as SPDCID (0x4001).

Finally, the PDS Ack and SES Response headers are encapsulated with Ethernet, IP, and optional UDP headers and transmitted by the NIC using High Traffic Class (TC). The High TC ensures that these control and acknowledgement messages are prioritized in the network, minimizing latency and supporting reliable flow control.

Figure 5-11: Target: Response Processing – SES → PDS → Transmit.

Initiator – SES Response and PDS Ack Respond


When the initiator receives a PDS Ack Response that also carries a SES Response, it first identifies the associated Packet Delivery Context (PDC) using the DPDCID field in the PDS header. Using this PDC, the initiator updates its PSN tracking state. The acknowledged PSN—for example, 0x12000—is marked as completed and released from the retransmission tracking state, indicating that the corresponding PDS Request has been successfully delivered and processed by the target.

After updating the transport-level state, the initiator extracts the SES Response and passes it to the Semantic Sublayer (SES) for semantic processing. The SES layer evaluates the response fields, including the opcode and return code, and determines that the UET_WRITE operation associated with message_id = 1 has completed successfully. As this response corresponds to the first fragment of the message, the initiator can mark that fragment as completed and, depending on the message structure, either wait for additional fragment responses or complete the overall operation. In our case, there are three more fragments to be processed.

This separation of responsibilities allows the PDS layer to manage reliability and delivery tracking, while the SES layer handles operation-level completion and status reporting.

Figure 5-12: Initiator: PDS Response & PDS Ack Processing.

Note: PDS Requests and Responses describe transport-specific parameters, such as the delivery mode (Reliable Unordered Delivery, RUD, or Reliable Ordered Delivery, ROD). In contrast, SES Requests and Responses describe semantic operations. SES Requests specify what action the target must perform, for example, writing data and the exact memory location for that operation, while SES Responses inform the initiator whether the operation completed successfully. In some flow diagrams, SES messages are shown as flowing between the SES and PDS layers, while PDS messages are shown as flowing between the PDS layers of the initiator and the target.



Wednesday, 10 December 2025

UET Protocol: How the NIC constructs packet from the Work Entries (WRE+SES+PDS)

 Semantic Sublayer (SES) Operation 

[Rewritte 12. Dec-2025]

After a Work Request Entity (WRE) is created, the UET provider generates the parameters needed by the Semantic Sublayer (SES) headers. At this stage, the SES does not construct the actual wire header. Instead, it provides the header parameters, which are later used by the Packet Delivery Context (PDC) state machine to construct the final SES wire header, as explained in the upcoming PDC section. These parameters ensure that all necessary information about the message, including addressing and size, is available for later stages of processing.

Fragmentation Due to Guaranteed Buffer Limits

In our example, the data to be written to the remote GPU is 16 384 bytes. The dual-port NIC in figure 5-5 has a total memory capacity of 16 384 bytes, divided into three regions: a 4 096-byte guaranteed per-port buffer for Eth0 and Eth1, and an 8 192-byte shared memory pool available to both ports. Because gradient synchronization requires lossless delivery, all data must fit within the guaranteed buffer region. The shared memory pool cannot be used, as its buffer space is not guaranteed.

Since the message exceeds the size of the guaranteed buffer, it must be fragmented. The UET provider splits the 16 384-byte message into four 4 096-byte sub-messages, as illustrated in Figure 5‑6. Fragmentation ensures that each piece of the message can be transmitted reliably within the available guaranteed buffer space.

Figure 5‑5 also illustrates the main object abstractions in UET. The Domain object represents and manages the NIC, while the Endpoint defines a logical connection between the application and the NIC. The max_msg_size parameter specifies the largest data payload that can be transferred in a single operation. In our example, the NIC’s egress buffer can hold 4 096 bytes, but the application needs to write 16 384 bytes. As a result, the data must be split into multiple smaller chunks, each fitting within the buffer and max_msg_size limits.


Figure 5-5: RMA Operation – NIC’s Buffers and Data Size.

Core SES Parameters


Each of the four fragments in Figure 5-6 carries the same msg_id = 1, allowing the target to recognize them as parts of the same original message. The first fragment sets ses.som = 1 (start of message), while the last fragment sets ses.eom = 1 (end of message). The two middle fragments have both ses.som and ses.eom set to 0. In addition to these boundary markers, the SES parameters also define the source and destination FEPs, the Delivery Mode (RUD – Reliable Unordered Delivery for an fi_write operation), the Job ID (101), the Traffic Class (Low = 0), and the Packet Length (4 096 bytes). The pds.next_hdr field determines which SES base header format the receiver must parse next. For an fi_write operation, a standard SES Request header is used (UET_HDR_REQUEST_STD = 0x3).



Figure 5-6: RMA Operation – Semantic Sublayer: SES Parameters.


Packet Deliver Sublayer (PDS) Operation

PDS Manager


Once the SES parameters are defined, the provider issues the ses_pds_tx_req() request to the PDS Manager (Figure 5-7). The PDS Manager examines the tuple {Job ID, Destination FEP, Traffic Class, Request Mode} to determine whether a Packet Delivery Context (PDC) already exists for that combination.

Because this request corresponds to the first of the four sub-messages, no PDC exists yet. The PDS Manager selects the first available PDC Identifier from the pre-allocated PDC pool. The pool—created during job initialization—consists of a general PDC pool (used for operations such as fi_write and fi_send) and a reserved PDC pool for high-priority traffic such as Ack messages from the target to the initiator. In our example, the first free PDC ID from the general pool is 0x4001.

After opening the PDC, the PDS Manager associates all subsequent requests with the same msg_id = 101 with PDC 0x4001.

PDC State Machine


After selecting a PDC Identifier, the PDS Manager forwards the request to the PDC State Machine. This component assigns the base Packet Sequence Number (PSN), which will be used in the PSN field of the first and upcoming fragments. Next, it constructs the PDS and SES headers for the message.

Constructing PDC Header


The Type field in the PDS header in Figure 5-7 is set to 2, indicating Reliable Unordered Delivery (RUD). Reliability is ensured by the Acknowledgement Required (ar) flag, which is mandatory for this transport mode. The Next Header (pds.next_hdr) field specifies that the following header is the Standard SES Request header (0x3 = UET_HDR_REQUEST_STD)

The syn flag remains set until the first PSN is acknowledged by the target (explained in upcoming chapters). Using the PSNs reported in ACK messages, the initiator can determine which packets have been successfully delivered.
A dynamically established PDC works as a communication channel between endpoints. For the channel to operate, both the initiator and the target must use the same PDC type (General or Reserved) and must agree on which local and remote PDCs are used for the exchange. In Figure 5-7, the Source PDC Identifier (SPDCID) in PDS header is derived from the Initiator PDCID (IPDCID). At this stage, the initiator does not know the Destination PDCID (DPDCID) because the communication channel is still in the SYN state (initializing). Instead of setting a DPDCID value, the initiator simply indicates the type of PDC the target should create for the connection (pdc = rsv). It also specifies the PSN offset (psn_offset = 0), indicating that the offset for this request is zero (i.e., using the base PSN value). The target opens its PDC for the connection after receiving the first packet and when generating the acknowledgment response.


Constructing SES Header


The SES header (Figure 5-7) provides the target NIC with information about which job and which process the incoming message is destined for. It also carries a Resource Index table pointer (0x00a) and an rkey (0xacce5) that together identify the entry in the RI table where the NIC can find the target’s local memory region for writing the received data.

The Opcode field (UET_WRITE) instructs the NIC how to handle the data. The rel flag (relative addressing) is set for parallel jobs to indicate that relative addressing mode is used.

Using the relative addressing mode, the NIC resolves the memory location by traversing a hierarchical address structure: Fabric Address (FA), Job ID, Fabric Endpoint (FEP), Resource Index (RI).

  • The FA, carried as the destination IP address in the IP header, selects the correct FEP when the system has multiple FEPs.
  • The Job ID identifies the job to which the packet belongs.
  • The PIDonFEP identifies the process participating in the job (for parallel jobs, PIDonFEP = rankID).
  • The RI value points to the correct entry in the RI table.

Relative addressing identifies which RI table belongs to the job on the target FEP, but not which entry inside that table. The rkey in the SES header provides this missing information: it selects the exact RI table entry that describes the registered memory region.

While the rkey selects the correct RI entry, the buffer_offset field specifies where inside the memory region the data should be written, relative to the region’s base address. In Figure 5-8, the first fragment writes at offset 0, and the second fragment (not shown) starts at offset 4 096, immediately after the first payload.

The ri_generation field (e.g., 0x01) indicates the version of the RI table in use. This is necessary because the table may be updated while the job is running. The hd (header-data-present) bit indicates whether the header_data field is included. This is useful when multiple gradient buckets must be synchronized, because each bucket can be identified by an ID in header_data (for example, GPU 0 might use bucket_id = 11 for the first chunk and bucket_id = 12 for the second). The initiator_id field specifies the initiator’s PIDonFEP (i.e., rankID).

Finally, note that the SES Standard header has several variants. For example:

  • If ses.som = 1, a Header Data field is present.
  • If ses.eom = 1, the Payload Length and Message Offset fields are included.





Figure 5-8: Packet Delivery Sublayer: PDC and PDS Header.

Work Element (WE)


Once the SES and PDS headers are created, they are inserted, along with the WRE, into the NIC’s Transmit Queue (TxQ) as a Work Element (WE). Figure 5‑9 illustrates the three components of a WE. The NIC fetches the data from local memory based on the instructions described in the WRE, wraps it with Ethernet, IP, optional UDP, PDS, and SES headers, and calculates the Cyclic Redundancy Check (CRC) for unencrypted packets. The packet is then ready for transmission.

The NIC can support multiple Transmit Queues. In our example, there are two: one for Traffic Class Low and another for Traffic Class High. The example WE is placed into the Low TxQ. Work Element 1 (WE1) corresponds to the first ses_pds_tx_req() request, completing the end-to-end flow from WRE creation to packet transmission.


Figure 5-9: UET NIC: Packetization, Queueing & Transport.



Thursday, 27 November 2025

UET Relative Addressing and Its Similarities to VXLAN

 Relative Addressing


As described in the previous section, applications use endpoint objects as their communication interfaces for data transfer. To write data from local memory to a target memory region on a remote GPU, the initiator must authorize the local UE-NIC to fetch data from local memory and describe where that data should be written on the remote side.

To route the packet to the correct Fabric Endpoint (FEP), the application and the UET provider must supply the FEP’s IP address (its Fabric Address, FA). To determine where in the remote process’s memory the received data belongs, the UE-NIC must also know:

  • Which job the communication belongs to
  • Which process within that job owns the target memory
  • Which Resource Index (RI) table should be used
  • Which entry in that table describes the exact memory location

This indirection model is called relative addressing.

How Relative Addressing Works

Figure 5-6 illustrates the concept. Two GPUs participate in distributed training. A process on GPU 0 with global rank 0 (PID 0) receives data from GPU 1 with global rank 1 (PID 1). The UE-NIC determines the target Fabric Endpoint (FEP) based on the destination IP address (FA = 10.0.1.11). This IP address forms the first component of the relative address.

Next, the NIC checks the JobID and PIDonFEP to resolve which job and which process the message is intended for. These two fields are the second and third components of the relative address { FA, JobID, PIDonFEP }.

The fourth component is the Resource Index (RI) table descriptor. It tells the NIC which RI table should be consulted for the memory lookup.

Finally, the rkey, although not part of the relative addressing tuple itself, selects the specific entry in that RI table that defines the precise remote memory region. In our example, the complete addressing information is:

{ FA: 10.0.1.11, JobID: 101, PIDonFEP: 0, RI: 0x00a }, and the rkey identifies the specific RI entry to use.



Figure 5-6: Ultra Ethernet: Relative Addressing for Distributed Learning.


Comparison with VXLAN

Relative addressing in UET has several structural similarities to addressing in VXLAN data planes.

A Fabric Address (FA) attached to a Fabric Endpoint (FEP) serves a role similar to a VTEP IP address in a VXLAN fabric. Both identify the tunnel endpoint used to route the packet across the underlay network toward its destination.

A JobID identifies a distributed job that consists of multiple processes. In VXLAN, the Layer-2 VNI (L2VNI) identifies a stretched Layer-2 segment for endpoints. In both technologies, these identifiers define the logical communication context in which the packet is interpreted.

The combination of PIDonFEP and RI tells the UE-NIC which Resource Index table describes the target memory locations owned by that process. Similarly, in VXLAN, the VNI-to-VLAN mapping on a VTEP determines which MAC address table holds the forwarding entries for that virtual network.

The rkey selects the specific entry in the RI table that defines the exact target memory location. The VXLAN equivalent is the destination MAC address, which selects the exact entry in the MAC table that determines the egress port or remote VTEP.

Figure 5-7 further illustrates this analogy. The Tunnel IP address determines the target VTEP, and the L2VNI-to-VLAN mapping on that VTEP identifies which MAC address table should be consulted for the destination MAC in the original Ethernet header. As a reminder, VXLAN is an Ethernet-in-IP encapsulation method where the entire Layer-2 frame is carried inside an IP/UDP tunnel.


Figure 5-7: Virtual eXtensible LAN: Layer2 Virtual Network Identifier.



Monday, 24 November 2025

UET Data Transfer Operation: Work Request Entity and Semantic Sublayer

Work Request Entity (WRE) 

[SES part updated 7-Decembr 2025: text and figure] 

The UET provider constructs a Work Request Entity (WRE) from a fi_write RMA operation that has been validated and passed by the libfabric core. The WRE is a software-level representation of the requested transfer and semantically describes both the source memory (local buffer) and the target memory (remote buffer) for the operation. Using the WRE, the UET provider constructs the Semantic Sublayer (SES) header and the Packet Delivery Context (PDC) header.

From the local memory perspective, the WRE specifies the address of the data in registered local memory, the length of the data, and the local memory key (lkey). This information allows the NIC to fetch the data directly from local memory when performing the transmission.

From the target memory perspective, the WRE describes the Resource Index (RI) table, which contains information about the destination memory region, including its base address and the offset within that region where the data should be written. The RI table also defines the allowed operations on the region. Because an RI table may contain multiple entries, the actual memory region is selected using the rkey, which is also included in the WRE. The rkey enables the remote NIC to locate the correct memory region within the selected RI table.

To ensure proper delivery, the WRE includes an Address Vector (AV) table entry, identified via the fi_addr_t handle. The AV provides the Fabric Address (FA) of the target and specifies which job and which rank (i.e., PIDonFEP) the data is intended for. The WRE also indicates whether the completion of the transport operation should be reported through a completion queue.

By including pointers to the AV table entry and the remote RI table, the WRE allows the UET provider to access all the transport and remote memory metadata required for the operation without duplicating the underlying AV or RI data structures. Using these indices, the UET provider can efficiently construct the SES and PDC headers for the Work Element (WE), ensuring correct delivery of the data from the initiator’s local memory to the remote target memory.

Figure 5-4 illustrates how the libfabric core passes a fi_write RMA operation request from the application to the UET provider after performing a sanity check. The UET provider then constructs the Work Request Entity, which encapsulates all the information about the local and remote memory, the AV entry identified by the fi_addr_t handle, and the transport metadata required to deliver the operation.

Figure 5-4: RMA Operation – Semantic Sublayer: Work Request Entity (WRE).

Semantic Sublayer (SES)


The UET provider’s Semantic Sublayer (SES) maps application-facing API calls, such as fi_write RMA operations, to UET operations. In our example, the UET provider constructs a SES header where the fi_write request from the libfabric application is mapped to a UET_WRITE operation. The first step of this mapping, described in the previous section, uses the content of the fi_write request to construct a UET Work Request Entity (WRE). Information from the WRE is then used to build the actual SES header, which will later be wrapped within Ethernet/IP/UDP (if an Entropy header is not used). The SES header carries information that the target NIC uses to resolve which job and process the message is targeted to, and which specific memory location the data should be written to. In other words, the SES header provides a form of intra-node routing, indicating where the data should be placed within the target node.

The first element to consider in Figure 5-5 is the maximum message size (max_msg_size) supported by the endpoint. In our example, the dual-port NIC has a total memory capacity of 16 384 bytes. Half of this memory (8 192 bytes) forms a shared pool accessible to both ports (Eth0 and Eth1). In addition, each port has a guaranteed private memory region of 4 096 bytes.

The maximum message size is determined by the largest amount of memory that the NIC can guarantee for a single message. Although the shared pool can temporarily absorb additional traffic, its availability cannot be guaranteed because both ports may consume it simultaneously. Consequently, the NIC must base max_msg_size solely on the per-port guaranteed memory. Thus, the largest message that the endpoint can safely handle is 4 096 bytes.

Note: Although an Accelerated NIC can fetch data directly from GPU VRAM without staging it through CPU memory, the NIC still needs the data briefly in its own buffer to perform transport framing (for example, adding headers and verifying CRC) before sending it onto the wire.

Because the data to be written to the target memory (16 384 bytes) exceeds the per-packet private buffer size (4 096 bytes), the UET provider must split the operation into four packets (messages).

The first key field in the SES header is the operation code (UET_WRITE), which instructs the target NIC how to handle the received data. The rel bit (relative addressing), when set, indicates that the operation uses a relative address, which includes JobID (101), PIDonFEP (process identifier, rank 2 in our example), and Resource Index (0x00a). Based on this information, the target NIC can identify the job and rank to which the message belongs. The Resource Index may contain multiple entries and the rkey (0xacce5) is used to select the correct RI entry.

The SES header also contains the buffer offset, which specifies the exact location relative to the base address where the data should be written. In Figure 5‑5, the first message will write data starting at offset 0, while the second message will write at offset 8192, immediately following the first message’s payload. The ses.som bit indicates the start of the message, and ses.eom indicates the end of the message. Note that ses.som and ses.eom bits are not used for ordering; message ordering is ensured by the Message ID field, which allows the NIC to process fragments in the correct sequence. 

During the job lifetime, Resource Indices may be updated. To validate that the correct version is used, the SES header includes the ri_generation field, which identifies the initiator’s current RI table version.

The hd (header data present) bit indicates whether the header_data field is presented in the SES header. A common use case for header data is when a GPU holds multiple gradient chunks that must be synchronized with a remote process. Each chunk can be identified by a bucket ID stored in the SES header’s header_data field. For example, the first chunk in GPU 0 memory may have bucket_id=11, the second chunk bucket_id=12, and so on. This allows the NIC to distinguish which messages correspond to which chunk. The initiator id describes the rank id (initiator’s PIDonFEP).

If a gradient chunk exceeds the NIC’s max_msg_size, it must be split into multiple SES messages. Consider the second chunk (bucket_id=12) split into four messages. The first message has ses.som=1, indicating the start of the chunk, and hd=1, signaling that header data is present. The header_data field contains the bucket ID (12). This message also has message_id=1, identifying it as the first SES message of the chunk. The next two messages have message_id=2 and message_id=3, respectively. Both have hd=0, ses.som=0, and ses.eom=0, indicating they are continuation packets. The fourth message is similar but has ses.eom=1, marking it as the last message of the chunk. 





Figure 5-5: RMA Operation – Semantic Sublayer: SES Header.


Monday, 17 November 2025

UET Data Transfer Operation: Introduction

Introduction

[Updated 22 November 2025: Handoff Section]

The previous chapter described how an application gathers information about available hardware resources and uses that information to initialize the job environment. During this initialization, hardware resources are abstracted and made accessible to the UET provider as objects.

This chapter explains the data transport process, using gradient synchronization as an example.

Figure 5-1 depicts two GPUs—Rank 0 and Rank 2—participating in the same training job (JobID: 101). Both GPUs belong to the same NCCL topology and are connected to the Scale-Out Backend Network’s rail0.

Because the training model is large, each layer of neural network is split across two GPUs using tensor parallelism, meaning that the computations of a single layer are distributed between GPUs. 

During the first forward-pass training iteration, the predicted model output does not match the expected result. This triggers the backward pass process, in which gradients—values indicating how much each weight parameter should be adjusted to improve the next forward-pass prediction—are computed.

Rank 0 computes its gradients, which in Figure 5-1 are stored as a 2D matrix with 3 rows and 1024 columns. The results are stored in a memory space registered for the process in local VRAM. The memory region’s base address is 0x20000000.

The first gradient at row 0, column 0 (index [0,0]) is stored at offset 0. In this example, each gradient value is 4 bytes. Thus, the second gradient at [0,1] is stored at offset 4, and so on. All 1024 gradients in row 0 require 4096 bytes (1024 × 4 bytes) of memory, which corresponds to the Scale-Out Backend Network’s Maximum Transfer Unit (MTU). The entire gradient block stored in Rank 0’s VRAM occupies 12,280 bytes. 

Row 0: [G[0,0], G[0,1], G[0,2], ..., G[0,1023]] → offsets 0, 4, 8, ..., 4092 bytes

Row 1: [G[1,0], G[1,1], G[1,2], ..., G[1,1023]] → offsets 4096, 4100, 4104, ..., 8192 bytes

Row 2: [G[2,0], G[2,1], G[2,2], ..., G[2,1023]] → offsets 8192, 8196, 8200, ..., 12288 bytes


After completing its part of the gradient computation, the application on Rank 0 initiates gradient synchronization with Rank 2 by executing an fi_write RMA operation. This operation writes the gradient data from local memory to remote memory. To perform this operation successfully, the application—together with the UET provider, libfabric core, and the UET NIC—must provide the necessary information for the process, UET NIC, and network:

Semantic (What we want to do): The application describes the intent: Write 12,280 bytes of data from the local registered memory region stared at base memory address 0x20000000 to the corresponding memory region of a process running on Rank 2.

Delivery (How to transport): The application specifies how the data must be delivered: reliable or unreliable transport, with ordered or unordered packet delivery. In figure 5-1, the selected mode is Reliable, Ordered Delivery (ROD).

Forwarding (Where to transport): To route the packets over the Scale-Out Backend Network, the delivery information is encapsulated within Ethernet, IP, and optionally UDP headers.


Figure 5-1: High-Level View of Remote Memory Access Operation.



Application: RMA Write Operation


Figure 5-2 illustrates how an application gathers information for an fi_write RMA operation.

The first field, fid_ep, maps the remote write operation to the endpoint object (fid_ep = 0xFIDAE01; AE stands for Active Endpoint). The endpoint type is FI_EP_RDM, which provides reliable datagram delivery. The endpoint is bound to a registered memory region (fid_mr = 0xFIDDA01), and the RMA operation stores the memory descriptor in the desc field. Gradients reside in this memory region, starting from the base address 0x20000000.

The length field specifies how many bytes should be written to the remote memory. The target of the write is represented by an fi_addr_t value. In this example, it points to the first entry in the Address Vector (AV), which identifies the remote rank and its fabric address (IP address). The AV also references a Resource Index (RI) entry. The RI entry contains the JobID, rank ID, remote memory address, and key required for access.

After collecting all the necessary information, the application invokes the fi_write RMA operation in the libfabric core.

Figure 5-2: RMA Operation – Application: fi_write.

Libfabric Core Validation for RMA Operations


Libfabric core validation ensures that an RMA request is structurally correct, references valid objects, and complies with the capabilities negotiated during endpoint creation. It performs three main types of lightweight checks before handing off the request to the UET provider.

Object Integrity and Type Validation


The core first ensures that all objects referenced by the application are valid and consistent. For fi_write, it verifies that fid_ep is a properly initialized endpoint of the expected class (FI_CLASS_EP) and type (FI_EP_RDM). The memory region fid_mr is checked to confirm it is registered and that the desc field correctly points to the local buffer. The target fi_addr_t is validated against the Address Vector (AV) to ensure it corresponds to a valid remote rank and fabric address. Any associated Resource Index (RI) references are verified for consistency.
Attribute and Capability Compliance

Next, the core verifies that the endpoint supports the requested operation. For fi_write, this means checking that the endpoint provides reliable RMA write capability. It also ensures that any attributes or flags used are compatible with the endpoint’s capabilities, including alignment with memory registration and compliance with the provider’s supported ordering and reliability semantics.

Basic Parameter Sanity Checks 


The core performs sanity checks on operation parameters. It verifies that length is non-zero, does not exceed the registered memory region, and is correctly aligned. Any flags or optional parameters are checked for validity.

Provider Handoff


After passing all validations, the libfabric core hands off the fi_write operation to the provider, ensuring the request is well-formed.

The UET provider converts the fi_write RMA operation into a Work Request Element (WRE), from which it creates a Semantic Sublayer (SES) header that specifies where the data will be written on the target, and a Packet Delivery Context (PDC) that describes how the packet is expected to be delivered. It then constructs the Data Descriptor (DD), which includes the local memory address, length, and access key. Next, the UET provider creates a Work Element (WE) from the SES, PDC, and DD. The WE is placed into the NIC’s staging buffer, where the NIC reads it and constructs a packet, copying the data from the local memory.

These processes are explained in the upcoming chapters.


Figure 5-3: RMA Operation – Libfaric Core: Lightweight Sanity Check.


Monday, 10 November 2025

UET Data Transport Part I: Introduction

[Figure updated 13 November 2025]

My previous UET posts explained how an application uses libfabric function API calls to discover available hardware resources and how this information is used to create a hardware abstraction layer composed of Fabric, Domain, and Endpoint objects, along with their child objects — Event Queues, Completion Queues, Completion Counters, Address Vectors, and Memory Regions.

This chapter explains how these objects are used during data transfer operations. It also describes how information is encoded into UET protocol headers, including the Semantic Sublayer (SES) and Packet Delivery Sublayer (PDC). In addition, the chapter covers how the Congestion Management Sublayer (CMS) monitors and controls send queue rates to prevent egress buffer overflows.

Note: In this book, libfabric API calls are divided into two categories for clarity. Functions are used to create and configure fabric objects such as fabrics, domains, endpoints, and memory regions (for example, fi_fabric(), fi_domain(), and fi_mr_reg()). Operations, on the other hand, perform actual data transfer or synchronization between processes (for example, fi_write(), fi_read(), and fi_send()).

Figure 5-1 provides a high-level overview of a libfabric Remote Memory Access (RMA) operation using the fi_write function call. When an application needs to transfer data, such as gradients, from its local memory to the memory of a GPU on a remote node, both the application and the UET provider must specify a set of parameters. These parameters ensure that the local RMA-capable NIC can forward packets to the correct destination and that the target node can locate the appropriate memory region using its process and job identifiers.

First, the application defines the operation to perform, in our example, a remote write fi_write(). It then specifies the resources involved in the transfer. The endpoint (fid_ep) represents the communication interface between the process and the underlying fabric. Each endpoint is bound to exactly one domain object, which abstracts the UET NIC. Through this binding, the UET provider automatically knows which NIC the endpoint uses, and the endpoint is automatically assigned to one or more send queues for processing work requests. This means the application does not need to manage NIC queue assignments manually.

Next, the application identifies the registered memory region (desc) that contains the local data to be transmitted. It also specifies where within that region to start reading the payload (buffer pointer: buf) and how many bytes to transfer (length: len). 

To reach the correct remote peer, the application uses a fabric address handle (fi_addr_t). The provider resolves this logical address through its Address Vector (AV) to obtain the peer’s actual fabric address—corresponding, in the UET context, to the remote UET NIC endpoint.

Finally, the application specifies the destination memory information: the remote virtual (addr) address where the data should be written and the remote protection key, which authorizes access to that memory region. 

The resulting fi_write function call, as described in the libfabric programmer’s manual, is structured as follows:

ssize_t fi_write(struct fid_ep *ep, const void *buf, size_t len, void *desc, fi_addr_t dest_addr, uint64_t addr, uint64_t key, void *context);

Next, the application fi_write operation API call is passed by the libfabric core to the UET provider. Based on the fi_addr_t handle, the provider knows which Address Vector (AV) table entry it should consult. In our example, the handle value 0x0001 corresponds to rank 1 with the fabric address 10.0.1.11.

Depending on the provider implementation, an AV entry may optionally reference a Resource Index (RI) entry. The RI table can associate the JobID with the work request and store an authorization key, if it was not provided directly by the application. It may also define which operations are permitted for the target rank.

Note: The Rank Identifier (RankID) can be considered analogous to a Process Identifier (PID); that is, the RankID defines the PID on the Fabric EndPoint (FEP).

Armed with this information, gathered from the application’s fi_write operation call and from the Address Vector and Resource Index tables, the UET provider creates a Work Request (WR) and places it into the Send Queue (SQ). Each SQ is implemented as a circular buffer in memory, shared between the GPU (running the provider) and the NIC hardware. Writing a WR into the queue does not automatically notify the NIC. To signal that new requests are available, the provider performs a doorbell operation, writing to a special NIC register. This alerts the NIC to read the SQ, determine how many WRs are pending, and identify where to start processing.

Once notified, the NIC fetches each WR, retrieves the associated metadata, such as the destination fabric address, remote memory key, and SES/PDC header information — and begins executing the data transfer. Some NICs may also periodically poll the SQ, but modern UET NICs typically rely on doorbell notifications to achieve low-latency execution.

Because GPUs and the application are multithreaded, multiple operations may be posted to the SQ simultaneously. Each WR is treated independently and can be placed in separate send queues, allowing the NIC to execute multiple transfers in parallel. This design ensures efficient utilization of both the NIC and GPU resources while maintaining correct ordering and authorization of each operation.


Figure 5-1: Mapping between libfaric operation, Provider Objects, and Hardware.

Before transmitting packets, the NIC uses the metadata retrieved from the AV and RI tables to construct the necessary protocol headers. The Semantic Sublayer (SES) header is created using information such as the JobID, process context, and authorization key, ensuring that the remote peer can correctly identify and authorize the operation. Simultaneously, the Packet Delivery Sublayer (PDC) header is prepared to control reliable delivery, sequence numbering, and congestion management. Together, these headers allow the NIC to send the payload efficiently and securely, while preserving the correct association with the source operation and enabling proper handling by the remote UET NIC.

Next, we will examine in detail how the UET headers — SES and PDC — are constructed and encapsulated with Ethernet/IP headers, and optionally UDP headers for entropy, so that packets can be efficiently routed by the scale-out backend network switches to the correct target node for further processing. On the sending side, the PDC header provides context that the UET NIC uses to manage reliable delivery, sequence numbering, and congestion control, ensuring that packets are transmitted correctly and in order to the remote peer. On the receiving side, the SES header carries the operation-specific information that tells the remote UET NIC exactly what to do — in our example, it instructs the UET NIC to WRITE a block of data to a memory address registered with target process, participating in JobID 101.