Thursday, 5 February 2026

Ultra Ethernet: Receiver Credit-based Congestion Control (RCCC)

 Introduction

Receiver Credit-Based Congestion Control (RCCC) is a cornerstone of the Ultra Ethernet transport architecture, specifically designed to eliminate incast congestion. Incast occurs at the last-hop switch when the aggregate data rate from multiple senders exceeds the egress interface capacity of the target’s link. This mismatch leads to rapid buffer exhaustion on the outgoing interface, resulting in packet drops and severe performance degradation.


The RCCC Mechanism

Figure 8-1 illustrates the operational flow of the RCCC algorithm. In a standard scenario without credit limits, source Rank 0 and Rank 1 might attempt to transmit at their full 100G line rates simultaneously. If the backbone fabric consists of 400G inter-switch links, the core utilization remains a comfortable 50% (200G total traffic). However, because the target host link is only 100G, the last-hop switch (Leaf 1B-1) becomes an immediate bottleneck. The switch is forced to queue packets that cannot be forwarded at the 100G egress rate, eventually triggering incast congestion and buffer overflows.

While "incast" occurs at the egress interface and can resemble head-of-line blocking, it is fundamentally a "fan-in" problem where multiple sources converge on a single receiver. Under RCCC, standard Explicit Congestion Notification (ECN) on the last-hop switch's egress interface is typically disabled for this traffic class. The reasoning is twofold:

Redundancy: In Ultra Ethernet, ECN is the primary signal for NSCC to adjust the Congestion Window (CWND) and rotate the Entropy Value (EV) to trigger packet-level load balancing across the fabric.

Path Convergence: At the last-hop switch, rotating the EV is ineffective because there is only a single physical path to the destination. Since RCCC provides a more granular, proactive mechanism to throttle senders based on the receiver's actual capacity, the reactive "slow down" signaling of ECN becomes unnecessary at this stage. By disabling ECN here, the receiver (Target) takes full responsibility for flow management, ensuring that the fabric remains clear of congestion markers that might otherwise trigger unnecessary path hunting.


Credit Allocation and Flow

Instead of relying on late-stage ECN signaling, the RCCC algorithm proactively throttles senders by granting credits that match the physical transport speed of the target's connection.

Discovery: When Rank 2 receives data, it identifies the sources via the CCC_ID field in the RUD_CC_REQ (the specific request type used when RCCC is enabled) and adds them to its Active Sender Table.

Calculation: The algorithm divides the total available bandwidth, for a 100Gbps link, this is roughly 12.5 GB/, among the active senders. In this example, each sender is allocated 6.25 GB/s (50Gbps) worth of credits.

Granting: These credits are transmitted back to the sources via ACK_CC packets once data is successfully committed to Rank 2’s memory.

Enforcement: Upon receiving the ACK_CC, the Congestion Control Context (CCC) associated with the sender’s Packet Delivery Control (PDC) updates its local credit table. The PDC only permits transmission based on these available credits, effectively capping the individual sender's rate at 50G. This ensures that when combined with the other sender, the aggregate rate at the receiver does not exceed its 100G link capacity.

This credit-grant loop is continuous. The RUD_CC_REQ carries "backlog" information, telling the target exactly how much data is waiting in the source's queue. By dynamically adjusting grants based on this feedback, RCCC ensures the backend network remains lossless.

Figure 8-1: RCCC: Destination Flow Control.


Source RCCC Operation


The RCCC operation from the perspective of source UET Node-A begins when an application on Rank 0 initiates a 256 MB Remote Memory Access (RMA) write operation toward Rank 2. This request is handled by the Semantic Sublayer (SES), which translates the high-level command into  ses_pds_tx_req for the Packet Delivery Sublayer (PDS). In our example, the PDS Manager determines that no communication channel currently exists between the Fabric Endpoints used for these connections, so it allocates a new Packet Delivery Control (PDC) from its general pool with the PDC identifier 0x4001. Simultaneously, it requests a Congestion Control Context (CCC) from the Congestion Management System (CMS), resulting in a dedicated context, CCC_ID = 0xA1, being configured and bound to the new PDC.

Once PDC and CCC are established, the system tracks the pending data through a two-tier backlog system. In our example, PDC 0x4001 updates its delta backlog with the full 256 MB of the request, which is then added to the CCC’s global backlog. This global value represents the total volume of data currently waiting for transport across all PDCs managed by that specific context. Because this is the start of the transaction, the global backlog moves from zero to 256 MB, establishing the total "demand" the source is prepared to place on the network.

In our example, new contexts are pre-provisioned with initial credits scaled to the Bandwidth-Delay Product (BDP). While the theoretical capacity of a 100G link is 12.5 GB/s, the initial "pipe-cleaning" burst is much smaller, specifically 12.5 KB in this scenario. This value represents a safe, conservative fraction of the total BDP, ensuring that the source can trigger the feedback loop without the risk of overwhelming the receiver's buffers or the last-hop switch before the control loop fully engages. The CCC authorizes PDC 0x4001 to transmit this initial amount, subtracts it from the current cumulative credits, and updates the global backlog to show that this small portion is now in-flight, leaving 255.987.500 bytes remaining in the queue.

With this authorization from the CCC, the PDC passes the work request to the NIC, which fetches the data from memory and prepares the packet for transmission. In our example, the FEP Fabric Addresses (FA) are encoded into the IP header’s source and destination IP address fields, and the DSCP bits are configured to correspond to the TC-LOW traffic class. Additionally, the ECN bits are set to reflect that the packet is ECN-capable, ensuring visibility for Network Signaled Congestion Control (NSCC) if needed. The type of the PDS request is set to RUD_CC_REQ, which requires a pds.req_cc_state field. In our example, this field carries the CCC_ID (0xA1) and the Credit Target, which describes the size of the backlog of the sender CCC. By including these parameters, the source explicitly informs the target of its total remaining data, allowing the receiver to calculate and return the next set of credit grants to keep the pipeline moving.

Note: Since the source does not yet have information regarding the PDC on the remote target, it populates the pdc_info field with a value of 0x0 for the Destination PDC ID (DPDCID) for notifying the target that its new PDC ID must be taken from the global PDC pool. Furthermore, the SYN bit remains set until the first ACK_CC message is received, signaling to the target that the connection handshake and credit-granting loop are in the initialization phase.


Figure 8-2: Source RCCC Processing.

Target RCCC Operation – PDS Request


When the initial packet arrives at destination Node-B, the PDS Manager first checks for an existing PDC associated with the incoming connection from Fabric Address (FA) 10.0.0.1 and SPDCID 0x4001. Because no such PDC exists, the PDS Manager identifies this as a new connection request. The value of 0x0 in the pdc_info field instructs the target to allocate a General type PDC, ensuring the local delivery control matches the source's PDC type.

Since no Congestion Control Context (CCC) currently exists for this specific FEP-to-FEP connection, the PDS requests the CMS to allocate a new one. The CMS assigns CCC_ID 0xB1 and creates an entry source CCC_ID-specific entry in the Active Sender Table. This entry describes the source address (FA 10.0.0.1), and the assigned traffic class (TC-LOW) from the IP header. Besides, entry for source CCC_ID 0xA1 described in PDS headers tells the source backlog size as credit_target with value 255.987.500. 

Simultaneously, the NIC extracts the semantic information from the SES header to identify the required operation. In our example, it recognizes a UET_WRITE command and determines the target memory address for the incoming data. Once the packet payload is verified, the data is forwarded to the High-Bandwidth Memory (HBM) Controller, where it waits for its turn to be committed to the physical memory.


Figure 8-3: Target RCCC Processing – PDS Request.


Target RCCC Operation – Credit Assignment


After receiving confirmation from the SES regarding the completed memory operation, the PDS prepares the response using an ACK_CC message. The CMS must now determine how much data the source is permitted to send in its next burst. In our example, the CMS allocates 12.5 KB of credits for CCC_ID 0xA1.

The math behind this allocation is a function of the receiver’s total capacity and the time-granularity of the control loop. While the NIC provides a 100 Gbps (12.5 GB/s) "pipe," the receiver does not grant a full second of data at once, as doing so would bypass the congestion control mechanism. Instead, it grants data in "time-slices", in this scenario, representing 1 microsecond of transmission. By dividing the total bandwidth by the number of active senders for that specific time-slice, the receiver ensures that the aggregate "demand" never exceeds the physical capabilities of the link.

In our example, with only one active sender, the calculation is:

(12.5 GB/s x 0.000001 s) ÷ active senders = 12.5 KB

The RCCC algorithm is designed for dynamic fairness. Though not explicitly shown in Figure 8-4, if Rank 0 had a simultaneous transfer in progress, the Active Sender Table would list two sources. The CMS would then divide that same 1-microsecond "slice" between them, reducing the granted credit per source to 6.25 KB. This prevents "incast" congestion by ensuring that even if multiple sources transmit at once, their combined throughput matches exactly what the receiver can ingest.

The PDS defines this response by setting the pds.cc_type to CC_CREDIT. The pds.ack_cc_state field is populated with this calculated credit value, while the ooo_count field tracks any Out-of-Order packets. To ensure this information is not delayed by standard data traffic, the DSCP bits in the IP header are set to TC-High. This gives the ACK_CC message "express" priority across the backend fabric, minimizing the time the source spends waiting for new credits and maintaining a high-performance, steady-state flow.

Crucially, the Target populates its own local PDC ID (0x4011) into the Source PDC Identifier (SPDCID) field of the PDS Prologue header. By doing so, it provides the return address necessary for the source to transition out of its initial "discovery" state.

Figure 8-4: Target RCCC Processing – ACK_CC Message Reply.


Source-Side Processing of ACK_CC


When the ACK_CC message arrives at the source, the NIC identifies the target FEP based on the destination IP address. However, for high-speed internal processing, it uses the DPDCID in the PDS header as a local handle to jump directly to the correct PDC Context. From this entry, the NIC automatically resolves the CCC_ID associated with that specific PDC.

Once the correct CCC entry is identified, the source processes the new credit information. In our example, the receiver has sent a new Cumulative Credit value of 25,000 bytes. To determine the currently available window, the source performs a simple subtraction: 

Incremental Credit = Received Cumulative Credit – Local Cumulative Credit

By subtracting the previously recorded 12,500 bytes from the new 25,000 bytes, the source identifies an incremental grant of 12,500 bytes. The CCC then authorizes the PDC to transmit this amount. Simultaneously, the Global Backlog is updated by subtracting these 12,500 bytes from the remaining 255,987,500 bytes, keeping the sender’s demand signal accurate for the next request.

The PDC informs the NIC that it is cleared to construct packets fitting this allowed credit size (respecting the NIC’s MTU). The NIC fetches the data from memory, packetizes it, and transports it to the destination. This control loop continues—updating demand and receiving cumulative grants—until the entire backlog has been transported and acknowledged.

Once the job is complete, the PDC context is closed. If no other PDCs are currently associated with that CCC_ID, the CCC is also closed. This hierarchical teardown ensures that no unnecessary hardware resources or bandwidth are reserved in the AI Fabric once the work is done.



Saturday, 31 January 2026

Ultra Ethernet: NSCC Destination Flow Control

Figure 6-14 depicts a demonstrative event where Rank 4 receives seven simultaneous flows (1). As these flows are processed by their respective PDCs and handed over to the Semantic Sublayer (2), the High-Bandwidth Memory (HBM) Controller becomes congested. Because HBM must arbitrate multiple fi_write RMA operations requiring concurrent memory bank access and state updates, the incoming packet rate quickly exceeds HBM’s transactional retirement rate. 

This causes internal buffers at the memory interface to fill, creating a local congestion event (3). To prevent buffer overflow, which would lead to dropped packets and expensive RMA retries, the receiver utilizes NSCC to move the queuing "pain" back to the source. This is achieved by using pds.rcv_cwnd_pend parameter of the ACK_CC header (4). The parameter operates on a scale of 0 to 127; while zero is ignored, a value of 127 triggers the maximum possible rate decrement. In this scenario, a value of 64 is utilized, resulting in a 50% penalty relative to the newly acknowledged data.

Rather than directly computing a new transport rate, the mechanism utilizes a three-phase process to define a restricted Congestion Window (CWND). This reduction in CWND inherently forces the source to drain its inflight bucket to maintain protocol compliance and synchronize the injection rate with the HBM's processing capacity. The process begins by calculating the newly_rcvd_bytes, representing the data volume acknowledged by the incoming ACK_CC. This is the delta between the rcvd_bytes of the predecessor ACK_CC (12,288 bytes) and the newest rcvd_bytes (16,384 bytes), totaling 4,096 bytes (A).

 In the next phase, the logic multiplies 4,096 bytes by the rcv_cwnd_pend value of 64, resulting in a product of 262,144. Applying a bit-shift of 7 (equivalent to dividing by 128) yields a penalty of 2,048 bytes (B). This penalty is then subtracted from the current CWND of 75,776, establishing a new, throttled CWND of 73,728 bytes (C). 

In a stable state, the CWND and the inflight bucket are typically equal in size; consequently, immediately following the decrement, the current inflight bucket exceeds the newly defined CWND limit by 2,048 bytes. This state violates the fundamental transport rule where the CCC allows the PDC to transmit data only when the inflight bucket is less than or equal to the CWND (5). In response, the PDC must suspend transmission, waiting for the destination to acknowledge enough packets to reduce the inflight bucket size to be less than or equal to size of the new CWND (6). 

This pause allows the HBM controller the necessary time to clear its transaction queue. Only once the inflight level has drained to meet the new CWND ceiling can the CCC authorize the PDS to resume data transport. The rc-flag (Restore CWND) when set, it signals that after flow congestion control event, the original CWND can be utilized again.



Figure 6-14: NSCC: Destination Flow Control.

NSCC Mechanism Summary


The Network-Signaled Congestion Control framework ensures high-performance data transfer by balancing the real-time Inflight Load against a dynamic Congestion Window (CWND). By utilizing proactive feedback from the fabric and the destination, the system maintains line-rate performance while preventing buffer overflow and high tail latency.

Proportional and Fast Increase: These methods are utilized when the network is underloaded, characterized by a lack of ECN-CE signals and queuing delays below the target threshold. Proportional Increase scales the CWND based on the gap between measured and target delays to optimize utilization. Fast Increase employs exponential growth to quickly reclaim bandwidth when the network remains significantly underutilized for a duration.

Fair Increase: This method is initiated as congestion subsides to ensure an equitable recovery among competing flows. By adding a fixed, constant amount to the CWND of every active flow, it allows flows with smaller windows to grow at a faster relative rate, eventually leading all participants to converge on a fair share of the available bandwidth.

Multiplicative Decrease: This action is used to protect the fabric during periods of high pressure, specifically when queuing delay exceeds targets and ECN feedback indicates stagnant queues. It slashes the CWND proportionally to the measured buffer excess, rapidly shedding load to return the network queue to its target occupancy level within a single Round-Trip Time.

Destination Flow Control (NSCC Receiver Penalty): This mechanism addresses bottlenecks at the receiver’s hardware level, such as the High-Bandwidth Memory (HBM) controller. By applying a penalty via the rcv_cwnd_pend parameter, the receiver forces the source to reduce its CWND based on a percentage of the newly acknowledged data. This pauses new data injections until the destination's transaction queues have drained, moving the queuing pressure from the memory controller back to the source.

CWND Restoration: The Restore CWND method, triggered by the rc-flag, allows a flow to immediately resume its original transmission rate once a congestion event has passed. This prevents the flow from having to slowly ramp back up through increase phases, ensuring that the system returns to peak efficiency as soon as the bottleneck, whether in the fabric or at the destination, is resolved.


Thursday, 29 January 2026

Ultra Ethernet: Inflight Bytes and CWND Adjustment

Inflight Packet Adjustment

Figure 6-12 depicts the ACK_CC header structure and fields. When NSCC is enabled in the UET node, the PDS must use the pds.type ACK_CC in the prologue header, which serves as the common header structure for all PDS messages. Within the actual PDS ACK_CC header, the pds.cc_type must be set to CC_NSCC. The pds.ack_cc_state field describes the values and states for service_time, rc (restore congestion CWND), rcv_cwnd_pend, and received_bytes. The source specifically utilizes the received_bytes parameter to calculate the updated state for inflight packets.

The CCC computes the reduction in the inflight state by subtracting the rcvd_bytes value received in previous ACK_CC messages from the rcvd_bytes value carried within the latest ACK_CC message. As illustrated in Figure 6-12, the inflight state is decreased by 4,096 bytes, which is the delta between 16,384 and 12,288 bytes.

Recap: In order to transport data to network, the Inflight bytes must be less than CWND size.



Figure 6-12: NSCC: Inflight Bytes adjustment.


CWND Adjustment


A single, shared Congestion Window (CWND) regulates the total volume of bytes across all PDCs that are permitted for transmission to the backend network. The transport rate and network performance are continuously monitored and serve as the baseline information for dynamic CWND adjustments. The primary objective of these adjustments is to maintain minimal queue depth to eliminate queuing delay while ensuring the backend network is not overloaded, thereby guaranteeing lossless, line-rate packet transport.

Figure 6-13 illustrates the ACK_CC message structure, specifically the pds.flags field, which indicates whether a switch in the backend network has experienced congestion in an outgoing interface queue. The m-flag is set when the packet being acknowledged carries ECN-CE bits within the ToS field of the IP header.

In addition to the ECN state, the CWND adjustment algorithm compares the measured Queuing Delay against the Target Delay. As a recap from the Inflight section, the Queuing Delay is calculated as follows:

 Queuing Delay = Ack_arrival_time - pkt_tx_state – service_time

The pkt_tx_state is recorded by the PDS and passed to the CCC, which derives the queuing delay by subtracting both the transmission timestamp and the service_time from the ACK arrival time.

The service_time is measured at the target by calculating the delta between the packet reception time (rx_time) and the response transmission time (tx_time). This value is encoded into the service_time parameter within the pds.ack_cc_state field. It represents the total processing overhead for the Packet Delivery and Semantic Sublayers to handle the RMA operation, including header extraction, memory addressing, data writing, SES response generation, and ACK_CC construction. By isolating this processing time, the source can accurately determine the true delay caused strictly by the network fabric.

Proportional and Fast Increase Logic


When the m-flag is clear (indicating no ECN-CE bits are detected) and the calculated Queuing Delay is less than the Target Delay, the network utilization is considered below its optimum. In this state, the CWND size increment is directly related to the magnitude of the difference between the measured queuing delay and the target delay.

A large difference between these two values indicates that network utilization is low and the path can handle a significantly higher volume of flows. The NSCC algorithm responds by increasing the CWND at a rate proportional to this gap, allowing more room for inflight packets. As the measured delay approaches the target delay and the difference narrows, the rate of increase automatically slows down to stabilize the flow.

In scenarios where the network remains significantly underloaded for a duration, such as when competing flows terminate, the system can escalate to a fast_increase. This mode employs exponential growth to quickly converge on the newly available bandwidth, remaining active until the system detects the first signs of incipient congestion.

Fair increase Logic


The fair_increase action is initiated when the system detects that a congestion event is subsiding. This state occurs when ECN signals indicate that the network queue has drained below the configured threshold, even if a previous packet experienced congestion. This mechanism is primarily designed to prevent the transmission rate from "undershooting" the actual network capacity during the recovery phase.

In this mode, the NSCC algorithm performs an additive increase rather than a proportional one. By adding a constant, fixed amount to the CWND, the system promotes fairness among competing flows. Because every flow receiving the same signal increases its CWND by the identical fixed value, flows with smaller windows experience a larger relative growth rate compared to those with larger windows. This ensures that all active flows eventually converge toward an equitable share of the available bandwidth.

Multiplicative Decrease Logic


The multiplicative_decrease action is triggered when the measured Queuing Delay exceeds the Target Delay and ECN feedback indicates that the network queue is not effectively decreasing. In this state, the average delay serves as a direct metric for the volume of excess data currently enqueued beyond the desired threshold.

The NSCC algorithm reacts by reducing the CWND proportionally to this measured queuing excess. By directly tying the window reduction to the specific magnitude of the buffer overflow, the system can rapidly shed load to alleviate congestion. When all competing flows execute this coordinated reduction, the objective is to clear the bottleneck and return the queue to its target occupancy level within approximately one Round-Trip Time (RTT).



Figure 6-13: NSCC: CWND Adjustment.


NSCC Summary


The Network-Signaled Congestion Control (NSCC) framework is designed to solve the fundamental challenge of high-performance networking: maximizing throughput while maintaining near-zero latency. This objective is achieved by managing a continuous balance between Inflight Load and the CWND Budget.

The system continuously adjusts transmission rates to ensure optimal network utilization. By proactively responding to fabric signals, NSCC maintains line-rate performance and prevents bufferbloat (a jam before buffer overflow), ensuring that packets spend as little time as possible waiting in switch queues. The core gatekeeper rule of the transport layer is defined by the relationship between two variables. Inflight bytes represent the real-time volume of data currently transiting the network, while the Congestion Window (CWND) represents the total data budget the network can safely handle. The source calculates the current Inflight state by tracking cumulative received_bytes from ACK_CC messages, and new data is only injected into the fabric when the Inflight count is lower than the CWND.

The sender dynamically scales the CWND budget using a sophisticated state machine based on the m-flag (ECN-CE) and measured Queuing Delay. It utilizes Proportional or Fast Increase to fill available bandwidth when the network is underutilized, ensuring the pipe remains full. As congestion clears, the Fair Increase mechanism ensures multiple flows converge to an equitable share of the pipe. Conversely, the Multiplicative Decrease action aggressively slashes the window when buffers overflow to protect the fabric from packet loss.


Tuesday, 27 January 2026

Ultra Ethernet: Network-Signaled Congestion Control (NSCC) - Overview

Network-Signaled Congestion Control (NSCC)


The Network-Signaled Congestion Control (NSCC) algorithm operates on the principle that the network fabric itself is the best source of truth regarding congestion. Rather than waiting for packet loss to occur, NSCC relies on proactive feedback from switches to adjust transmission rates in real time. The primary mechanism for this feedback is Explicit Congestion Notification (ECN) marking. When a switch interface's egress queue begins to build up, it employs a Random Early Detection (RED) logic to mark specific packets. Once the buffer’s Minimum Threshold is crossed, the switch begins randomly marking packets by setting the last two bits of the IP header’s Type of Service (ToS) field to the CE (11) state. If the congestion worsens and the Maximum Threshold is reached, every packet passing through that interface is marked, providing a clear and urgent signal to the endpoints.

The practical impact of this mechanism is best illustrated by a hash collision event, such as the one shown in Figure 6-10. In this scenario, multiple GPUs on the left-hand side of the fabric transmit data at line rate. Due to the specific entropy of these flows, the ECMP hashing algorithms on leaf switches 1A-1 and 1A-2 inadvertently select the same uplink to Spine 1A. Because all destination GPUs are concentrated on leaf switch 1B-1, the spine is forced to aggregate these incoming flows—totaling 500 Gbps—into a single outgoing interface. This bottleneck causes the queue to fill rapidly. Consequently, Spine 1A marks packets destined for Rank 9 and Rank 5 with ECN-CE. When these marked packets reach the receiver, the Packet Delivery Service (PDS) detects the congestion signal and reflects it back to the source by setting the pds.m flag in the acknowledgement (ACK) message.

The second signaling mechanism is based on measured queuing delay, which provides a granular view of fabric pressure even when ECN marks are not present. The algorithm calculates this by measuring the current Round-Trip Time (RTT) and subtracting the Base_RTT—the known minimum RTT of an uncongested path. This difference (Delta RTT) represents the time a packet spent sitting in switch buffers. By isolating the queuing delay from the total propagation time, the algorithm can detect the earliest stages of buffer buildup with high precision.

To manage these signals effectively, the algorithm maintains a constant record of the inflight packet state, tracking every byte transmitted to the network that has not yet been acknowledged or NACKed by the receiver. By synthesizing these three critical factors, ECN-CE signals, calculated queuing delay, and the volume of packets in flight, the NSCC algorithm dynamically adjusts the Congestion Window (CWND). This data allows the algorithm to decide precisely when a PDS is permitted to inject new data into the fabric and, if necessary, to rotate the Entropy Value (EV) to steer traffic toward underutilized paths, effectively resolving the collision and restoring optimal flow.


Figure 6-10: NSCC: Link Congestion due to Hash Collision.


The Overview of the NSCC Control Loop


Building on the previous overview, this section examines the granular mechanics of the NSCC process. Figure 6-11 illustrates the source-side operations as various Ranks initiate communication over the backend fabric. In this scenario, data from Ranks 0 and 8 is managed by Packet Delivery Context (PDC) 0x4001, Rank 2 is handled by PDC 0x4002, and Ranks 1 and 3 are assigned to PDC 0x4003.Each rank is tasked with transferring 4,096 KB of data. While abstracted in the diagram, the process begins when an application executes a fi_write RMA operation. This request is passed to the Semantic Sublayer (SES), which translates the intent into a UET_WRITE operation before handing it off to the PDC layer. Upon receiving new data, the PDC notifies the Congestion Control Context (CCC) Manager within the Congestion Management Sublayer (CMS) of a delta backlog (Steps 1a–c). This delta represents the volume of unsent data waiting in the PDC buffers that must be added to the total CCC backlog.

The CMS then acts as the gatekeeper; it compares the current inflight bytes against the Congestion Window (CWND). If the volume of data currently on the wire is less than the CWND, the CCC scheduler permits data transport (Step 2). In our example, there is sufficient headroom in the window, allowing the scheduler to authorize PDC 0x4001 to transmit. As the packet is dispatched, the hardware records the precise transmission time and injects the Entropy Value (EV) into the header to facilitate fabric load balancing (Phase 3). Simultaneously, the Inflight state is incremented and the backlog is decremented to reflect the data now transiting the network (Phases 4 and 5).The receiver processes the incoming packet and generates an ACK_CC message (Step 6). If the packet arrived with ECN-CE bits set by a switch, the receiver sets the pds.m flag in the ACK to signal that congestion was manifested. In this specific example, no congestion is encountered, so the pds.m bit remains unset. Crucially, the ACK_CC includes the service-time, the internal processing delay at the receiver—and a cumulative byte count to inform the source of the total data successfully received.

When the source receives the ACK_CC, it logs the arrival time (Step 7) and updates the CCC state. It decreases the inflight counter based on the rcvd_bytes value and adjusts the CWND. The adjustment is governed by two factors: the state of the ECN-CE bits and the measured Queuing Delay relative to the target delay. The Queuing Delay is calculated as:

Queuing Delay = Ack_arrival_time - pkt_tx_state – service_time 

When packet trimming is used the default target delay is the same as configured base_rtt. Without packet trimming the target delay is base_rtt * 0.75. The CWND adjustment options are explained in the next section.

This autonomous, self-adjusting control loop represents a sophisticated implementation of Intent-Based Networking (IBN) at the transport layer. The high-level "intent" is simple: the reliable delivery of data between Ranks at line rate with minimal tail latency. To fulfill this, the NSCC algorithm operates as a real-time, closed-loop system—monitoring network feedback, analyzing fabric pressure, and adapting injection rates without human intervention. By offloading this decision-making to the Congestion Management Sublayer (CMS), the fabric becomes self-optimizing, ensuring that even in the face of unpredictable hash collisions, the network remains a transparent utility for the application.



Figure 6-11: NSCC Operation.

The following section concludes our exploration of NSCC by detailing the specific fields within the ACK_CC header and illustrating how the source-side state machine transitions between different congestion levels. While the overview provided here is sufficient to understand the fundamental operations of NSCC, the subsequent deep dive is intended for those who require bit-level architectural details.

While NSCC serves as the primary proactive mechanism for modulating flow at the source, it is only one part of the Ultra Ethernet "congestion toolbox." To ensure total fabric reliability, UEC employs additional layers of defense, such as Receiver Credit-based Congestion Control (RCCC) and Packet Trimming. These mechanisms are designed to handle specific scenarios where proactive rate-limiting isn't enough, providing the "emergency" recovery needed to maintain near-line-rate performance. Each of these solutions will be explored in detail in the upcoming chapters.

Tuesday, 13 January 2026

Ultra Ethernet: Congestion Control Context

 Ultra Ethernet Transport (UET) uses a vendor-neutral, sender-specific congestion window–based congestion control mechanism together with flow-based, adjustable entropy-value (EV) load balancing to manage incast, outcast, local, link, and network congestion events. Congestion control in UET is implemented through coordinated sender-side and receiver-side functions to enforce end-to-end congestion control behavior.

On the sender side, UET relies on the Network-Signaled Congestion Control (NSCC) algorithm. Its main purpose is to regulate how quickly packets are transmitted by a Packet Delivery Context (PDC). The sender adapts its transmission window based on round-trip time (RTT) measurements and Explicit Congestion Notification (ECN) Congestion Experienced (CE) feedback conveyed through acknowledgments from the receiver.

On the receiver side, Receiver Credit-based Congestion Control (RCCC) limits incast pressure by issuing credits to senders. These credits define how much data a sender is permitted to transmit toward the receiver. The receiver also observes ECN-CE markings in incoming packets to detect path congestion. When congestion is detected, the receiver can instruct the sender to change the entropy value, allowing traffic to be steered away from congested paths.

Both sender-side and receiver-side mechanisms ultimately control congestion by limiting the amount of in-flight data, meaning data that has been sent but not yet acknowledged. In UET, this coordination is handled through a Congestion Control Context (CCC). The CCC maintains the congestion control state and determines the effective transmission window, thereby bounding the number of outstanding packets in the network. A single CCC may be associated with one or more PDCs communicating between the same Fabric Endpoint (FEP) within the same traffic class.


Initializing Congestion Control Context (CCC)

When the PDS Manager receives an RMA operation request from the SES layer, it first checks whether a suitable Packet Delivery Context (PDC) already exists for the JobID, destination FEP, traffic class, and delivery mode. If no matching PDC is found, the PDS Manager allocates a new one.

For the first PDC associated with a specific FEP-to-FEP flow, a Congestion Control Context (CCC) is required to manage end-to-end congestion. The PDS Manager requests this context from the CCC Manager within the Congestion Management Sublayer (CMS). Upon instantiation, the CCC initially enters the IDLE state, containing basic data structures without an active configuration.

The CCC Manager then initializes the context by calculating values and thresholds, such as the Initial Congestion Window (Initial CWND) and Maximum CWND (MaxWnd), using pre-defined configuration parameters. Once these initial source states for the NSCC are set, the CCC is bound to the corresponding PDC.

When fully configured, the CCC transitions to the READY state. This transition signals that the CCC is authorized to enforce congestion control policies and monitor traffic. The CCC serves as the central control structure for congestion management, hosting either sender-side (NSCC) or receiver-side (RCCC) algorithms. Because a CCC is unidirectional, it is instantiated independently on both the sender and the receiver.

Once in the READY state, the PDC is permitted to begin data transmission. The CCC maintains the active state required to regulate flow, enabling the NSCC and RCCC to enforce windows, credits, and path usage to prevent network congestion and optimize transport efficiency.

Note: In this model, the PDS Manager acts as the control-plane authority responsible for context management and coordination, while the PDC handles data-plane execution under the guidance of the CCC. Once the CCC is operational, RMA data transfers proceed directly via the PDC without further involvement from the PDS Manager.



Figure 6-6: Congestion Context: Initialization.

Calculating Initial CWND


Following the initialization of the Congestion Control Context (CCC) for a Packet Delivery Context (PDC), specific configuration parameters are used to establish the Initial Congestion Window (CWND) and the Maximum Congestion Window (MaxWnd). 

The Congestion Window (CWND) defines the maximum number of "in-flight" bytes, data that has been transmitted but not yet acknowledged by the receiver. Effectively, the CWND regulates the volume of data allowed on the wire for a specific flow at any given time to prevent network saturation.

The primary element for computing the CWND is the Bandwidth-Delay Product (BDP). To determine the path-specific BDP, the algorithm selects the slowest link speed and multiplies it by the configured base Round-Trip Time (config_base_rtt):

BDP = min(sender.linkspeed, receiver.linkspeed) x config_base_rtt

The config_base_rtt represents the latency over the longest physical path under zero-load conditions. This value is a static constant derived from the cumulative sum of:
  • Serialization delays (time to put bits on the wire)
  • Propagation delays (speed of light through fiber)
  • Switching delays (internal switch traversal)
  • FEC (Forward Error Correction) delays

Setting MaxWnd


The MaxWnd serves as a definitive upper limit for the CWND that cannot be exceeded under any circumstances. It is typically derived by multiplying the calculated BDP by a factor of 1.5.While a CWND equal to 1.0 x BDP is theoretically sufficient to saturate a link, real-world variables, such as transient bursts, scheduling jitter, or variations in switch processing, can cause the link to go idle if the window is too restrictive. UET allows the CWND to grow up to 1.5 x BDP to maintain high utilization and accommodate acknowledgment (ACK) clocking dynamics.

Example Calculation: Consider a flow where the slowest link speed is 100 Gbps and the config_base_rtt is 6.0 µs.

Calculate BDP (Bits): 100 x 109 bps x 0.000006 s = 600,000 bits
Calculate BDP (Bytes): 600,000 / 8 = 75,000 bytes
Calculate MaxWnd: 75,000 x 1.5 = 112,500  bytes

Note on Incast Prevention: While the "ideal" initial CWND is 1.0 x BDP, UET allows the starting window to be configured to a significantly smaller value (e.g., 10–32 KB or a few MTUs). This configuration prevents Incast congestion, a phenomenon where the aggregate traffic from multiple ingress ports exceeds the physical capacity of an egress port. By starting with a conservative CWND, the system ensures that the switch's egress buffers are not exhausted during the first RTT, providing the NSCC algorithm sufficient time to measure RTT inflation and modulate the flow rates.

A common misconception is that the BDP limits the transmission rate. In reality, the BDP defines the volume of data required to keep the "pipe" full. While the Initial CWND may be only 75,000 bytes, it is replenished every RTT. At a 6.0 µs RTT, this volume translates to a full 100 Gbps line rate:

600,000 bits / 6.0 µs = 600,000 / 0.000006 = 100 × 109 bps = 100 Gbps

Therefore, a window of 1.0 x BDP achieves 100% utilization. The 1.5 x BDP (MaxWnd) simply provides the necessary headroom to prevent the link from going idle during minor acknowledgment delays.

Figure 6-7: CC Config Parameters, Initial CWND and MaxWnd.

Calculating New CWND


When the network is uncongested, indicated by a measured RTT remaining near the base_rtt, the NSCC algorithm performs an Additive Increase (AI) to grow the CWND. To ensure fairness across the entire fabric, the algorithm utilizes a universal Base_BDP parameter rather than the path-specific BDP.

The Base_BDP is a fixed protocol constant (typically 150,000 bytes, derived from a reference 100 Gbps link at 12 µs). The new CWND is calculated by adding a fraction of this constant to the current window:

CWND(new) = CWND(Init) + Base_BDP\Scaling Factor

Using a universal constant ensures Scale-Invariance in a mixed-speed fabric (e.g., 100G and 400G NICs). 

If a 400G NIC were to use its own BDP (300,000 bytes) for the increase step, its window would grow four times faster than that of a 100G NIC. By using the shared Base_BDP (150,000 bytes), both NICs increase their throughput by the same number of bytes per second. This "normalized acceleration" prevents faster NICs from starving slower flows during the capacity-seeking phase.

As illustrated in Figure 6-8, consider a flow with an Initial CWND of 75,000 bytes, a Base_BDP of 150,000 bytes, and a Scaling Factor of 1024:

Step Size = 150,000 / 1024 ≈ 146.5  bytes
New CWND = 75,000 + 146.5 = 75,146.5 bytes

Note: Scaling factors are ideally set to powers of 2 (e.g., 512, 1024, 2048, 4096, 8192) to allow the hardware to use fast bit-shifting operations instead of expensive division. 

Higher factors (e.g., 8192): Result in smaller, smoother increments (high stability). 
Lower factors (e.g., 512): Result in larger increments (faster convergence to link rate).

Figure 6-8: Increasing CWND.




Tuesday, 6 January 2026

UET Congestion Management: CCC Base RTT

Calculating Base RTT

[Edit: January 7 2026, RTT role in CWND adjustment process]

As described in the previous section, the Bandwidth-Delay Product (BDP) is a baseline value used when setting the maximum size (MaxWnd) of the Congestion Window (CWND). The BDP is calculated by multiplying the lowest link speed among the source and destination nodes by the Base Round-Trip Time (Base_RTT).

In addition to its role in BDP calculation, Base_RTT plays a key role in the CWND adjustment process. During operation, the RTT measured for each packet is compared against the Base_RTT. If the measured RTT is significantly higher than the Base_RTT, the CWND is reduced. If the RTT is close to or lower than the Base_RTT, the CWND is allowed to increase.

This adjustment process is described in more detail in the upcoming sections.

The config_base_rtt parameter represents the RTT of the longest path between sender and receiver when no other packets are in flight. In other words, it reflects the minimum RTT under uncongested conditions. Figure 6-7 illustrates the individual delay components that together form the RTT.

Serialization Delay: The network shown in Figure 6-7 supports jumbo frames with an MTU of 9216 bytes. Serialization delay is measured in time per bit, so the frame size must first be converted from bytes to bits:

9216 bytes × 8 = 73,728 bits

Serialization delay is then calculated by dividing the frame size in bits by the link speed. For a 100 Gbps link:

73,728 bits / 100 Gbps = 0.737 µs

Note: In a cut-through switched network, which is standard in modern 100 Gbps and above data center fabrics, the switch does not wait for the full 9216-byte frame to arrive before forwarding it. Instead, it processes only the packet header (typically the first 64–128 bytes) to determine the destination MAC or IP address and immediately begins transmitting the packet on the egress port. While the tail of the packet is still arriving on the ingress port, the head is already leaving the switch.

This behavior creates a pipeline effect, where bits flow through the network similarly to water through a pipe. As a result, when calculating end-to-end latency from a first-bit-in to last-bit-out perspective, the serialization delay is effectively incurred only once—the time required to place the packet onto the first link.

Propagation delay: The time it takes for light to travel through the cabling infrastructure. In our example, the combined fiber-optic length between Rank 0 on Node 1A and GPU 7 on Node A2 is 50 meters. Light travels through fiber at approximately 5 ns per meter, resulting in a propagation delay of:

50 m × 5 ns/m = 250 ns = 0.250 µs

Switching Delay (Cut-Through): The time a packet spends inside a network switch while being processed before it is forwarded. This latency arises from internal operations such as examining the packet header, performing a Forwarding Information Base (FIB) lookup to determine the correct egress port, and updating internal buffers and queues.

In modern cut-through switches, much of this processing occurs while the packet is still being received, so the added delay per switch is very small. High-end 400G switches exhibit cut-through latencies on the order of 350–500 ns per switch. For a path traversing three switches, the total switching delay sums to approximately:

3 × 400 ns ≈ 1.2 µs

Thus, even with multiple hops, switching delay contributes only a modest portion to the total Base RTT in 100 Gbps and above data center fabrics.

Forward Error Correction(FEC) Delay: Forward Error Correction (FEC) ensures reliable, “lossless” data transfer in high-speed AI fabrics. It is required because high-speed optical links can experience bit errors due to signal distortion, fiber imperfections, or high-frequency signaling noise.

FEC operates using data blocks and symbols. The outgoing data is divided into fixed-size blocks, each consisting of data symbols. In 100G and 400G Ethernet FEC, one symbol = 10 bits. For example, a 514-symbol data block contains 514 × 10 = 5,140 bits of actual data.

To detect and correct errors, the switch or NIC ASIC computes parity symbols from the data block using Reed-Solomon (RS) math and appends them to the block. The combination of the original data and the parity symbols forms a codeword. For example, in RS(544, 514), the codeword has 544 symbols in total, of which 514 are data symbols and 30 are parity symbols. Each symbol is 10 bits, so the 30 parity symbols add 300 extra bits to the codeword.

At the receiver, the codeword is checked: the parity symbols are used to detect and correct any corrupted symbols in the original data block. Because RS-FEC operates on symbols rather than individual bits, if multiple bits within a single 10-bit symbol are corrupted, the entire symbol is corrected as a single unit.

The FEC latency (or accumulation delay) comes from the requirement to receive the entire codeword before error correction can begin. For a 400G RS(544, 514) codeword:

544 symbols × 10 bits/symbol = 5,440 bits total

At 400 Gbps, this adds a fixed delay of ~150 ns per hop

This delay is a “fixed cost” of high-speed networking and must be included in the Base RTT calculation for AI fabrics. The sum of all delays gives the one-way delay, and the round-trip time (RTT) is obtained by multiplying this value by two. The config_base_rtt value in figure 6-7 is the RTT rounded to a safe, reasonable integer. 

Figure 6-7: Calculating Base_RTT Value.

Saturday, 3 January 2026

UET Congestion Management: Congestion Control Context

Congestion Control Context

Updated 5.1.2026: Added CWND computation example into figure. Added CWND cmputaiton into text.
Updated 13.1.2026: Deprectade by: Ultra Ethernet: Congestion Control Context 

Ultra Ethernet Transport (UET) uses a vendor-neutral, sender-specific congestion window–based congestion control mechanism together with flow-based, adjustable entropy-value (EV) load balancing to manage incast, outcast, local, link, and network congestion events. Congestion control in UET is implemented through coordinated sender-side and receiver-side functions to enforce end-to-end congestion control behavior.

On the sender side, UET relies on the Network-Signaled Congestion Control (NSCC) algorithm. Its main purpose is to regulate how quickly packets are transmitted by a Packet Delivery Context (PDC). The sender adapts its transmission window based on round-trip time (RTT) measurements and Explicit Congestion Notification (ECN) Congestion Experienced (CE) feedback conveyed through acknowledgments from the receiver.

On the receiver side, Receiver Credit-based Congestion Control (RCCC) limits incast pressure by issuing credits to senders. These credits define how much data a sender is permitted to transmit toward the receiver. The receiver also observes ECN-CE markings in incoming packets to detect path congestion. When congestion is detected, the receiver can instruct the sender to change the entropy value, allowing traffic to be steered away from congested paths.

Both sender-side and receiver-side mechanisms ultimately control congestion by limiting the amount of in-flight data, meaning data that has been sent but not yet acknowledged. In UET, this coordination is handled through a Congestion Control Context (CCC). The CCC maintains the congestion control state and determines the effective transmission window, thereby bounding the number of outstanding packets in the network. A single CCC may be associated with one or more PDCs communicating between the same Fabric Endpoint (FEP) within the same traffic class.


Initializing Congestion Control Context (CCC)

When the PDS Manager receives an RMA operation request from the SES layer, it first checks whether a suitable Packet Delivery Context (PDC) already exists for the JobID, destination FEP, traffic class, and delivery mode carried in the request. If no matching PDC is found, the PDS Manager allocates a new one.

For the first PDC associated with a particular destination, a Congestion Control Context (CCC) is required to manage end-to-end congestion for that flow. The PDS Manager requests a CCC from the CCC Manager within the Congestion Management Sublayer (CMS). The CCC Manager creates the CCC, which initially enters the IDLE state, containing only the basic data structures without an active configuration. After creation, the CCC is bound to the PDC.

Next, the CCC is assigned a congestion window (CWND), which is computed based on CCC configuration parameters. The first step is to compute the Bandwidth-Delay Product (BDP), which is used to derive the upper bound for the initial congestion window. The CWND limits the total number of bytes in flight across all paths between the sender and the receiver.

The BDP is computed as:

BDP = min(sender_link_speed, receiver_link_speed) × config_base_rtt

The link speed must be expressed in bytes per second, not bits per second, because BDP is measured in bytes. The min() operator selects the smaller of the sender and receiver link speeds. In an AI fabric, these values are typically identical. The sender link speed, receiver link speed, and config_base_rtt are pre-assigned configuration parameters.

UET typically allows a maximum in-flight volume of 1.5 × BDP to provide throughput headroom while minimizing excessive queuing. A factor of 1.0 represents the minimum required to “fill the pipe” and would set the BDP directly as the maximum congestion window (MaxWnd). However, the UET specification applies a factor of 1.5 to allow controlled oversubscription and improved utilization.

Once the CWND is assigned and the CCC is bound to the PDC, the CCC transitions from the IDLE state to the ACTIVE state. In the ACTIVE state, the CCC holds all configuration information and is associated with the PDC, but data transport has not yet started.

When the CCC is fully configured and ready for operation, it transitions to the READY state. This transition signals that the CCC can enforce congestion control policies and monitor traffic. At this point, the PDC is allowed to begin sending data, and the CCC tracks and regulates the flow according to the configured congestion control algorithms.

The CCC serves as the central control structure for congestion management, hosting either sender-side (NSCC) or receiver-side (RCCC) algorithms. A CCC is unidirectional and is instantiated independently on both the sender and the receiver, where it is locally associated with the corresponding PDC. Once in the READY state, the CCC maintains the state required to regulate data flow, enabling NSCC and RCCC to enforce congestion windows, credits, and path usage to prevent network congestion and maintain efficient data transport.

Note: In this model, the PDS Manager acts as the control-plane authority responsible for context management and coordination, while the Packet Delivery Context (PDC) performs data-plane execution under the control of the Congestion Control Context (CCC). Once the CCC is operational and the PDC is authorized for data transport, RMA data transfers proceed directly over the PDC without further involvement from the PDS Manager.



Figure 6-6: Congestion Context: Initialization.

Monday, 29 December 2025

UET Congestion Management: Introduction

 Introduction


Figure 6-1 depicts a simple scale-out backend network for an AI data center. The topology follows a modular design, allowing the network to scale out or scale in as needed. The smallest building block in this example is a segment, which consists of two nodes, two rail switches, and one spine switch. Each node in the segment is equipped with a dual-port UET NIC and two GPUs.

Within a segment, GPUs are connected to the leaf switches using a rail-based topology. For example, in Segment 1A, the communication path between GPU 0 on Node A1 and GPU 0 on Node A2 uses Rail A0 (Leaf 1A-1). Similarly, GPU 1 on both nodes is connected to Rail A1 (Leaf 1A-2). In this example, we assume that intra-node GPU collective communication takes place over an internal, high-bandwidth scale-up network (such as NVLink). As a result, intra-segment GPU traffic never reaches the spine layer. Communication between segments is carried over the spine layer.

The example network is a best-effort (that is, PFC is not enabled) two-tier, three-stage non-blocking fat-tree topology, where each leaf and spine switch has four 100-Gbps links. Leaf switches have two host-facing links and two inter-switch links, while spine switches have four inter-switch links. All inter-switch and host links are Layer-3 point-to-point interfaces, meaning that no Layer-2 VLANs are used in the example network.

Links between a node’s NIC and the leaf switches are Layer-3 point-to-point connections. The IP addressing scheme uses /31 subnets, where the first address is assigned to the host NIC and the second address to the leaf switch interface. These subnets are allocated in a contiguous manner so they can be advertised as a single BGP aggregate route toward the spine layer.

The trade-off of this aggregation model is that host-link or NIC failures cannot rely solely on BGP route withdrawal for fast failure detection. Additional local failure-detection mechanisms are therefore required at the leaf switch.

Although not shown in Figure 6-1, the example design supports a scalable multi-pod architecture. Multiple pods can be interconnected through a super-spine layer, enabling large-scale backend networks.

Note: The OSI between GPUs within a node indicates that both GPUs belong to the same Operating System Instance (OSI). The link between GPUs, in turn, is part of a high-bandwidth domain (the scale-up backend).

Figure 6-1: Example of AI DC Backend Networks Topology.

Congestion Types

In this text, we categorize congestion into two distinct domains: congestion within nodes, which includes incast, local, and outcast congestion, and congestion in scale-out backend networks, which includes link and network congestion. The following sections describe each congestion type in detail.


Incast Congestion

In high-performance networking, Incast is a specific type of congestion that occurs when a many-to-one communication pattern overwhelms a single network point. This is fundamentally a "fan-in" problem, where the traffic volume destined for a single receiver exceeds both the physical line rate of the last-hop switch's egress interface and the storage capacity of its output buffers.

To visualize this, consider the configuration in Figure 6-2. The setup consists of four UET Nodes (A1, A2, B1, and B2), each containing two GPUs. This results in eight total processing units, labeled Rank 0 through Rank 7. Each Rank is equipped with its own dedicated 100G NIC.

The bottleneck forms when multiple sources target a single destination simultaneously. In this scenario, Ranks 1 through 7 all begin transmitting data to Rank 0 at the exact same time, each at a 100G line rate.

The backbone of the network is typically robust enough to handle this aggregate traffic. If the switches are connected via 400G or 800G links, the core of the network stays clear and fast. If the core were to experience congestion, Network Signaled Congestion Control (NSCC) could be enabled to manage it. However, the specific problem here occurs at Leaf 1A-1, the switch where the target (Rank 0) is connected. While the switch receives a combined 600G of data destined for Rank 0, the outgoing interface from the switch to Rank 0 can only move 100G. Note, that Rank 1 use high-speed NVLink, not its Ethernet NIC

A buffer overflow is inevitable when 700G of data arrives at an egress port that can only output 100G. The switch is forced to store the extra 600G of data per second in its internal memory (buffers). Because network buffers are quite small and high-speed data moves incredibly fast, these buffers fill up in microseconds.

Once the buffers are full, the switch has no choice but to drop any new incoming packets. This leads to massive retransmission delays and "stuttering" in application performance. This is particularly devastating for AI training workloads, where all Ranks must stay synchronized to maintain efficiency.

While traditional networks use simple buffer management to deal with this, Ultra Ethernet utilizes a more sophisticated approach. To prevent "fan-in" from ever overwhelming the switch buffers in the first place, UET employs Receiver Credit-based Congestion Control (RCCC). This mechanism ensures the receiver remains in control by distributing credits that define exactly how much data each active source is allowed to transmit at any given time.


Figure 6-2: Intra-node Congestion - Incast Congestions.


Local Congestion

Local congestion arises when the High-Bandwidth Memory (HBM) controller, which manages access to the GPU’s memory channels, becomes a bottleneck. The HBM controller arbitrates all read and write requests to GPU memory, regardless of their source. These requests may originate from the GPU’s compute cores, from a peer GPU via NVLink, or from a network interface card (NIC) performing remote memory access (RMA) operations.

With a UET_WRITE operation, the target GPU compute cores are bypassed: the NIC writes data directly into GPU memory using DMA. The GPU does not participate in the data transfer itself, and the NIC handles packet reception and memory writes. Even in this case, however, the data must still pass through the HBM controller, which serves as the shared gateway to the GPU’s memory system.

In Figure 6-3, the HBM controller of Rank 0 receives seven concurrent memory access requests: six inter-node RMA write requests and one intra-node request. The controller must arbitrate among these requests, determining the order and timing of each access. If the aggregate demand exceeds the available memory bandwidth or arbitration capacity, some requests are delayed. These memory-access delays are referred to as local congestion.



Figure 6-3: Intra-node Congestion - Local Congestions.


Outcast Congestion

Outcast congestion is the third type of congestion observed in collective operations. It occurs when multiple packet streams share the same egress port, and some flows are temporarily delayed relative to others. Unlike incast congestion, which arises from simultaneous arrivals at a receiver, outcast happens when certain flows dominate the output resources, causing other flows to experience unfair delays or buffer pressure.

Consider the broadcast phase of the AllReduce operation. After Rank 0 has aggregated the gradients from all participating ranks, it sends the averaged results back to all other ranks. Suppose Rank 0 sends these updates simultaneously to ranks on node A2 and node A3 over the same egress queue of its NIC. If one destination flow slightly exceeds the others in packet rate, the remaining flows experience longer queuing delays or may even be dropped if the egress buffer becomes full. These delayed flows are “outcast” relative to the dominant flows.

In this scenario, the NIC at Rank 0 must perform multiple UET_WRITE operations in parallel, generating high egress traffic toward several remote FEPs. At the same time, the HBM controller on Rank 0 may become a bottleneck because the data must be read from memory to feed the NIC. Thus, local congestion can occur concurrently with outcast congestion, especially during large-scale AllReduce broadcasts where multiple high-bandwidth streams are active simultaneously.

Outcast congestion illustrates that even when the network’s total capacity is sufficient, uneven traffic patterns can cause some flows to be temporarily delayed or throttled. Mitigating outcast congestion is addressed by appropriate egress scheduling and flow-control mechanisms to ensure fair access to shared resources and predictable collective operation performance. These mechanisms are explained in the upcoming Network-Signaled Congestion Control (NSCC) and Receiver Credit-Based Congestion Control (RCCC) chapters.


Figure 6-4: Intra-node Congestion - Outcast Congestions.


Link Congestion


Traffic in distributed neural network training workloads is dominated by bursty, long-lived elephant flows. These flows are tightly coupled to the application’s compute–communication phases. During the forward pass, network traffic is minimal, whereas during the backward pass, each GPU transmits large gradient updates at or near line rate. Because weight updates can only be computed after gradient synchronization across all workers has completed, even a single congested link can delay the entire training step.

In a routed, best-effort fat-tree Clos fabric, link congestion may be caused by Equal-Cost Multi-Path (ECMP) collisions. ECMP typically uses a five-tuple hash—comprising the source and destination IP addresses, transport protocol, and source and destination ports—to select an outgoing path for each flow. During the backward pass, a single rank often synchronizes multiple gradient chunks with several remote ranks simultaneously, forming a point-to-multipoint traffic pattern.

For example, suppose Ranks 0–3 in segment 1 initiate gradient synchronization with Ranks 4–7 in segment 2 at the same time. Ranks 0 and 2 are connected to rail 0 through Leaf 1A-1, while Ranks 1 and 3 are connected to rail 1 through Leaf 1A-2. As shown in Figure 6-5, the ECMP hash on Leaf 1A-1 selects the same uplink toward Spine 1A for both flows arriving via rail 0, while the ECMP hash on Leaf 1A-2 distributes its flows evenly across the available spine links.

As a result, two 100-Gbps flows are mapped onto a single 100-Gbps uplink on Leaf 1A-1. The combined traffic exceeds the egress link capacity, causing buffer buildup and eventual buffer overflow on the uplink toward Spine 1A. This condition constitutes link congestion, even though alternative equal-cost paths exist in the topology.

In large-scale AI fabrics, thousands of concurrent flows may be present, and low entropy in traffic patterns—such as many flows sharing similar IP address ranges and port numbers—further increases the likelihood of ECMP collisions. Consequently, link utilization may become uneven, leading to transient congestion and performance degradation even in a nominally non-blocking network.

Ultra Ethernet Transport includes signaling mechanisms that allow endpoints to react to persistent link congestion, including influencing path selection in ECMP-based fabrics. These mechanisms are discussed in later chapters.

Note: Although outcast congestion is fundamentally caused by the same condition—attempting to transmit more data than an egress interface can sustain—Ultra Ethernet Transport distinguishes between host-based and switch-based egress congestion events and applies different signaling and control mechanisms to each. These mechanisms are described in the following congestion control chapters.



Figure 6-5: Link Congestions.

Network Congestion


Common causes of network congestion include too high oversubscription ration, ECMP collisions, and link or device failures. A less obvious but important source of short-term congestion is Priority Flow Control (PFC), which is commonly used to build lossless Ethernet networks. PFC together with Explicit Congestion Notification (ECN) forms the foundation of Lossless Ethernet for RoCEv2 but should be avoided in UET enabled best-effort network. The upcoming chapters explains why.

PFC relies on two buffer thresholds to control traffic flow: xOFF and xON. The xOFF threshold defines the point at which a switch generates a pause frame when a priority queue becomes congested. A pause frame is an Ethernet MAC control frame that tells the upstream device which Traffic Class (TC) queue is congested and for how long packet transmission for that TC should be paused. Packets belonging to other traffic classes can still be forwarded normally. Once the buffer occupancy drops below the xON threshold, the switch sends a resume signal, allowing traffic for that priority queue to continue before the actual pause timer expires.

At first sight, PFC appears to affect only a single link and only a specific traffic class. In practice, however, a PFC pause can trigger a chain reaction across the network. For example, if the egress buffer size exceeds the xOFF threshold for TC-Low on interface to rank 7 on Leaf switch 1B-1, the switch sends PFC pause frames to both connected spine switches, instructing them to temporarily hold TC-Low packets in their buffers. As the egress buffers for TC-Low on the spine switches begin to fill and xOFF threshold is crossed, they in turn sends PFC pause frame to rest of the leaf switches.

This behavior can quickly spread congestion beyond the original point of contention. In the worst case, multiple switches and links may experience temporary pauses. Once buffer occupancy drops below the xON threshold, Leaf switch 1B-1 sends resume signals, and traffic gradually recovers as normal transmission resumes. Even though the congestion episode is short, it disrupts collective operations and negatively impacts distributed training performance.

The upcoming chapters explain how Ultra Ethernet Network-Signal Congestion Control (NSCC) and Receiver-Credit Congestion Control (RCCC) manage the amount of data that sources are allowed to send over the network, maximizing network utilization while avoiding congestion. The next chapters also describe how Explicit Congestion Notification (ECN), Packet Trimming, and Entropy Value-based Packet Spraying, when combined with NSCC and RCCC, contribute to a self-adjusting, reliable backend network.


Monday, 15 December 2025

UET Request–Response Packet Flow Overview

 This section brings together the processes described earlier and explains the packet flow from the node perspective. A detailed network-level packet walk is presented in the following sections..

Initiator – SES Request Packet Transmission

After the Work Request Entity (WRE) and the corresponding SES and PDS headers are constructed, they are submitted to the NIC as a Work Element (WE). As part of this process, a Packet Delivery Context (PDC) is created, and the base Packet Sequence Number (PSN) is selected and encoded into the PDS header.

Once the PDC is established, it begins tracking acknowledged PSNs from the target. For example, the PSN 0x12000 is marked as transmitted. 

The NIC then fetches the payload data from local memory according to the address and length information in the WRE. The NIC autonomously performs these steps without CPU intervention, illustrating the hardware offload capabilities of UET.

Next, the NIC encapsulates the data with the required protocol headers: Ethernet, IP, optional UDP, PDS, and SES, and computes the Cyclic Redundancy Check (CRC). The fully formed packet is then transmitted toward the target with Traffic Class (TC) set to Low.

Note: The Traffic Class is orthogonal to the PDC; a single PDC may carry packets transmitted with Low or High TC depending on their role (data vs control).

Figure 5-9: Initiator: SES Request Processing.

Target – SES Request Reception and PDC Handling


Figure 5-10 illustrates the target-side processing when an PDS Request carrying SES Request is received. Unlike the initiator, the target PDS manager identifies the PDC using the tuple {source IP address, destination IP address, Source PDC Identifier (SPDCID)} to perform a lookup in its PDC mapping table.


Because no matching entry exists, the lookup results in a miss, and the target creates a new PDC. The PDC identifier (PDCID) is allocated from the General PDC pool, as indicated by the DPDCID field in the received PDS header. In this example, the target selects PDCID 0x8001.

This PDCID is subsequently used as the SPDCID when sending the PDS Ack  Response (carrying Semantic Response) back to the initiator. Any subsequent PDS Requests from the initiator reference this PDC using the same DPDCID = 0x8001, ensuring continuity of the PDC across messages.

After the PDC has been created, the UET NIC writes the received data into memory according to the SES header information. The memory placement process follows several steps:

  • Job and rank identification: The relative address in the SES header identifies the JobID (101) and the PIDonFEP (RankID 2).
  • Resource Index (RI) table lookup: The NIC consults the RI table, indexed by 0x00a, and verifies that the ri_generation field (0x01) matches the current table version. This ensures the memory region is valid and has not been re-registered.
  • Remote key validation: The NIC uses the rkey = 0xacce5 to locate the correct RI table entry and confirm permissions for writing.
  • Data placement: The data is written at base address (0xba5eadd1) + buffer_offset (0). The buffer_offset allows fragmented messages to be written sequentially without overwriting previous fragments.

In Figure 5-10, the memory highlighted in orange shows the destination of the first data fragment, starting at the beginning of the registered memory region.

Note: The NIC handles all these steps autonomously, performing direct memory placement and verification, which is essential for high-performance, low-latency applications like AI and HPC workloads.

Figure 5-10: Target: Request Processing – NIC → PDS → SES → Memory.


Target – SES Response, PDS Ack Response and Packet Transmission

After completing the write operation, the UET provider uses Semantic Response (SES Response) to notify the initiator that the operation was successful. The opcode in the SES Response header is set to UET_DEFAULT_RESPONSE, with list= UET_EXPECTED and return_code = RC_OK, indicating that the UET_WRITE operation has been executed successfully and the data has been written to target memory. Other fields, including message_id, ri_generation, JobID, and modified_length, are filled with the same values received in the SES Request, for example, message_id = 1, ri_generation = 0x001, JobID = 101, and modified_length = 16384.

Once the SES Response header is constructed, the UET provider creates a PDS Acknowledgement (PDS Ack) Response. The type is set to PDS_ACK, and the next_header field UET_HDR_RESPONSE references the SES Response type. The ack_psn_offset encodes the PSN from the received PDS Request, while the cumulative PSN (cack_psn) acknowledges all PDS Requests up to and including the current packet. The SPDCID is set to the target’s Initial PDCID (0x8001), and the DPDCID is set to the value received from the PDS Request as SPDCID (0x4001).

Finally, the PDS Ack and SES Response headers are encapsulated with Ethernet, IP, and optional UDP headers and transmitted by the NIC using High Traffic Class (TC). The High TC ensures that these control and acknowledgement messages are prioritized in the network, minimizing latency and supporting reliable flow control.

Figure 5-11: Target: Response Processing – SES → PDS → Transmit.

Initiator – SES Response and PDS Ack Respond


When the initiator receives a PDS Ack Response that also carries a SES Response, it first identifies the associated Packet Delivery Context (PDC) using the DPDCID field in the PDS header. Using this PDC, the initiator updates its PSN tracking state. The acknowledged PSN—for example, 0x12000—is marked as completed and released from the retransmission tracking state, indicating that the corresponding PDS Request has been successfully delivered and processed by the target.

After updating the transport-level state, the initiator extracts the SES Response and passes it to the Semantic Sublayer (SES) for semantic processing. The SES layer evaluates the response fields, including the opcode and return code, and determines that the UET_WRITE operation associated with message_id = 1 has completed successfully. As this response corresponds to the first fragment of the message, the initiator can mark that fragment as completed and, depending on the message structure, either wait for additional fragment responses or complete the overall operation. In our case, there are three more fragments to be processed.

This separation of responsibilities allows the PDS layer to manage reliability and delivery tracking, while the SES layer handles operation-level completion and status reporting.

Figure 5-12: Initiator: PDS Response & PDS Ack Processing.

Note: PDS Requests and Responses describe transport-specific parameters, such as the delivery mode (Reliable Unordered Delivery, RUD, or Reliable Ordered Delivery, ROD). In contrast, SES Requests and Responses describe semantic operations. SES Requests specify what action the target must perform, for example, writing data and the exact memory location for that operation, while SES Responses inform the initiator whether the operation completed successfully. In some flow diagrams, SES messages are shown as flowing between the SES and PDS layers, while PDS messages are shown as flowing between the PDS layers of the initiator and the target.