Saturday, 31 January 2026

Ultra Ethernet: NSCC Destination Flow Control

Figure 6-14 depicts a demonstrative event where Rank 4 receives seven simultaneous flows (1). As these flows are processed by their respective PDCs and handed over to the Semantic Sublayer (2), the High-Bandwidth Memory (HBM) Controller becomes congested. Because HBM must arbitrate multiple fi_write RMA operations requiring concurrent memory bank access and state updates, the incoming packet rate quickly exceeds HBM’s transactional retirement rate. 

This causes internal buffers at the memory interface to fill, creating a local congestion event (3). To prevent buffer overflow, which would lead to dropped packets and expensive RMA retries, the receiver utilizes NSCC to move the queuing "pain" back to the source. This is achieved by using pds.rcv_cwnd_pend parameter of the ACK_CC header (4). The parameter operates on a scale of 0 to 127; while zero is ignored, a value of 127 triggers the maximum possible rate decrement. In this scenario, a value of 64 is utilized, resulting in a 50% penalty relative to the newly acknowledged data.

Rather than directly computing a new transport rate, the mechanism utilizes a three-phase process to define a restricted Congestion Window (CWND). This reduction in CWND inherently forces the source to drain its inflight bucket to maintain protocol compliance and synchronize the injection rate with the HBM's processing capacity. The process begins by calculating the newly_rcvd_bytes, representing the data volume acknowledged by the incoming ACK_CC. This is the delta between the rcvd_bytes of the predecessor ACK_CC (12,288 bytes) and the newest rcvd_bytes (16,384 bytes), totaling 4,096 bytes (A).

 In the next phase, the logic multiplies 4,096 bytes by the rcv_cwnd_pend value of 64, resulting in a product of 262,144. Applying a bit-shift of 7 (equivalent to dividing by 128) yields a penalty of 2,048 bytes (B). This penalty is then subtracted from the current CWND of 75,776, establishing a new, throttled CWND of 73,728 bytes (C). 

In a stable state, the CWND and the inflight bucket are typically equal in size; consequently, immediately following the decrement, the current inflight bucket exceeds the newly defined CWND limit by 2,048 bytes. This state violates the fundamental transport rule where the CCC allows the PDC to transmit data only when the inflight bucket is less than or equal to the CWND (5). In response, the PDC must suspend transmission, waiting for the destination to acknowledge enough packets to reduce the inflight bucket size to be less than or equal to size of the new CWND (6). 

This pause allows the HBM controller the necessary time to clear its transaction queue. Only once the inflight level has drained to meet the new CWND ceiling can the CCC authorize the PDS to resume data transport. The rc-flag (Restore CWND) when set, it signals that after flow congestion control event, the original CWND can be utilized again.



Figure 6-14: NSCC: Destination Flow Control.

NSCC Mechanism Summary


The Network-Signaled Congestion Control framework ensures high-performance data transfer by balancing the real-time Inflight Load against a dynamic Congestion Window (CWND). By utilizing proactive feedback from the fabric and the destination, the system maintains line-rate performance while preventing buffer overflow and high tail latency.

Proportional and Fast Increase: These methods are utilized when the network is underloaded, characterized by a lack of ECN-CE signals and queuing delays below the target threshold. Proportional Increase scales the CWND based on the gap between measured and target delays to optimize utilization. Fast Increase employs exponential growth to quickly reclaim bandwidth when the network remains significantly underutilized for a duration.

Fair Increase: This method is initiated as congestion subsides to ensure an equitable recovery among competing flows. By adding a fixed, constant amount to the CWND of every active flow, it allows flows with smaller windows to grow at a faster relative rate, eventually leading all participants to converge on a fair share of the available bandwidth.

Multiplicative Decrease: This action is used to protect the fabric during periods of high pressure, specifically when queuing delay exceeds targets and ECN feedback indicates stagnant queues. It slashes the CWND proportionally to the measured buffer excess, rapidly shedding load to return the network queue to its target occupancy level within a single Round-Trip Time.

Destination Flow Control (NSCC Receiver Penalty): This mechanism addresses bottlenecks at the receiver’s hardware level, such as the High-Bandwidth Memory (HBM) controller. By applying a penalty via the rcv_cwnd_pend parameter, the receiver forces the source to reduce its CWND based on a percentage of the newly acknowledged data. This pauses new data injections until the destination's transaction queues have drained, moving the queuing pressure from the memory controller back to the source.

CWND Restoration: The Restore CWND method, triggered by the rc-flag, allows a flow to immediately resume its original transmission rate once a congestion event has passed. This prevents the flow from having to slowly ramp back up through increase phases, ensuring that the system returns to peak efficiency as soon as the bottleneck, whether in the fabric or at the destination, is resolved.


No comments:

Post a Comment