Network-Signaled Congestion Control (NSCC)
The Network-Signaled Congestion Control (NSCC) algorithm operates on the principle that the network fabric itself is the best source of truth regarding congestion. Rather than waiting for packet loss to occur, NSCC relies on proactive feedback from switches to adjust transmission rates in real time. The primary mechanism for this feedback is Explicit Congestion Notification (ECN) marking. When a switch interface's egress queue begins to build up, it employs a Random Early Detection (RED) logic to mark specific packets. Once the buffer’s Minimum Threshold is crossed, the switch begins randomly marking packets by setting the last two bits of the IP header’s Type of Service (ToS) field to the CE (11) state. If the congestion worsens and the Maximum Threshold is reached, every packet passing through that interface is marked, providing a clear and urgent signal to the endpoints.
The practical impact of this mechanism is best illustrated by a hash collision event, such as the one shown in Figure 6-10. In this scenario, multiple GPUs on the left-hand side of the fabric transmit data at line rate. Due to the specific entropy of these flows, the ECMP hashing algorithms on leaf switches 1A-1 and 1A-2 inadvertently select the same uplink to Spine 1A. Because all destination GPUs are concentrated on leaf switch 1B-1, the spine is forced to aggregate these incoming flows—totaling 500 Gbps—into a single outgoing interface. This bottleneck causes the queue to fill rapidly. Consequently, Spine 1A marks packets destined for Rank 9 and Rank 5 with ECN-CE. When these marked packets reach the receiver, the Packet Delivery Service (PDS) detects the congestion signal and reflects it back to the source by setting the pds.m flag in the acknowledgement (ACK) message.
The second signaling mechanism is based on measured queuing delay, which provides a granular view of fabric pressure even when ECN marks are not present. The algorithm calculates this by measuring the current Round-Trip Time (RTT) and subtracting the Base_RTT—the known minimum RTT of an uncongested path. This difference (Delta RTT) represents the time a packet spent sitting in switch buffers. By isolating the queuing delay from the total propagation time, the algorithm can detect the earliest stages of buffer buildup with high precision.
To manage these signals effectively, the algorithm maintains a constant record of the inflight packet state, tracking every byte transmitted to the network that has not yet been acknowledged or NACKed by the receiver. By synthesizing these three critical factors, ECN-CE signals, calculated queuing delay, and the volume of packets in flight, the NSCC algorithm dynamically adjusts the Congestion Window (CWND). This data allows the algorithm to decide precisely when a PDS is permitted to inject new data into the fabric and, if necessary, to rotate the Entropy Value (EV) to steer traffic toward underutilized paths, effectively resolving the collision and restoring optimal flow.
Figure 6-10: NSCC: Link Congestion due to Hash Collision.
The Overview of the NSCC Control Loop
Building on the previous overview, this section examines the granular mechanics of the NSCC process. Figure 6-11 illustrates the source-side operations as various Ranks initiate communication over the backend fabric. In this scenario, data from Ranks 0 and 8 is managed by Packet Delivery Context (PDC) 0x4001, Rank 2 is handled by PDC 0x4002, and Ranks 1 and 3 are assigned to PDC 0x4003.Each rank is tasked with transferring 4,096 KB of data. While abstracted in the diagram, the process begins when an application executes a fi_write RMA operation. This request is passed to the Semantic Sublayer (SES), which translates the intent into a UET_WRITE operation before handing it off to the PDC layer. Upon receiving new data, the PDC notifies the Congestion Control Context (CCC) Manager within the Congestion Management Sublayer (CMS) of a delta backlog (Steps 1a–c). This delta represents the volume of unsent data waiting in the PDC buffers that must be added to the total CCC backlog.
The CMS then acts as the gatekeeper; it compares the current inflight bytes against the Congestion Window (CWND). If the volume of data currently on the wire is less than the CWND, the CCC scheduler permits data transport (Step 2). In our example, there is sufficient headroom in the window, allowing the scheduler to authorize PDC 0x4001 to transmit. As the packet is dispatched, the hardware records the precise transmission time and injects the Entropy Value (EV) into the header to facilitate fabric load balancing (Phase 3). Simultaneously, the Inflight state is incremented and the backlog is decremented to reflect the data now transiting the network (Phases 4 and 5).The receiver processes the incoming packet and generates an ACK_CC message (Step 6). If the packet arrived with ECN-CE bits set by a switch, the receiver sets the pds.m flag in the ACK to signal that congestion was manifested. In this specific example, no congestion is encountered, so the pds.m bit remains unset. Crucially, the ACK_CC includes the service-time, the internal processing delay at the receiver—and a cumulative byte count to inform the source of the total data successfully received.
When the source receives the ACK_CC, it logs the arrival time (Step 7) and updates the CCC state. It decreases the inflight counter based on the rcvd_bytes value and performs a critical adjustment of the CWND. This adjustment is calculated by synthesizing the ECN state and the measured queuing delay, defined as:
Queuing Delay = RTT_measured - (Base_RTT + Service_Time)
This autonomous, self-adjusting control loop represents a sophisticated implementation of Intent-Based Networking (IBN) at the transport layer. The high-level "intent" is simple: the reliable delivery of data between Ranks at line rate with minimal tail latency. To fulfill this, the NSCC algorithm operates as a real-time, closed-loop system—monitoring network feedback, analyzing fabric pressure, and adapting injection rates without human intervention. By offloading this decision-making to the Congestion Management Sublayer (CMS), the fabric becomes self-optimizing, ensuring that even in the face of unpredictable hash collisions, the network remains a transparent utility for the application.
Figure 6-11: NSCC Operation.
The following section concludes our exploration of NSCC by detailing the specific fields within the ACK_CC header and illustrating how the source-side state machine transitions between different congestion levels. While the overview provided here is sufficient to understand the fundamental operations of NSCC, the subsequent deep dive is intended for those who require bit-level architectural details.
While NSCC serves as the primary proactive mechanism for modulating flow at the source, it is only one part of the Ultra Ethernet "congestion toolbox." To ensure total fabric reliability, UEC employs additional layers of defense, such as Receiver Credit-based Congestion Control (RCCC) and Packet Trimming. These mechanisms are designed to handle specific scenarios where proactive rate-limiting isn't enough, providing the "emergency" recovery needed to maintain near-line-rate performance. Each of these solutions will be explored in detail in the upcoming chapters.
No comments:
Post a Comment