Wednesday, 25 February 2026

Packet Trimming Deep Dive - Part II

Receive Interface Group (Rx IFG)


Ingress Pre-Processing and Integrity

The Receive Interface Group (Rx IFG) is the ingress pre-processing stage that handles the incoming Ethernet bitstream before the packet enters the Packet Processing Array (PPA) of the Receive Network Processing Unit (Rx NPU) in the Cisco Silicon One architecture.

Processing begins at the Rx MAC. The Rx MAC reconstructs (“delimits”) the Ethernet frame from the Physical Coding Sublayer (PCS) bitstream and verifies frame integrity by computing a Frame Check Sequence (FCS) using the CRC-32 algorithm. If the computed FCS does not match the received FCS value, the frame is considered corrupted and is dropped immediately at ingress. If the CRC check succeeds, the frame is admitted for further processing. 

Shallow classification and Traffic Class mapping

After frame validation, the Rx IFG identifies the Ethernet MAC header and detects the presence of IEEE 802.1Q VLAN tags. The Rx IFG performs shallow classification to efficiently manage hardware resources before deeper protocol parsing and forwarding decisions are executed in the Rx NPU. When an IEEE 802.1Q VLAN tag is present, the Rx IFG extracts the Priority Code Point (PCP) bits from the VLAN tag and maps them to an Internal Traffic Class (ITC). Based on this mapping, the frame is placed into a specific port-based local hardware FIFO queue.

If the frame does not carry a VLAN tag, or if no usable CoS information is available, the frame is assigned to a default Internal Traffic Class and buffered in a standard port-based FIFO queue.

At this stage, no forwarding lookup is performed; the purpose is limited to frame validation and shallow classification.

Port-based FIFO queues and packet buffering

The small port-based FIFO queues serve two purposes.

First, they provide prioritization for writing packet data into shared SRAM. Prioritization is required because a large number of packets, originating from many ingress ports and representing thousands of concurrent flows, may arrive simultaneously at the Rx IFG. The FIFO queues regulate this contention and determine the order in which packets are admitted into shared memory based on their assigned Internal Traffic Class.

Second, in parallel with the memory write operation, the Rx IFG generates a compact packet metadata structure known as the Packet Header Vector (PHV).

Packet Header Vector (PHV) Creation

The Packet Header Vector summarizes essential information about the packet without requiring full packet inspection. It is created in the Rx IFG to offload basic packet characterization from the Rx NPU parser, thereby reducing processing overhead in the programmable pipeline and enabling higher sustained packet rates. The PHV includes metadata such as:


  • Ingress interface and timestamp, and Packet length
  • Pointer to the packet’s cell chain in memory
  • Internal Traffic Class (ITC), VLAN ID, and EtherType
  • Router MAC hit indication


The Rx IFG operates as a pattern-matching engine rather than performing linear, sequential parsing of the entire packet. For example, when the EtherType field indicates IPv4 (0x0800), the Rx IFG recognizes that the IP Protocol field, used to identify the transport protocol such as TCP or UDP, is located at a fixed offset within the IPv4 header. It can therefore extract that field directly without scanning the header byte by byte.

Similarly, when an IEEE 802.1Q VLAN tag is present, the Rx IFG accounts for the additional header field and adjusts the parsing offset accordingly to locate the correct EtherType position before proceeding with further field extraction.

The Router MAC field in the PHV is a hit indicator that signals whether the destination MAC address matches one of the switch’s configured router MAC addresses. A positive match indicates that the packet is addressed to the switch itself and allows the Rx NPU to immediately invoke Layer 3 processing logic, bypassing Layer 2 forwarding paths.

Handoff to the Rx NPU

Once the PHV is constructed, it is dispatched to one of the many Run-to-Completion (RTC) cores in the Packet Processing Array (PPA) of the Rx NPU. Packets belonging to the same flow are consistently steered to the same RTC core to preserve packet ordering. All other flows arriving in parallel are distributed across the remaining RTC cores to maximize throughput and parallelism.



Figure 9-2: Receive Interface Group Processes.

 

Friday, 20 February 2026

Packet Trimming Deep Dive - Part I

 Introduction


The previous chapter introduced the Ultra Ethernet (UE) Transport Layer and its endpoint-centric congestion control mechanisms: Network Signaled Congestion Control (NSCC) and Receiver Credit-based Congestion Control (RCCC). This chapter moves down to the UE Network Layer and introduces Packet Trimming (PT).

While node-based approaches rely on NIC-to-NIC feedback loops, Packet Trimming allows network switches to actively intervene during periods of high utilization. Instead of silently dropping packets under congestion, the network provides an explicit and fast signal that enables immediate recovery.

The primary goal of Packet Trimming is to prevent incast congestion, a situation in which multiple ingress ports simultaneously overwhelm a single egress port. In AI and HPC workloads, many-to-one traffic patterns are common—for example, when multiple workers send data to a single parameter server. Under these conditions, egress buffers can be exhausted very quickly. In a best-effort network, this typically results in tail drops. The receiver then waits for a retransmission timeout, which introduces long tail latency and disrupts synchronization across distributed workloads. Packet Trimming replaces this silent packet loss with an explicit congestion signal that travels faster than the data itself.

The process begins at the source UE node. The NIC marks outgoing data packets with a DSCP-TRIMMABLE codepoint, indicating that the packet payload may be truncated if congestion occurs in the network. When such a packet enters a switch, it is initially treated as normal data traffic, for example classified into a low-priority traffic class and forwarded according to standard scheduling rules.

As the switch resolves the egress port, its scheduling logic continuously monitors congestion on that port. If a predefined congestion threshold is exceeded, the switch performs a precise operation. Instead of dropping the packet entirely, it discards only the payload while preserving the essential transport and network headers. For IPv4 traffic, the packet is typically truncated to 64 bytes, and for IPv6 traffic to 128 bytes. These sizes are chosen to ensure that all transport-level identification fields remain intact.

The Packet Delivery Sublayer (PDS) header carries the transport identity of the packet, including the Packet Sequence Number and the Packet Delivery Context identifiers. This information is sufficient for the receiver to detect packet loss and request retransmission. The switch does not need to preserve application semantics to signal congestion; it only needs to preserve transport-level identity.

Packet Trimming is effective only if the trimmed packet retains all information required for transport-level recovery. For this reason, Ultra Ethernet defines a minimum trim size that ensures the complete PDS header is preserved. This guarantees that sequence numbers and context identifiers are not lost during trimming.

In networks that use tunneling or encapsulation, such as VXLAN, additional headers must also be preserved. The receiver relies on this information to correctly demultiplex traffic and associate it with the appropriate transport context. As a result, the minimum trim size is a network-wide configuration parameter that depends on the transport protocol, IP version, and encapsulation methods used in the fabric. All switches in the network must apply the same value to ensure consistent behavior.

From a practical standpoint, trimming is most effective for large data packets. When a multi-kilobyte packet is reduced to a compact header-only frame, the data rate is reduced by orders of magnitude, while loss detection still happens within a single round-trip time. For small packets, the relative benefit of trimming is limited because the size reduction is much smaller.

After trimming the payload, the switch rewrites the DSCP field in the IP header from DSCP-TRIMMABLE to DSCP-TRIMMED. This rewrite explicitly signals that the packet has been trimmed and now represents congestion feedback rather than application data. The packet is then reclassified into a higher-priority traffic class before transmission.

This priority promotion is essential. Trimmed packets are forwarded using a medium- or high-priority queue so that they bypass the congestion that caused the trimming in the first place. Because trimmed packets are very small, prioritizing them does not increase congestion. Instead, it shortens the feedback loop by ensuring that congestion information reaches the destination with minimal delay.

When the trimmed packet arrives at the destination UE node, the transport layer immediately recognizes the DSCP marking and the absence of the payload. Using the preserved Packet Sequence Number from the PDS header, the receiver generates a selective negative acknowledgment with the retransmit flag set. This allows the sender to begin retransmission almost immediately, without waiting for a timeout.

By combining payload removal, DSCP rewrite, and priority forwarding, Packet Trimming enables fast, hardware-assisted recovery while maintaining high throughput. This capability is critical for large-scale AI training and HPC workloads, where low latency and tight synchronization directly impact performance.

The following sections describe how Packet Trimming and signal processing are implemented within the Cisco Silicon One G200 ASIC.


Optical to Digital Signal Processing


Figure 9-1 provides a high-level overview of the components involved in translating a received signal into an Ethernet bitstream. It traces the path of a signal encoded using Pulse Amplitude Modulation 4 (PAM4) as it travels from the optical domain, through electrical processing inside the ASIC, and finally to the RX MAC, where Ethernet frame boundaries are identified and the Cyclic Redundancy Check (CRC) is validated.

In Figure 9-1, an 800 Gbps optical transceiver is attached to front-panel interface E0, which belongs to Interface Group 0 (IFG0) covering interfaces E0–E3. Internally, each 800G interface is supported by eight 112 Gb/s SerDes lanes. These lanes are processed independently and later aggregated by the Physical Coding Sublayer (PCS) before reaching the RX MAC.

Phase 1 - Optical Reception: The transceiver receives an optical waveform that is already PAM4-modulated by the transmitting device. PAM4 encodes two bits per symbol by using four distinct optical symbol levels, allowing higher data rates without increasing the symbol rate.

Phase 2 - Photodiode Conversion: A photodiode converts the incoming optical signal into an analog electrical waveform, preserving the relative PAM4 symbol levels. Due to attenuation, dispersion, and noise introduced during transmission over fiber, this waveform may be distorted when it arrives at the receiver.

Phase 3 - Transceiver DSP Conditioning: To ensure the signal can reliably traverse the switch’s internal circuit board, the transceiver includes a small Digital Signal Processor (DSP). This DSP performs signal conditioning functions such as amplification, equalization, and retiming. The result is a cleaned PAM4 electrical signal, which is transmitted toward the ASIC across eight high-speed differential electrical lanes.

Phase 4 - ADC and SerDes Processing: Inside the G200 ASIC, each incoming electrical lane from the transceiver terminates at a dedicated SerDes RX slice, which forms the physical-layer front end of the ASIC. An Analog-to-Digital Converter (ADC) samples the incoming analog waveform, and SerDes logic performs equalization, clock recovery, and PAM4 symbol decoding to reconstruct the digital bitstream for that lane. At this stage, each SerDes lane produces a high-speed stream of digital data. To make this data manageable for internal logic, the SerDes widens the data path by converting the serial stream into parallel data. For example, instead of processing a single bit at a 112 Gb/s rate, the data may be represented as a 64-bit-wide word operating at approximately 1.75 GHz. This example illustrates the principle rather than a fixed architectural constant.

Phase 5 - PCS Lane Bonding: Because the full 800 Gbps signal is distributed across eight independent lanes, the Physical Coding Sublayer (PCS), alongside the RS-FEC sublayer, performs lane alignment, and deskew. The PCS then bonds the lane data together, producing a single continuous logical bitstream that represents the Ethernet interface. This aggregated bitstream is delivered to the RX MAC, which detects the Ethernet preamble and Start-of-Frame Delimiter (SFD), validates the Frame Check Sequence (CRC), and prepares the frame for further processing within the switching pipeline.


Why This Structure Is Necessary

No practical silicon device can process a serial data stream at hundreds of gigabits per second directly. Such operation would exceed thermal, power, and signal integrity limits. 

By converting speed into width, the G200 trades extremely fast serial signaling for wide, lower-frequency parallel processing. In the example above, widening the data path to 512 bits (8 lanes × 64 bits) allows the ASIC to sustain the line-rate throughput while operating internal logic at a manageable clock rate. This architectural approach enables high performance without compromising power efficiency or reliability.

In modern high-speed Ethernet ASICs, the SerDes and PCS functions form the physical-layer front end of the chip, while the RX MAC operates only after the electrical lanes have been recovered, aligned, and bonded into a logical bitstream.




Figure 9-1: Optical to Digital Signal Processing.


Thursday, 5 February 2026

Ultra Ethernet: Receiver Credit-based Congestion Control (RCCC)

 Introduction

Receiver Credit-Based Congestion Control (RCCC) is a cornerstone of the Ultra Ethernet transport architecture, specifically designed to eliminate incast congestion. Incast occurs at the last-hop switch when the aggregate data rate from multiple senders exceeds the egress interface capacity of the target’s link. This mismatch leads to rapid buffer exhaustion on the outgoing interface, resulting in packet drops and severe performance degradation.


The RCCC Mechanism

Figure 8-1 illustrates the operational flow of the RCCC algorithm. In a standard scenario without credit limits, source Rank 0 and Rank 1 might attempt to transmit at their full 100G line rates simultaneously. If the backbone fabric consists of 400G inter-switch links, the core utilization remains a comfortable 50% (200G total traffic). However, because the target host link is only 100G, the last-hop switch (Leaf 1B-1) becomes an immediate bottleneck. The switch is forced to queue packets that cannot be forwarded at the 100G egress rate, eventually triggering incast congestion and buffer overflows.

While "incast" occurs at the egress interface and can resemble head-of-line blocking, it is fundamentally a "fan-in" problem where multiple sources converge on a single receiver. Under RCCC, standard Explicit Congestion Notification (ECN) on the last-hop switch's egress interface is typically disabled for this traffic class. The reasoning is twofold:

Redundancy: In Ultra Ethernet, ECN is the primary signal for NSCC to adjust the Congestion Window (CWND) and rotate the Entropy Value (EV) to trigger packet-level load balancing across the fabric.

Path Convergence: At the last-hop switch, rotating the EV is ineffective because there is only a single physical path to the destination. Since RCCC provides a more granular, proactive mechanism to throttle senders based on the receiver's actual capacity, the reactive "slow down" signaling of ECN becomes unnecessary at this stage. By disabling ECN here, the receiver (Target) takes full responsibility for flow management, ensuring that the fabric remains clear of congestion markers that might otherwise trigger unnecessary path hunting.


Credit Allocation and Flow

Instead of relying on late-stage ECN signaling, the RCCC algorithm proactively throttles senders by granting credits that match the physical transport speed of the target's connection.

Discovery: When Rank 2 receives data, it identifies the sources via the CCC_ID field in the RUD_CC_REQ (the specific request type used when RCCC is enabled) and adds them to its Active Sender Table.

Calculation: The algorithm divides the total available bandwidth, for a 100Gbps link, this is roughly 12.5 GB/, among the active senders. In this example, each sender is allocated 6.25 GB/s (50Gbps) worth of credits.

Granting: These credits are transmitted back to the sources via ACK_CC packets once data is successfully committed to Rank 2’s memory.

Enforcement: Upon receiving the ACK_CC, the Congestion Control Context (CCC) associated with the sender’s Packet Delivery Control (PDC) updates its local credit table. The PDC only permits transmission based on these available credits, effectively capping the individual sender's rate at 50G. This ensures that when combined with the other sender, the aggregate rate at the receiver does not exceed its 100G link capacity.

This credit-grant loop is continuous. The RUD_CC_REQ carries "backlog" information, telling the target exactly how much data is waiting in the source's queue. By dynamically adjusting grants based on this feedback, RCCC ensures the backend network remains lossless.

Figure 8-1: RCCC: Destination Flow Control.


Source RCCC Operation


The RCCC operation from the perspective of source UET Node-A begins when an application on Rank 0 initiates a 256 MB Remote Memory Access (RMA) write operation toward Rank 2. This request is handled by the Semantic Sublayer (SES), which translates the high-level command into  ses_pds_tx_req for the Packet Delivery Sublayer (PDS). In our example, the PDS Manager determines that no communication channel currently exists between the Fabric Endpoints used for these connections, so it allocates a new Packet Delivery Control (PDC) from its general pool with the PDC identifier 0x4001. Simultaneously, it requests a Congestion Control Context (CCC) from the Congestion Management System (CMS), resulting in a dedicated context, CCC_ID = 0xA1, being configured and bound to the new PDC.

Once PDC and CCC are established, the system tracks the pending data through a two-tier backlog system. In our example, PDC 0x4001 updates its delta backlog with the full 256 MB of the request, which is then added to the CCC’s global backlog. This global value represents the total volume of data currently waiting for transport across all PDCs managed by that specific context. Because this is the start of the transaction, the global backlog moves from zero to 256 MB, establishing the total "demand" the source is prepared to place on the network.

In our example, new contexts are pre-provisioned with initial credits scaled to the Bandwidth-Delay Product (BDP). While the theoretical capacity of a 100G link is 12.5 GB/s, the initial "pipe-cleaning" burst is much smaller, specifically 12.5 KB in this scenario. This value represents a safe, conservative fraction of the total BDP, ensuring that the source can trigger the feedback loop without the risk of overwhelming the receiver's buffers or the last-hop switch before the control loop fully engages. The CCC authorizes PDC 0x4001 to transmit this initial amount, subtracts it from the current cumulative credits, and updates the global backlog to show that this small portion is now in-flight, leaving 255.987.500 bytes remaining in the queue.

With this authorization from the CCC, the PDC passes the work request to the NIC, which fetches the data from memory and prepares the packet for transmission. In our example, the FEP Fabric Addresses (FA) are encoded into the IP header’s source and destination IP address fields, and the DSCP bits are configured to correspond to the TC-LOW traffic class. Additionally, the ECN bits are set to reflect that the packet is ECN-capable, ensuring visibility for Network Signaled Congestion Control (NSCC) if needed. The type of the PDS request is set to RUD_CC_REQ, which requires a pds.req_cc_state field. In our example, this field carries the CCC_ID (0xA1) and the Credit Target, which describes the size of the backlog of the sender CCC. By including these parameters, the source explicitly informs the target of its total remaining data, allowing the receiver to calculate and return the next set of credit grants to keep the pipeline moving.

Note: Since the source does not yet have information regarding the PDC on the remote target, it populates the pdc_info field with a value of 0x0 for the Destination PDC ID (DPDCID) for notifying the target that its new PDC ID must be taken from the global PDC pool. Furthermore, the SYN bit remains set until the first ACK_CC message is received, signaling to the target that the connection handshake and credit-granting loop are in the initialization phase.


Figure 8-2: Source RCCC Processing.

Target RCCC Operation – PDS Request


When the initial packet arrives at destination Node-B, the PDS Manager first checks for an existing PDC associated with the incoming connection from Fabric Address (FA) 10.0.0.1 and SPDCID 0x4001. Because no such PDC exists, the PDS Manager identifies this as a new connection request. The value of 0x0 in the pdc_info field instructs the target to allocate a General type PDC, ensuring the local delivery control matches the source's PDC type.

Since no Congestion Control Context (CCC) currently exists for this specific FEP-to-FEP connection, the PDS requests the CMS to allocate a new one. The CMS assigns CCC_ID 0xB1 and creates an entry source CCC_ID-specific entry in the Active Sender Table. This entry describes the source address (FA 10.0.0.1), and the assigned traffic class (TC-LOW) from the IP header. Besides, entry for source CCC_ID 0xA1 described in PDS headers tells the source backlog size as credit_target with value 255.987.500. 

Simultaneously, the NIC extracts the semantic information from the SES header to identify the required operation. In our example, it recognizes a UET_WRITE command and determines the target memory address for the incoming data. Once the packet payload is verified, the data is forwarded to the High-Bandwidth Memory (HBM) Controller, where it waits for its turn to be committed to the physical memory.


Figure 8-3: Target RCCC Processing – PDS Request.


Target RCCC Operation – Credit Assignment


After receiving confirmation from the SES regarding the completed memory operation, the PDS prepares the response using an ACK_CC message. The CMS must now determine how much data the source is permitted to send in its next burst. In our example, the CMS allocates 12.5 KB of credits for CCC_ID 0xA1.

The math behind this allocation is a function of the receiver’s total capacity and the time-granularity of the control loop. While the NIC provides a 100 Gbps (12.5 GB/s) "pipe," the receiver does not grant a full second of data at once, as doing so would bypass the congestion control mechanism. Instead, it grants data in "time-slices", in this scenario, representing 1 microsecond of transmission. By dividing the total bandwidth by the number of active senders for that specific time-slice, the receiver ensures that the aggregate "demand" never exceeds the physical capabilities of the link.

In our example, with only one active sender, the calculation is:

(12.5 GB/s x 0.000001 s) ÷ active senders = 12.5 KB

The RCCC algorithm is designed for dynamic fairness. Though not explicitly shown in Figure 8-4, if Rank 0 had a simultaneous transfer in progress, the Active Sender Table would list two sources. The CMS would then divide that same 1-microsecond "slice" between them, reducing the granted credit per source to 6.25 KB. This prevents "incast" congestion by ensuring that even if multiple sources transmit at once, their combined throughput matches exactly what the receiver can ingest.

The PDS defines this response by setting the pds.cc_type to CC_CREDIT. The pds.ack_cc_state field is populated with this calculated credit value, while the ooo_count field tracks any Out-of-Order packets. To ensure this information is not delayed by standard data traffic, the DSCP bits in the IP header are set to TC-High. This gives the ACK_CC message "express" priority across the backend fabric, minimizing the time the source spends waiting for new credits and maintaining a high-performance, steady-state flow.

Crucially, the Target populates its own local PDC ID (0x4011) into the Source PDC Identifier (SPDCID) field of the PDS Prologue header. By doing so, it provides the return address necessary for the source to transition out of its initial "discovery" state.

Figure 8-4: Target RCCC Processing – ACK_CC Message Reply.


Source-Side Processing of ACK_CC


When the ACK_CC message arrives at the source, the NIC identifies the target FEP based on the destination IP address. However, for high-speed internal processing, it uses the DPDCID in the PDS header as a local handle to jump directly to the correct PDC Context. From this entry, the NIC automatically resolves the CCC_ID associated with that specific PDC.

Once the correct CCC entry is identified, the source processes the new credit information. In our example, the receiver has sent a new Cumulative Credit value of 25,000 bytes. To determine the currently available window, the source performs a simple subtraction: 

Incremental Credit = Received Cumulative Credit – Local Cumulative Credit

By subtracting the previously recorded 12,500 bytes from the new 25,000 bytes, the source identifies an incremental grant of 12,500 bytes. The CCC then authorizes the PDC to transmit this amount. Simultaneously, the Global Backlog is updated by subtracting these 12,500 bytes from the remaining 255,987,500 bytes, keeping the sender’s demand signal accurate for the next request.

The PDC informs the NIC that it is cleared to construct packets fitting this allowed credit size (respecting the NIC’s MTU). The NIC fetches the data from memory, packetizes it, and transports it to the destination. This control loop continues—updating demand and receiving cumulative grants—until the entire backlog has been transported and acknowledged.

Once the job is complete, the PDC context is closed. If no other PDCs are currently associated with that CCC_ID, the CCC is also closed. This hierarchical teardown ensures that no unnecessary hardware resources or bandwidth are reserved in the AI Fabric once the work is done.