The Network Times: Packet Trimming Deep Dive

Introduction

The previous chapter introduced the Ultra Ethernet (UE) Transport Layer and its endpoint-centric congestion control mechanisms: Network Signaled Congestion Control (NSCC) and Receiver Credit-based Congestion Control (RCCC). This chapter moves down to the UE Network Layer and introduces Packet Trimming (PT).

While node-based approaches rely on NIC-to-NIC feedback loops, Packet Trimming allows network switches to actively intervene during periods of high utilization. Instead of silently dropping packets under congestion, the network provides an explicit and fast signal that enables immediate recovery.

The primary goal of Packet Trimming is to prevent incast congestion, a situation in which multiple ingress ports simultaneously overwhelm a single egress port. In AI and HPC workloads, many-to-one traffic patterns are common—for example, when multiple workers send data to a single parameter server. Under these conditions, egress buffers can be exhausted very quickly. In a best-effort network, this typically results in tail drops. The receiver then waits for a retransmission timeout, which introduces long tail latency and disrupts synchronization across distributed workloads. Packet Trimming replaces this silent packet loss with an explicit congestion signal that travels faster than the data itself.

The process begins at the source UE node. The NIC marks outgoing data packets with a DSCP-TRIMMABLE codepoint, indicating that the packet payload may be truncated if congestion occurs in the network. When such a packet enters a switch, it is initially treated as normal data traffic, for example classified into a low-priority traffic class and forwarded according to standard scheduling rules.

As the switch resolves the egress port, its scheduling logic continuously monitors congestion on that port. If a predefined congestion threshold is exceeded, the switch performs a precise operation. Instead of dropping the packet entirely, it discards only the payload while preserving the essential transport and network headers. For IPv4 traffic, the packet is typically truncated to 64 bytes, and for IPv6 traffic to 128 bytes. These sizes are chosen to ensure that all transport-level identification fields remain intact.

The Packet Delivery Sublayer (PDS) header carries the transport identity of the packet, including the Packet Sequence Number and the Packet Delivery Context identifiers. This information is sufficient for the receiver to detect packet loss and request retransmission. The switch does not need to preserve application semantics to signal congestion; it only needs to preserve transport-level identity.

Packet Trimming is effective only if the trimmed packet retains all information required for transport-level recovery. For this reason, Ultra Ethernet defines a minimum trim size that ensures the complete PDS header is preserved. This guarantees that sequence numbers and context identifiers are not lost during trimming.

In networks that use tunneling or encapsulation, such as VXLAN, additional headers must also be preserved. The receiver relies on this information to correctly demultiplex traffic and associate it with the appropriate transport context. As a result, the minimum trim size is a network-wide configuration parameter that depends on the transport protocol, IP version, and encapsulation methods used in the fabric. All switches in the network must apply the same value to ensure consistent behavior.

From a practical standpoint, trimming is most effective for large data packets. When a multi-kilobyte packet is reduced to a compact header-only frame, the data rate is reduced by orders of magnitude, while loss detection still happens within a single round-trip time. For small packets, the relative benefit of trimming is limited because the size reduction is much smaller.

After trimming the payload, the switch rewrites the DSCP field in the IP header from DSCP-TRIMMABLE to DSCP-TRIMMED. This rewrite explicitly signals that the packet has been trimmed and now represents congestion feedback rather than application data. The packet is then reclassified into a higher-priority traffic class before transmission.

This priority promotion is essential. Trimmed packets are forwarded using a medium- or high-priority queue so that they bypass the congestion that caused the trimming in the first place. Because trimmed packets are very small, prioritizing them does not increase congestion. Instead, it shortens the feedback loop by ensuring that congestion information reaches the destination with minimal delay.

When the trimmed packet arrives at the destination UE node, the transport layer immediately recognizes the DSCP marking and the absence of the payload. Using the preserved Packet Sequence Number from the PDS header, the receiver generates a selective negative acknowledgment with the retransmit flag set. This allows the sender to begin retransmission almost immediately, without waiting for a timeout.

By combining payload removal, DSCP rewrite, and priority forwarding, Packet Trimming enables fast, hardware-assisted recovery while maintaining high throughput. This capability is critical for large-scale AI training and HPC workloads, where low latency and tight synchronization directly impact performance.

The following sections describe how Packet Trimming and signal processing are implemented within the Cisco Silicon One G200 ASIC.

Optical to Digital Signal Processing

Figure 9-1 provides a high-level overview of the components involved in translating a received signal into an Ethernet bitstream. It traces the path of a signal encoded using Pulse Amplitude Modulation 4 (PAM4) as it travels from the optical domain, through electrical processing inside the ASIC, and finally to the RX MAC, where Ethernet frame boundaries are identified and the Cyclic Redundancy Check (CRC) is validated.

In Figure 9-1, an 800 Gbps optical transceiver is attached to front-panel interface E0, which belongs to Interface Group 0 (IFG0) covering interfaces E0–E3. Internally, each 800G interface is supported by eight 112 Gb/s SerDes lanes. These lanes are processed independently and later aggregated by the Physical Coding Sublayer (PCS) before reaching the RX MAC.

Phase 1 - Optical Reception: The transceiver receives an optical waveform that is already PAM4-modulated by the transmitting device. PAM4 encodes two bits per symbol by using four distinct optical symbol levels, allowing higher data rates without increasing the symbol rate.

Phase 2 - Photodiode Conversion: A photodiode converts the incoming optical signal into an analog electrical waveform, preserving the relative PAM4 symbol levels. Due to attenuation, dispersion, and noise introduced during transmission over fiber, this waveform may be distorted when it arrives at the receiver.

Phase 3 - Transceiver DSP Conditioning: To ensure the signal can reliably traverse the switch’s internal circuit board, the transceiver includes a small Digital Signal Processor (DSP). This DSP performs signal conditioning functions such as amplification, equalization, and retiming. The result is a cleaned PAM4 electrical signal, which is transmitted toward the ASIC across eight high-speed differential electrical lanes.

Phase 4 - ADC and SerDes Processing: Inside the G200 ASIC, each incoming electrical lane from the transceiver terminates at a dedicated SerDes RX slice, which forms the physical-layer front end of the ASIC. An Analog-to-Digital Converter (ADC) samples the incoming analog waveform, and SerDes logic performs equalization, clock recovery, and PAM4 symbol decoding to reconstruct the digital bitstream for that lane. At this stage, each SerDes lane produces a high-speed stream of digital data. To make this data manageable for internal logic, the SerDes widens the data path by converting the serial stream into parallel data. For example, instead of processing a single bit at a 112 Gb/s rate, the data may be represented as a 64-bit-wide word operating at approximately 1.75 GHz. This example illustrates the principle rather than a fixed architectural constant.

Phase 5 - PCS Lane Bonding: Because the full 800 Gbps signal is distributed across eight independent lanes, the Physical Coding Sublayer (PCS), alongside the RS-FEC sublayer, performs lane alignment, and deskew. The PCS then bonds the lane data together, producing a single continuous logical bitstream that represents the Ethernet interface. This aggregated bitstream is delivered to the RX MAC, which detects the Ethernet preamble and Start-of-Frame Delimiter (SFD), validates the Frame Check Sequence (CRC), and prepares the frame for further processing within the switching pipeline.

Why This Structure Is Necessary

No practical silicon device can process a serial data stream at hundreds of gigabits per second directly. Such operation would exceed thermal, power, and signal integrity limits.

By converting speed into width, the G200 trades extremely fast serial signaling for wide, lower-frequency parallel processing. In the example above, widening the data path to 512 bits (8 lanes × 64 bits) allows the ASIC to sustain the line-rate throughput while operating internal logic at a manageable clock rate. This architectural approach enables high performance without compromising power efficiency or reliability.

In modern high-speed Ethernet ASICs, the SerDes and PCS functions form the physical-layer front end of the chip, while the RX MAC operates only after the electrical lanes have been recovered, aligned, and bonded into a logical bitstream.

Figure 9-1: Optical to Digital Signal Processing.

Friday, 20 February 2026

Packet Trimming Deep Dive - Part I

Introduction

Optical to Digital Signal Processing

Why This Structure Is Necessary

No comments:

Post a Comment