Monday, 29 December 2025

UET Congestion Management: Introduction

 Introduction


Figure 6-1 depicts a simple scale-out backend network for an AI data center. The topology follows a modular design, allowing the network to scale out or scale in as needed. The smallest building block in this example is a segment, which consists of two nodes, two rail switches, and one spine switch. Each node in the segment is equipped with a dual-port UET NIC and two GPUs.

Within a segment, GPUs are connected to the leaf switches using a rail-based topology. For example, in Segment 1A, the communication path between GPU 0 on Node A1 and GPU 0 on Node A2 uses Rail A0 (Leaf 1A-1). Similarly, GPU 1 on both nodes is connected to Rail A1 (Leaf 1A-2). In this example, we assume that intra-node GPU collective communication takes place over an internal, high-bandwidth scale-up network (such as NVLink). As a result, intra-segment GPU traffic never reaches the spine layer. Communication between segments is carried over the spine layer.

The example network is a best-effort (that is, PFC is not enabled) two-tier, three-stage non-blocking fat-tree topology, where each leaf and spine switch has four 100-Gbps links. Leaf switches have two host-facing links and two inter-switch links, while spine switches have four inter-switch links. All inter-switch and host links are Layer-3 point-to-point interfaces, meaning that no Layer-2 VLANs are used in the example network.

Links between a node’s NIC and the leaf switches are Layer-3 point-to-point connections. The IP addressing scheme uses /31 subnets, where the first address is assigned to the host NIC and the second address to the leaf switch interface. These subnets are allocated in a contiguous manner so they can be advertised as a single BGP aggregate route toward the spine layer.

The trade-off of this aggregation model is that host-link or NIC failures cannot rely solely on BGP route withdrawal for fast failure detection. Additional local failure-detection mechanisms are therefore required at the leaf switch.

Although not shown in Figure 6-1, the example design supports a scalable multi-pod architecture. Multiple pods can be interconnected through a super-spine layer, enabling large-scale backend networks.

Note: The OSI between GPUs within a node indicates that both GPUs belong to the same Operating System Instance (OSI). The link between GPUs, in turn, is part of a high-bandwidth domain (the scale-up backend).

Figure 6-1: Example of AI DC Backend Networks Topology.

Congestion Types

In this text, we categorize congestion into two distinct domains: congestion within nodes, which includes incast, local, and outcast congestion, and congestion in scale-out backend networks, which includes link and network congestion. The following sections describe each congestion type in detail.


Incast Congestion

In high-performance networking, Incast is a specific type of congestion that occurs when a many-to-one communication pattern overwhelms a single network point. This is fundamentally a "fan-in" problem, where the traffic volume destined for a single receiver exceeds both the physical line rate of the last-hop switch's egress interface and the storage capacity of its output buffers.

To visualize this, consider the configuration in Figure 6-2. The setup consists of four UET Nodes (A1, A2, B1, and B2), each containing two GPUs. This results in eight total processing units, labeled Rank 0 through Rank 7. Each Rank is equipped with its own dedicated 100G NIC.

The bottleneck forms when multiple sources target a single destination simultaneously. In this scenario, Ranks 1 through 7 all begin transmitting data to Rank 0 at the exact same time, each at a 100G line rate.

The backbone of the network is typically robust enough to handle this aggregate traffic. If the switches are connected via 400G or 800G links, the core of the network stays clear and fast. If the core were to experience congestion, Network Signaled Congestion Control (NSCC) could be enabled to manage it. However, the specific problem here occurs at Leaf 1A-1, the switch where the target (Rank 0) is connected. While the switch receives a combined 600G of data destined for Rank 0, the outgoing interface from the switch to Rank 0 can only move 100G. Note, that Rank 1 use high-speed NVLink, not its Ethernet NIC

A buffer overflow is inevitable when 700G of data arrives at an egress port that can only output 100G. The switch is forced to store the extra 600G of data per second in its internal memory (buffers). Because network buffers are quite small and high-speed data moves incredibly fast, these buffers fill up in microseconds.

Once the buffers are full, the switch has no choice but to drop any new incoming packets. This leads to massive retransmission delays and "stuttering" in application performance. This is particularly devastating for AI training workloads, where all Ranks must stay synchronized to maintain efficiency.

While traditional networks use simple buffer management to deal with this, Ultra Ethernet utilizes a more sophisticated approach. To prevent "fan-in" from ever overwhelming the switch buffers in the first place, UET employs Receiver Credit-based Congestion Control (RCCC). This mechanism ensures the receiver remains in control by distributing credits that define exactly how much data each active source is allowed to transmit at any given time.


Figure 6-2: Intra-node Congestion - Incast Congestions.


Local Congestion

Local congestion arises when the High-Bandwidth Memory (HBM) controller, which manages access to the GPU’s memory channels, becomes a bottleneck. The HBM controller arbitrates all read and write requests to GPU memory, regardless of their source. These requests may originate from the GPU’s compute cores, from a peer GPU via NVLink, or from a network interface card (NIC) performing remote memory access (RMA) operations.

With a UET_WRITE operation, the target GPU compute cores are bypassed: the NIC writes data directly into GPU memory using DMA. The GPU does not participate in the data transfer itself, and the NIC handles packet reception and memory writes. Even in this case, however, the data must still pass through the HBM controller, which serves as the shared gateway to the GPU’s memory system.

In Figure 6-3, the HBM controller of Rank 0 receives seven concurrent memory access requests: six inter-node RMA write requests and one intra-node request. The controller must arbitrate among these requests, determining the order and timing of each access. If the aggregate demand exceeds the available memory bandwidth or arbitration capacity, some requests are delayed. These memory-access delays are referred to as local congestion.



Figure 6-3: Intra-node Congestion - Local Congestions.


Outcast Congestion

Outcast congestion is the third type of congestion observed in collective operations. It occurs when multiple packet streams share the same egress port, and some flows are temporarily delayed relative to others. Unlike incast congestion, which arises from simultaneous arrivals at a receiver, outcast happens when certain flows dominate the output resources, causing other flows to experience unfair delays or buffer pressure.

Consider the broadcast phase of the AllReduce operation. After Rank 0 has aggregated the gradients from all participating ranks, it sends the averaged results back to all other ranks. Suppose Rank 0 sends these updates simultaneously to ranks on node A2 and node A3 over the same egress queue of its NIC. If one destination flow slightly exceeds the others in packet rate, the remaining flows experience longer queuing delays or may even be dropped if the egress buffer becomes full. These delayed flows are “outcast” relative to the dominant flows.

In this scenario, the NIC at Rank 0 must perform multiple UET_WRITE operations in parallel, generating high egress traffic toward several remote FEPs. At the same time, the HBM controller on Rank 0 may become a bottleneck because the data must be read from memory to feed the NIC. Thus, local congestion can occur concurrently with outcast congestion, especially during large-scale AllReduce broadcasts where multiple high-bandwidth streams are active simultaneously.

Outcast congestion illustrates that even when the network’s total capacity is sufficient, uneven traffic patterns can cause some flows to be temporarily delayed or throttled. Mitigating outcast congestion is addressed by appropriate egress scheduling and flow-control mechanisms to ensure fair access to shared resources and predictable collective operation performance. These mechanisms are explained in the upcoming Network-Signaled Congestion Control (NSCC) and Receiver Credit-Based Congestion Control (RCCC) chapters.


Figure 6-4: Intra-node Congestion - Outcast Congestions.


Link Congestion


Traffic in distributed neural network training workloads is dominated by bursty, long-lived elephant flows. These flows are tightly coupled to the application’s compute–communication phases. During the forward pass, network traffic is minimal, whereas during the backward pass, each GPU transmits large gradient updates at or near line rate. Because weight updates can only be computed after gradient synchronization across all workers has completed, even a single congested link can delay the entire training step.

In a routed, best-effort fat-tree Clos fabric, link congestion may be caused by Equal-Cost Multi-Path (ECMP) collisions. ECMP typically uses a five-tuple hash—comprising the source and destination IP addresses, transport protocol, and source and destination ports—to select an outgoing path for each flow. During the backward pass, a single rank often synchronizes multiple gradient chunks with several remote ranks simultaneously, forming a point-to-multipoint traffic pattern.

For example, suppose Ranks 0–3 in segment 1 initiate gradient synchronization with Ranks 4–7 in segment 2 at the same time. Ranks 0 and 2 are connected to rail 0 through Leaf 1A-1, while Ranks 1 and 3 are connected to rail 1 through Leaf 1A-2. As shown in Figure 6-5, the ECMP hash on Leaf 1A-1 selects the same uplink toward Spine 1A for both flows arriving via rail 0, while the ECMP hash on Leaf 1A-2 distributes its flows evenly across the available spine links.

As a result, two 100-Gbps flows are mapped onto a single 100-Gbps uplink on Leaf 1A-1. The combined traffic exceeds the egress link capacity, causing buffer buildup and eventual buffer overflow on the uplink toward Spine 1A. This condition constitutes link congestion, even though alternative equal-cost paths exist in the topology.

In large-scale AI fabrics, thousands of concurrent flows may be present, and low entropy in traffic patterns—such as many flows sharing similar IP address ranges and port numbers—further increases the likelihood of ECMP collisions. Consequently, link utilization may become uneven, leading to transient congestion and performance degradation even in a nominally non-blocking network.

Ultra Ethernet Transport includes signaling mechanisms that allow endpoints to react to persistent link congestion, including influencing path selection in ECMP-based fabrics. These mechanisms are discussed in later chapters.

Note: Although outcast congestion is fundamentally caused by the same condition—attempting to transmit more data than an egress interface can sustain—Ultra Ethernet Transport distinguishes between host-based and switch-based egress congestion events and applies different signaling and control mechanisms to each. These mechanisms are described in the following congestion control chapters.



Figure 6-5: Link Congestions.

Network Congestion


Common causes of network congestion include too high oversubscription ration, ECMP collisions, and link or device failures. A less obvious but important source of short-term congestion is Priority Flow Control (PFC), which is commonly used to build lossless Ethernet networks. PFC together with Explicit Congestion Notification (ECN) forms the foundation of Lossless Ethernet for RoCEv2 but should be avoided in UET enabled best-effort network. The upcoming chapters explains why.

PFC relies on two buffer thresholds to control traffic flow: xOFF and xON. The xOFF threshold defines the point at which a switch generates a pause frame when a priority queue becomes congested. A pause frame is an Ethernet MAC control frame that tells the upstream device which Traffic Class (TC) queue is congested and for how long packet transmission for that TC should be paused. Packets belonging to other traffic classes can still be forwarded normally. Once the buffer occupancy drops below the xON threshold, the switch sends a resume signal, allowing traffic for that priority queue to continue before the actual pause timer expires.

At first sight, PFC appears to affect only a single link and only a specific traffic class. In practice, however, a PFC pause can trigger a chain reaction across the network. For example, if the egress buffer size exceeds the xOFF threshold for TC-Low on interface to rank 7 on Leaf switch 1B-1, the switch sends PFC pause frames to both connected spine switches, instructing them to temporarily hold TC-Low packets in their buffers. As the egress buffers for TC-Low on the spine switches begin to fill and xOFF threshold is crossed, they in turn sends PFC pause frame to rest of the leaf switches.

This behavior can quickly spread congestion beyond the original point of contention. In the worst case, multiple switches and links may experience temporary pauses. Once buffer occupancy drops below the xON threshold, Leaf switch 1B-1 sends resume signals, and traffic gradually recovers as normal transmission resumes. Even though the congestion episode is short, it disrupts collective operations and negatively impacts distributed training performance.

The upcoming chapters explain how Ultra Ethernet Network-Signal Congestion Control (NSCC) and Receiver-Credit Congestion Control (RCCC) manage the amount of data that sources are allowed to send over the network, maximizing network utilization while avoiding congestion. The next chapters also describe how Explicit Congestion Notification (ECN), Packet Trimming, and Entropy Value-based Packet Spraying, when combined with NSCC and RCCC, contribute to a self-adjusting, reliable backend network.


Monday, 15 December 2025

UET Request–Response Packet Flow Overview

 This section brings together the processes described earlier and explains the packet flow from the node perspective. A detailed network-level packet walk is presented in the following sections..

Initiator – SES Request Packet Transmission

After the Work Request Entity (WRE) and the corresponding SES and PDS headers are constructed, they are submitted to the NIC as a Work Element (WE). As part of this process, a Packet Delivery Context (PDC) is created, and the base Packet Sequence Number (PSN) is selected and encoded into the PDS header.

Once the PDC is established, it begins tracking acknowledged PSNs from the target. For example, the PSN 0x12000 is marked as transmitted. 

The NIC then fetches the payload data from local memory according to the address and length information in the WRE. The NIC autonomously performs these steps without CPU intervention, illustrating the hardware offload capabilities of UET.

Next, the NIC encapsulates the data with the required protocol headers: Ethernet, IP, optional UDP, PDS, and SES, and computes the Cyclic Redundancy Check (CRC). The fully formed packet is then transmitted toward the target with Traffic Class (TC) set to Low.

Note: The Traffic Class is orthogonal to the PDC; a single PDC may carry packets transmitted with Low or High TC depending on their role (data vs control).

Figure 5-9: Initiator: SES Request Processing.

Target – SES Request Reception and PDC Handling


Figure 5-10 illustrates the target-side processing when an PDS Request carrying SES Request is received. Unlike the initiator, the target PDS manager identifies the PDC using the tuple {source IP address, destination IP address, Source PDC Identifier (SPDCID)} to perform a lookup in its PDC mapping table.


Because no matching entry exists, the lookup results in a miss, and the target creates a new PDC. The PDC identifier (PDCID) is allocated from the General PDC pool, as indicated by the DPDCID field in the received PDS header. In this example, the target selects PDCID 0x8001.

This PDCID is subsequently used as the SPDCID when sending the PDS Ack  Response (carrying Semantic Response) back to the initiator. Any subsequent PDS Requests from the initiator reference this PDC using the same DPDCID = 0x8001, ensuring continuity of the PDC across messages.

After the PDC has been created, the UET NIC writes the received data into memory according to the SES header information. The memory placement process follows several steps:

  • Job and rank identification: The relative address in the SES header identifies the JobID (101) and the PIDonFEP (RankID 2).
  • Resource Index (RI) table lookup: The NIC consults the RI table, indexed by 0x00a, and verifies that the ri_generation field (0x01) matches the current table version. This ensures the memory region is valid and has not been re-registered.
  • Remote key validation: The NIC uses the rkey = 0xacce5 to locate the correct RI table entry and confirm permissions for writing.
  • Data placement: The data is written at base address (0xba5eadd1) + buffer_offset (0). The buffer_offset allows fragmented messages to be written sequentially without overwriting previous fragments.

In Figure 5-10, the memory highlighted in orange shows the destination of the first data fragment, starting at the beginning of the registered memory region.

Note: The NIC handles all these steps autonomously, performing direct memory placement and verification, which is essential for high-performance, low-latency applications like AI and HPC workloads.

Figure 5-10: Target: Request Processing – NIC → PDS → SES → Memory.


Target – SES Response, PDS Ack Response and Packet Transmission

After completing the write operation, the UET provider uses Semantic Response (SES Response) to notify the initiator that the operation was successful. The opcode in the SES Response header is set to UET_DEFAULT_RESPONSE, with list= UET_EXPECTED and return_code = RC_OK, indicating that the UET_WRITE operation has been executed successfully and the data has been written to target memory. Other fields, including message_id, ri_generation, JobID, and modified_length, are filled with the same values received in the SES Request, for example, message_id = 1, ri_generation = 0x001, JobID = 101, and modified_length = 16384.

Once the SES Response header is constructed, the UET provider creates a PDS Acknowledgement (PDS Ack) Response. The type is set to PDS_ACK, and the next_header field UET_HDR_RESPONSE references the SES Response type. The ack_psn_offset encodes the PSN from the received PDS Request, while the cumulative PSN (cack_psn) acknowledges all PDS Requests up to and including the current packet. The SPDCID is set to the target’s Initial PDCID (0x8001), and the DPDCID is set to the value received from the PDS Request as SPDCID (0x4001).

Finally, the PDS Ack and SES Response headers are encapsulated with Ethernet, IP, and optional UDP headers and transmitted by the NIC using High Traffic Class (TC). The High TC ensures that these control and acknowledgement messages are prioritized in the network, minimizing latency and supporting reliable flow control.

Figure 5-11: Target: Response Processing – SES → PDS → Transmit.

Initiator – SES Response and PDS Ack Respond


When the initiator receives a PDS Ack Response that also carries a SES Response, it first identifies the associated Packet Delivery Context (PDC) using the DPDCID field in the PDS header. Using this PDC, the initiator updates its PSN tracking state. The acknowledged PSN—for example, 0x12000—is marked as completed and released from the retransmission tracking state, indicating that the corresponding PDS Request has been successfully delivered and processed by the target.

After updating the transport-level state, the initiator extracts the SES Response and passes it to the Semantic Sublayer (SES) for semantic processing. The SES layer evaluates the response fields, including the opcode and return code, and determines that the UET_WRITE operation associated with message_id = 1 has completed successfully. As this response corresponds to the first fragment of the message, the initiator can mark that fragment as completed and, depending on the message structure, either wait for additional fragment responses or complete the overall operation. In our case, there are three more fragments to be processed.

This separation of responsibilities allows the PDS layer to manage reliability and delivery tracking, while the SES layer handles operation-level completion and status reporting.

Figure 5-12: Initiator: PDS Response & PDS Ack Processing.

Note: PDS Requests and Responses describe transport-specific parameters, such as the delivery mode (Reliable Unordered Delivery, RUD, or Reliable Ordered Delivery, ROD). In contrast, SES Requests and Responses describe semantic operations. SES Requests specify what action the target must perform, for example, writing data and the exact memory location for that operation, while SES Responses inform the initiator whether the operation completed successfully. In some flow diagrams, SES messages are shown as flowing between the SES and PDS layers, while PDS messages are shown as flowing between the PDS layers of the initiator and the target.



Wednesday, 10 December 2025

UET Protocol: How the NIC constructs packet from the Work Entries (WRE+SES+PDS)

 Semantic Sublayer (SES) Operation 

[Rewritte 12. Dec-2025]

After a Work Request Entity (WRE) is created, the UET provider generates the parameters needed by the Semantic Sublayer (SES) headers. At this stage, the SES does not construct the actual wire header. Instead, it provides the header parameters, which are later used by the Packet Delivery Context (PDC) state machine to construct the final SES wire header, as explained in the upcoming PDC section. These parameters ensure that all necessary information about the message, including addressing and size, is available for later stages of processing.

Fragmentation Due to Guaranteed Buffer Limits

In our example, the data to be written to the remote GPU is 16 384 bytes. The dual-port NIC in figure 5-5 has a total memory capacity of 16 384 bytes, divided into three regions: a 4 096-byte guaranteed per-port buffer for Eth0 and Eth1, and an 8 192-byte shared memory pool available to both ports. Because gradient synchronization requires lossless delivery, all data must fit within the guaranteed buffer region. The shared memory pool cannot be used, as its buffer space is not guaranteed.

Since the message exceeds the size of the guaranteed buffer, it must be fragmented. The UET provider splits the 16 384-byte message into four 4 096-byte sub-messages, as illustrated in Figure 5‑6. Fragmentation ensures that each piece of the message can be transmitted reliably within the available guaranteed buffer space.

Figure 5‑5 also illustrates the main object abstractions in UET. The Domain object represents and manages the NIC, while the Endpoint defines a logical connection between the application and the NIC. The max_msg_size parameter specifies the largest data payload that can be transferred in a single operation. In our example, the NIC’s egress buffer can hold 4 096 bytes, but the application needs to write 16 384 bytes. As a result, the data must be split into multiple smaller chunks, each fitting within the buffer and max_msg_size limits.


Figure 5-5: RMA Operation – NIC’s Buffers and Data Size.

Core SES Parameters


Each of the four fragments in Figure 5-6 carries the same msg_id = 1, allowing the target to recognize them as parts of the same original message. The first fragment sets ses.som = 1 (start of message), while the last fragment sets ses.eom = 1 (end of message). The two middle fragments have both ses.som and ses.eom set to 0. In addition to these boundary markers, the SES parameters also define the source and destination FEPs, the Delivery Mode (RUD – Reliable Unordered Delivery for an fi_write operation), the Job ID (101), the Traffic Class (Low = 0), and the Packet Length (4 096 bytes). The pds.next_hdr field determines which SES base header format the receiver must parse next. For an fi_write operation, a standard SES Request header is used (UET_HDR_REQUEST_STD = 0x3).



Figure 5-6: RMA Operation – Semantic Sublayer: SES Parameters.


Packet Deliver Sublayer (PDS) Operation

PDS Manager


Once the SES parameters are defined, the provider issues the ses_pds_tx_req() request to the PDS Manager (Figure 5-7). The PDS Manager examines the tuple {Job ID, Destination FEP, Traffic Class, Request Mode} to determine whether a Packet Delivery Context (PDC) already exists for that combination.

Because this request corresponds to the first of the four sub-messages, no PDC exists yet. The PDS Manager selects the first available PDC Identifier from the pre-allocated PDC pool. The pool—created during job initialization—consists of a general PDC pool (used for operations such as fi_write and fi_send) and a reserved PDC pool for high-priority traffic such as Ack messages from the target to the initiator. In our example, the first free PDC ID from the general pool is 0x4001.

After opening the PDC, the PDS Manager associates all subsequent requests with the same msg_id = 101 with PDC 0x4001.

PDC State Machine


After selecting a PDC Identifier, the PDS Manager forwards the request to the PDC State Machine. This component assigns the base Packet Sequence Number (PSN), which will be used in the PSN field of the first and upcoming fragments. Next, it constructs the PDS and SES headers for the message.

Constructing PDC Header


The Type field in the PDS header in Figure 5-7 is set to 2, indicating Reliable Unordered Delivery (RUD). Reliability is ensured by the Acknowledgement Required (ar) flag, which is mandatory for this transport mode. The Next Header (pds.next_hdr) field specifies that the following header is the Standard SES Request header (0x3 = UET_HDR_REQUEST_STD)

The syn flag remains set until the first PSN is acknowledged by the target (explained in upcoming chapters). Using the PSNs reported in ACK messages, the initiator can determine which packets have been successfully delivered.
A dynamically established PDC works as a communication channel between endpoints. For the channel to operate, both the initiator and the target must use the same PDC type (General or Reserved) and must agree on which local and remote PDCs are used for the exchange. In Figure 5-7, the Source PDC Identifier (SPDCID) in PDS header is derived from the Initiator PDCID (IPDCID). At this stage, the initiator does not know the Destination PDCID (DPDCID) because the communication channel is still in the SYN state (initializing). Instead of setting a DPDCID value, the initiator simply indicates the type of PDC the target should create for the connection (pdc = rsv). It also specifies the PSN offset (psn_offset = 0), indicating that the offset for this request is zero (i.e., using the base PSN value). The target opens its PDC for the connection after receiving the first packet and when generating the acknowledgment response.


Constructing SES Header


The SES header (Figure 5-7) provides the target NIC with information about which job and which process the incoming message is destined for. It also carries a Resource Index table pointer (0x00a) and an rkey (0xacce5) that together identify the entry in the RI table where the NIC can find the target’s local memory region for writing the received data.

The Opcode field (UET_WRITE) instructs the NIC how to handle the data. The rel flag (relative addressing) is set for parallel jobs to indicate that relative addressing mode is used.

Using the relative addressing mode, the NIC resolves the memory location by traversing a hierarchical address structure: Fabric Address (FA), Job ID, Fabric Endpoint (FEP), Resource Index (RI).

  • The FA, carried as the destination IP address in the IP header, selects the correct FEP when the system has multiple FEPs.
  • The Job ID identifies the job to which the packet belongs.
  • The PIDonFEP identifies the process participating in the job (for parallel jobs, PIDonFEP = rankID).
  • The RI value points to the correct entry in the RI table.

Relative addressing identifies which RI table belongs to the job on the target FEP, but not which entry inside that table. The rkey in the SES header provides this missing information: it selects the exact RI table entry that describes the registered memory region.

While the rkey selects the correct RI entry, the buffer_offset field specifies where inside the memory region the data should be written, relative to the region’s base address. In Figure 5-8, the first fragment writes at offset 0, and the second fragment (not shown) starts at offset 4 096, immediately after the first payload.

The ri_generation field (e.g., 0x01) indicates the version of the RI table in use. This is necessary because the table may be updated while the job is running. The hd (header-data-present) bit indicates whether the header_data field is included. This is useful when multiple gradient buckets must be synchronized, because each bucket can be identified by an ID in header_data (for example, GPU 0 might use bucket_id = 11 for the first chunk and bucket_id = 12 for the second). The initiator_id field specifies the initiator’s PIDonFEP (i.e., rankID).

Finally, note that the SES Standard header has several variants. For example:

  • If ses.som = 1, a Header Data field is present.
  • If ses.eom = 1, the Payload Length and Message Offset fields are included.





Figure 5-8: Packet Delivery Sublayer: PDC and PDS Header.

Work Element (WE)


Once the SES and PDS headers are created, they are inserted, along with the WRE, into the NIC’s Transmit Queue (TxQ) as a Work Element (WE). Figure 5‑9 illustrates the three components of a WE. The NIC fetches the data from local memory based on the instructions described in the WRE, wraps it with Ethernet, IP, optional UDP, PDS, and SES headers, and calculates the Cyclic Redundancy Check (CRC) for unencrypted packets. The packet is then ready for transmission.

The NIC can support multiple Transmit Queues. In our example, there are two: one for Traffic Class Low and another for Traffic Class High. The example WE is placed into the Low TxQ. Work Element 1 (WE1) corresponds to the first ses_pds_tx_req() request, completing the end-to-end flow from WRE creation to packet transmission.


Figure 5-9: UET NIC: Packetization, Queueing & Transport.