Wednesday, 10 December 2025

UET Protocol: How the NIC constructs packet from the Work Entries (WRE+SES+PDS)

 Semantic Sublayer (SES) Operation 

[Rewritte 12. Dec-2025]

After a Work Request Entity (WRE) is created, the UET provider generates the parameters needed by the Semantic Sublayer (SES) headers. At this stage, the SES does not construct the actual wire header. Instead, it provides the header parameters, which are later used by the Packet Delivery Context (PDC) state machine to construct the final SES wire header, as explained in the upcoming PDC section. These parameters ensure that all necessary information about the message, including addressing and size, is available for later stages of processing.

Fragmentation Due to Guaranteed Buffer Limits

In our example, the data to be written to the remote GPU is 16 384 bytes. The dual-port NIC in figure 5-5 has a total memory capacity of 16 384 bytes, divided into three regions: a 4 096-byte guaranteed per-port buffer for Eth0 and Eth1, and an 8 192-byte shared memory pool available to both ports. Because gradient synchronization requires lossless delivery, all data must fit within the guaranteed buffer region. The shared memory pool cannot be used, as its buffer space is not guaranteed.

Since the message exceeds the size of the guaranteed buffer, it must be fragmented. The UET provider splits the 16 384-byte message into four 4 096-byte sub-messages, as illustrated in Figure 5‑6. Fragmentation ensures that each piece of the message can be transmitted reliably within the available guaranteed buffer space.

Figure 5‑5 also illustrates the main object abstractions in UET. The Domain object represents and manages the NIC, while the Endpoint defines a logical connection between the application and the NIC. The max_msg_size parameter specifies the largest data payload that can be transferred in a single operation. In our example, the NIC’s egress buffer can hold 4 096 bytes, but the application needs to write 16 384 bytes. As a result, the data must be split into multiple smaller chunks, each fitting within the buffer and max_msg_size limits.


Figure 5-5: RMA Operation – NIC’s Buffers and Data Size.

Core SES Parameters


Each of the four fragments in Figure 5-6 carries the same msg_id = 1, allowing the target to recognize them as parts of the same original message. The first fragment sets ses.som = 1 (start of message), while the last fragment sets ses.eom = 1 (end of message). The two middle fragments have both ses.som and ses.eom set to 0. In addition to these boundary markers, the SES parameters also define the source and destination FEPs, the Delivery Mode (RUD – Reliable Unordered Delivery for an fi_write operation), the Job ID (101), the Traffic Class (Low = 0), and the Packet Length (4 096 bytes). The pds.next_hdr field determines which SES base header format the receiver must parse next. For an fi_write operation, a standard SES Request header is used (UET_HDR_REQUEST_STD = 0x3).



Figure 5-6: RMA Operation – Semantic Sublayer: SES Parameters.


Packet Deliver Sublayer (PDS) Operation

PDS Manager


Once the SES parameters are defined, the provider issues the ses_pds_tx_req() request to the PDS Manager (Figure 5-7). The PDS Manager examines the tuple {Job ID, Destination FEP, Traffic Class, Request Mode} to determine whether a Packet Delivery Context (PDC) already exists for that combination.

Because this request corresponds to the first of the four sub-messages, no PDC exists yet. The PDS Manager selects the first available PDC Identifier from the pre-allocated PDC pool. The pool—created during job initialization—consists of a general PDC pool (used for operations such as fi_write and fi_send) and a reserved PDC pool for high-priority traffic such as Ack messages from the target to the initiator. In our example, the first free PDC ID from the general pool is 0x4001.

After opening the PDC, the PDS Manager associates all subsequent requests with the same msg_id = 101 with PDC 0x4001.

PDC State Machine


After selecting a PDC Identifier, the PDS Manager forwards the request to the PDC State Machine. This component assigns the base Packet Sequence Number (PSN), which will be used in the PSN field of the first and upcoming fragments. Next, it constructs the PDS and SES headers for the message.

Constructing PDC Header


The Type field in the PDS header in Figure 5-7 is set to 2, indicating Reliable Unordered Delivery (RUD). Reliability is ensured by the Acknowledgement Required (ar) flag, which is mandatory for this transport mode. The Next Header (pds.next_hdr) field specifies that the following header is the Standard SES Request header (0x3 = UET_HDR_REQUEST_STD)

The syn flag remains set until the first PSN is acknowledged by the target (explained in upcoming chapters). Using the PSNs reported in ACK messages, the initiator can determine which packets have been successfully delivered.
A dynamically established PDC works as a communication channel between endpoints. For the channel to operate, both the initiator and the target must use the same PDC type (General or Reserved) and must agree on which local and remote PDCs are used for the exchange. In Figure 5-7, the Source PDC Identifier (SPDCID) in PDS header is derived from the Initiator PDCID (IPDCID). At this stage, the initiator does not know the Destination PDCID (DPDCID) because the communication channel is still in the SYN state (initializing). Instead of setting a DPDCID value, the initiator simply indicates the type of PDC the target should create for the connection (pdc = rsv). It also specifies the PSN offset (psn_offset = 0), indicating that the offset for this request is zero (i.e., using the base PSN value). The target opens its PDC for the connection after receiving the first packet and when generating the acknowledgment response.


Constructing SES Header


The SES header (Figure 5-7) provides the target NIC with information about which job and which process the incoming message is destined for. It also carries a Resource Index table pointer (0x00a) and an rkey (0xacce5) that together identify the entry in the RI table where the NIC can find the target’s local memory region for writing the received data.

The Opcode field (UET_WRITE) instructs the NIC how to handle the data. The rel flag (relative addressing) is set for parallel jobs to indicate that relative addressing mode is used.

Using the relative addressing mode, the NIC resolves the memory location by traversing a hierarchical address structure: Fabric Address (FA), Job ID, Fabric Endpoint (FEP), Resource Index (RI).

  • The FA, carried as the destination IP address in the IP header, selects the correct FEP when the system has multiple FEPs.
  • The Job ID identifies the job to which the packet belongs.
  • The PIDonFEP identifies the process participating in the job (for parallel jobs, PIDonFEP = rankID).
  • The RI value points to the correct entry in the RI table.

Relative addressing identifies which RI table belongs to the job on the target FEP, but not which entry inside that table. The rkey in the SES header provides this missing information: it selects the exact RI table entry that describes the registered memory region.

While the rkey selects the correct RI entry, the buffer_offset field specifies where inside the memory region the data should be written, relative to the region’s base address. In Figure 5-8, the first fragment writes at offset 0, and the second fragment (not shown) starts at offset 4 096, immediately after the first payload.

The ri_generation field (e.g., 0x01) indicates the version of the RI table in use. This is necessary because the table may be updated while the job is running. The hd (header-data-present) bit indicates whether the header_data field is included. This is useful when multiple gradient buckets must be synchronized, because each bucket can be identified by an ID in header_data (for example, GPU 0 might use bucket_id = 11 for the first chunk and bucket_id = 12 for the second). The initiator_id field specifies the initiator’s PIDonFEP (i.e., rankID).

Finally, note that the SES Standard header has several variants. For example:

  • If ses.som = 1, a Header Data field is present.
  • If ses.eom = 1, the Payload Length and Message Offset fields are included.





Figure 5-8: Packet Delivery Sublayer: PDC and PDS Header.

Work Element (WE)


Once the SES and PDS headers are created, they are inserted, along with the WRE, into the NIC’s Transmit Queue (TxQ) as a Work Element (WE). Figure 5‑9 illustrates the three components of a WE. The NIC fetches the data from local memory based on the instructions described in the WRE, wraps it with Ethernet, IP, optional UDP, PDS, and SES headers, and calculates the Cyclic Redundancy Check (CRC) for unencrypted packets. The packet is then ready for transmission.

The NIC can support multiple Transmit Queues. In our example, there are two: one for Traffic Class Low and another for Traffic Class High. The example WE is placed into the Low TxQ. Work Element 1 (WE1) corresponds to the first ses_pds_tx_req() request, completing the end-to-end flow from WRE creation to packet transmission.


Figure 5-9: UET NIC: Packetization, Queueing & Transport.



No comments:

Post a Comment