Semantic Sublayer (SES) Operation
[Edited 11. Dec-2025]
After a Work Request Entity (WRE) is created, the UET provider generates the parameters needed by the Semantic Sublayer (SES) headers. At this stage, the SES does not construct the actual wire header. Instead, it provides the header parameters, which are later used by the Packet Delivery Context (PDC) state machine to construct the final SES wire header, as explained in the upcoming PDC section. These parameters ensure that all necessary information about the message, including addressing and size, is available for later stages of processing.
Fragmentation Due to Guaranteed Buffer Limits
In our example, the data to be written to the remote GPU is 16 384 bytes. The dual-port NIC has a total memory capacity of 16 384 bytes, divided into three regions: a 4 096-byte guaranteed per-port buffer for Eth0 and Eth1, and an 8 192-byte shared memory pool available to both ports. Because gradient synchronization requires lossless delivery, all data must fit within the guaranteed buffer region. The shared memory pool cannot be used, as its buffer space is not guaranteed.
Since the message exceeds the size of the guaranteed buffer, it must be fragmented. The UET provider splits the 16 384-byte message into four 4 096-byte sub-messages, as illustrated in Figures 5‑5 and 5‑6. Fragmentation ensures that each piece of the message can be transmitted reliably within the available guaranteed buffer space.
Figure 5‑5 also illustrates the main object abstractions in UET. The Domain object represents and manages the NIC, while the Endpoint defines a logical connection between the application and the NIC. The max_msg_size parameter specifies the largest data payload that can be transferred in a single operation. In our example, the NIC’s egress buffer can hold 4 096 bytes, but the application needs to write 16 384 bytes. As a result, the data must be split into multiple smaller chunks, each fitting within the buffer and max_msg_size limits
Figure 5-5: RMA Operation – NIC’s Buffers and Data Size.
Message Identification and Sequencing
Each sub-message carries the same msg_id = 1 parameter, which allows the target to identify them as parts of the same original message. The SES header also marks message boundaries: the first fragment has ses.som = 1 (start of message), the last fragment has ses.eom = 1 (end of message), and the two middle fragments have both ses.som = 0 and ses.eom = 0.
The target SES determines where the data should be written based on the relative address information in the SES header. This relative address includes the Fabric Address (FA), JobID, PIDonFEP, and Resource Index (RI). The RI identifies an entry in the Resource Index table corresponding to a set of memory regions registered for the operation. If multiple memory regions exist under the same RI, the rkey in the SES header selects the exact memory region. Finally, the base address from the selected RI entry is combined with the offset from the SES header to locate the precise memory location where the data should be written.
Core SES Parameters
In addition to boundary information, the SES parameters define the source and destination FEPs, the Delivery Mode (RUD: Reliable Unordered Delivery for an fi_write operation), the Job ID (101), the Traffic Class (Low = 0), and the Packet Length (4 096 bytes). The pds.next_hdr field determines which SES base header format the receiver must parse next.
In this example, the application issues an fi_write RMA request, and the data size requires fragmentation across several packets. This operation uses the standard SES Request header (0x3 = UET_HDR_REQUEST_STD). Note that the exact SES Standard header format depends on whether ses.som is set to 1 or 0, because the Start-of-Message bit determines which fields are included in the header.
The exact location within the memory space is indicated by the memory_offset (or buffer offset) field, specifying the starting point relative to the base address where the data should be written.
Each sub-message, along with its corresponding Work Request Entity (WRE), is then passed to the Packet Delivery Sublayer (PDS) Manager via ses_pds_tx_req(), one request per sub-message. At this stage, the SES layer does not handle ordering; its role is to provide the necessary information so the target knows the memory space and offset for each fragment.
Figure 5-6: RMA Operation – Semantic Sublayer: SES Parameters.
Packet Deliver Sublayer (PDS) Operation
PDS Manager
Once the SES parameters are defined, the provider issues the ses_pds_tx_req() request to the PDS Manager. The PDS Manager examines the tuple {JobID, Destination FEP, Traffic Class, Request Mode} to determine whether a Packet Delivery Context (PDC) already exists for that combination.
Since this request corresponds to the first of the four sub-messages, no PDC exists yet. The PDS Manager selects the first available PDC Identifier from the pre-allocated PDC pool. The pool, created during job initialization, consists of a general PDC pool for operations such as fi_write and fi_send, and a reserved PDC pool for high-priority traffic, such as control-plane messages.
In our example, the first free PDC ID from the general pool is 0x4001. After opening the PDC, the PDS Manager associates all subsequent requests with the same msg_id with this PDC, ensuring that all sub-messages are delivered within the same context.
PDC State Machine
After selecting a PDC Identifier, the PDS Manager forwards the request to the PDC State Machine. This component assigns the base Packet Sequence Number (PSN), which will be used in the PSN field of the first and upcoming fragments. Next, it constructs the PDS and SES headers for the message.
Constructing PDC Header
The Type field in the PDS header is set to 2, indicating Reliable Unordered Delivery (RUD). Reliability is ensured by the Acknowledge Required (ar) flag, which is mandatory for this transport mode. The Next Header (pds.next_hdr) field specifies that the following header is the Standard SES Request header (0x3 = UET_HDR_REQUEST_STD). Note that the exact format of this SES Standard header depends on the Start-of-Message (ses.som) bit: when ses.som = 1, the header includes additional fields for the start of the message, whereas when ses.som = 0, the header is shorter for continuation packets.
The syn flag remains set until the first PSN is acknowledged by the target. Using the PSNs reported in ACK messages, the initiator can determine which packets have been successfully delivered.
Because the PDC tracks PSNs and provides reliable delivery, both the initiator and target must indicate which PDC-to-PDC context each packet belongs to. The Source PDC Identifier (SPDCID) is derived from the Initiator PDCID (IPDCID) assigned for this FEP-to-FEP connection. The Destination PDCID (DPDCID) does not exist on the target at this phase because the communication channel is still in the SYN state (initializing). In other words, the target opens its PDC in the same manner as the initiator after receiving the first packet and when generating the acknowledgment response. In the meantime, the DPDCID field in the first packet instructs the target to select its IPDCID from the general PDC pool and indicates the PSN offset, which is zero for the first packet.
Constructing SES Header
The SES header provides the target NIC with information about which job and process the message is destined for, and the specific memory location where the data should be written. Essentially, it acts as an intra-node routing header, guiding the data to its final location.
The first key field is the operation code (UET_WRITE), which instructs the NIC how to handle the data. The rel flag (relative addressing), when set, indicates that the address is relative and composed of the Fabric Address (FA), JobID, PIDonFEP, and Resource Index (RI). For our example, the FA is 10.0.1.12, the JobID is 101, PIDonFEP is rank 2, and the Resource Index is 0x00a.
The NIC uses the relative addressing information in the SES header to map the message to the correct target memory region. The destination IP address selects the local FA instance, which is bound to a specific FEP. The JobID in which the FEP participates, together with its PIDonFEP, identifies which Resource Index (RI) table applies. Finally, the rkey (e.g., 0xacce5) selects the specific RI entry that describes the target memory space for the operation.
The SES header includes the buffer_offset, which specifies the location in the memory region relative to the base address where the data should be written. In Figure 5-8, the first fragment writes at offset 0, while the second (not shown) starts at offset 4 096, immediately following the first payload. The ses.som and ses.eom flags mark the start and end of the message. This offset is encoded into the SES header by the PDC state machine and determines the actual data placement order.
The ri_generation field (e.g., 0x01) indicates the version of the Resource Index table in use, which is required because the table may be updated while the job is running. The hd (header-data-present) bit indicates whether the header_data field is included. This is useful when multiple gradient chunks must be synchronized, because each chunk can be identified by a bucket ID in header_data (for example, GPU 0 might use bucket_id = 11 for the first chunk and bucket_id = 12 for the second). The initiator_id field specifies the initiator’s PIDonFEP rank.
Figure 5-8: Packet Delivery Sublayer: PDC and PDS Header.
Work Element (WE)
Once the SES and PDS headers are created, they are inserted, along with the WRE, into the NIC’s Transmit Queue (TxQ) as a Work Element (WE). Figure 5‑9 illustrates the three components of a WE. The NIC fetches the data from local memory based on the instructions described in the WRE, wraps it with Ethernet, IP, optional UDP, PDS, and SES headers, and calculates the Cyclic Redundancy Check (CRC) for unencrypted packets. The packet is then ready for transmission.
The NIC can support multiple Transmit Queues. In our example, there are two: one for Traffic Class Low and another for Traffic Class High. The example WE is placed into the Low TxQ. Work Element 1 (WE1) corresponds to the first ses_pds_tx_req() request, completing the end-to-end flow from WRE creation to packet transmission.
Figure 5-9: UET NIC: Packetization, Queueing & Transport.
Relative Addressing
As described in the previous section, applications use endpoint objects as their communication interfaces for data transfer. To write data from local memory to a target memory region on a remote GPU, the initiator must authorize the local UE-NIC to fetch data from local memory and indicate where that data should be written on the remote side.
To route the packet to the correct Fabric Endpoint (FEP), the application and the UET provider must supply the FEP’s IP address (its Fabric Address, FA). To determine where in the remote process’s memory the received data belongs, the UE-NIC must also know:
• Which job the communication belongs to
• Which process within that job owns the target memory
• Which Resource Index (RI) table should be used
• Which entry in that table describes the exact memory location
This indirection model is called relative addressing.
How Relative Addressing Works
Figure 5-6 illustrates the concept. Two GPUs participate in distributed training. A process on GPU 0 with global rank 0 (PID 0) receives data from GPU 1 with global rank 1 (PID 1). The UE-NIC determines the target Fabric Endpoint (FEP) based on the destination IP address (FA = 10.0.1.11). This IP address forms the first component of the relative address.
Next, the NIC checks the JobID and PIDonFEP to resolve which job and which process the message is intended for. These two fields are the second and third components of the relative address { FA, JobID, PIDonFEP }.
The fourth component is the Resource Index (RI) table descriptor, which tells the NIC which RI table should be consulted for the memory lookup.
Finally, the rkey, although not part of the relative addressing tuple itself, selects the specific entry in that RI table that defines the precise remote memory region. In our example, the complete addressing information is:
{ FA: 10.0.1.11, JobID: 101, PIDonFEP: 0, RI: 0x00a },
and the rkey identifies the specific RI entry to use.
Note: In InfiniBand, each communication endpoint is represented by a Queue Pair (QP), and historically, each QP required its own receive queue. In large systems with tens or hundreds of thousands of peers, maintaining a per-QP receive queue consumed an impractical amount of memory. This limitation led to the introduction of Shared Receive Queues (SRQ) in InfiniBand.
Ultra Ethernet takes a different approach. A receive queue at a FEP can be uniquely addressed using the combination of JobID + PIDonFEP + Resource Index (RI), without requiring any connection context at the sender. As a result, the concept of shared receive queues becomes unnecessary: any worker in the distributed application can write to the queue, and the JobID provides the authorization boundary for those operations.
Figure 5-9: Ultra Ethernet: Relative Addressing for Distributed Learning.
Comparison with VXLAN
Relative addressing in UET has several structural similarities to addressing in VXLAN data planes.
A Fabric Address (FA) attached to a Fabric Endpoint (FEP) serves a role similar to a VTEP IP address in a VXLAN fabric. Both identify the tunnel endpoint used to route the packet across the underlay network toward its destination.
A JobID identifies a distributed job that consists of multiple processes. In VXLAN, the Layer-2 VNI (L2VNI) identifies a stretched Layer-2 segment for endpoints. In both technologies, these identifiers define the logical communication context in which the packet is interpreted.
The combination of PIDonFEP and RI tells the UE-NIC which Resource Index table describes the target memory locations owned by that process. Similarly, in VXLAN, the VNI-to-VLAN mapping on a VTEP determines which MAC address table holds the forwarding entries for that virtual network.
The rkey selects the specific entry in the RI table that defines the exact target memory location. The VXLAN equivalent is the destination MAC address, which selects the exact entry in the MAC table that determines the egress port or remote VTEP.
Figure 5-10 further illustrates this analogy. The tunnel IP address determines the target VTEP, and the L2VNI-to-VLAN mapping on that VTEP identifies which MAC address table should be consulted for the destination MAC in the original Ethernet header. As a reminder, VXLAN is an Ethernet-in-IP encapsulation method in which the entire Layer-2 frame is carried inside an IP/UDP tunnel.
Figure 5-10: Virtual eXtensible LAN: Layer2 Virtual Network Identifier.