Monday, 24 November 2025

UET Data Transfer Operation: Work Request Entity and Semantic Sublayer

Work Request Entity (WRE) 


The UET provider constructs a Work Request Entity (WRE) from a fi_write RMA operation that has been validated and passed by the libfabric core. The WRE is a software-level representation of the requested transfer and semantically describes both the source memory (local buffer) and the target memory (remote buffer) for the operation. Using the WRE, the UET provider constructs the Semantic Sublayer (SES) header and the Packet Delivery Context (PDC) header.

From the local memory perspective, the WRE specifies the address of the data in registered local memory, the length of the data, and the local memory key (lkey). This information allows the NIC to fetch the data directly from local memory when performing the transmission.

From the target memory perspective, the WRE describes the Resource Index (RI) table, which contains information about the destination memory region, including its base address and the offset within that region where the data should be written. The RI table also defines the allowed operations on the region. Because an RI table may contain multiple entries, the actual memory region is selected using the rkey, which is also included in the WRE. The rkey enables the remote NIC to locate the correct memory region within the selected RI table.

To ensure proper delivery, the WRE includes an Address Vector (AV) table entry, identified via the fi_addr_t handle. The AV provides the Fabric Address (FA) of the target and specifies which job and which rank (i.e., PIDonFEP) the data is intended for. The WRE also indicates whether the completion of the transport operation should be reported through a completion queue.

By including pointers to the AV table entry and the remote RI table, the WRE allows the UET provider to access all the transport and remote memory metadata required for the operation without duplicating the underlying AV or RI data structures. Using these indices, the UET provider can efficiently construct the SES and PDC headers for the Work Element (WE), ensuring correct delivery of the data from the initiator’s local memory to the remote target memory.

Figure 5-4 illustrates how the libfabric core passes a fi_write RMA operation request from the application to the UET provider after performing a sanity check. The UET provider then constructs the Work Request Entity, which encapsulates all the information about the local and remote memory, the AV entry identified by the fi_addr_t handle, and the transport metadata required to deliver the operation.

Figure 5-4: RMA Operation – Semantic Sublayer: Work Request Entity (WRE).

Semantic Sublayer (SES)


The UET provider’s Semantic Sublayer (SES) maps application-facing API calls, such as fi_write RMA operations, to UET operations. In our example, the UET provider constructs a SES header where the fi_write request from the libfabric application is mapped to a UET_WRITE operation. The first step of this mapping, described in the previous section, uses the content of the fi_write request to construct a UET Work Request Entity (WRE). Information from the WRE is then used to build the actual SES header, which will later be wrapped within Ethernet/IP/UDP (if an Entropy header is not used). The SES header carries information that the target NIC uses to resolve which job and process the message is targeted to, and which specific memory location the data should be written to. In other words, the SES header provides a form of intra-node routing, indicating where the data should be placed within the target node.

The first element to consider in Figure 5‑5 is the maximum message size for the endpoint. The NIC is abstracted as a Domain object, to which the endpoint is associated. In our example, the NIC has two Ethernet ports, each with an egress buffer size of 8192 bytes. A shared staging buffer describes the maximum number of bytes that can be queued for transmission through the egress ports. Because the data to be written to the target memory (16 384 bytes) exceeds the staging buffer size (8192 bytes), the UET provider must split the operation into two messages, each with a slightly different SES header.

The first key field in the SES header is the operation code (UET_WRITE), which instructs the target NIC how to handle the received data. The rel bit (relative addressing), when set, indicates that the operation uses a relative address, which includes JobID (101), PIDonFEP (process identifier, rank 2 in our example), and Resource Index (0x00a). Based on this information, the target NIC can identify the job and rank to which the message belongs. The Resource Index may contain multiple entries and the rkey (0xacce5) is used to select the correct RI entry.

The SES header also contains the buffer offset, which specifies the exact location relative to the base address where the data should be written. In Figure 5‑5, the first message will write data starting at offset 0, while the second message will write at offset 8192, immediately following the first message’s payload. The ses.som bit indicates the start of the message, and ses.eom indicates the end of the message. Note that ses.som and ses.eom bits are not used for ordering; message ordering is ensured by the Message ID field, which allows the NIC to process fragments in the correct sequence. 

During the job lifetime, Resource Indices may be updated. To validate that the correct version is used, the SES header includes the ri_generation field, which identifies the initiator’s current RI table version.

The hd (header data present) bit indicates whether the header_data field is presented in the SES header. A common use case for header data is when a GPU holds multiple gradient chunks that must be synchronized with a remote process. Each chunk can be identified by a bucket ID stored in the SES header’s header_data field. For example, the first chunk in GPU 0 memory may have bucket_id=11, the second chunk bucket_id=12, and so on. This allows the NIC to distinguish which messages correspond to which chunk. The initiator id describes the rank id (initiator’s PIDonFEP).

If a gradient chunk exceeds the NIC’s max_msg_size, it must be split into multiple SES messages. Consider the second chunk (bucket_id=12) split into four messages. The first message has ses.som=1, indicating the start of the chunk, and hd=1, signaling that header data is present. The header_data field contains the bucket ID (12). This message also has message_id=1, identifying it as the first SES message of the chunk. The next two messages have message_id=2 and message_id=3, respectively. Both have hd=0, ses.som=0, and ses.eom=0, indicating they are continuation packets. The fourth message is similar but has ses.eom=1, marking it as the last message of the chunk. 



Figure 5-5: RMA Operation – Semantic Sublayer: SES Header.


No comments:

Post a Comment