[Figure updated 12 November 2025]
My previous UET posts explained how an application uses libfabric function API calls to discover available hardware resources and how this information is used to create a hardware abstraction layer composed of Fabric, Domain, and Endpoint objects, along with their child objects — Event Queues, Completion Queues, Completion Counters, Address Vectors, and Memory Regions.
This chapter explains how these objects are used during data transfer operations. It also describes how information is encoded into UET protocol headers, including the Semantic Sublayer (SES) and Packet Delivery Sublayer (PDC). In addition, the chapter covers how the Congestion Management Sublayer (CMS) monitors and controls send queue rates to prevent egress buffer overflows.
Note: In this book, libfabric API calls are divided into two categories for clarity. Functions are used to create and configure fabric objects such as fabrics, domains, endpoints, and memory regions (for example, fi_fabric(), fi_domain(), and fi_mr_reg()). Operations, on the other hand, perform actual data transfer or synchronization between processes (for example, fi_write(), fi_read(), and fi_send()).
Figure 5-1 provides a high-level overview of a libfabric Remote Memory Access (RMA) operation using the fi_write function call. When an application needs to transfer data, such as gradients, from its local memory to the memory of a GPU on a remote node, both the application and the UET provider must specify a set of parameters. These parameters ensure that the local RMA-capable NIC can forward packets to the correct destination and that the target node can locate the appropriate memory region using its process and job identifiers.
First, the application defines the operation to perform, in our example, a remote write fi_write(). It then specifies the resources involved in the transfer. The endpoint (fid_ep) represents the communication interface between the process and the underlying fabric. Each endpoint is bound to exactly one domain object, which abstracts the UET NIC. Through this binding, the UET provider automatically knows which NIC the endpoint uses, and the endpoint is automatically assigned to one or more send queues for processing work requests. This means the application does not need to manage NIC queue assignments manually.
Next, the application identifies the registered memory region (desc) that contains the local data to be transmitted. It also specifies where within that region to start reading the payload (buffer pointer: buf) and how many bytes to transfer (length: len).
To reach the correct remote peer, the application uses a fabric address handle (fi_addr_t). The provider resolves this logical address through its Address Vector (AV) to obtain the peer’s actual fabric address—corresponding, in the UET context, to the remote UET NIC endpoint.
Finally, the application specifies the destination memory information: the remote virtual (addr) address where the data should be written and the remote protection key, which authorizes access to that memory region.
The resulting fi_write function call, as described in the libfabric programmer’s manual, is structured as follows:
ssize_t fi_write(struct fid_ep *ep, const void *buf, size_t len, void *desc, fi_addr_t dest_addr, uint64_t addr, uint64_t key, void *context);
Next, the application fi_write operation API call is passed by the libfabric core to the UET provider. Based on the fi_addr_t handle, the provider knows which Address Vector (AV) table entry it should consult. In our example, the handle value 0x0001 corresponds to rank 1 with the fabric address 10.0.1.11.
Depending on the provider implementation, an AV entry may optionally reference a Resource Index (RI) entry. The RI table can associate the JobID with the work request and store an authorization key, if it was not provided directly by the application. It may also define which operations are permitted for the target rank.
Note: The Rank Identifier (RankID) can be considered analogous to a Process Identifier (PID); that is, the RankID defines the PID on the Fabric EndPoint (FEP).
Armed with this information, gathered from the application’s fi_write operation call and from the Address Vector and Resource Index tables, the UET provider creates a Work Request (WR) and places it into the Send Queue (SQ). Each SQ is implemented as a circular buffer in memory, shared between the GPU (running the provider) and the NIC hardware. Writing a WR into the queue does not automatically notify the NIC. To signal that new requests are available, the provider performs a doorbell operation, writing to a special NIC register. This alerts the NIC to read the SQ, determine how many WRs are pending, and identify where to start processing.
Once notified, the NIC fetches each WR, retrieves the associated metadata, such as the destination fabric address, remote memory key, and SES/PDC header information — and begins executing the data transfer. Some NICs may also periodically poll the SQ, but modern UET NICs typically rely on doorbell notifications to achieve low-latency execution.
Because GPUs and the application are multithreaded, multiple operations may be posted to the SQ simultaneously. Each WR is treated independently and can be placed in separate send queues, allowing the NIC to execute multiple transfers in parallel. This design ensures efficient utilization of both the NIC and GPU resources while maintaining correct ordering and authorization of each operation.