Memory Registration and Endpoint Binding in UET with libfabric
In distributed AI workloads, each process requires memory regions that are visible to the fabric for efficient data transfer. The Job framework or application typically allocates these buffers in GPU VRAM to maximize throughput and enable low-latency direct memory access. These buffers store model parameters, gradients, neuron outputs, and temporary workspace, such as intermediate activations or partial gradients during collective operations in forward and backward passes.
Memory Registration and Key Generation
Once memory is allocated, it must be registered with the fabric domain using fi_mr_reg(). Registration informs the NIC that the memory is pinned and accessible for data transfers initiated by endpoints. The fabric library associates the buffer with a Memory Region handle (fid_mr) and internally generates a remote protection key (fi_mr_key), which uniquely identifies the memory region within the Job and domain context.
The local endpoint binds the fid_mr using fi_mr_bind() to define permitted operations, FI_REMOTE_WRITE in figure 4-10. This allows the NIC to access local memory efficiently and perform zero-copy operations.
The application retrieves the memory key using fi_mr_key(fid_mr) and constructs a Resource Index (RI) entry. The RI entry serves as a compact, portable identifier of the memory region for remote ranks. It does not expose the local fid_mr, but encapsulates the key along with associated metadata necessary for remote access..
Distribution of Resource Index Pointers
During UET job initialization, only remotely accessible resources are distributed to peers. Typically, these are registered memory regions that have a local key, and an RI assigned.
Each rank sends its memory RI information to the master rank over the control channel. The transmitted metadata for each memory region includes:
• RI pointer (local identifier for the memory region)
• Job ID
• Rank ID
• Memory Key (rkey)
• Memory type (DRAM, VRAM, etc...)
• Access rights (allowed RDMA operations)
Note: The local fid_mr handle is never shared, as it only has meaning inside the owning rank.
The master rank collects all entries from participating ranks and constructs a job-wide Resource Index Table. Each entry in this table corresponds to one remotely accessible memory region and includes all the metadata above. Once distributed back to all ranks, each rank can resolve a remote RI pointer into the corresponding memory key and access rights, enabling safe and efficient RDMA operations without additional control-plane lookups.
Objects that are purely local — such as domain, fabric, completion queues, event queues, or local counters — do not require RI distribution and are therefore excluded from the table. This selective sharing ensures efficiency while giving all ranks sufficient knowledge to access remote memory.
Memory Binding and Accessing Remote Memory
Once the Resource Index table is distributed, each rank binds its local memory regions to the local endpoint using fi_mr_bind(). Binding associates the fid_mr handle with the endpoint and specifies access permissions. This step ensures that the NIC can access the local memory efficiently and perform zero-copy operations.
The Resource Index table, like the AV table, contains only remote memory entries. Conceptually, the endpoint is “bound” to this table: it can automatically resolve remote RI pointers to the corresponding memory key, type, and access rights during RDMA operations. Local binding provides the actual handle and permissions needed for the NIC, while the table provides the mapping required for remote access, just as the AV table provides remote destination addresses for send operations.
When an application wants to send data, it specifies the destination using an fi_addr_t handle — a compact identifier representing the remote rank or endpoint. The application does not need to know the remote Fabric Address (FA) or RI. The local endpoint looks up the fi_addr_t in the AV table, retrieves the remote FA, and uses it to identify the remote rank. For RDMA operations, the NIC references the Resource Index table for that rank to resolve the remote RI into the memory key, type, and access rights. This combination of local fid_mr binding and remote RI table lookup allows the endpoint to safely and efficiently perform zero-copy RDMA operations.
Endpoint Binding to Monitoring and Signaling Objects
After memory regions are bound, the endpoint must also be associated with monitoring and signaling objects:
Event Queues (EQs): Deliver asynchronous fabric events, such as errors or connection state changes. Created with fi_eq_open() and bound to the endpoint using fi_ep_bind(fid_ep, fid_eq, flags).
Completion Queues (CQs): Track completion status of send, receive, or RDMA operations. Created with fi_cq_open() and bound to the endpoint using fi_ep_bind(fid_ep, fid_cq, FI_SEND | FI_RECV).
Completion Counters (CNTRs): Provide lightweight tracking of operation progress for flow control or synchronization. Created with fi_cntr_open() and bound to the endpoint using fi_ep_bind(fid_ep, fid_cntr, flags).
This allows the endpoint to receive notifications, track operation completions, and integrate with application-level synchronization.
Endpoint Enablement
Once all memory and monitoring bindings are complete, the endpoint is enabled using fi_enable(fid_ep). This final step activates the endpoint for communication, ensuring that all memory regions, Resource Index pointers, and signaling mechanisms are properly connected. After enabling, the endpoint can safely issue and receive messages, perform remote memory operations, and fully participate in the distributed AI
With memory regions registered, Resource Index pointers distributed, and endpoints fully bound and enabled, the application is now ready to perform send, receive, and remote memory operations. This chapter has focused on the high-level setup from the application’s perspective, illustrating how memory and endpoint resources are prepared for communication.
Figure 4-10: Memory Registration.
References
[1] Libfabric Programmer's Manual: Libfabric man pages https://ofiwg.github.io/libfabric/v2.3.0/man/
[2] Ultra Ethernet Specification v1.0, June 11, 2025 by Ultra Ethernet Consortium, https://ultraethernet.org
[3] In-Memory Database, Wikipedia, https://en.wikipedia.org/wiki/In-memory_database
No comments:
Post a Comment