Endpoint Creation and Operation
[Updated 12-October, 2025: Figure & uet addressing section]
In libfabric and Ultra Ethernet Transport (UET), the endpoint, represented by the object fid_ep, serves as the primary communication interface between a process and the underlying network fabric. Every data exchange, whether it involves message passing, remote memory access (RMA), or atomic operations, ultimately passes through an endpoint. It acts as a software abstraction of the transport hardware, exposing a programmable interface that the application can use to perform high-performance data transfers.
Conceptually, an endpoint resembles a socket in the TCP/IP world. However, while sockets hide much of the underlying network stack behind a simple API, endpoints expose far more detail and control. They allow the process to define which completion queues to use, what capabilities to enable, and how multiple communication contexts are managed concurrently. This design gives applications, especially large distributed training frameworks and HPC workloads, direct control over latency, throughput, and concurrency in ways that traditional sockets cannot provide.
Furthermore, socket-based communication typically relies on the operating system’s networking stack and consumes CPU cycles for data movement and protocol handling. In contrast, endpoint communication paths can interact directly with the NIC, enabling user-space data transfers and RDMA operations that bypass the kernel and minimize CPU involvement.
Endpoint Types
Libfabric defines three endpoint types: active, passive, and scalable. Each serves a specific role in communication setup and operation and is created using a dedicated constructor function. Their configuration and behavior are largely determined by the information returned by fi_getinfo(), which populates a structure called fi_info. Within this structure, the subfields fi_ep_attr, fi_tx_attr, and fi_rx_attr define how the endpoint interacts with the provider and the transport hardware."
Endpoint Type Constructor Function Typical Use Case
Active Endpoint fi_endpoint() Actively sends and receives data once connections are established
Passive Endpoint fi_passive() Listens for and accepts incoming connection requests
Scalable Endpoint fi_scalable() Supports multiple transmit and receive contexts for concurrent communication
Table 4-1: Endpoint Types.
Active Endpoint
The active endpoint is the most common and versatile type. It is created using the fi_endpoint() function, which initializes an endpoint that can actively send and receive data once configured and enabled. The attributes describing its behavior are drawn from the provider’s fi_info structure returned by fi_getinfo(), which provides general provider capabilities and limits. The caps field specifies which operations the endpoint supports, such as message transmission (FI_SEND and FI_RECV), remote memory operations (FI_RMA), and atomic instructions (FI_ATOMIC).
The fi_ep_attr substructure defines the endpoint’s communication semantics. For example, FI_EP_MSG describes a reliable, connection-oriented model similar to TCP, whereas FI_EP_RDM represents a reliable datagram model with connectionless semantics. Active endpoints support only a single transmit and a single receive context. In fi_ep_attr, tx_ctx_cnt and rx_ctx_cnt are typically set to 0, which indicates that the default context should be used. In contrast, scalable endpoints can support multiple TX and RX contexts, enabling concurrent operations across several threads or contexts while sharing the same logical endpoint.
The fi_tx_attr and fi_rx_attr substructures define the operational behavior of the endpoint for sending and receiving. They specify which operations are supported (e.g., FI_SEND, FI_RMA), how many transmit or receive commands the provider should allocate resources for (size), and the message ordering guarantees (msg_order). In figure 4-1, msg_order = 0 indicates no ordering guarantees, meaning the provider does not enforce any specific ordering of operations. These attributes guide the provider in configuring internal queues and hardware resources to meet the requested behavior.
From a transport perspective, active endpoints can operate either in a connection-oriented or connectionless mode. When configured as FI_EP_MSG, the endpoint behaves much like a TCP socket, requiring an explicit connection setup before data transfer can occur. When configured as FI_EP_RDM, the endpoint exchanges messages with peers directly through the fabric. Address resolution is handled through the Address Vector table, and the PIDonFEP ensures that packets reach the correct process on the host, eliminating the need for connection handshakes. This mode is particularly useful for distributed workloads, such as collective operations in AI training.
Before an endpoint can be used, it must eventually be bound to observation objects such as completion queues, event queues, and the Address Vector. Binding and the role of the Address Vector are described in dedicated sections later.
Passive Endpoint
The passive endpoint, created with fi_passive_ep(), performs a role analogous to a listening socket in TCP. It does not transmit data itself but instead waits for incoming connection requests from remote peers. When a connection request arrives, it triggers an event in the associated Event Queue, which the application can monitor. Based on this event, the application can decide whether to accept or reject the request by calling fi_accept() or fi_reject().
Because passive endpoints exist solely for connection management, they always operate in FI_EP_MSG mode and do not perform data transmission. Consequently, they only need to bind to an Event Queue, which receives notifications of connection attempts, completions, and related state changes. Once a connection is accepted, the provider automatically creates a corresponding active endpoint that can then send and receive data. In this process, the Address Vector is updated to include the new peer’s address, ensuring that subsequent communication can occur transparently.
Scalable Endpoint
The scalable endpoint represents the most advanced and parallelized endpoint type. It is constructed with fi_scalable_ep() and is designed for applications that require massive concurrency, such as distributed AI training jobs running on multiple GPUs or multi-threaded inference servers.
Unlike a regular endpoint, a scalable endpoint can be subdivided into multiple transmit (TX) and receive (RX) contexts. Each context acts like an independent communication lane with its own completion queues and state, allowing multiple threads to perform concurrent operations without locking or contention. In essence, scalability in this context means that a single endpoint object can represent many lightweight communication contexts that share the same underlying hardware resources.
The number of transmit and receive contexts for a scalable endpoint is specified in fi_ep_attr using tx_ctx_cnt and rx_ctx_cnt. The provider validates the requested counts against the domain’s capabilities (max_ep_tx_ctx and max_ep_rx_ctx), ensuring that the requested contexts do not exceed the hardware limits. The capabilities field (caps) indicates support for features such as FI_SHARED_CONTEXT and FI_MULTI_RECV, which are required for efficient scalable endpoint operation.
From a conceptual point of view, this arrangement can be compared to a single process maintaining multiple sockets, each used by a separate thread to communicate independently. However, scalable endpoints achieve this concurrency without the overhead of creating and maintaining separate endpoints per thread. Each context binds to its own completion queue, enabling parallel operation, while the scalable endpoint as a whole binds only once to the Event Queue and Address Vector. This design allows high-throughput, low-latency communication suitable for GPU-parallelized data exchange, where multiple communication streams operate concurrently between the same set of nodes.
UET Endpoint Address
When a UET endpoint is created, it receives a UET Endpoint Address, which uniquely identifies it within the fabric. Unlike a simple IP or MAC-style address, this is a composite structure that encodes several critical attributes of the endpoint and its execution context. The fields defined in UET Specification v1.0 include Version, Flags, Fabric Endpoint Capabilities, PIDonFEP, Fabric Address (FA), Start Resource Index, Num Resource Indices, and Initiator ID. Together, these fields describe both where the endpoint resides and how it interacts with other endpoints across the transport layer.
Address Assignment Process
The UET address assignment is a multi-stage process that completes when the endpoint is opened and the final address is retrieved. During the fi_ep_open() call, the provider uses the information from the fi_info structure and its substructures, including the transmit and receive attributes, to form the basic addressing context. At this stage, the provider contacts the UET kernel driver and finalizes the internal endpoint address by adding UET-specific fields such as the JobID, PIDonFEP, Initiator ID, and resource indices.
Although these fields are now part of the provider’s internal address structure, the application does not yet have access to them. To retrieve the complete and fully resolved UET address, the application must call the fi_getname() API. This call returns the composite address containing all the fields that were assigned during the fi_ep_open() operation.
In summary, fi_ep_open() finalizes the endpoint’s address internally, while fi_getname() exposes this finalized address to the application.
Version and Flags
The Version field ensures compatibility between different revisions of the UET library or NIC firmware, allowing peers to interpret the endpoint address structure correctly.
The Flags field defines both the validity of other fields in the endpoint address and the addressing behavior of the endpoint. It verifies whether the Fabric Endpoint Capabilities, PIDonFEP, Resource Index, or Initiator ID fields are valid or not. It also indicates which addressing mode is used (relative or absolute), the type of address in the Fabric Address (IPv4 or IPv6), and whether the maximum message size is limited to the domain object’s (NIC) MTU.
From a transport perspective, the flags make the endpoint address self-descriptive: peers can parse it to know which fields are valid and how messages should be addressed without additional negotiation. This is especially important in large distributed AI clusters, where thousands of endpoints are initialized and need to communicate efficiently.
Fabric Endpoint Capabilities
The Fabric Endpoint Capabilities field complements the flags by describing what the endpoint can actually do. It indicates the supported operational profile of the endpoint. In UET, three profiles are defined:
- AI Base Profile: Provides the minimal feature set for distributed AI workloads, including reliable messaging, basic RDMA operations, and standard memory registration.
- AI Full Profile: Extends AI Base with advanced features, such as optimized queue handling and SES header enhancements, which are critical for large-scale, high-throughput AI training.
- HPC Profile: Optimized for low-latency, synchronization-heavy workloads typical in high-performance computing.
Each profile requires specific capabilities and parameters to be set in the fi_info structure, which the UET provider returns during fi_getinfo(). These include supported operations (such as send/receive, RDMA, or atomics), maximum message sizes, queue depths, memory registration limits, and other provider-specific settings.
For a complete list of per-profile requirements, see Table 2‑4, “Per-Profile Libfabric Parameter Requirements,” in the UET Specification v1.0.
PIDonFEP – Process Identifier on Fabric Endpoint
Every Fabric Endpoint (FEP) managed by a UET NIC has a Process Identifier on FEP (PIDonFEP), which distinguishes different processes or contexts sharing the same NIC. For example, each PyTorch or MPI rank can create one or more endpoints, and each endpoint’s PIDonFEP ensures that the NIC can demultiplex incoming packets to the correct process context.
PIDonFEP is analogous to an operating system PID but scoped to the fabric device rather than the OS. It is also critical for resource isolation: multiple AI training jobs can share the same NIC hardware while maintaining independent endpoint contexts.
Fabric Address (FA)
The NIC’s Fabric Address (FA) is used for network-level routing and is inserted into the IP header so that packets are delivered to the correct NIC over the backend network. Once a packet reaches the NIC, the PIDonFEP value, carried within the Session header in the data plane, identifies the target process or endpoint on that host. Together, FA and PIDonFEP form a globally unique identifier for the endpoint within the UET fabric, allowing multiple processes and workloads to share the same NIC without conflicts.
The mode of the address (relative or absolute) and the address type (IPv4 or IPv6) is defined by flag bits. Relative addresses, meaningful only within the scope of a job or endpoint group, are commonly used in AI and distributed training workloads to simplify routing and reduce overhead.
Note: Relative addresses typically include a Job Identifier to distinguish endpoints belonging to different jobs, in addition to the PIDonFEP and, optionally, an index. This ensures uniqueness within the job while keeping addresses lightweight.
Note: Details of data plane encapsulation, including the structure of headers and how fields like PIDonFEP are carried, are explained in upcoming chapters on data transport.
Start Resource Index and Num Resource Indices
Each endpoint is allocated a range of Resource Indices. The Start Resource Index indicates the first index assigned to the endpoint, and Num Resource Indices specifies how many consecutive indices are reserved.
This range acts as a local namespace for all objects associated with the endpoint, including memory regions, completion queues, and transmit/receive contexts. When a remote peer performs an operation such as fi_remote_write() targeting a specific index, the NIC can resolve the correct resource entirely in hardware without software intervention.
For example, if an endpoint’s Start Resource Index is 300 and Num Resource Indices is 64, all local resources occupy indices 300–363. A remote write to index 312 is immediately mapped to the appropriate memory region or queue, enabling high-throughput, low-latency operations required in large-scale AI training clusters.
Initiator ID and Job Identification
The Initiator ID provides an optional identifier for the training job or distributed session owning the endpoint. In AI frameworks, this often corresponds to a Job ID, grouping multiple ranks within the same session. In relative addressing mode, the Job ID forms part of the endpoint address, ensuring that endpoints belonging to different jobs remain distinct even if they share the same NIC and PIDonFEP indices.
Note: Implementation details, including how the Job ID is carried in fi_ep_attr via the auth_key field when auth_key_size = 3, are explained in the upcoming transport chapter.
As described earlier, UET address assignment is a multi-stage process. Figure 4‑7 illustrates the basic flow of how a complete UET address is assigned to an endpoint. While the figure shows only the fi_ep_open() API call, the final fid_ep object box provides an illustrative example of the fully resolved endpoint after the application has also called fi_getname() to retrieve the complete provider-specific UET address.
The focus of this section has been on opening an endpoint, rather than detailing the full addressing schema. Later, in the packet transport chapter, we will examine the UET address structure and its role in data plane operations in greater depth.
Figure 4-7: Objects Creation Process – Endpoint (EP).
No comments:
Post a Comment