Sunday, 12 October 2025

Ultra Ethernet: Creating Endpoint Object

Endpoint Creation and Operation

[Updated 12-October, 2025: Figure & uet addressing section]

In libfabric and Ultra Ethernet Transport (UET), the endpoint, represented by the object fid_ep, serves as the primary communication interface between a process and the underlying network fabric. Every data exchange, whether it involves message passing, remote memory access (RMA), or atomic operations, ultimately passes through an endpoint. It acts as a software abstraction of the transport hardware, exposing a programmable interface that the application can use to perform high-performance data transfers.

Conceptually, an endpoint resembles a socket in the TCP/IP world. However, while sockets hide much of the underlying network stack behind a simple API, endpoints expose far more detail and control. They allow the process to define which completion queues to use, what capabilities to enable, and how multiple communication contexts are managed concurrently. This design gives applications, especially large distributed training frameworks and HPC workloads, direct control over latency, throughput, and concurrency in ways that traditional sockets cannot provide.

Furthermore, socket-based communication typically relies on the operating system’s networking stack and consumes CPU cycles for data movement and protocol handling. In contrast, endpoint communication paths can interact directly with the NIC, enabling user-space data transfers and RDMA operations that bypass the kernel and minimize CPU involvement.


Endpoint Types

Libfabric defines three endpoint types: active, passive, and scalable. Each serves a specific role in communication setup and operation and is created using a dedicated constructor function. Their configuration and behavior are largely determined by the information returned by fi_getinfo(), which populates a structure called fi_info. Within this structure, the subfields fi_ep_attr, fi_tx_attr, and fi_rx_attr define how the endpoint interacts with the provider and the transport hardware."



Endpoint Type Constructor Function Typical Use Case

Active Endpoint fi_endpoint() Actively sends and receives data once connections are established

Passive Endpoint fi_passive() Listens for and accepts incoming connection requests

Scalable Endpoint fi_scalable() Supports multiple transmit and receive contexts for concurrent communication

Table 4-1: Endpoint Types.


Active Endpoint

The active endpoint is the most common and versatile type. It is created using the fi_endpoint() function, which initializes an endpoint that can actively send and receive data once configured and enabled. The attributes describing its behavior are drawn from the provider’s fi_info structure returned by fi_getinfo(), which provides general provider capabilities and limits. The caps field specifies which operations the endpoint supports, such as message transmission (FI_SEND and FI_RECV), remote memory operations (FI_RMA), and atomic instructions (FI_ATOMIC).

The fi_ep_attr substructure defines the endpoint’s communication semantics. For example, FI_EP_MSG describes a reliable, connection-oriented model similar to TCP, whereas FI_EP_RDM represents a reliable datagram model with connectionless semantics. Active endpoints support only a single transmit and a single receive context. In fi_ep_attr, tx_ctx_cnt and rx_ctx_cnt are typically set to 0, which indicates that the default context should be used. In contrast, scalable endpoints can support multiple TX and RX contexts, enabling concurrent operations across several threads or contexts while sharing the same logical endpoint.

The fi_tx_attr and fi_rx_attr substructures define the operational behavior of the endpoint for sending and receiving. They specify which operations are supported (e.g., FI_SEND, FI_RMA), how many transmit or receive commands the provider should allocate resources for (size), and the message ordering guarantees (msg_order). In figure 4-1, msg_order = 0 indicates no ordering guarantees, meaning the provider does not enforce any specific ordering of operations. These attributes guide the provider in configuring internal queues and hardware resources to meet the requested behavior.

From a transport perspective, active endpoints can operate either in a connection-oriented or connectionless mode. When configured as FI_EP_MSG, the endpoint behaves much like a TCP socket, requiring an explicit connection setup before data transfer can occur. When configured as FI_EP_RDM, the endpoint exchanges messages with peers directly through the fabric. Address resolution is handled through the Address Vector table, and the PIDonFEP ensures that packets reach the correct process on the host, eliminating the need for connection handshakes. This mode is particularly useful for distributed workloads, such as collective operations in AI training.

Before an endpoint can be used, it must eventually be bound to observation objects such as completion queues, event queues, and the Address Vector. Binding and the role of the Address Vector are described in dedicated sections later.


Passive Endpoint

The passive endpoint, created with fi_passive_ep(), performs a role analogous to a listening socket in TCP. It does not transmit data itself but instead waits for incoming connection requests from remote peers. When a connection request arrives, it triggers an event in the associated Event Queue, which the application can monitor. Based on this event, the application can decide whether to accept or reject the request by calling fi_accept() or fi_reject().

Because passive endpoints exist solely for connection management, they always operate in FI_EP_MSG mode and do not perform data transmission. Consequently, they only need to bind to an Event Queue, which receives notifications of connection attempts, completions, and related state changes. Once a connection is accepted, the provider automatically creates a corresponding active endpoint that can then send and receive data. In this process, the Address Vector is updated to include the new peer’s address, ensuring that subsequent communication can occur transparently.


Scalable Endpoint

The scalable endpoint represents the most advanced and parallelized endpoint type. It is constructed with fi_scalable_ep() and is designed for applications that require massive concurrency, such as distributed AI training jobs running on multiple GPUs or multi-threaded inference servers.

Unlike a regular endpoint, a scalable endpoint can be subdivided into multiple transmit (TX) and receive (RX) contexts. Each context acts like an independent communication lane with its own completion queues and state, allowing multiple threads to perform concurrent operations without locking or contention. In essence, scalability in this context means that a single endpoint object can represent many lightweight communication contexts that share the same underlying hardware resources.

The number of transmit and receive contexts for a scalable endpoint is specified in fi_ep_attr using tx_ctx_cnt and rx_ctx_cnt. The provider validates the requested counts against the domain’s capabilities (max_ep_tx_ctx and max_ep_rx_ctx), ensuring that the requested contexts do not exceed the hardware limits. The capabilities field (caps) indicates support for features such as FI_SHARED_CONTEXT and FI_MULTI_RECV, which are required for efficient scalable endpoint operation.

From a conceptual point of view, this arrangement can be compared to a single process maintaining multiple sockets, each used by a separate thread to communicate independently. However, scalable endpoints achieve this concurrency without the overhead of creating and maintaining separate endpoints per thread. Each context binds to its own completion queue, enabling parallel operation, while the scalable endpoint as a whole binds only once to the Event Queue and Address Vector. This design allows high-throughput, low-latency communication suitable for GPU-parallelized data exchange, where multiple communication streams operate concurrently between the same set of nodes.


UET Endpoint Address

When a UET endpoint is created, it receives a UET Endpoint Address, which uniquely identifies it within the fabric. Unlike a simple IP or MAC-style address, this is a composite structure that encodes several critical attributes of the endpoint and its execution context. The fields defined in UET Specification v1.0 include Version, Flags, Fabric Endpoint Capabilities, PIDonFEP, Fabric Address (FA), Start Resource Index, Num Resource Indices, and Initiator ID. Together, these fields describe both where the endpoint resides and how it interacts with other endpoints across the transport layer.

Address Assignment Process

The UET address assignment is a multi-stage process that completes when the endpoint is opened and the final address is retrieved. During the fi_ep_open() call, the provider uses the information from the fi_info structure and its substructures, including the transmit and receive attributes, to form the basic addressing context. At this stage, the provider contacts the UET kernel driver and finalizes the internal endpoint address by adding UET-specific fields such as the JobID, PIDonFEP, Initiator ID, and resource indices.

Although these fields are now part of the provider’s internal address structure, the application does not yet have access to them. To retrieve the complete and fully resolved UET address, the application must call the fi_getname() API. This call returns the composite address containing all the fields that were assigned during the fi_ep_open() operation.

In summary, fi_ep_open() finalizes the endpoint’s address internally, while fi_getname() exposes this finalized address to the application.

Version and Flags

The Version field ensures compatibility between different revisions of the UET library or NIC firmware, allowing peers to interpret the endpoint address structure correctly.

The Flags field defines both the validity of other fields in the endpoint address and the addressing behavior of the endpoint. It verifies whether the Fabric Endpoint Capabilities, PIDonFEP, Resource Index, or Initiator ID fields are valid or not. It also indicates which addressing mode is used (relative or absolute), the type of address in the Fabric Address (IPv4 or IPv6), and whether the maximum message size is limited to the domain object’s (NIC) MTU.

From a transport perspective, the flags make the endpoint address self-descriptive: peers can parse it to know which fields are valid and how messages should be addressed without additional negotiation. This is especially important in large distributed AI clusters, where thousands of endpoints are initialized and need to communicate efficiently.

Fabric Endpoint Capabilities

The Fabric Endpoint Capabilities field complements the flags by describing what the endpoint can actually do. It indicates the supported operational profile of the endpoint. In UET, three profiles are defined:

  • AI Base Profile: Provides the minimal feature set for distributed AI workloads, including reliable messaging, basic RDMA operations, and standard memory registration.
  • AI Full Profile: Extends AI Base with advanced features, such as optimized queue handling and SES header enhancements, which are critical for large-scale, high-throughput AI training.
  • HPC Profile: Optimized for low-latency, synchronization-heavy workloads typical in high-performance computing.

Each profile requires specific capabilities and parameters to be set in the fi_info structure, which the UET provider returns during fi_getinfo(). These include supported operations (such as send/receive, RDMA, or atomics), maximum message sizes, queue depths, memory registration limits, and other provider-specific settings. 

For a complete list of per-profile requirements, see Table 2‑4, “Per-Profile Libfabric Parameter Requirements,” in the UET Specification v1.0.


PIDonFEP – Process Identifier on Fabric Endpoint

Every Fabric Endpoint (FEP) managed by a UET NIC has a Process Identifier on FEP (PIDonFEP), which distinguishes different processes or contexts sharing the same NIC. For example, each PyTorch or MPI rank can create one or more endpoints, and each endpoint’s PIDonFEP ensures that the NIC can demultiplex incoming packets to the correct process context.

PIDonFEP is analogous to an operating system PID but scoped to the fabric device rather than the OS. It is also critical for resource isolation: multiple AI training jobs can share the same NIC hardware while maintaining independent endpoint contexts.


Fabric Address (FA)

The NIC’s Fabric Address (FA) is used for network-level routing and is inserted into the IP header so that packets are delivered to the correct NIC over the backend network. Once a packet reaches the NIC, the PIDonFEP value, carried within the Session header in the data plane, identifies the target process or endpoint on that host. Together, FA and PIDonFEP form a globally unique identifier for the endpoint within the UET fabric, allowing multiple processes and workloads to share the same NIC without conflicts.

The mode of the address (relative or absolute) and the address type (IPv4 or IPv6) is defined by flag bits. Relative addresses, meaningful only within the scope of a job or endpoint group, are commonly used in AI and distributed training workloads to simplify routing and reduce overhead.

Note: Relative addresses typically include a Job Identifier to distinguish endpoints belonging to different jobs, in addition to the PIDonFEP and, optionally, an index. This ensures uniqueness within the job while keeping addresses lightweight.

Note: Details of data plane encapsulation, including the structure of headers and how fields like PIDonFEP are carried, are explained in upcoming chapters on data transport.


Start Resource Index and Num Resource Indices

Each endpoint is allocated a range of Resource Indices. The Start Resource Index indicates the first index assigned to the endpoint, and Num Resource Indices specifies how many consecutive indices are reserved.

This range acts as a local namespace for all objects associated with the endpoint, including memory regions, completion queues, and transmit/receive contexts. When a remote peer performs an operation such as fi_remote_write() targeting a specific index, the NIC can resolve the correct resource entirely in hardware without software intervention.

For example, if an endpoint’s Start Resource Index is 300 and Num Resource Indices is 64, all local resources occupy indices 300–363. A remote write to index 312 is immediately mapped to the appropriate memory region or queue, enabling high-throughput, low-latency operations required in large-scale AI training clusters.


Initiator ID and Job Identification

The Initiator ID provides an optional identifier for the training job or distributed session owning the endpoint. In AI frameworks, this often corresponds to a Job ID, grouping multiple ranks within the same session. In relative addressing mode, the Job ID forms part of the endpoint address, ensuring that endpoints belonging to different jobs remain distinct even if they share the same NIC and PIDonFEP indices.

Note: Implementation details, including how the Job ID is carried in fi_ep_attr via the auth_key field when auth_key_size = 3, are explained in the upcoming transport chapter.

As described earlier, UET address assignment is a multi-stage process. Figure 4‑7 illustrates the basic flow of how a complete UET address is assigned to an endpoint. While the figure shows only the fi_ep_open() API call, the final fid_ep object box provides an illustrative example of the fully resolved endpoint after the application has also called fi_getname() to retrieve the complete provider-specific UET address.

The focus of this section has been on opening an endpoint, rather than detailing the full addressing schema. Later, in the packet transport chapter, we will examine the UET address structure and its role in data plane operations in greater depth.





Figure 4-7: Objects Creation Process – Endpoint (EP).

Wednesday, 8 October 2025

Ultra Ethernet: Fabric Object - What it is and How it is created

Fabric Object


Fabric Object Overview

In libfabric, a fabric represents a logical network domain, a group of hardware and software resources that can communicate with each other through a shared network. All network ports that can exchange traffic belong to the same fabric domain. In practice, a fabric corresponds to one interconnected network, such as an Ethernet or Ultra Ethernet Transport (UET) fabric.

A good way to think about a fabric is to compare it to a Virtual Data Center (VDC) in a cloud environment. Just as a VDC groups together compute, storage, and networking resources into an isolated logical unit, a libfabric fabric groups together network interfaces, addresses, and transport resources that belong to the same communication context. Multiple fabrics can exist on the same system, just like multiple VDCs can operate independently within one cloud infrastructure.

The fabric object acts as the top-level context for all communication. Before an application can create domains, endpoints, or memory regions, it must first open a fabric using the fi_fabric() call. This creates the foundation for all other libfabric objects.

Each fabric is associated with a specific provider,  for example, libfabric-uet, which defines how the fabric interacts with the underlying hardware and network stack. Once created, the fabric object maintains provider-specific state, hardware mappings, and resource visibility for all subsequent objects created under it.

For the application, the fabric object is simply a handle to a network domain that other libfabric calls will use. For the provider, it is the root structure that connects all internal data structures and controls how communication resources are managed within the same network fabric.

The following section explains how the application requests a fabric object and how the provider and libfabric core work together to create and publish it.


Creating Fabric Object

After the UET provider populates the fi_info structures for each NIC/port combination during discovery, the application can begin creating objects. It first consults the fi_info list to identify the entry that best matches its requirements. Figure 4-3 shows an illustrative example of how the application calls fi_fabric() to request a fabric object based on the fi_fabric_attr sub-structure of fi_info[0] corresponding to NIC Eth0.

Once the application issues the API call, the request is handed over to the libfabric core, which acts as a lightweight coordinator and translator between the application and the provider.

The UET provider receives the request via the uet_fabric function pointer. It may first verify that the requested fabric name is still valid and supported by NIC 0 before consulting the fi_fabric_attr structure specified in the API call. The provider then creates the fabric object, defining its type, fabric ID (fid), provider, and fabric name.


Figure 4-3: Objects Creation Process – Fabric for Cluster.

Memory Allocation and Object Publication


In libfabric, memory for fabric objects — as well as for all other objects created later, such as domains, endpoints, and memory regions — is allocated by the provider, not the libfabric core. The provider is responsible for creating the actual data structures that represent these objects and for mapping them to underlying hardware resources. This ensures that the object’s state, capabilities, and hardware associations are maintained correctly and consistently.

When the provider creates a fabric object, it allocates memory in its own address space and initializes all internal fields, including type, fabric ID (fid), provider-specific metadata, and associated NIC resources. Once the object is fully initialized, the provider returns a pointer to the libfabric core. This pointer effectively tells the core the location of the object in memory.

The libfabric core then wraps this provider pointer in a lightweight descriptor called fid_fabric, which acts as the application-visible handle for the fabric object. This descriptor contains metadata and a reference to the provider-managed object, allowing the core to track and route subsequent API calls correctly without duplicating the object. The core stores the fid_fabric handle in its internal tables, enabling fast lookup and validation whenever the application references the fabric in later calls.
Finally, the libfabric core returns the fid_fabric handle to the application. From the application’s perspective, this handle uniquely identifies the fabric object, while internally the provider maintains the persistent state and hardware mappings.


Tuesday, 7 October 2025

Ultra Etherent: Discovery

Updated 8-Ocotber 2025

Creating the fi_info Structure

Before the application can discover what communication services are available, it first needs a way to describe what it is looking for. This description is built using a structure called fi_info. The fi_info structure acts like a container that holds the application’s initial requirements, such as desired endpoint type or capabilities.

The first step is to reserve memory for this structure in the system’s main memory. The fi_allocinfo() helper function does this for the application. When called, fi_allocinfo() allocates space for a new fi_info structure, which this book refers to as the pre-fi_info, that will later be passed to the libfabric core for matching against available providers.

At this stage, most of the fields inside the pre-fi_info structure are left at their default values. The application typically sets only the most relevant parameters that express what it needs, such as the desired endpoint type or provider name, and leaves the rest for the provider to fill in later.

In addition to the main fi_info structure, the helper function also allocates memory for a set of sub-structures. These describe different parts of the communication stack, including the fabric, domain, and endpoints. The fixed sub-structures include:

  • fi_fabric_attr: Describes the fabric-level properties such as the provider’s name and fabric identifier.
  • fi_domain_attr: Defines attributes related to the domain, including threading and data progress behavior.
  • fi_ep_attr: Specifies endpoint-related characteristics such as endpoint type.
  • fi_tx_attr and fi_rx_attr: Contain parameters for transmit and receive operations, such as message ordering and completion behavior.
  • fi_nic: Represents the properties and capabilities of the UET network interface.

In figure 4-1, these are labeled as fixed sub-structures because their layout and meaning are always the same. They consist of predefined fields and expected value types, which makes them consistent across different applications. Like the main fi_info structure, they usually remain at their default values until the provider fills them in. The information stored in these sub-structures will later be leveraged when the application begins creating actual fabric objects, such as domains and endpoints.

In addition to the fixed parts, the fi_info structure can contain generic sub-structures such as src_addr. Unlike the fixed ones, the generic sub-structure src_addr depend on the chosen addr_format. For example, when using Ultra Ethernet Transport, the address field points to a structure describing a UET endpoint address, which includes bits for Version, Flags, Fabric Endpoint Capabilities, PIDonFEB, Fabric Address, Start Resource Index, Num Resource Indices, and Initiator ID. This compact representation carries both addressing and capability information, allowing the same structure definition to be reused across different transport technologies and addressing schemes. Note that in figure 4-2 the returned src_addr is only partially filled because the complete address information is not available until the endpoint is created.

In Figure 4-1, the application defines its communication requirements in the fi_info structure by setting the caps (capabilities) field. This field describes the types of operations the application intends to perform through the fabric interface. For example, values such as FI_MSG, FI_RMA, FI_WRITE, FI_REMOTE_WRITE, FI_COLLECTIVE, FI_ATOMIC, and FI_HMEM specify support for message-based communication, remote memory access, atomic operations, and host memory extensions.

When the fi_getinfo() call is issued, the provider compares these requested capabilities against what the underlying hardware and driver can support. Only compatible providers return a matching fi_info structure.

In this example, the application also sets the addr_format field to FI_ADDR_UET, indicating that Ultra Ethernet Transport endpoint addressing is used. This format includes hardware-specific addressing details beyond a simple IP address.

The current Ultra Ethernet Transport specification v1.0 does not define or support the FI_COLLECTIVE capability. Therefore, the UET provider does not return this flag, and collective operations are not offloaded or accelerated by the UET NIC.

After fi_allocinfo() has allocated memory for both the fixed and generic sub-structures, it automatically links them together by inserting pointers into the main fi_info structure. The application can then easily access each attribute through fi_info without manually handling the memory layout. 

Once the structure is prepared, the next step is to request matching provider information using fi_getinfo() API call, which will be described in detail in the following section.

Figure 4-1: Discovery: Allocate Memory and Create Structures – fi_allocinfo.

Requesting Provider Services with fi_getinfo()


After creating a pre-fi_info structure, the application calls the fi_getinfo() API to discover which services and transport features are available on the node’s NIC(s). This function takes a pointer to the pre-fi_info structure, which contains hints describing the application’s requirements, such as desired capabilities and address format.

When the discovery request reaches the libfabric core, the library identifies and loads an appropriate provider from the available options, which may include providers for TCP, verbs (InfiniBand), or sockets (TCP/UDP). For Ultra Ethernet Transport, the core selects the uet-provider. The core invokes the provider’s entry points, including the .getinfo callback, which is responsible for returning the provider’s supported capabilities. Internally, the provider uses function pointers uet_getinfo.

Inside the UET provider, the uet_getinfo() routine queries the NIC driver or kernel interface to determine what capabilities each NIC can safely and efficiently support. The provider does not access the hardware directly. For multi-GPU AI workloads, the focus is on push-based remote memory access operations:

  • FI_MSG: Used for standard message-based communication.
  • FI_RMA: Enables direct remote memory access, forming the foundation for high-performance gradient or parameter transfers between GPUs.
  • FI_WRITE: Allows a GPU to push local gradient updates directly into another GPU’s memory.
  • FI_REMOTE_WRITE: Signals that remote GPUs can write directly into this GPU’s memory, supporting push-based collective operations.
  • FI_COLLECTIVE: Indicates support for collective operations like AllReduce, though the current UET specification does not implement this capability.
  • FI_ATOMIC: Allows atomic operations on remote memory.
  • FI_HMEM: Marks support for host memory or GPU memory extensions.

Low-level hardware metrics, such as link speed or MTU, are not returned at this phase; the focus is on semantic capabilities that the application can rely on. 

The provider allocates new fi_info structures in CPU memory, creating one structure per NIC that satisfies the hints provided by the application and describes all other supported services.

After the provider has created these structures, libfabric returns them to the application as a linked list. The next pointer links all available fi_info structures, allowing the application to iterate over the discovered NICs. Each fi_info entry contains both top-level fields, such as caps and addr_format, and several attribute sub-structures—fi_fabric_attr, fi_domain_attr, fi_ep_attr, fi_tx_attr, and fi_rx_attr.

Even if the application provides no hints, the provider fills in these attribute groups with its default or supported values. This ensures that every fi_info structure returned by fi_getinfo() contains a complete description of the provider’s capabilities and configuration options. During object creation, the libfabric core passes these attributes to the provider, which uses them to map the requested fabric, domain, and endpoint objects to the appropriate NIC and transport configuration.


The application can then select the most appropriate entry and request the creation of the Fabric object. Creating the Fabric first establishes the context in which all subsequent domains, sub-resources, and endpoints are organized and managed. Once all pieces of the AI Fabric “jigsaw puzzle” abstraction have been created and initialized, the application can release the memory for all fi_info structures in the linked list by calling fi_freeinfo().





Figure 4-2: Discovery: Discover Provider Capabilities – fi_getinfo.


Note: fields and values of fi_info and its sub-structures are explained in upcoming chapters

Next, Fabric Object... 


Thursday, 2 October 2025

Ultra Ethernet: Address Vector (AV)

 Address Vector (AV)

The Address Vector (AV) is a provider-managed mapping that connects remote fabric addresses to compact integer handles (fi_addr_t) used in communication operations. Unlike a routing table, the AV does not store IP (device mappings). Instead, it converts an opaque Fabric Address (FA)—which may contain IP, port, and transport-specific identifiers—into a simple handle that endpoints can use for sending and receiving messages. The application never needs to reference the raw IP addresses directly.

Phase 1: Application – Request & Definition

The application begins by requesting an Address Vector (AV) through the fi_av_open() call. To do this, it first defines the desired AV properties in a fi_av_attr structure:

int fi_av_open(struct fid_domain *domain, struct fi_av_attr *attr, struct fid_av **av, void *context);

struct fi_av_attr av_attr = {

    .type        = FI_AV_TABLE,   

    .count       = 16,            

    .rx_ctx_bits = 0,          

    .ep_per_node = 1,          

    .name        = "my_av",    

    .map_addr    = NULL,       

    .flags       = 0           

};

Example 4-1: structure fi_av_attr.


fi_av_attr Fields

type: Specifies the type of Address Vector. In Ultra Ethernet, the most common choice is FI_AV_TABLE, which organizes peer addresses in a simple, contiguous table. Each inserted Fabric Address (FA) receives a sequential fi_addr_t starting from zero. The type determines how the provider manages and looks up peer addresses during runtime.

Count: Indicates the expected number of addresses that will be inserted into the AV. The provider uses this value to optimize resource allocation. In most implementations, this also acts as an upper bound—if the application attempts to insert more addresses than specified, the operation may fail. Therefore, the application should set this field according to the anticipated communication pattern, ensuring the AV is sized appropriately without over-allocating memory.

Receive Context Bits (rx_ctx_bits): The rx_ctx_bits field is relevant only for scalable endpoints, which can expose multiple independent receive contexts. Normally, when an address is inserted into the Address Vector (AV), the provider returns a compact handle (fi_addr_t) that identifies the peer. If the peer has multiple receive contexts, the handle must also encode which specific receive context the application intends to target.

The rx_ctx_bits field specifies how many bits of the fi_addr_t are reserved for this purpose. For example, if an endpoint has 8 receive contexts (rx_ctx_cnt = 8), then at least 3 bits are required (2^3 = 8) to distinguish all contexts. These reserved bits allow the application to refer to each receive context individually without inserting duplicate entries into the AV.

The provider uses rx_ctx_bits internally when constructing the fi_addr_t returned by fi_av_insert(). The helper function fi_rx_addr() can then combine a base fi_addr_t with a receive context index to produce the final address used in communication operations.


Example:


Suppose a peer has 4 receive contexts, and the application sets rx_ctx_bits = 2 (2 bits reserved for receive contexts). The base fi_addr_t for the peer returned by the AV might be 0x10. Using the 2 reserved bits, the application can address each receive context as follows:


Peer FA         Base fi_addr_t     Receive Context   Final fi_addr_t

10.0.0.11:7500 0x10         0                      0x10

10.0.0.11:7500 0x10         1                      0x11

10.0.0.11:7500 0x10         2                      0x12

10.0.0.11:7500 0x10         3                      0x13


This approach keeps the AV compact while allowing fine-grained targeting of receive contexts. The application never needs to manipulate raw IP addresses or transport identifiers; it simply uses the final fi_addr_t values in send, receive, or RDMA operations.


ep_per_node: Indicates the expected number of endpoints that will be associated with a given Fabric Address. The provider uses this value to optimize resource allocation. If the number of endpoints per node is unknown, the application can set it to 0. In distributed or parallel applications, this is typically set to the number of processes per node multiplied by the number of endpoints each process will open. This value is a hint rather than a strict limit, allowing the AV to scale efficiently in multi-endpoint configurations.


Name: An optional human-readable system name for the AV. If non-NULL and the AV is opened with write access, the provider may create a shared, named AV, which can be accessed by multiple processes within the same domain on the same node. This feature enables resource sharing in multi-process applications. If sharing is not required, the name can be set to NULL.


map_addr: An optional base address for the AV, used primarily when creating a shared, named AV (FI_AV_MAP) across multiple processes. If multiple processes provide the same map_addr and name, the AV guarantees that the fi_addr_t handles returned by fi_av_insert() are consistent across all processes. The provider may internally memory-map the AV at this address, but this is not required. In single-process AVs, or when the AV is private, this field can be set to NULL.


Flags: Behavior-modifying flags. These can control provider-specific features, such as enabling asynchronous operations or special memory access patterns. Setting it to 0 selects default behavior.

Once the structure is populated, the application calls:


The fi_av_open() call allocates a provider-managed fid_av object, which contains a pointer to operations such as insert(), remove(), and lookup(), along with an internal mapping table that translates integer handles into transport-specific Fabric Addresses. At this stage, the AV exists independently of endpoints or peers and contains no entries; it is a blank, ready-to-populate mapping structure.


The AV type, described in Example 4-1, determines how the provider organizes and manages peer addresses. In Ultra Ethernet, FI_AV_TABLE is the most common choice. Conceptually, it is a simple, contiguous table that maps each peer’s Fabric Address to a sequential integer handle (fi_addr_t). Each inserted FA receives a handle starting from zero, and the application never accesses the raw IP or transport identifiers directly, relying solely on the fi_addr_t to refer to peers. From the Ultra Ethernet perspective, FI_AV_TABLE is ideal because it allows constant-time lookup of a peer’s FA. When the application posts a send, receive, or memory operation, the provider translates the fi_addr_t into the transport-specific FA efficiently. Internally, each table entry can also reference provider-managed resources, such as transmit and receive contexts or resource indexes, making runtime operations lightweight and predictable. 


The application should select attribute values in line with the provider’s capabilities reported by fi_getinfo(), considering the expected number of peers and any provider-specific recommendations. This ensures that the AV is sized appropriately for the intended communication pattern and supports efficient, opaque addressing throughout the system.

Example: Mapping fi_addr_t index to peer FA and Rank



fi_addr_t Peer Rank Fabric Address (FA)

0                 1         10.0.0.11:7500

1                 2         10.0.0.12:7500

2                 3         10.0.0.13:7500


Each inserted Fabric Address receives a sequential integer handle (fi_addr_t) starting from zero. The application then uses this handle in communication operations, for example:


fi_addr_t peer2 = fi_addrs[1];  // index 1 corresponds to rank 2

fi_send(ep, buffer, length, NULL, peer2, 0);


This mapping allows the application to refer to peers purely by fi_addr_t handles, avoiding any direct manipulation of IP addresses or transport identifiers, and enabling efficient, opaque peer referencing in the data path.


Phase 2: Provider – Validation & Limit Check.

After the application requests an AV using fi_av_open(), the provider validates the requested attributes against the capabilities reported in the domain attributes (fi_domain_attr) returned earlier by fi_getinfo(). The fid_domain serves as the context for AV creation, but the actual limits and supported features come from the domain attributes.

The provider performs the following checks to ensure the AV configuration is compatible with its supported limits:

  • type is compared against fi_info structure -> fi_domain_attr sub-structure -> av_type to verify that the requested AV organization (for example, FI_AV_TABLE) is supported.
  • count is checked to ensure it does not exceed the maximum number of entries the provider can manage efficiently.
  • rx_ctx_bits and ep_per_node are validated against domain context limits, such as rx_ctx_cnt, tx_ctx_cnt, and max_ep_*_ctx values, to guarantee that the requested receive context and endpoint configuration can be supported.
  • flags are compared against fi_info->caps and fi_info->mode to confirm that the requested behaviors are allowed.

Note: The caps field lists the capabilities supported by the provider that an application may request, such as RMA, atomics, or tagged messaging. The mode field, in contrast, specifies mandatory requirements imposed by the provider. If a mode bit is set, the application must comply with that constraint to use the provider. Typical examples include requiring all memory buffers to be explicitly registered (FI_LOCAL_MR) or the use of struct fi_context with operations (FI_CONTEXT). 

  • name and map_addr are optional; the provider may validate them if the AV is intended to be shared or named, but they can also be ignored for local/private AVs.

If all checks pass, the provider allocates the AV and returns a handle (fid_av *) to the application. At this stage, the AV exists as an empty container, independent of any endpoints or peer Fabric Addresses. It is ready to be populated with addresses in the subsequent phase.

Phase 3: Population and distribution of FAs.

Once the AV has been created and validated, it must be populated with the Fabric Addresses (FAs) of all peers that the application will communicate with. Initially, each process only knows its own local FA. To enable inter-process communication, these local FAs must be exchanged and distributed to all ranks, typically using an out-of-band control channel such as a bootstrap TCP connection.

In a common master-collect model, the master process (Rank 0) collects the local FAs from all worker processes and constructs a global map of {Rank → FA}. The master then broadcasts this global mapping back to all ranks. Each process inserts the received FAs into its AV using fi_av_insert().

Each inserted FA is assigned a sequential integer handle (fi_addr_t) starting from zero. These handles are then used in all subsequent communication operations, so the application never directly references IP addresses or transport-specific identifiers.

Example: Populating the AV with peer FAs

struct sockaddr_in peers[3];

peers[0] = { ip=10.0.0.11, port=7500 }; // rank 1

peers[1] = { ip=10.0.0.12, port=7500 }; // rank 2

peers[2] = { ip=10.0.0.13, port=7500 }; // rank 3


fi_addr_t fi_addrs[3];

ret = fi_av_insert(av, peers, 3, fi_addrs, 0, NULL);

Note on Local Fabric Address: In most Ultra Ethernet applications, the local Fabric Address (FA) is not inserted into the Address Vector (AV_TABLE). This is because it is rare for a process to communicate with itself; endpoints typically only need to send or receive messages to remote peers. As a result, the AV contains entries only for remote FAs, each mapped to a sequential fi_addr_t handle. The local FA remains known to the process but does not occupy an AV entry, keeping the mapping compact and efficient. If an application does require self-communication, the local FA can be explicitly inserted into the AV, but this is an uncommon scenario.


Once this step is complete, the AV contains the full mapping of all ranks to their assigned fi_addr_t handles, including the master itself. The application can now communicate with any peer by passing its fi_addr_t to libfabric operations.

Example: Using fi_addr_t in a send operation

fi_addr_t peer2 = fi_addrs[1];  // index 1 corresponds to rank 2

fi_send(ep, buffer, length, NULL, peer2, 0);

This design allows all communication operations to reference peers via compact, opaque integer handles, keeping the data path efficient and abstracted from transport-specific details.


Phase 4: Runtime usage.

After the AV is populated with all peer Fabric Addresses (FAs), the provider handles all runtime operations transparently. The fi_addr_t handles inserted by the application are used internally to direct traffic to the correct peer, hiding the underlying transport details.

  • Send and Receive Operations: When the application posts a send, receive, or RDMA operation, the provider translates the fi_addr_t handle into the transport-specific FA. The application never needs to reference IP addresses, ports, or other identifiers directly.
  • Updating the AV: If a peer’s FA needs to be removed or replaced, the provider uses its AV operations (fi_ops_av->remove() and fi_ops_av->insert()) to update the mapping seamlessly. These updates maintain the integrity of the fi_addr_t handles across the application.
  • Resource References: Each table entry may internally reference provider-managed resources, such as transmit/receive contexts or reserved bits for scalable endpoints. This ensures that runtime operations are lightweight and predictable.
  • Consistency Across Processes: For shared or named AVs, the provider ensures that fi_addr_t handles are consistent across processes, allowing communication to function correctly even in distributed, multi-node applications.


At this stage, the AV is fully functional. The application can perform communication with any known peer using the opaque fi_addr_t handles. This design keeps the data path efficient, predictable, and independent of transport-specific details, just like the Completion Queue and Event Queue abstractions.

Example fid_av (Address Vector) – Illustrative


fid_av {

    fid_type            : FI_AV            

    fid                     : 0xF1AD01         

    parent_fid         : 0xF1DD001         

    provider            : "libfabric-uet"  

    type                   : FI_AV_TABLE      

    count                 : 16               

    rx_ctx_bits        : 0                

    ep_per_node     : 1                

    flags                  : 0x00000000 

    name                : "my_av"          

    map_addr        : 0x0              

    ops                   : 0xF1AA10         

    fi_addr_table   : [0..count-1]     

}

Explanation of fields:

Note: Some of these fields were previously explained as part of the fi_av_attr structure.


Identity

  • fid_type: Specifies the type of object. For an Address Vector, this is always FI_AV. This allows the provider and application to distinguish AVs from other objects like fid_cq, fid_eq, or fid_domain.
  • Fid: A unique identifier for this AV instance, typically assigned by the provider. It serves as a handle for internal management and debugging, similar to fid_cq or fid_eq.
  • parent_fid: Points to the parent domain (fid_domain) in which this AV was created. This links the AV to its domain’s resources and capabilities and ensures proper hierarchy in the provider’s object model.
  • Name: Optional human-readable system name for the AV. Used to identify shared AVs across processes within the same domain. Named AVs allow multiple processes to open the same underlying mapping.


Provider Info

  • Provider: A string identifying the provider managing this AV (e.g., "libfabric-uet").
  • Type: Reflects the AV organization, usually set to FI_AV_TABLE in Ultra Ethernet. Conceptually, it defines the AV as a simple contiguous table mapping peer FAs to integer handles (fi_addr_t).
  • Flags: Behavior-modifying options for the AV. These can enable provider-specific features or modify how insert/remove operations behave.
  • Ops: Pointer to the provider-specific operations table (fi_ops_av). This contains function pointers for AV operations such as insert(), remove(), lookup(), and control().


Resource Limits / Configuration

  • Count: The expected number of addresses to be inserted into the AV. This guides the provider in allocating resources efficiently and sizing the internal mapping table. Unlike a hard maximum, this is primarily an optimization hint.
  • rx_ctx_bits: Used with scalable endpoints. Specifies the number of bits in the returned fi_addr_t that identify a target receive context. Ensures the AV can support multiple receive contexts per endpoint when needed.
  • ep_per_node: Indicates how many endpoints are associated with the same Fabric Address. This helps the provider optimize internal data structures when multiple endpoints share the same FA.
  • map_addr: Optional base address used when sharing an AV between processes. For shared AVs, this ensures that the same fi_addr_t handles are returned on all processes opening the AV at the same map_addr.


Internal State

  • fi_addr_table: Internal array or mapping that connects fi_addr_t handles to actual remote Fabric Addresses (FAs). Initially empty after creation, entries are populated during the AV population phase. Each entry allows the application to refer to peers efficiently without directly using IP addresses or transport-specific identifiers.




Figure 4-6: Objects Creation Process – Address Vector.

Tuesday, 30 September 2025

Ultra Ethernet: Completion Queue

Completion Queue Creation (fi_cq_open)


Phase 1: Application – Request & Definition


The purpose of this phase is to define the queue where operation completions will be reported. Completion queues are used to report the completion of operations submitted to endpoints, such as data transfers, RMA accesses, or remote write requests. By preparing a struct fi_cq_attr, the application describes exactly what it needs, so the provider can allocate a CQ that meets its requirements.


Example API Call:

struct fi_cq_attr cq_attr = {

    .size = 2048,

    .format = FI_CQ_FORMAT_DATA,

    .wait_obj = FI_WAIT_FD,

    .flags = FI_WRITE | FI_REMOTE_WRITE | FI_RMA,

    .data_size = 64

};


struct fid_cq *cq;

int ret = fi_cq_open(domain, &cq_attr, &cq, NULL);


Explanation of fields:

.size = 2048:  The CQ can hold up to 2048 completions. This determines how many completed operations can be buffered before the application consumes them.

.format = FI_CQ_FORMAT_DATA: This setting determines the level of detail included in each completion entry. With FI_CQ_FORMAT_DATA, the CQ entries contain information about the operation, such as the buffer pointer, the length of data, and optional completion data. If the application uses tagged messaging, choosing FI_CQ_FORMAT_TAGGED expands the entries to also include the tag, allowing the application to match completions with specific operations. The format attribute essentially defines the structure of the data returned when reading the completion queue, letting the application control how much information it receives about each completed operation 

.wait_obj = FI_WAIT_FD: Provides a file descriptor for the application to poll or select on; other options include FI_WAIT_NONE (busy polling) or FI_WAIT_SET.

.flags = FI_WRITE | FI_REMOTE_WRITE | FI_RMA: This field is a bitmask specifying which types of completions the application wants the Completion Queue to report. The provider checks these flags against the capabilities of the domain (fid_domain) to ensure they are supported. If a requested capability is not available, fi_cq_open() will fail. This allows the application to control which events are tracked while the provider manages the underlying resources.

Note: You don’t always need to request every completion type. For example, if your application only cares about local sends, you can set the flag for FI_WRITE and skip FI_REMOTE_WRITE or FI_RMA. Limiting the flags reduces the amount of tracking the provider must do, which can save memory and improve performance, while still giving you the information your application actually needs.

.data_size = 64:  Maximum size of immediate data per entry, in bytes, used for RMA or atomic operations.


Phase 2: Provider – Validation & Limits Check

When the application calls fi_cq_open() with a fi_cq_attr structure, the provider validates each attribute against the parent domain’s capabilities (fid_domain):


fi_cq_attr.size: compared to the domain’s maximum CQ depth.

fi_cq_attr.data_size: compared to the domain’s supported CQ data size.

The total number of CQs requested: limited by the domain’s CQ count.

fi_cq_attr.flags: each requested capability is checked against the domain’s supported features.

If any requested value exceeds the domain’s limits, the provider may adjust it to the maximum allowed or return an error.


Phase 3: Provider – Creation & Handle Return

The purpose of this phase is to allocate memory and internal structures for the CQ and return a handle to the application. The provider creates the fid_cq object in RAM, associates it with the parent domain (fid_domain), and returns the handle. The CQ is now ready to be bound to endpoints (fi_ep_bind) and used for reporting operation completions.

Example fid_cq (Completion Queue) – Illustrative

fid_cq {

    fid_type        : FI_CQ

    fid             : 0xF1DC601

    parent_fid      : 0xF1DD001       

    provider        : "libfabric-uet"

    caps            : FI_WRITE | FI_REMOTE_WRITE | FI_RMA

    size            : 2048           

    format          : FI_CQ_FORMAT_DATA

    wait_obj        : FI_WAIT_FD

    flags           : FI_WRITE | FI_REMOTE_WRITE | FI_RMA

    data_size       : 64        

    provider_data   : <pointer to provider CQ struct>

    ref_count       : 1

    context         : <app-provided void *>

}

Object Example 4-4: Completion Queue (CQ).

Explanation of fields:

fid_type: Type of object, here CQ.

Fid: Unique handle for the CQ object.

parent_fid: Domain the CQ belongs to.

caps: Capabilities supported by this CQ.

size: Queue depth (number of completion entries).

Format: Structure format for completion entries.

wait_obj: Mechanism to wait for completions.

Flags: Requested capabilities for this CQ.

data_size: Maximum size of immediate data per completion entry.

provider_data: Pointer to provider-internal CQ structure.

ref_count: Tracks references to this object.

context: Application-provided context pointer.

Note: In a Completion Queue (fid_cq), the flags field represents the capabilities requested by the application when calling fi_cq_open() (for example, tracking user events, remote writes, or RMA operations). The provider checks these flags against the capabilities of the parent domain (fid_domain). The caps field, on the other hand, shows the capabilities that the provider actually granted to the CQ. This distinction is important because the provider may adjust or limit requested flags to match what the domain supports. In short:

Flags: what the application asked for.

Caps: what the CQ can actually do.


Why EQs and CQs Reside in Host Memory

Event Queues (EQs) and Completion Queues (CQs) are not data buffers in which application payloads are stored. Instead, they are control structures that track the state of communication. When the application posts an operation, such as sending or receiving data, the provider allocates descriptors and manages the flow of that operation. As the operation progresses or completes, the provider generates records describing what has happened. These records typically include information such as completion status, error codes, or connection management events.

Because the application must observe these records to make progress, both EQs and CQs are placed in host memory where the CPU can access them directly. The application typically calls functions like fi_cq_read() or fi_eq_read() to poll the queue, which means that the CPU is actively checking for new records. If these control structures were stored in GPU memory, the CPU would not be able to efficiently poll them, as each access would require a costly transfer over the PCIe or NVLink bus. The design is therefore intentional: the GPU may own the data buffers being transferred, but the coordination, synchronization, and signaling of those transfers are always managed through CPU-accessible queue structures.

Figure 4-5: Objects Creation Process – Completion Queue.

Ultra Ethernet: Event Queue

Event Queue Creation (fi_eq_open)


Phase 1: Application – Request & Definition


The purpose of this phase is to specify the type, size, and capabilities of the Event Queue (EQ) your application needs. Event queues are used to report events associated with control operations. They can be linked to memory registration, address vectors, connection management, and fabric- or domain-level events. Reported events are either associated with a requested operation or affiliated with a call that registers for specific types of events, such as listening for connection requests. By preparing a struct fi_eq_attr, the application describes exactly what it needs so the provider can allocate the EQ properly.

In addition to basic properties like .size (number of events the queue can hold) and .wait_obj (how the application waits for events), the .flags field can request specific EQ capabilities. Common flags include:


  • FI_WRITE: Requests support for user-inserted events via fi_eq_write(). If this flag is set, the provider must allow the application to invoke fi_eq_write().
  • FI_REMOTE_WRITE: Requests support for remote write completions being reported to this EQ.
  • FI_RMA: Requests support for Remote Memory Access events (e.g., RMA completions) to be delivered to this EQ.

Flags are encoded as a bitmask, so multiple capabilities can be requested simultaneously using bitwise OR.

Example API Call:

struct fi_eq_attr eq_attr = {

    .size = 1024,

    .wait_obj = FI_WAIT_FD,

    .flags = FI_WRITE | FI_REMOTE_WRITE | FI_RMA, 

};

struct fid_eq *eq;

int ret = fi_eq_open(domain, &eq_attr, &eq, NULL);


Explanation of the fields in this example:


.size = 1024: The EQ can hold up to 1024 events. This defines the queue depth, i.e., how many events can be buffered by the provider before the application consumes them.

.wait_obj = FI_WAIT_FD: Specifies the mechanism the application will use to wait for events. FI_WAIT_FD means the EQ provides a file descriptor that the application can poll or select on, integrating event waiting into standard OS I/O mechanisms. Other options include FI_WAIT_NONE for busy polling or FI_WAIT_SET to attach the EQ to a wait set.

.flags = FI_WRITE | FI_REMOTE_WRITE | FI_RMA:This field is a bitmask specifying which types of events the application expects the Event Queue to support. FI_WRITE allows user-inserted events, FI_REMOTE_WRITE requests notifications for remote write completions, and FI_RMA requests notifications for RMA operations. The provider checks these flags against the capabilities of the parent domain (fid_domain) to ensure they are supported. If a requested capability is not available, fi_eq_open() will fail. In this example, instead of using a bitmask, descriptive capability names are shown for clarity.


Phase 2: Provider – Validation & Limits Check

The purpose of this phase is to ensure that the requested EQ can be supported by the provider. The provider validates the fi_eq_attr structure against its capabilities in the fi_info structure returned during discovery. Specifically, the .flags bitmask is checked against fi_info->caps, and each requested capability (FI_WRITE, FI_REMOTE_WRITE, FI_RMA) must be supported. Other checks include domain constraints, such as maximum number of EQs per domain and maximum queue depth. If any requested flag or attribute exceeds provider limits, the call fails.


Phase 3: Provider – Creation & Handle Return

The purpose of this phase is to allocate memory and internal structures for the EQ and return a usable handle to the application. The provider creates the fid_eq object in RAM, associates it with the parent domain (fid_domain), and returns the handle. The EQ is now ready to be bound to endpoints and used for event reporting. The completion queue (CQ) is used to track the results of data transfer operations. Every communication request eventually produces a completion, which is placed into the CQ once processed.


Example fid_eq (Event Queue) – Illustrative


fid_eq {

    fid_type        : FI_EQ

    fid             : 0xF1DE601

    parent_fid      : 0xF1DD01

    provider        : "libfabric-uet"

    caps            :FI_WRITE | FI_REMOTE_WRITE | FI_RMA

    size            : 1024           

    wait_obj        : FI_WAIT_FD

    provider_data   : <pointer to provider EQ struct>

    ref_count       : 1

    context         : <app-provided void *>

}

Object Example 4-3: Event Queue (EQ).


Explanation of fields:


fid_type: Type of object, here EQ.

fid: Unique handle for the EQ object.

parent_fid: Pointer to the domain it belongs to.

Caps: Bitmask of requested/available capabilities: user events, remote write completions, RMA completions.

Size: Queue depth (number of events).

wait_obj: Wait mechanism used by application.

provider_data: Pointer to provider-internal EQ structure.

ref_count: Tracks object references for lifecycle management.

Context: Application-provided context pointer.


Figure 4-4: Objects Creation Process – Event Queue.

Note: Creating an Event Queue (EQ) differs from fabric or domain creation. Instead of using fi_info to select a provider, fi_eq_open() simply takes an existing domain handle (fid_domain). The provider then allocates the EQ’s internal structures in host memory and returns a handle the application can use. This design ensures the CPU can efficiently track events, while the provider manages the details internally. 

Sunday, 28 September 2025

Ultra Ethernet: Domain Creation Process in Libfabric

Creating a domain object is the step where the application establishes a logical context for a NIC within a fabric, enabling endpoints, completion queues, and memory regions to be created and managed consistently.

Phase 1: Application (Discovery & choice — selecting a domain snapshot)

During discovery, the provider had populated one or more fi_info entries — each entry was a snapshot describing one possible NIC/port/transport combination. Each fi_info contained nested attribute structures for fabric, domain, and endpoint: fi_fabric_attr, fi_domain_attr, and fi_ep_attr. The fi_domain_attr substructure captured the domain-level template the provider had reported during discovery (memory registration modes, MR key sizes, counts and limits, capability and mode bitmasks, CQ/CTX limits, authentication key sizes, etc.).

When the application had decided which NIC/port it wanted to use, it selected a single fi_info entry whose fi_domain_attr matched its needs. That chosen fi_info became the authoritative configuration for domain creation, containing both the application’s requested settings and the provider-reported capabilities. At this phase, the application moved forward from fabric initialization to domain creation.

To create the domain, the application called the fi_domain function:


API Call → Create Domain object

    Within Fabric ID: 0xF1DFA01

    Using fi_info structure: 0xCAFE43E

    On success: returns fid_domain handle


Phase 2: Libfabric core (dispatch & validation)

The application calls the domain creation API:

int fi_domain(struct fid_fabric *fabric, struct fi_info *info,

              struct fid_domain **domain, void *context);

What the core does, at a high level:

  • Validate arguments: Ensure fabric is a live fid_fabric handle and info is non-NULL.
  • Sanity-check provider/fabric match: The core checks that the fi_info the application supplied corresponds to the same provider (and, indirectly, the same NIC/port) represented by fabric. This is the first piece of the “glue”: the fid_fabric (published earlier) contains the provider identity and fabric name; fi_info also contains provider/fabric identifiers from discovery. The core rejects or returns an error if the two do not match (this prevents cross-provider or cross-fabric mixes).
  • Forward the call to the provider: The core hands the fi_info (including its fi_domain_attr) and the fabric handle to the provider’s domain creation entry point. The core itself remains lightweight — The core performs validation and routing only; it does not modify attributes or allocate hardware resources; the provider performs the heavy lifting of mapping attributes onto hardware.


Phase 3: UET provider (mapping to NIC / resource allocation)

The provider receives the fabric handle (so it knows which NIC/port and which provider instance to use) and the fi_info/fi_domain_attr descriptor. The provider:

  • Interprets the domain attributes and verifies they are feasible given the NIC hardware, driver state and current configuration. For example: requested MR key size, number of CQ/CTXs, per-endpoint limits, requested capability bitmask.
  • Allocates driver / NIC resources or driver contexts that correspond to a domain: memory-registration state, structures for completion queues, context objects for send/recv, and any other provider-private handles.
  • Fails early if mismatch (NIC removed, driver not support requested capability, or requested limits exceed available resources).

Because the fi_info came from discovery for that NIC port, the provider immediately knows the physical mapping. The created domain represents a logical handle for accessing the NIC (or to the NIC/port context the provider manages). In other words: the domain is the provider’s logical handle to NIC resources (memory registration tables, per-device queues, etc.). The domain represents NIC resources logically; the exact mapping to hardware structures may vary by provider implementation, but typically it corresponds one-to-one or one-to-few with real NIC ports..


Phase 4: Libfabric core (fid_domain publication & hierarchy)

On successful provider creation, the provider returns a provider-private handle (pointer) to the domain state. The libfabric core then:

  • Wraps the provider handle into an application-visible fid_domain object.
  • Links the fid_domain to its parent fid_fabric (the fabric FID is stored as the domain’s parent/owner). This is the second piece of the “glue”: the created fid_domain explicitly references the fid_fabric that it belongs to, so the core can route future child-creation calls (endpoints, MRs, CQs) for this domain back to the same provider/fabric.
  • Copies or records the domain-level attributes (caps, mode, limits) into fields of fid_domain so they can be queried, validated on child creation, and used for lifetime/ref-counting.
  • Increments ref_count on the fid_fabric to prevent fabric destruction while the domain exists.


After this step the application holds a fid_domain handle and can proceed to create endpoints, register memory, create completion queues, etc., all of which the provider maps into the NIC/driver context that the domain represents.

Example fid_domain (illustrative)


fid_domain {

    fid_type        : FI_DOMAIN

    fid             : 0xF1DD01

    parent_fid      : 0xF1DFA01

    provider        : "libfabric-uet"

    caps            : 0x00000011      

    mode            : 0x00000004

    mr_key_size     : 8

    cq_cnt          : 4

    ep_cnt          : 128

    provider_data   : <pointer to provider domain struct>

    ref_count       : 1

    context         : <app-provided void *>

}

This stored state allows the core and provider to validate, route and implement subsequent calls that reference the domain.


Explanation of fields:

Identity

  • fid_type : FI_DOMAIN: Fabric Identifier (FID). Identifies the object as a domain. The domain object allows the application (and libfabric itself) to distinguish between different object types (fabric, domain, endpoint, etc.).
  • fid: 0xF1DD001: Unique handle for this domain instance. The application uses this handle in API calls that act on the domain.
  • parent_fid: 0xF1DFA01: Reference to the parent fabric object. This links the domain to the fabric where it belongs, ensuring resources remain associated with the correct fabric.

Provider Info

  • provider: "libfabric-uet": Name of the provider managing the domain. The application can confirm which provider implementation it is using, which is important when multiple providers are installed.
  • caps: 0x00000011: Bitmask of capabilities (for example, FI_MSG | FI_RMA). Defines which communication operations the domain supports so the application can use only valid features.
  • mode: 0x00000004: Mode requirements (for example, scalable endpoints). Tells the application about specific restrictions or rules it must follow when using the domain.

Resource Linits

  • mr_key_size: 8: Size in bytes of memory registration keys. The application uses this value when registering memory regions so it provides keys of the correct length.
  • cq_cnt: 4: Maximum number of completion queues supported. Guides the application when designing its event handling because it cannot create more than this limit.
  • ep_cnt: 128: Maximum number of endpoints supported. Tells the application how many communication endpoints it can create within this domain.

Internal States

  • provider_data: <pointer to provider domain struct>: Provider-specific internal pointer. Not used directly by the application but allows the provider to maintain its own internal state.
  • ref_count : 1: Current reference count for the domain. Tracks how many objects depend on this domain and ensures proper cleanup when the domain is released.
  • context : <app-provided void *>: Application-supplied pointer. Lets the application attach custom data, such as state or identifiers, which the provider will return in callbacks.


Figure 4-4: Objects Creation Process – Domain for NIC(s).

Friday, 26 September 2025

Ultra Ethernet: Fabric Creation Process in Libfabric

[edit Oct-8, 2025]
New version of this subject can be found here: Ultra Ethernet: Fabric Object - What it is and How it is created

Phase 1: Application (Discovery & choice)

After the UET provider populated fi_info structures for each NIC/port combination during discovery, the application can begin the object creation process. It first consults the in-memory fi_info list to identify the entry that best matches its requirements. Each fi_info contains nested attribute structures describing fabric, domain, and endpoint capabilities, including fi_fabric_attr (fabric name, provider identifier, version information), fi_domain_attr (memory registration mode, key details, domain capabilities), and fi_ep_attr (endpoint type, reliable versus unreliable semantics, size limits, and supported capabilities). The application examines the returned entries and selects the fi_info that satisfies its needs (for example: provider == "uet", fabric name == "UET", required capabilities, reliable transport, or a specific memory registration mode). The chosen fi_info then provides the attributes — effectively serving as hints — that the application passes into subsequent creation calls such as fi_fabric(), fi_domain(), and fi_endpoint(). Each fi_info acts as a self-contained “capability snapshot,” describing one possible combination of NIC, port, and transport mode.


Phase 2: Libfabric Core (dispatch & wiring)

When the application calls fi_fabric(), the core forwards this request to the corresponding provider’s fabric entry point. In this way, the fi_info produced during discovery effectively becomes the configuration input for object creation.

The core’s role is intentionally lightweight: it matches the application’s selected fi_info to the appropriate provider implementation and invokes the provider callback for fabrics, domains, and endpoints. Throughout this process, the fi_info acts as a context carrier, containing the provider identifier, fabric name, and attribute templates for domains and endpoints. The core passes these attributes directly to the provider during creation, ensuring that the provider has all the information necessary to map the requested objects to the correct NIC and transport configuration.


Phase 3: UET Provider

When the application invokes fi_fabric(), passing the fi_fabric_attr obtained from the chosen fi_info, the call is routed to the UET provider’s uet_fabric() entry point for fabric creation. The provider treats the attributes contained within the chosen fi_info as the authoritative configuration for fabric creation. Because each fi_fabric_attr originates from a NIC-specific fi_info, the provider immediately knows which physical NIC and port are associated with the requested fabric object.

The provider uses the fi_fabric_attr to determine which interfaces belong to the requested fabric, the provider-specific capabilities that must be supported, and any optional flags supplied by the application. It then allocates and initializes internal data structures to represent the fabric object, mapping it to the underlying NIC and driver resources described by the discovery snapshot.

During creation, the provider validates that the requested fabric can be supported by the current hardware and driver state. If the NIC or configuration has changed since discovery—for example, if the NIC is unavailable or the requested capabilities are no longer supported—the provider returns an error, preventing creation of an invalid or unsupported fabric object. Otherwise, the provider completes the fabric initialization, making it ready for subsequent domain and endpoint creation calls.

By relying exclusively on the fi_info snapshot from discovery, the provider ensures that the fabric object is created consistently and deterministically, reflecting the capabilities and constraints reported to the application during the discovery phase.


Phase 4: Libfabric Core (Fabric Object Publication)

Once the UET provider successfully creates the fabric object, the libfabric core generates the corresponding fid_fabric handle, which serves as the application-visible representation of the fabric. The term FID stands for Fabric Identifier, a unique identifier assigned to each libfabric object. All subsequent libfabric objects created within the context of this fabric — including domains, endpoints, and memory regions — are prefixed with this FID to maintain object hierarchy and enable internal tracking.

The fid_fabric structure contains metadata about the fabric, such as the associated provider, the fabric name, and internal pointers to provider-specific data structures. It acts as a lightweight descriptor for the application, while the actual resources and state remain managed by the provider.

The libfabric core stores the fid_fabric in in-memory structures in RAM, typically within internal libfabric tables that track all active fabric objects. This allows the core to efficiently validate future API calls that reference the fabric, to maintain object hierarchies, and to route creation requests (e.g., for domains or endpoints) to the correct provider instance. Because the FID resides in RAM, operations using fid_fabric are fast and transient; the core relies on the provider to maintain the persistent state and hardware mappings associated with the fabric.

By publishing the fabric as a fid_fabric object, libfabric establishes a clear and consistent handle for the application. This handle allows the application to reference the fabric unambiguously in subsequent creation calls, while preserving the mapping between the abstract libfabric object and the underlying NIC resources managed by the provider.

Example fid_fabric Object

id_fabric {

    fid_type        : FI_FABRIC

    fid              : 0x0001ABCD                

    provider         : "uet"                     

    fabric_name      : "UET-Fabric1"         

    version          : 1.0                       

    state            : ACTIVE                     

    ref_count        : 1                          

    provider_data    : <pointer>                 

    creation_flags   : 0x0                        

    timestamp        : 1690000000                 

}


Explanation of fields:

fid_type: Identifies the type of object within libfabric and distinguishes fabric objects (FI_FABRIC) from other object types such as domains or endpoints. This allows both the libfabric core and the provider to validate and correctly handle API calls, object creation, and destruction. The fid field is a unique Fabric Identifier assigned by libfabric, serving as the primary identifier for the fabric object. All child objects, including domains, endpoints, and memory regions, inherit this FID as a prefix to maintain hierarchy and uniqueness, enabling the core to validate object references and route creation calls to the correct provider instance.

provider field: Indicates the name of the provider managing this fabric object, such as "uet". It associates the fabric with the underlying hardware implementation and ensures that all subsequent calls for child objects are dispatched to the correct provider. The fabric_name field contains the human-readable name of the fabric, selected during discovery or by the application, for example "UET-Fabric1". This name allows applications to identify the fabric among multiple available options and is used as a selection criterion during both discovery (fi_getinfo) and creation (fi_fabric).

version field: Specifies the provider or fabric specification version and ensures compatibility between the application and provider. It can be used for logging, debugging, or runtime checks to verify that the fabric supports the required feature set. The state field tracks the current lifecycle status of the fabric object, indicating whether it is active, ready for use, or destroyed. Both the libfabric core and provider validate this field before allowing any operations on the fabric.

ref_count: Maintains a reference counter for the object, preventing it from being destroyed while still in use by the application or other libfabric structures. This counter is incremented during creation or when child objects reference the fabric and decremented when objects are released or destroyed. 

provider_data: Contains an internal pointer to provider-managed structures, including NIC mappings, hardware handles, and configuration details. This field is only accessed by the provider; the application interacts with the fabric through the fid_fabric handle.

creation_flags: Contains optional flags provided by the application during fi_fabric() creation, allowing customization of the fabric initialization process, such as enabling non-default modes or debug options. 

timestamp field: An optional value indicating when the fabric was created. It is useful for debugging, logging, and performance tracking, helping to correlate fabric initialization with other libfabric objects and operations.



Figure 4-2: Objects Creation Process – Fabric for Cluster.


Related post:
https://nwktimes.blogspot.com/2025/09/ultra-ethernet-resource-initialization.html