Tuesday, 7 October 2025

Ultra Etherent: Discovery

Creating the fi_info Structure

Before the application can discover what communication services are available, it first needs a way to describe what it is looking for. This description is built using a structure called fi_info. The fi_info structure acts like a container that holds the application’s initial requirements, such as desired endpoint type or capabilities.

The first step is to reserve memory for this structure in the system’s main memory. The fi_allocinfo() helper function does this for the application. When called, fi_allocinfo() allocates space for a new fi_info structure, which this book refers to as the pre-fi_info, that will later be passed to the libfabric core for matching against available providers.

At this stage, most of the fields inside the pre-fi_info structure are left at their default values. The application typically sets only the most relevant parameters that express what it needs, such as the desired endpoint type or provider name, and leaves the rest for the provider to fill in later.

In addition to the main fi_info structure, the helper function also allocates memory for a set of sub-structures. These describe different parts of the communication stack, including the fabric, domain, and endpoints. The fixed sub-structures include:

  • fi_fabric_attr: Describes the fabric-level properties such as the provider’s name and fabric identifier.
  • fi_domain_attr: Defines attributes related to the domain, including threading and data progress behavior.
  • fi_ep_attr: Specifies endpoint-related characteristics such as endpoint type.
  • fi_tx_attr and fi_rx_attr: Contain parameters for transmit and receive operations, such as message ordering and completion behavior.
  • fi_nic: Represents the properties and capabilities of the UET network interface.

In figure 4-1, these are labeled as fixed sub-structures because their layout and meaning are always the same. They consist of predefined fields and expected value types, which makes them consistent across different applications. Like the main fi_info structure, they usually remain at their default values until the provider fills them in. The information stored in these sub-structures will later be leveraged when the application begins creating actual fabric objects, such as domains and endpoints.

In addition to the fixed parts, the fi_info structure can contain generic sub-structures such as src_addr. Unlike the fixed ones, the generic sub-structure src_addr depend on the chosen addr_format. For example, when using Ultra Ethernet Transport, the address field points to a structure describing a UET endpoint address, which includes bits for Version, Flags, Fabric Endpoint Capabilities, PIDonFEB, Fabric Address, Start Resource Index, Num Resource Indices, and Initiator ID. This compact representation carries both addressing and capability information, allowing the same structure definition to be reused across different transport technologies and addressing schemes. Note that in figure 4-2 the returned src_addr is only partially filled because the complete address information is not available until the endpoint is created.

In Figure 4-1, the application defines its communication requirements in the fi_info structure by setting the caps (capabilities) field. This field describes the types of operations the application intends to perform through the fabric interface. For example, values such as FI_MSG, FI_RMA, FI_WRITE, FI_REMOTE_WRITE, FI_COLLECTIVE, FI_ATOMIC, and FI_HMEM specify support for message-based communication, remote memory access, atomic operations, and host memory extensions.

When the fi_getinfo() call is issued, the provider compares these requested capabilities against what the underlying hardware and driver can support. Only compatible providers return a matching fi_info structure.

In this example, the application also sets the addr_format field to FI_ADDR_UET, indicating that Ultra Ethernet Transport endpoint addressing is used. This format includes hardware-specific addressing details beyond a simple IP address.

The current Ultra Ethernet Transport specification v1.0 does not define or support the FI_COLLECTIVE capability. Therefore, the UET provider does not return this flag, and collective operations are not offloaded or accelerated by the UET NIC.

After fi_allocinfo() has allocated memory for both the fixed and generic sub-structures, it automatically links them together by inserting pointers into the main fi_info structure. The application can then easily access each attribute through fi_info without manually handling the memory layout. 

Once the structure is prepared, the next step is to request matching provider information using fi_getinfo() API call, which will be described in detail in the following section.

Figure 4-1: Discovery: Allocate Memory and Create Structures – fi_allocinfo.

Requesting Provider Services with fi_getinfo()


After creating a pre-fi_info structure, the application calls the fi_getinfo() API to discover which services and transport features are available on the node’s NIC(s). This function takes a pointer to the pre-fi_info structure, which contains hints describing the application’s requirements, such as desired capabilities and address format.

When the discovery request reaches the libfabric core, the library identifies and loads an appropriate provider from the available options, which may include providers for TCP, verbs (InfiniBand), or sockets (TCP/UDP). For Ultra Ethernet Transport, the core selects the uet-provider. The core invokes the provider’s entry points, including the .getinfo callback, which is responsible for returning the provider’s supported capabilities. Internally, the provider uses function pointers uet_getinfo.

Inside the UET provider, the uet_getinfo() routine queries the NIC driver or kernel interface to determine what capabilities each NIC can safely and efficiently support. The provider does not access the hardware directly. For multi-GPU AI workloads, the focus is on push-based remote memory access operations:

  • FI_MSG: Used for standard message-based communication.
  • FI_RMA: Enables direct remote memory access, forming the foundation for high-performance gradient or parameter transfers between GPUs.
  • FI_WRITE: Allows a GPU to push local gradient updates directly into another GPU’s memory.
  • FI_REMOTE_WRITE: Signals that remote GPUs can write directly into this GPU’s memory, supporting push-based collective operations.
  • FI_COLLECTIVE: Indicates support for collective operations like AllReduce, though the current UET specification does not implement this capability.
  • FI_ATOMIC: Allows atomic operations on remote memory.
  • FI_HMEM: Marks support for host memory or GPU memory extensions.

Low-level hardware metrics, such as link speed or MTU, are not returned at this phase; the focus is on semantic capabilities that the application can rely on. 

The provider allocates new fi_info structures in CPU memory, creating one structure per NIC that satisfies the hints provided by the application and describes all other supported services.

After the provider has created these structures, libfabric returns them to the application as a linked list. The next pointer links all available fi_info structures, allowing the application to iterate over all discovered NICs. The main fields, including caps and addr_format, are populated and can be inspected immediately, while sub-structures such as fi_fabric_attr, fi_domain_attr, and fi_ep_attr remain empty unless explicitly requested in the hints. These sub-structures are typically filled later during the creation of fabric objects, domains, or endpoints.

The application can then select the most appropriate entry and request the creation of the Fabric object. Creating the Fabric first establishes the context in which all subsequent domains, sub-resources, and endpoints are organized and managed. Once all pieces of the AI Fabric “jigsaw puzzle” abstraction have been created and initialized, the application can release the memory for all fi_info structures in the linked list by calling fi_freeinfo().


Figure 4-2: Discovery: Discover Provider Capabilities – fi_getinfo.


Note: fields and values of fi_info and its sub-structures are explained in upcoming chapters

Next, Fabric Object... 


Thursday, 2 October 2025

Ultra Ethernet: Address Vector (AV)

 Address Vector (AV)

The Address Vector (AV) is a provider-managed mapping that connects remote fabric addresses to compact integer handles (fi_addr_t) used in communication operations. Unlike a routing table, the AV does not store IP (device mappings). Instead, it converts an opaque Fabric Address (FA)—which may contain IP, port, and transport-specific identifiers—into a simple handle that endpoints can use for sending and receiving messages. The application never needs to reference the raw IP addresses directly.

Phase 1: Application – Request & Definition

The application begins by requesting an Address Vector (AV) through the fi_av_open() call. To do this, it first defines the desired AV properties in a fi_av_attr structure:

int fi_av_open(struct fid_domain *domain, struct fi_av_attr *attr, struct fid_av **av, void *context);

struct fi_av_attr av_attr = {

    .type        = FI_AV_TABLE,   

    .count       = 16,            

    .rx_ctx_bits = 0,          

    .ep_per_node = 1,          

    .name        = "my_av",    

    .map_addr    = NULL,       

    .flags       = 0           

};

Example 4-1: structure fi_av_attr.


fi_av_attr Fields

type: Specifies the type of Address Vector. In Ultra Ethernet, the most common choice is FI_AV_TABLE, which organizes peer addresses in a simple, contiguous table. Each inserted Fabric Address (FA) receives a sequential fi_addr_t starting from zero. The type determines how the provider manages and looks up peer addresses during runtime.

Count: Indicates the expected number of addresses that will be inserted into the AV. The provider uses this value to optimize resource allocation. In most implementations, this also acts as an upper bound—if the application attempts to insert more addresses than specified, the operation may fail. Therefore, the application should set this field according to the anticipated communication pattern, ensuring the AV is sized appropriately without over-allocating memory.

Receive Context Bits (rx_ctx_bits): The rx_ctx_bits field is relevant only for scalable endpoints, which can expose multiple independent receive contexts. Normally, when an address is inserted into the Address Vector (AV), the provider returns a compact handle (fi_addr_t) that identifies the peer. If the peer has multiple receive contexts, the handle must also encode which specific receive context the application intends to target.

The rx_ctx_bits field specifies how many bits of the fi_addr_t are reserved for this purpose. For example, if an endpoint has 8 receive contexts (rx_ctx_cnt = 8), then at least 3 bits are required (2^3 = 8) to distinguish all contexts. These reserved bits allow the application to refer to each receive context individually without inserting duplicate entries into the AV.

The provider uses rx_ctx_bits internally when constructing the fi_addr_t returned by fi_av_insert(). The helper function fi_rx_addr() can then combine a base fi_addr_t with a receive context index to produce the final address used in communication operations.


Example:


Suppose a peer has 4 receive contexts, and the application sets rx_ctx_bits = 2 (2 bits reserved for receive contexts). The base fi_addr_t for the peer returned by the AV might be 0x10. Using the 2 reserved bits, the application can address each receive context as follows:


Peer FA         Base fi_addr_t     Receive Context   Final fi_addr_t

10.0.0.11:7500 0x10         0                      0x10

10.0.0.11:7500 0x10         1                      0x11

10.0.0.11:7500 0x10         2                      0x12

10.0.0.11:7500 0x10         3                      0x13


This approach keeps the AV compact while allowing fine-grained targeting of receive contexts. The application never needs to manipulate raw IP addresses or transport identifiers; it simply uses the final fi_addr_t values in send, receive, or RDMA operations.


ep_per_node: Indicates the expected number of endpoints that will be associated with a given Fabric Address. The provider uses this value to optimize resource allocation. If the number of endpoints per node is unknown, the application can set it to 0. In distributed or parallel applications, this is typically set to the number of processes per node multiplied by the number of endpoints each process will open. This value is a hint rather than a strict limit, allowing the AV to scale efficiently in multi-endpoint configurations.


Name: An optional human-readable system name for the AV. If non-NULL and the AV is opened with write access, the provider may create a shared, named AV, which can be accessed by multiple processes within the same domain on the same node. This feature enables resource sharing in multi-process applications. If sharing is not required, the name can be set to NULL.


map_addr: An optional base address for the AV, used primarily when creating a shared, named AV (FI_AV_MAP) across multiple processes. If multiple processes provide the same map_addr and name, the AV guarantees that the fi_addr_t handles returned by fi_av_insert() are consistent across all processes. The provider may internally memory-map the AV at this address, but this is not required. In single-process AVs, or when the AV is private, this field can be set to NULL.


Flags: Behavior-modifying flags. These can control provider-specific features, such as enabling asynchronous operations or special memory access patterns. Setting it to 0 selects default behavior.

Once the structure is populated, the application calls:


The fi_av_open() call allocates a provider-managed fid_av object, which contains a pointer to operations such as insert(), remove(), and lookup(), along with an internal mapping table that translates integer handles into transport-specific Fabric Addresses. At this stage, the AV exists independently of endpoints or peers and contains no entries; it is a blank, ready-to-populate mapping structure.


The AV type, described in Example 4-1, determines how the provider organizes and manages peer addresses. In Ultra Ethernet, FI_AV_TABLE is the most common choice. Conceptually, it is a simple, contiguous table that maps each peer’s Fabric Address to a sequential integer handle (fi_addr_t). Each inserted FA receives a handle starting from zero, and the application never accesses the raw IP or transport identifiers directly, relying solely on the fi_addr_t to refer to peers. From the Ultra Ethernet perspective, FI_AV_TABLE is ideal because it allows constant-time lookup of a peer’s FA. When the application posts a send, receive, or memory operation, the provider translates the fi_addr_t into the transport-specific FA efficiently. Internally, each table entry can also reference provider-managed resources, such as transmit and receive contexts or resource indexes, making runtime operations lightweight and predictable. 


The application should select attribute values in line with the provider’s capabilities reported by fi_getinfo(), considering the expected number of peers and any provider-specific recommendations. This ensures that the AV is sized appropriately for the intended communication pattern and supports efficient, opaque addressing throughout the system.

Example: Mapping fi_addr_t index to peer FA and Rank



fi_addr_t Peer Rank Fabric Address (FA)

0                 1         10.0.0.11:7500

1                 2         10.0.0.12:7500

2                 3         10.0.0.13:7500


Each inserted Fabric Address receives a sequential integer handle (fi_addr_t) starting from zero. The application then uses this handle in communication operations, for example:


fi_addr_t peer2 = fi_addrs[1];  // index 1 corresponds to rank 2

fi_send(ep, buffer, length, NULL, peer2, 0);


This mapping allows the application to refer to peers purely by fi_addr_t handles, avoiding any direct manipulation of IP addresses or transport identifiers, and enabling efficient, opaque peer referencing in the data path.


Phase 2: Provider – Validation & Limit Check.

After the application requests an AV using fi_av_open(), the provider validates the requested attributes against the capabilities reported in the domain attributes (fi_domain_attr) returned earlier by fi_getinfo(). The fid_domain serves as the context for AV creation, but the actual limits and supported features come from the domain attributes.

The provider performs the following checks to ensure the AV configuration is compatible with its supported limits:

  • type is compared against fi_info structure -> fi_domain_attr sub-structure -> av_type to verify that the requested AV organization (for example, FI_AV_TABLE) is supported.
  • count is checked to ensure it does not exceed the maximum number of entries the provider can manage efficiently.
  • rx_ctx_bits and ep_per_node are validated against domain context limits, such as rx_ctx_cnt, tx_ctx_cnt, and max_ep_*_ctx values, to guarantee that the requested receive context and endpoint configuration can be supported.
  • flags are compared against fi_info->caps and fi_info->mode to confirm that the requested behaviors are allowed.

Note: The caps field lists the capabilities supported by the provider that an application may request, such as RMA, atomics, or tagged messaging. The mode field, in contrast, specifies mandatory requirements imposed by the provider. If a mode bit is set, the application must comply with that constraint to use the provider. Typical examples include requiring all memory buffers to be explicitly registered (FI_LOCAL_MR) or the use of struct fi_context with operations (FI_CONTEXT). 

  • name and map_addr are optional; the provider may validate them if the AV is intended to be shared or named, but they can also be ignored for local/private AVs.

If all checks pass, the provider allocates the AV and returns a handle (fid_av *) to the application. At this stage, the AV exists as an empty container, independent of any endpoints or peer Fabric Addresses. It is ready to be populated with addresses in the subsequent phase.

Phase 3: Population and distribution of FAs.

Once the AV has been created and validated, it must be populated with the Fabric Addresses (FAs) of all peers that the application will communicate with. Initially, each process only knows its own local FA. To enable inter-process communication, these local FAs must be exchanged and distributed to all ranks, typically using an out-of-band control channel such as a bootstrap TCP connection.

In a common master-collect model, the master process (Rank 0) collects the local FAs from all worker processes and constructs a global map of {Rank → FA}. The master then broadcasts this global mapping back to all ranks. Each process inserts the received FAs into its AV using fi_av_insert().

Each inserted FA is assigned a sequential integer handle (fi_addr_t) starting from zero. These handles are then used in all subsequent communication operations, so the application never directly references IP addresses or transport-specific identifiers.

Example: Populating the AV with peer FAs

struct sockaddr_in peers[3];

peers[0] = { ip=10.0.0.11, port=7500 }; // rank 1

peers[1] = { ip=10.0.0.12, port=7500 }; // rank 2

peers[2] = { ip=10.0.0.13, port=7500 }; // rank 3


fi_addr_t fi_addrs[3];

ret = fi_av_insert(av, peers, 3, fi_addrs, 0, NULL);

Note on Local Fabric Address: In most Ultra Ethernet applications, the local Fabric Address (FA) is not inserted into the Address Vector (AV_TABLE). This is because it is rare for a process to communicate with itself; endpoints typically only need to send or receive messages to remote peers. As a result, the AV contains entries only for remote FAs, each mapped to a sequential fi_addr_t handle. The local FA remains known to the process but does not occupy an AV entry, keeping the mapping compact and efficient. If an application does require self-communication, the local FA can be explicitly inserted into the AV, but this is an uncommon scenario.


Once this step is complete, the AV contains the full mapping of all ranks to their assigned fi_addr_t handles, including the master itself. The application can now communicate with any peer by passing its fi_addr_t to libfabric operations.

Example: Using fi_addr_t in a send operation

fi_addr_t peer2 = fi_addrs[1];  // index 1 corresponds to rank 2

fi_send(ep, buffer, length, NULL, peer2, 0);

This design allows all communication operations to reference peers via compact, opaque integer handles, keeping the data path efficient and abstracted from transport-specific details.


Phase 4: Runtime usage.

After the AV is populated with all peer Fabric Addresses (FAs), the provider handles all runtime operations transparently. The fi_addr_t handles inserted by the application are used internally to direct traffic to the correct peer, hiding the underlying transport details.

  • Send and Receive Operations: When the application posts a send, receive, or RDMA operation, the provider translates the fi_addr_t handle into the transport-specific FA. The application never needs to reference IP addresses, ports, or other identifiers directly.
  • Updating the AV: If a peer’s FA needs to be removed or replaced, the provider uses its AV operations (fi_ops_av->remove() and fi_ops_av->insert()) to update the mapping seamlessly. These updates maintain the integrity of the fi_addr_t handles across the application.
  • Resource References: Each table entry may internally reference provider-managed resources, such as transmit/receive contexts or reserved bits for scalable endpoints. This ensures that runtime operations are lightweight and predictable.
  • Consistency Across Processes: For shared or named AVs, the provider ensures that fi_addr_t handles are consistent across processes, allowing communication to function correctly even in distributed, multi-node applications.


At this stage, the AV is fully functional. The application can perform communication with any known peer using the opaque fi_addr_t handles. This design keeps the data path efficient, predictable, and independent of transport-specific details, just like the Completion Queue and Event Queue abstractions.

Example fid_av (Address Vector) – Illustrative


fid_av {

    fid_type            : FI_AV            

    fid                     : 0xF1AD01         

    parent_fid         : 0xF1DD001         

    provider            : "libfabric-uet"  

    type                   : FI_AV_TABLE      

    count                 : 16               

    rx_ctx_bits        : 0                

    ep_per_node     : 1                

    flags                  : 0x00000000 

    name                : "my_av"          

    map_addr        : 0x0              

    ops                   : 0xF1AA10         

    fi_addr_table   : [0..count-1]     

}

Explanation of fields:

Note: Some of these fields were previously explained as part of the fi_av_attr structure.


Identity

  • fid_type: Specifies the type of object. For an Address Vector, this is always FI_AV. This allows the provider and application to distinguish AVs from other objects like fid_cq, fid_eq, or fid_domain.
  • Fid: A unique identifier for this AV instance, typically assigned by the provider. It serves as a handle for internal management and debugging, similar to fid_cq or fid_eq.
  • parent_fid: Points to the parent domain (fid_domain) in which this AV was created. This links the AV to its domain’s resources and capabilities and ensures proper hierarchy in the provider’s object model.
  • Name: Optional human-readable system name for the AV. Used to identify shared AVs across processes within the same domain. Named AVs allow multiple processes to open the same underlying mapping.


Provider Info

  • Provider: A string identifying the provider managing this AV (e.g., "libfabric-uet").
  • Type: Reflects the AV organization, usually set to FI_AV_TABLE in Ultra Ethernet. Conceptually, it defines the AV as a simple contiguous table mapping peer FAs to integer handles (fi_addr_t).
  • Flags: Behavior-modifying options for the AV. These can enable provider-specific features or modify how insert/remove operations behave.
  • Ops: Pointer to the provider-specific operations table (fi_ops_av). This contains function pointers for AV operations such as insert(), remove(), lookup(), and control().


Resource Limits / Configuration

  • Count: The expected number of addresses to be inserted into the AV. This guides the provider in allocating resources efficiently and sizing the internal mapping table. Unlike a hard maximum, this is primarily an optimization hint.
  • rx_ctx_bits: Used with scalable endpoints. Specifies the number of bits in the returned fi_addr_t that identify a target receive context. Ensures the AV can support multiple receive contexts per endpoint when needed.
  • ep_per_node: Indicates how many endpoints are associated with the same Fabric Address. This helps the provider optimize internal data structures when multiple endpoints share the same FA.
  • map_addr: Optional base address used when sharing an AV between processes. For shared AVs, this ensures that the same fi_addr_t handles are returned on all processes opening the AV at the same map_addr.


Internal State

  • fi_addr_table: Internal array or mapping that connects fi_addr_t handles to actual remote Fabric Addresses (FAs). Initially empty after creation, entries are populated during the AV population phase. Each entry allows the application to refer to peers efficiently without directly using IP addresses or transport-specific identifiers.




Figure 4-6: Objects Creation Process – Address Vector.

Tuesday, 30 September 2025

Ultra Ethernet: Completion Queue

Completion Queue Creation (fi_cq_open)


Phase 1: Application – Request & Definition


The purpose of this phase is to define the queue where operation completions will be reported. Completion queues are used to report the completion of operations submitted to endpoints, such as data transfers, RMA accesses, or remote write requests. By preparing a struct fi_cq_attr, the application describes exactly what it needs, so the provider can allocate a CQ that meets its requirements.


Example API Call:

struct fi_cq_attr cq_attr = {

    .size = 2048,

    .format = FI_CQ_FORMAT_DATA,

    .wait_obj = FI_WAIT_FD,

    .flags = FI_WRITE | FI_REMOTE_WRITE | FI_RMA,

    .data_size = 64

};


struct fid_cq *cq;

int ret = fi_cq_open(domain, &cq_attr, &cq, NULL);


Explanation of fields:

.size = 2048:  The CQ can hold up to 2048 completions. This determines how many completed operations can be buffered before the application consumes them.

.format = FI_CQ_FORMAT_DATA: This setting determines the level of detail included in each completion entry. With FI_CQ_FORMAT_DATA, the CQ entries contain information about the operation, such as the buffer pointer, the length of data, and optional completion data. If the application uses tagged messaging, choosing FI_CQ_FORMAT_TAGGED expands the entries to also include the tag, allowing the application to match completions with specific operations. The format attribute essentially defines the structure of the data returned when reading the completion queue, letting the application control how much information it receives about each completed operation 

.wait_obj = FI_WAIT_FD: Provides a file descriptor for the application to poll or select on; other options include FI_WAIT_NONE (busy polling) or FI_WAIT_SET.

.flags = FI_WRITE | FI_REMOTE_WRITE | FI_RMA: This field is a bitmask specifying which types of completions the application wants the Completion Queue to report. The provider checks these flags against the capabilities of the domain (fid_domain) to ensure they are supported. If a requested capability is not available, fi_cq_open() will fail. This allows the application to control which events are tracked while the provider manages the underlying resources.

Note: You don’t always need to request every completion type. For example, if your application only cares about local sends, you can set the flag for FI_WRITE and skip FI_REMOTE_WRITE or FI_RMA. Limiting the flags reduces the amount of tracking the provider must do, which can save memory and improve performance, while still giving you the information your application actually needs.

.data_size = 64:  Maximum size of immediate data per entry, in bytes, used for RMA or atomic operations.


Phase 2: Provider – Validation & Limits Check

When the application calls fi_cq_open() with a fi_cq_attr structure, the provider validates each attribute against the parent domain’s capabilities (fid_domain):


fi_cq_attr.size: compared to the domain’s maximum CQ depth.

fi_cq_attr.data_size: compared to the domain’s supported CQ data size.

The total number of CQs requested: limited by the domain’s CQ count.

fi_cq_attr.flags: each requested capability is checked against the domain’s supported features.

If any requested value exceeds the domain’s limits, the provider may adjust it to the maximum allowed or return an error.


Phase 3: Provider – Creation & Handle Return

The purpose of this phase is to allocate memory and internal structures for the CQ and return a handle to the application. The provider creates the fid_cq object in RAM, associates it with the parent domain (fid_domain), and returns the handle. The CQ is now ready to be bound to endpoints (fi_ep_bind) and used for reporting operation completions.

Example fid_cq (Completion Queue) – Illustrative

fid_cq {

    fid_type        : FI_CQ

    fid             : 0xF1DC601

    parent_fid      : 0xF1DD001       

    provider        : "libfabric-uet"

    caps            : FI_WRITE | FI_REMOTE_WRITE | FI_RMA

    size            : 2048           

    format          : FI_CQ_FORMAT_DATA

    wait_obj        : FI_WAIT_FD

    flags           : FI_WRITE | FI_REMOTE_WRITE | FI_RMA

    data_size       : 64        

    provider_data   : <pointer to provider CQ struct>

    ref_count       : 1

    context         : <app-provided void *>

}

Object Example 4-4: Completion Queue (CQ).

Explanation of fields:

fid_type: Type of object, here CQ.

Fid: Unique handle for the CQ object.

parent_fid: Domain the CQ belongs to.

caps: Capabilities supported by this CQ.

size: Queue depth (number of completion entries).

Format: Structure format for completion entries.

wait_obj: Mechanism to wait for completions.

Flags: Requested capabilities for this CQ.

data_size: Maximum size of immediate data per completion entry.

provider_data: Pointer to provider-internal CQ structure.

ref_count: Tracks references to this object.

context: Application-provided context pointer.

Note: In a Completion Queue (fid_cq), the flags field represents the capabilities requested by the application when calling fi_cq_open() (for example, tracking user events, remote writes, or RMA operations). The provider checks these flags against the capabilities of the parent domain (fid_domain). The caps field, on the other hand, shows the capabilities that the provider actually granted to the CQ. This distinction is important because the provider may adjust or limit requested flags to match what the domain supports. In short:

Flags: what the application asked for.

Caps: what the CQ can actually do.


Why EQs and CQs Reside in Host Memory

Event Queues (EQs) and Completion Queues (CQs) are not data buffers in which application payloads are stored. Instead, they are control structures that track the state of communication. When the application posts an operation, such as sending or receiving data, the provider allocates descriptors and manages the flow of that operation. As the operation progresses or completes, the provider generates records describing what has happened. These records typically include information such as completion status, error codes, or connection management events.

Because the application must observe these records to make progress, both EQs and CQs are placed in host memory where the CPU can access them directly. The application typically calls functions like fi_cq_read() or fi_eq_read() to poll the queue, which means that the CPU is actively checking for new records. If these control structures were stored in GPU memory, the CPU would not be able to efficiently poll them, as each access would require a costly transfer over the PCIe or NVLink bus. The design is therefore intentional: the GPU may own the data buffers being transferred, but the coordination, synchronization, and signaling of those transfers are always managed through CPU-accessible queue structures.

Figure 4-5: Objects Creation Process – Completion Queue.

Ultra Ethernet: Event Queue

Event Queue Creation (fi_eq_open)


Phase 1: Application – Request & Definition


The purpose of this phase is to specify the type, size, and capabilities of the Event Queue (EQ) your application needs. Event queues are used to report events associated with control operations. They can be linked to memory registration, address vectors, connection management, and fabric- or domain-level events. Reported events are either associated with a requested operation or affiliated with a call that registers for specific types of events, such as listening for connection requests. By preparing a struct fi_eq_attr, the application describes exactly what it needs so the provider can allocate the EQ properly.

In addition to basic properties like .size (number of events the queue can hold) and .wait_obj (how the application waits for events), the .flags field can request specific EQ capabilities. Common flags include:


  • FI_WRITE: Requests support for user-inserted events via fi_eq_write(). If this flag is set, the provider must allow the application to invoke fi_eq_write().
  • FI_REMOTE_WRITE: Requests support for remote write completions being reported to this EQ.
  • FI_RMA: Requests support for Remote Memory Access events (e.g., RMA completions) to be delivered to this EQ.

Flags are encoded as a bitmask, so multiple capabilities can be requested simultaneously using bitwise OR.

Example API Call:

struct fi_eq_attr eq_attr = {

    .size = 1024,

    .wait_obj = FI_WAIT_FD,

    .flags = FI_WRITE | FI_REMOTE_WRITE | FI_RMA, 

};

struct fid_eq *eq;

int ret = fi_eq_open(domain, &eq_attr, &eq, NULL);


Explanation of the fields in this example:


.size = 1024: The EQ can hold up to 1024 events. This defines the queue depth, i.e., how many events can be buffered by the provider before the application consumes them.

.wait_obj = FI_WAIT_FD: Specifies the mechanism the application will use to wait for events. FI_WAIT_FD means the EQ provides a file descriptor that the application can poll or select on, integrating event waiting into standard OS I/O mechanisms. Other options include FI_WAIT_NONE for busy polling or FI_WAIT_SET to attach the EQ to a wait set.

.flags = FI_WRITE | FI_REMOTE_WRITE | FI_RMA:This field is a bitmask specifying which types of events the application expects the Event Queue to support. FI_WRITE allows user-inserted events, FI_REMOTE_WRITE requests notifications for remote write completions, and FI_RMA requests notifications for RMA operations. The provider checks these flags against the capabilities of the parent domain (fid_domain) to ensure they are supported. If a requested capability is not available, fi_eq_open() will fail. In this example, instead of using a bitmask, descriptive capability names are shown for clarity.


Phase 2: Provider – Validation & Limits Check

The purpose of this phase is to ensure that the requested EQ can be supported by the provider. The provider validates the fi_eq_attr structure against its capabilities in the fi_info structure returned during discovery. Specifically, the .flags bitmask is checked against fi_info->caps, and each requested capability (FI_WRITE, FI_REMOTE_WRITE, FI_RMA) must be supported. Other checks include domain constraints, such as maximum number of EQs per domain and maximum queue depth. If any requested flag or attribute exceeds provider limits, the call fails.


Phase 3: Provider – Creation & Handle Return

The purpose of this phase is to allocate memory and internal structures for the EQ and return a usable handle to the application. The provider creates the fid_eq object in RAM, associates it with the parent domain (fid_domain), and returns the handle. The EQ is now ready to be bound to endpoints and used for event reporting. The completion queue (CQ) is used to track the results of data transfer operations. Every communication request eventually produces a completion, which is placed into the CQ once processed.


Example fid_eq (Event Queue) – Illustrative


fid_eq {

    fid_type        : FI_EQ

    fid             : 0xF1DE601

    parent_fid      : 0xF1DD01

    provider        : "libfabric-uet"

    caps            :FI_WRITE | FI_REMOTE_WRITE | FI_RMA

    size            : 1024           

    wait_obj        : FI_WAIT_FD

    provider_data   : <pointer to provider EQ struct>

    ref_count       : 1

    context         : <app-provided void *>

}

Object Example 4-3: Event Queue (EQ).


Explanation of fields:


fid_type: Type of object, here EQ.

fid: Unique handle for the EQ object.

parent_fid: Pointer to the domain it belongs to.

Caps: Bitmask of requested/available capabilities: user events, remote write completions, RMA completions.

Size: Queue depth (number of events).

wait_obj: Wait mechanism used by application.

provider_data: Pointer to provider-internal EQ structure.

ref_count: Tracks object references for lifecycle management.

Context: Application-provided context pointer.


Figure 4-4: Objects Creation Process – Event Queue.

Note: Creating an Event Queue (EQ) differs from fabric or domain creation. Instead of using fi_info to select a provider, fi_eq_open() simply takes an existing domain handle (fid_domain). The provider then allocates the EQ’s internal structures in host memory and returns a handle the application can use. This design ensures the CPU can efficiently track events, while the provider manages the details internally. 

Sunday, 28 September 2025

Ultra Ethernet: Domain Creation Process in Libfabric

Creating a domain object is the step where the application establishes a logical context for a NIC within a fabric, enabling endpoints, completion queues, and memory regions to be created and managed consistently.

Phase 1: Application (Discovery & choice — selecting a domain snapshot)

During discovery, the provider had populated one or more fi_info entries — each entry was a snapshot describing one possible NIC/port/transport combination. Each fi_info contained nested attribute structures for fabric, domain, and endpoint: fi_fabric_attr, fi_domain_attr, and fi_ep_attr. The fi_domain_attr substructure captured the domain-level template the provider had reported during discovery (memory registration modes, MR key sizes, counts and limits, capability and mode bitmasks, CQ/CTX limits, authentication key sizes, etc.).

When the application had decided which NIC/port it wanted to use, it selected a single fi_info entry whose fi_domain_attr matched its needs. That chosen fi_info became the authoritative configuration for domain creation, containing both the application’s requested settings and the provider-reported capabilities. At this phase, the application moved forward from fabric initialization to domain creation.

To create the domain, the application called the fi_domain function:


API Call → Create Domain object

    Within Fabric ID: 0xF1DFA01

    Using fi_info structure: 0xCAFE43E

    On success: returns fid_domain handle


Phase 2: Libfabric core (dispatch & validation)

The application calls the domain creation API:

int fi_domain(struct fid_fabric *fabric, struct fi_info *info,

              struct fid_domain **domain, void *context);

What the core does, at a high level:

  • Validate arguments: Ensure fabric is a live fid_fabric handle and info is non-NULL.
  • Sanity-check provider/fabric match: The core checks that the fi_info the application supplied corresponds to the same provider (and, indirectly, the same NIC/port) represented by fabric. This is the first piece of the “glue”: the fid_fabric (published earlier) contains the provider identity and fabric name; fi_info also contains provider/fabric identifiers from discovery. The core rejects or returns an error if the two do not match (this prevents cross-provider or cross-fabric mixes).
  • Forward the call to the provider: The core hands the fi_info (including its fi_domain_attr) and the fabric handle to the provider’s domain creation entry point. The core itself remains lightweight — The core performs validation and routing only; it does not modify attributes or allocate hardware resources; the provider performs the heavy lifting of mapping attributes onto hardware.


Phase 3: UET provider (mapping to NIC / resource allocation)

The provider receives the fabric handle (so it knows which NIC/port and which provider instance to use) and the fi_info/fi_domain_attr descriptor. The provider:

  • Interprets the domain attributes and verifies they are feasible given the NIC hardware, driver state and current configuration. For example: requested MR key size, number of CQ/CTXs, per-endpoint limits, requested capability bitmask.
  • Allocates driver / NIC resources or driver contexts that correspond to a domain: memory-registration state, structures for completion queues, context objects for send/recv, and any other provider-private handles.
  • Fails early if mismatch (NIC removed, driver not support requested capability, or requested limits exceed available resources).

Because the fi_info came from discovery for that NIC port, the provider immediately knows the physical mapping. The created domain represents a logical handle for accessing the NIC (or to the NIC/port context the provider manages). In other words: the domain is the provider’s logical handle to NIC resources (memory registration tables, per-device queues, etc.). The domain represents NIC resources logically; the exact mapping to hardware structures may vary by provider implementation, but typically it corresponds one-to-one or one-to-few with real NIC ports..


Phase 4: Libfabric core (fid_domain publication & hierarchy)

On successful provider creation, the provider returns a provider-private handle (pointer) to the domain state. The libfabric core then:

  • Wraps the provider handle into an application-visible fid_domain object.
  • Links the fid_domain to its parent fid_fabric (the fabric FID is stored as the domain’s parent/owner). This is the second piece of the “glue”: the created fid_domain explicitly references the fid_fabric that it belongs to, so the core can route future child-creation calls (endpoints, MRs, CQs) for this domain back to the same provider/fabric.
  • Copies or records the domain-level attributes (caps, mode, limits) into fields of fid_domain so they can be queried, validated on child creation, and used for lifetime/ref-counting.
  • Increments ref_count on the fid_fabric to prevent fabric destruction while the domain exists.


After this step the application holds a fid_domain handle and can proceed to create endpoints, register memory, create completion queues, etc., all of which the provider maps into the NIC/driver context that the domain represents.

Example fid_domain (illustrative)


fid_domain {

    fid_type        : FI_DOMAIN

    fid             : 0xF1DD01

    parent_fid      : 0xF1DFA01

    provider        : "libfabric-uet"

    caps            : 0x00000011      

    mode            : 0x00000004

    mr_key_size     : 8

    cq_cnt          : 4

    ep_cnt          : 128

    provider_data   : <pointer to provider domain struct>

    ref_count       : 1

    context         : <app-provided void *>

}

This stored state allows the core and provider to validate, route and implement subsequent calls that reference the domain.


Explanation of fields:

Identity

  • fid_type : FI_DOMAIN: Fabric Identifier (FID). Identifies the object as a domain. The domain object allows the application (and libfabric itself) to distinguish between different object types (fabric, domain, endpoint, etc.).
  • fid: 0xF1DD001: Unique handle for this domain instance. The application uses this handle in API calls that act on the domain.
  • parent_fid: 0xF1DFA01: Reference to the parent fabric object. This links the domain to the fabric where it belongs, ensuring resources remain associated with the correct fabric.

Provider Info

  • provider: "libfabric-uet": Name of the provider managing the domain. The application can confirm which provider implementation it is using, which is important when multiple providers are installed.
  • caps: 0x00000011: Bitmask of capabilities (for example, FI_MSG | FI_RMA). Defines which communication operations the domain supports so the application can use only valid features.
  • mode: 0x00000004: Mode requirements (for example, scalable endpoints). Tells the application about specific restrictions or rules it must follow when using the domain.

Resource Linits

  • mr_key_size: 8: Size in bytes of memory registration keys. The application uses this value when registering memory regions so it provides keys of the correct length.
  • cq_cnt: 4: Maximum number of completion queues supported. Guides the application when designing its event handling because it cannot create more than this limit.
  • ep_cnt: 128: Maximum number of endpoints supported. Tells the application how many communication endpoints it can create within this domain.

Internal States

  • provider_data: <pointer to provider domain struct>: Provider-specific internal pointer. Not used directly by the application but allows the provider to maintain its own internal state.
  • ref_count : 1: Current reference count for the domain. Tracks how many objects depend on this domain and ensures proper cleanup when the domain is released.
  • context : <app-provided void *>: Application-supplied pointer. Lets the application attach custom data, such as state or identifiers, which the provider will return in callbacks.


Figure 4-4: Objects Creation Process – Domain for NIC(s).

Friday, 26 September 2025

Ultra Ethernet: Fabric Creation Process in Libfabric

 Phase 1: Application (Discovery & choice)

After the UET provider populated fi_info structures for each NIC/port combination during discovery, the application can begin the object creation process. It first consults the in-memory fi_info list to identify the entry that best matches its requirements. Each fi_info contains nested attribute structures describing fabric, domain, and endpoint capabilities, including fi_fabric_attr (fabric name, provider identifier, version information), fi_domain_attr (memory registration mode, key details, domain capabilities), and fi_ep_attr (endpoint type, reliable versus unreliable semantics, size limits, and supported capabilities). The application examines the returned entries and selects the fi_info that satisfies its needs (for example: provider == "uet", fabric name == "UET", required capabilities, reliable transport, or a specific memory registration mode). The chosen fi_info then provides the attributes — effectively serving as hints — that the application passes into subsequent creation calls such as fi_fabric(), fi_domain(), and fi_endpoint(). Each fi_info acts as a self-contained “capability snapshot,” describing one possible combination of NIC, port, and transport mode.


Phase 2: Libfabric Core (dispatch & wiring)

When the application calls fi_fabric(), the core forwards this request to the corresponding provider’s fabric entry point. In this way, the fi_info produced during discovery effectively becomes the configuration input for object creation.

The core’s role is intentionally lightweight: it matches the application’s selected fi_info to the appropriate provider implementation and invokes the provider callback for fabrics, domains, and endpoints. Throughout this process, the fi_info acts as a context carrier, containing the provider identifier, fabric name, and attribute templates for domains and endpoints. The core passes these attributes directly to the provider during creation, ensuring that the provider has all the information necessary to map the requested objects to the correct NIC and transport configuration.


Phase 3: UET Provider

When the application invokes fi_fabric(), passing the fi_fabric_attr obtained from the chosen fi_info, the call is routed to the UET provider’s uet_fabric() entry point for fabric creation. The provider treats the attributes contained within the chosen fi_info as the authoritative configuration for fabric creation. Because each fi_fabric_attr originates from a NIC-specific fi_info, the provider immediately knows which physical NIC and port are associated with the requested fabric object.

The provider uses the fi_fabric_attr to determine which interfaces belong to the requested fabric, the provider-specific capabilities that must be supported, and any optional flags supplied by the application. It then allocates and initializes internal data structures to represent the fabric object, mapping it to the underlying NIC and driver resources described by the discovery snapshot.

During creation, the provider validates that the requested fabric can be supported by the current hardware and driver state. If the NIC or configuration has changed since discovery—for example, if the NIC is unavailable or the requested capabilities are no longer supported—the provider returns an error, preventing creation of an invalid or unsupported fabric object. Otherwise, the provider completes the fabric initialization, making it ready for subsequent domain and endpoint creation calls.

By relying exclusively on the fi_info snapshot from discovery, the provider ensures that the fabric object is created consistently and deterministically, reflecting the capabilities and constraints reported to the application during the discovery phase.


Phase 4: Libfabric Core (Fabric Object Publication)

Once the UET provider successfully creates the fabric object, the libfabric core generates the corresponding fid_fabric handle, which serves as the application-visible representation of the fabric. The term FID stands for Fabric Identifier, a unique identifier assigned to each libfabric object. All subsequent libfabric objects created within the context of this fabric — including domains, endpoints, and memory regions — are prefixed with this FID to maintain object hierarchy and enable internal tracking.

The fid_fabric structure contains metadata about the fabric, such as the associated provider, the fabric name, and internal pointers to provider-specific data structures. It acts as a lightweight descriptor for the application, while the actual resources and state remain managed by the provider.

The libfabric core stores the fid_fabric in in-memory structures in RAM, typically within internal libfabric tables that track all active fabric objects. This allows the core to efficiently validate future API calls that reference the fabric, to maintain object hierarchies, and to route creation requests (e.g., for domains or endpoints) to the correct provider instance. Because the FID resides in RAM, operations using fid_fabric are fast and transient; the core relies on the provider to maintain the persistent state and hardware mappings associated with the fabric.

By publishing the fabric as a fid_fabric object, libfabric establishes a clear and consistent handle for the application. This handle allows the application to reference the fabric unambiguously in subsequent creation calls, while preserving the mapping between the abstract libfabric object and the underlying NIC resources managed by the provider.

Example fid_fabric Object

id_fabric {

    fid_type        : FI_FABRIC

    fid              : 0x0001ABCD                

    provider         : "uet"                     

    fabric_name      : "UET-Fabric1"         

    version          : 1.0                       

    state            : ACTIVE                     

    ref_count        : 1                          

    provider_data    : <pointer>                 

    creation_flags   : 0x0                        

    timestamp        : 1690000000                 

}


Explanation of fields:

fid_type: Identifies the type of object within libfabric and distinguishes fabric objects (FI_FABRIC) from other object types such as domains or endpoints. This allows both the libfabric core and the provider to validate and correctly handle API calls, object creation, and destruction. The fid field is a unique Fabric Identifier assigned by libfabric, serving as the primary identifier for the fabric object. All child objects, including domains, endpoints, and memory regions, inherit this FID as a prefix to maintain hierarchy and uniqueness, enabling the core to validate object references and route creation calls to the correct provider instance.

provider field: Indicates the name of the provider managing this fabric object, such as "uet". It associates the fabric with the underlying hardware implementation and ensures that all subsequent calls for child objects are dispatched to the correct provider. The fabric_name field contains the human-readable name of the fabric, selected during discovery or by the application, for example "UET-Fabric1". This name allows applications to identify the fabric among multiple available options and is used as a selection criterion during both discovery (fi_getinfo) and creation (fi_fabric).

version field: Specifies the provider or fabric specification version and ensures compatibility between the application and provider. It can be used for logging, debugging, or runtime checks to verify that the fabric supports the required feature set. The state field tracks the current lifecycle status of the fabric object, indicating whether it is active, ready for use, or destroyed. Both the libfabric core and provider validate this field before allowing any operations on the fabric.

ref_count: Maintains a reference counter for the object, preventing it from being destroyed while still in use by the application or other libfabric structures. This counter is incremented during creation or when child objects reference the fabric and decremented when objects are released or destroyed. 

provider_data: Contains an internal pointer to provider-managed structures, including NIC mappings, hardware handles, and configuration details. This field is only accessed by the provider; the application interacts with the fabric through the fid_fabric handle.

creation_flags: Contains optional flags provided by the application during fi_fabric() creation, allowing customization of the fabric initialization process, such as enabling non-default modes or debug options. 

timestamp field: An optional value indicating when the fabric was created. It is useful for debugging, logging, and performance tracking, helping to correlate fabric initialization with other libfabric objects and operations.



Figure 4-2: Objects Creation Process – Fabric for Cluster.


Related post:
https://nwktimes.blogspot.com/2025/09/ultra-ethernet-resource-initialization.html

Wednesday, 24 September 2025

Ultra Ethernet: Resource Initialization

Introduction to libfabric and Ultra Ethernet

[Updated: September-26, 2025]

Deprectaed: new version: Ultra Etherent: Discovery

Libfabric is a communication library that belongs to the OpenFabrics Interfaces (OFI) framework. Its main goal is to provide applications with high-performance and scalable communication services, especially in areas like high-performance computing (HPC) and artificial intelligence (AI). Instead of forcing applications to work directly with low-level networking details, libfabric offers a clean user-space API that hides complexity while still giving applications fast and efficient access to the network.

One of the strengths of libfabric is that it has been designed together with both application developers and hardware vendors. This makes it possible to map application needs closely to the capabilities of modern network hardware. The result is lower software overhead and better efficiency when applications send or receive data.

Ultra Ethernet builds on this foundation by adopting libfabric as its communication abstraction layer. Ultra Ethernet uses the libfabric framework to let endpoints interact with AI frameworks and, ultimately, with each other across GPUs. Libfabric provides a high-performance, low-latency API that hides the details of the underlying transport, so AI frameworks do not need to manage the low-level details of endpoints, buffers, or the underlying address tables that map communication paths. This makes applications more portable across different fabrics while still providing access to advanced features such as zero-copy transfers and RDMA, which are essential for large-scale AI workloads.

During system initialization, libfabric coordinates with the appropriate provider—such as the UET provider—to query the network hardware and organize communication around three main objects: the Fabric, the Domain, and the Endpoint. Each object manages specific sub-objects and resources. For example, a Domain handles memory registration and hardware resources, while an Endpoint is associated with completion queues, transmit/receive buffers, and transport metadata. Ultra Ethernet maps these objects directly to the network hardware, ensuring that when GPUs begin exchanging training data, the communication paths are already aligned for low-latency, high-bandwidth transfers.

Once initialization is complete, AI frameworks issue standard libfabric calls to send and receive data. Ultra Ethernet ensures that this data flows efficiently across GPUs and servers. By separating initialization from runtime communication, this approach reduces overhead, minimizes bottlenecks, and enables scalable training of modern AI models.

In the following sections, we will look more closely at these libfabric objects and explain how they are created and used within Ultra Ethernet.


Discovery Phase: How libfabric and Ultra Ethernet reveal NIC capabilities


Phase 1: Application starts discovery

The discovery process begins in the application. Before communication can start, the application needs to know which services and transport features are available. It calls the function fi_getinfo(). This function uses a data structure, struct fi_info, to exchange information with the provider. The application may fill in parts of this structure with hints that describe its requirements—for example, expected operations such as RMA reads and writes, or message-based exchange, as well as the preferred addressing format. For Ultra Ethernet, this could include a provider-specific format, signaling that the application expects the UET transport to manage its communication efficiently.

For multi-GPU AI workloads, this initial step is critical. Large-scale training depends on fast movement of gradients and activations between GPUs. By letting the application express its needs up front, libfabric ensures that only endpoints with the required low-latency and high-throughput capabilities are returned. This reduces surprises at runtime and confirms that the selected transport paths are suitable for demanding AI workloads.


Phase 2: libfabric core loads and initializes a provider

Once the discovery request reaches the libfabric core, the library identifies and loads the appropriate provider from the available options, which may include providers for TCP, verbs (InfiniBand), or sockets (TCP/UDP). In the case of Ultra Ethernet, the core selects the UET provider (libfabric-uet.so). The core invokes the provider entry points, including the getinfo callback, which is responsible for returning the provider’s supported capabilities. The provider structure contains pointers to essential functions like uet_getinfo, uet_fabric, uet_domain, and uet_endpoint. The function pointer uet_getinfo is used here.

This modular design allows libfabric to support multiple transports, such as verbs, or TCP, without changing the application code. For AI workloads, this flexibility is vital. A framework can run on Ultra Ethernet in a GPU cluster, or fallback to a different transport in a mixed environment, while maintaining the same high-level communication logic. The core abstracts away provider-specific details, enabling the application to scale seamlessly across different hardware and system configurations.


Phase 3: Provider inspects capabilities and queries the driver

Inside the UET provider, the fi_getinfo() function is executed. The provider does not access the NIC hardware directly; instead, it queries the vendor driver or kernel services that manage the device. Through this interface, the provider identifies the capabilities that each NIC can safely and efficiently support. These capabilities are exposed via endpoints, and each endpoint is associated with exactly one domain (i.e., one NIC or NIC port). The endpoint capabilities describe the types of communication operations available—such as RMA, message passing, and reads/writes—and indicate how they can be used effectively in demanding multi-GPU AI workloads.

For small control operations, such as signaling the start of a new training step, coordinating collective operations, or confirming completion of data transfers, the provider advertises capabilities that support message-based communication:

  • FI_SEND: Allows an endpoint to send control messages or metadata to another endpoint without requiring a prior read request.
  • FI_RECV: Enables an endpoint to post a buffer to receive incoming control messages. Combined with FI_SEND, this forms the basic point-to-point control channel between GPUs.
  • FI_SHARED_AV: Provides shared address vectors, allowing multiple endpoints to reference the same set of remote addresses efficiently. Even for control messages, this reduces memory overhead and simplifies management in large clusters.


For bulk data transfer operations, such as gradient or parameter exchanges in collective operations like AllReduce, the provider advertises RMA (remote memory access) capabilities. These allow GPUs to exchange large volumes of data efficiently, without involving the CPU, which is critical for maintaining high throughput in distributed training:


  • FI_RMA: Enables direct remote memory access, forming the foundation for high-performance transfers of gradients or model parameters between GPUs.
  • FI_WRITE: Allows a GPU to push local gradient updates directly into another GPU’s memory.
  • FI_REMOTE_WRITE: Signals that remote GPUs can write directly into this GPU’s memory, supporting push-based collective operations.
  • FI_READ: Allows a GPU to pull gradients or parameters from a peer GPU’s memory.
  • FI_REMOTE_READ: Indicates that remote GPUs can perform read operations on this GPU’s memory, enabling zero-copy pull transfers without CPU involvement.

Low-level hardware metrics, such as link speed or MTU, are not returned in this phase; the focus is on semantic capabilities that the application can rely on. By distinguishing control capabilities from data transfer capabilities, the system ensures that all operations required by the AI framework, both coordination and bulk gradient exchange, are supported by the transport and the underlying hardware driver.

This preemptive check avoids runtime errors, guarantees predictable performance, and ensures that GPUs can rely on zero-copy transfers for gradient synchronization, all-reduce operations, and small control signaling. By exposing both control and RMA capabilities, the UET provider allows AI frameworks to efficiently orchestrate distributed training across many GPUs while maintaining low latency and high throughput.

Phase 4: Provider stores capabilities in an internal in-memory database

After querying the driver, the UET provider creates fi_info structures in CPU memory to store the information it has gathered about available endpoints and domains. Each fi_info structure describes one endpoint, including the domain it belongs to, the communication operations it supports (e.g., RMA, send/receive, tagged messages), and the preferred address formats. These structures are entirely in-memory objects and act as a temporary snapshot of the provider’s capabilities at the time of the query.

The application receives a pointer to these structures from fi_getinfo() and can read their fields directly. This allows the application to examine which endpoints and capabilities are available, select those that meet its requirements, and plan subsequent operations such as creating endpoints or registering memory regions. Using in-memory structures ensures fast access, avoids stale information, and supports multiple threads or processes querying provider capabilities concurrently. Once the application no longer needs the information, it can release the memory using fi_freeinfo(), freeing the structures for reuse.

For AI workloads across many GPUs, this in-memory caching is particularly important. Large-scale training often involves dozens or hundreds of simultaneous communication channels. The in-memory database allows the provider to quickly answer repeated discovery requests without querying the hardware again, ensuring that GPU-to-GPU transfers are set up efficiently and consistently. This rapid access to provider capabilities contributes directly to reducing startup overhead and achieving predictable scaling across large clusters.


Phase 5: libfabric returns discovery results to the application

Finally, libfabric returns the discovery results to the application. The fi_info structures built and filled by the provider are handed back as a linked list, each entry describing a specific combination of fabric, domain, and endpoint attributes, along with supported capabilities and address formats. The application can then select the most appropriate entry and proceed to create the fabric, open domains, register memory regions, and establish endpoints.

In addition, when the application calls fi_getinfo(), the returned fi_info structures include several key sub-structures that describe the fabric, domain, and endpoint layers of the transport. The fi_fabric_attr pointer provides fabric-wide information, including the provider name "uet" and the fabric name "UET", which the provider fills to allow the application to create a fabric object with fi_fabric(). The caps field in fi_info reflects the domain or NIC capabilities, such as FI_SEND | FI_RECV | FI_WRITE | FI_REMOTE_WRITE, giving the application a clear view of what operations are supported. The fi_domain_attr sub-structure contains domain-specific details, including the NIC name, for example "Eth0", and the memory registration mode (mr_mode) indicating which memory operations are supported locally and remotely. Endpoint-specific attributes, such as endpoint type and queue configurations, are contained in the fi_ep_attr sub-structure. 

By returning all this information during discovery, the UET provider ensures that the application can select the appropriate fabric, domain, and endpoint configuration before creating any objects. This allows the application to proceed confidently with fabric creation, domain opening, memory registration, and endpoint instantiation without hardcoding any provider, fabric, or domain details. The upcoming sections will explain in detail how each of the fi_info sub-structures—fi_fabric_attr, fi_domain_attr, and fi_ep_attr—is used during the object creation processes, showing how the provider maps them to hardware and how the application leverages them to establish communication endpoints.

This separation between discovery and runtime communication is essential for multi-GPU AI workloads. Applications gain a complete, portable view of the transport capabilities before any data transfer begins. By knowing which operations are supported and having ready-to-use fi_info entries, AI frameworks can coordinate transfers across GPUs with minimal overhead. This ensures high throughput and low latency, even in large clusters, while reducing the risk of bottlenecks caused by unsupported operations or poorly matched endpoints.




Figure 4-1: Initialization stage – Discovery of Provider capabilities.


Friday, 19 September 2025

Ultra Ethernet: Libfabric Resource Initialization

Introduction

Ultra Ethernet uses the libfabric communication framework to let endpoints interact with AI frameworks and, ultimately, with each other across GPUs. Libfabric provides a high-performance, low-latency API that hides the details of the underlying transport, so AI frameworks do not need to manage the low-level details of endpoints, buffers, or the underlying address tables that map communication paths. This makes applications more portable across different fabrics while still providing access to advanced features such as zero-copy transfers and RDMA, which are essential for large-scale AI workloads.

During system initialization, libfabric coordinates with the appropriate provider—such as the UET provider—to query the network hardware and organize communication around three main objects: the Fabric, the Domain, and the Endpoint. Each object manages specific sub-objects and resources. For example, a Domain handles memory registration and hardware resources, while an Endpoint is associated with completion queues, transmit/receive buffers, and transport metadata. Ultra Ethernet maps these objects directly to the network hardware, ensuring that when GPUs begin exchanging training data, the communication paths are already aligned for low-latency, high-bandwidth transfers.

Once initialization is complete, AI frameworks issue standard libfabric calls to send and receive data. Ultra Ethernet ensures that this data flows efficiently across GPUs and servers. By separating initialization from runtime communication, this approach reduces overhead, minimizes bottlenecks, and enables scalable training of modern AI models.


Ultra Ethernet Inter-GPU Capability Discovery Flow  


Phase 1: Application Requests Transport Capabilities

The process begins when an AI application specifies its needs for Ultra Ethernet RMA transport between GPUs, including the type of communication, completion method, and other preferences. The application calls the libfabric function fi_getinfo, providing these hints. Libfabric receives the request and prepares to identify network interfaces and capabilities that match the application’s requirements. This information will eventually be returned in a fi_info structure, which describes the available interfaces and their features.


Phase 2: Libfabric Core Loads the Provider

The libfabric core determines which provider can fulfill the request. In this case, it selects the UET provider for Ultra Ethernet RMA transport. Libfabric can support multiple providers for different network types, such as TCP, InfiniBand (verbs), or other specialized fabrics. The core loads the chosen provider and calls its getinfo function through the registration structure, which references the provider’s main operations: fabric creation, domain creation, endpoint creation, and network information retrieval. This allows libfabric to interact with the provider without knowing its internal implementation.


Phase 3-4: Provider Queries the NIC

Inside the provider, the registration structure directs libfabric to the correct function implementation (getinfo). This function queries each network interface on the host. The hardware driver responds with detailed information for each interface, including MTU, link speed, address formats, memory registration support, and supported transport modes like RMA or messaging. At this stage, the provider has a complete picture of the hardware, but the information has not yet been organized into fi_info structures.


Phase 5: Provider Fills fi_info Structures and Libfabric Filters Results

The provider fills the fi_info structures with the discovered NIC capabilities. The list is then returned to the libfabric core, which applies the application’s original hints to filter the results. Only the interfaces and transport options that match the requested criteria are presented to the application, providing a pre-filtered set of network options ready for use.


Phase 6: Application Receives Filtered Capabilities

The filtered fi_info structures are returned to the application. The AI framework can now create fabrics, domains, and endpoints, confident that the selected NICs are ready for efficient inter-GPU communication. By separating initialization from runtime communication, Ultra Ethernet and libfabric ensure that resources are aligned with the hardware, minimizing overhead and enabling predictable, high-bandwidth transfers.

The flow—from the application request, through libfabric core, the UET provider, the NIC and hardware driver, and back to the application—establishes a clear separation of responsibilities: the application defines what it needs, libfabric coordinates the providers, the provider interacts with the hardware, and the application receives a pre-filtered set of options ready for low-latency, high-bandwidth inter-GPU communication.



Figure 4-1: Initialization stage – Discovery of Provider capabilities.

Simplified Coding Examples (Optional)


This book does not aim to teach programming, but simplified code examples are provided here to support readers who learn best by studying practical snippets. These examples map directly to the six phases of initialization and show how libfabric and the UET provider work together under the hood.

Example 1 – Application fi_info Request with Hints


Why: The application needs to describe what kind of transport it wants (e.g., RMA for GPU communication) without worrying about low-level NIC details.

What: It prepares a fi_info structure with hints, such as transport type and completion method, then calls fi_getinfo() to ask libfabric which providers and NICs can match these requirements.

How: The application sets fields in hints, then hands them to fi_getinfo(). Libfabric uses this information to begin provider discovery.

struct fi_info *hints, *info;
hints = fi_allocinfo();

// Request a reliable datagram endpoint for inter-GPU RMA
hints->ep_attr->type = FI_EP_RDM;

// Request RMA and messaging capabilities
hints->caps = FI_RMA | FI_MSG;

// Ask for completion notifications per operation (context-based)
hints->mode = FI_CONTEXT;

// Memory registration preferences
hints->domain_attr->mr_mode = FI_MR_LOCAL | FI_MR_VIRT_ADDR;

// Query libfabric to get matching providers and NICs
int ret = fi_getinfo(FI_VERSION(1, 18), NULL, NULL, 0, hints, &info);

Example 2 – UET Provider Registration with getinfo


Why: Providers tell libfabric what operations they support. Registration connects the generic libfabric core to the UET-specific implementation.

What: The fi_provider structure is filled with function pointers. Among them, uet_getinfo is the callback used when libfabric queries the UET provider for NIC capabilities.

How: Libfabric calls these functions through the registration, so it doesn’t need to know the provider’s internal code.

struct fi_provider uet_prov = {
    .name    = "uet",
    .version = FI_VERSION(1, 18),
    .getinfo = uet_getinfo,   // provider-specific implementation
    .fabric  = uet_fabric,
    .domain  = uet_domain,
    .endpoint= uet_endpoint,
};


Example 3 – Provider Builds fi_info Structures


Why: After querying the NIC driver, the provider must describe all capabilities in a format libfabric understands. This includes link speed, MTU, memory registration, and supported transport modes. These details allow libfabric to determine which providers can satisfy the application’s requirements.

What: The provider allocates and fills an fi_info structure with the capabilities of each discovered NIC. This structure represents the full picture of the hardware, independent of what the application specifically requested.

How: The uet_getinfo() function queries each NIC, populates fi_info fields, and returns them to libfabric for further filtering.


int uet_getinfo(uint32_t version, const char *node, const char *service,
                uint64_t flags, struct fi_info *hints, struct fi_info **info) {
    struct fi_info *fi;
    fi = fi_allocinfo();

    // Provider and endpoint information
    fi->fabric_attr->prov_name = strdup("uet");
    fi->fabric_attr->name = strdup("UltraEthernetFabric");
    fi->ep_attr->type = FI_EP_RDM;
    fi->caps = FI_RMA | FI_MSG;
    fi->domain_attr->mr_mode = FI_MR_LOCAL | FI_MR_VIRT_ADDR;

    // Illustrative NIC attributes (not mandatory for the example)
    fi->domain_attr->mtu = 9000;                // max transmission unit
    fi->fabric_attr->link_speed = 20000;        // 20 Gbps
    fi->domain_attr->address_format = FI_SOCKADDR_IN; // IPv4 example

    *info = fi;
    return 0;
}

Note: The provider describes all capabilities, even those not requested, so libfabric can filter later. Fields like MTU, link speed, and address format are illustrative, showing the kind of hardware information included.

Example 4 – Filtered fi_info Returned to Application


Why: The application does not need all hardware details — only the NICs and transport modes that match its original hints. Filtering ensures it sees a concise, relevant set of options.

What: Libfabric applies the application’s criteria to the provider’s fi_info structures and returns only the matching entries.

How: The application receives the filtered list and can use it to initialize fabrics, domains, and endpoints for communication.

struct fi_info *p;
for (p = info; p; p = p->next) {
    // Display key information relevant to the application
    printf("Provider: %s, Endpoint type: %d, Caps: %llu, MTU: %d\n",
           p->fabric_attr->prov_name,
           p->ep_attr->type,
           (unsigned long long)p->caps,
           p->domain_attr->mtu);
}
fi_freeinfo(info);

Note: Only the relevant subset of NIC capabilities is shown to the application. The application can now create fabrics, domains, and endpoints confident that the hardware matches its requirements. Filtering reduces complexity and ensures predictable, high-performance inter-GPU transfers.

References

[1] Libfabric Programmer's Manual: Libfabric man pages https://ofiwg.github.io/libfabric/v2.3.0/man/
[2] Ultra Ethernet Specification v1.0, June 11, 2025 by Ultra Ethernet Consortium, https://ultraethernet.org