Wednesday, 24 September 2025

Ultra Ethernet: Resource Initialization

Introduction to libfabric and Ultra Ethernet

[Updated: September-26, 2025]

Libfabric is a communication library that belongs to the OpenFabrics Interfaces (OFI) framework. Its main goal is to provide applications with high-performance and scalable communication services, especially in areas like high-performance computing (HPC) and artificial intelligence (AI). Instead of forcing applications to work directly with low-level networking details, libfabric offers a clean user-space API that hides complexity while still giving applications fast and efficient access to the network.

One of the strengths of libfabric is that it has been designed together with both application developers and hardware vendors. This makes it possible to map application needs closely to the capabilities of modern network hardware. The result is lower software overhead and better efficiency when applications send or receive data.

Ultra Ethernet builds on this foundation by adopting libfabric as its communication abstraction layer. Ultra Ethernet uses the libfabric framework to let endpoints interact with AI frameworks and, ultimately, with each other across GPUs. Libfabric provides a high-performance, low-latency API that hides the details of the underlying transport, so AI frameworks do not need to manage the low-level details of endpoints, buffers, or the underlying address tables that map communication paths. This makes applications more portable across different fabrics while still providing access to advanced features such as zero-copy transfers and RDMA, which are essential for large-scale AI workloads.

During system initialization, libfabric coordinates with the appropriate provider—such as the UET provider—to query the network hardware and organize communication around three main objects: the Fabric, the Domain, and the Endpoint. Each object manages specific sub-objects and resources. For example, a Domain handles memory registration and hardware resources, while an Endpoint is associated with completion queues, transmit/receive buffers, and transport metadata. Ultra Ethernet maps these objects directly to the network hardware, ensuring that when GPUs begin exchanging training data, the communication paths are already aligned for low-latency, high-bandwidth transfers.

Once initialization is complete, AI frameworks issue standard libfabric calls to send and receive data. Ultra Ethernet ensures that this data flows efficiently across GPUs and servers. By separating initialization from runtime communication, this approach reduces overhead, minimizes bottlenecks, and enables scalable training of modern AI models.

In the following sections, we will look more closely at these libfabric objects and explain how they are created and used within Ultra Ethernet.


Discovery Phase: How libfabric and Ultra Ethernet reveal NIC capabilities


Phase 1: Application starts discovery

The discovery process begins in the application. Before communication can start, the application needs to know which services and transport features are available. It calls the function fi_getinfo(). This function uses a data structure, struct fi_info, to exchange information with the provider. The application may fill in parts of this structure with hints that describe its requirements—for example, expected operations such as RMA reads and writes, or message-based exchange, as well as the preferred addressing format. For Ultra Ethernet, this could include a provider-specific format, signaling that the application expects the UET transport to manage its communication efficiently.

For multi-GPU AI workloads, this initial step is critical. Large-scale training depends on fast movement of gradients and activations between GPUs. By letting the application express its needs up front, libfabric ensures that only endpoints with the required low-latency and high-throughput capabilities are returned. This reduces surprises at runtime and confirms that the selected transport paths are suitable for demanding AI workloads.


Phase 2: libfabric core loads and initializes a provider

Once the discovery request reaches the libfabric core, the library identifies and loads the appropriate provider from the available options, which may include providers for TCP, verbs (InfiniBand), or sockets (TCP/UDP). In the case of Ultra Ethernet, the core selects the UET provider (libfabric-uet.so). The core invokes the provider entry points, including the getinfo callback, which is responsible for returning the provider’s supported capabilities. The provider structure contains pointers to essential functions like uet_getinfo, uet_fabric, uet_domain, and uet_endpoint. The function pointer uet_getinfo is used here.

This modular design allows libfabric to support multiple transports, such as verbs, or TCP, without changing the application code. For AI workloads, this flexibility is vital. A framework can run on Ultra Ethernet in a GPU cluster, or fallback to a different transport in a mixed environment, while maintaining the same high-level communication logic. The core abstracts away provider-specific details, enabling the application to scale seamlessly across different hardware and system configurations.


Phase 3: Provider inspects capabilities and queries the driver

Inside the UET provider, the fi_getinfo() function is executed. The provider does not access the NIC hardware directly; instead, it queries the vendor driver or kernel services that manage the device. Through this interface, the provider identifies the capabilities that each NIC can safely and efficiently support. These capabilities are exposed via endpoints, and each endpoint is associated with exactly one domain (i.e., one NIC or NIC port). The endpoint capabilities describe the types of communication operations available—such as RMA, message passing, and reads/writes—and indicate how they can be used effectively in demanding multi-GPU AI workloads.

For small control operations, such as signaling the start of a new training step, coordinating collective operations, or confirming completion of data transfers, the provider advertises capabilities that support message-based communication:

  • FI_SEND: Allows an endpoint to send control messages or metadata to another endpoint without requiring a prior read request.
  • FI_RECV: Enables an endpoint to post a buffer to receive incoming control messages. Combined with FI_SEND, this forms the basic point-to-point control channel between GPUs.
  • FI_SHARED_AV: Provides shared address vectors, allowing multiple endpoints to reference the same set of remote addresses efficiently. Even for control messages, this reduces memory overhead and simplifies management in large clusters.


For bulk data transfer operations, such as gradient or parameter exchanges in collective operations like AllReduce, the provider advertises RMA (remote memory access) capabilities. These allow GPUs to exchange large volumes of data efficiently, without involving the CPU, which is critical for maintaining high throughput in distributed training:


  • FI_RMA: Enables direct remote memory access, forming the foundation for high-performance transfers of gradients or model parameters between GPUs.
  • FI_WRITE: Allows a GPU to push local gradient updates directly into another GPU’s memory.
  • FI_REMOTE_WRITE: Signals that remote GPUs can write directly into this GPU’s memory, supporting push-based collective operations.
  • FI_READ: Allows a GPU to pull gradients or parameters from a peer GPU’s memory.
  • FI_REMOTE_READ: Indicates that remote GPUs can perform read operations on this GPU’s memory, enabling zero-copy pull transfers without CPU involvement.

Low-level hardware metrics, such as link speed or MTU, are not returned in this phase; the focus is on semantic capabilities that the application can rely on. By distinguishing control capabilities from data transfer capabilities, the system ensures that all operations required by the AI framework, both coordination and bulk gradient exchange, are supported by the transport and the underlying hardware driver.

This preemptive check avoids runtime errors, guarantees predictable performance, and ensures that GPUs can rely on zero-copy transfers for gradient synchronization, all-reduce operations, and small control signaling. By exposing both control and RMA capabilities, the UET provider allows AI frameworks to efficiently orchestrate distributed training across many GPUs while maintaining low latency and high throughput.

Phase 4: Provider stores capabilities in an internal in-memory database

After querying the driver, the UET provider creates fi_info structures in CPU memory to store the information it has gathered about available endpoints and domains. Each fi_info structure describes one endpoint, including the domain it belongs to, the communication operations it supports (e.g., RMA, send/receive, tagged messages), and the preferred address formats. These structures are entirely in-memory objects and act as a temporary snapshot of the provider’s capabilities at the time of the query.

The application receives a pointer to these structures from fi_getinfo() and can read their fields directly. This allows the application to examine which endpoints and capabilities are available, select those that meet its requirements, and plan subsequent operations such as creating endpoints or registering memory regions. Using in-memory structures ensures fast access, avoids stale information, and supports multiple threads or processes querying provider capabilities concurrently. Once the application no longer needs the information, it can release the memory using fi_freeinfo(), freeing the structures for reuse.

For AI workloads across many GPUs, this in-memory caching is particularly important. Large-scale training often involves dozens or hundreds of simultaneous communication channels. The in-memory database allows the provider to quickly answer repeated discovery requests without querying the hardware again, ensuring that GPU-to-GPU transfers are set up efficiently and consistently. This rapid access to provider capabilities contributes directly to reducing startup overhead and achieving predictable scaling across large clusters.


Phase 5: libfabric returns discovery results to the application

Finally, libfabric returns the discovery results to the application. The fi_info structures built and filled by the provider are handed back as a linked list, each entry describing a specific combination of fabric, domain, and endpoint attributes, along with supported capabilities and address formats. The application can then select the most appropriate entry and proceed to create the fabric, open domains, register memory regions, and establish endpoints.

In addition, when the application calls fi_getinfo(), the returned fi_info structures include several key sub-structures that describe the fabric, domain, and endpoint layers of the transport. The fi_fabric_attr pointer provides fabric-wide information, including the provider name "uet" and the fabric name "UET", which the provider fills to allow the application to create a fabric object with fi_fabric(). The caps field in fi_info reflects the domain or NIC capabilities, such as FI_SEND | FI_RECV | FI_WRITE | FI_REMOTE_WRITE, giving the application a clear view of what operations are supported. The fi_domain_attr sub-structure contains domain-specific details, including the NIC name, for example "Eth0", and the memory registration mode (mr_mode) indicating which memory operations are supported locally and remotely. Endpoint-specific attributes, such as endpoint type and queue configurations, are contained in the fi_ep_attr sub-structure. 

By returning all this information during discovery, the UET provider ensures that the application can select the appropriate fabric, domain, and endpoint configuration before creating any objects. This allows the application to proceed confidently with fabric creation, domain opening, memory registration, and endpoint instantiation without hardcoding any provider, fabric, or domain details. The upcoming sections will explain in detail how each of the fi_info sub-structures—fi_fabric_attr, fi_domain_attr, and fi_ep_attr—is used during the object creation processes, showing how the provider maps them to hardware and how the application leverages them to establish communication endpoints.

This separation between discovery and runtime communication is essential for multi-GPU AI workloads. Applications gain a complete, portable view of the transport capabilities before any data transfer begins. By knowing which operations are supported and having ready-to-use fi_info entries, AI frameworks can coordinate transfers across GPUs with minimal overhead. This ensures high throughput and low latency, even in large clusters, while reducing the risk of bottlenecks caused by unsupported operations or poorly matched endpoints.




Figure 4-1: Initialization stage – Discovery of Provider capabilities.


1 comment:

  1. For provider in-memory caching, is there any refresh mechanism in case there is a change in hardware, in order to update the hardware capabilities to the provider

    ReplyDelete