Introduction to libfabric and Ultra Ethernet
Libfabric is a communication library that belongs to the OpenFabrics Interfaces (OFI) framework. Its main goal is to provide applications with high-performance and scalable communication services, especially in areas like high-performance computing (HPC) and artificial intelligence (AI). Instead of forcing applications to work directly with low-level networking details, libfabric offers a clean user-space API that hides complexity while still giving applications fast and efficient access to the network.
One of the strengths of libfabric is that it has been designed together with both application developers and hardware vendors. This makes it possible to map application needs closely to the capabilities of modern network hardware. The result is lower software overhead and better efficiency when applications send or receive data.
Ultra Ethernet builds on this foundation by adopting libfabric as its communication abstraction layer. Ultra Ethernet uses the libfabric framework to let endpoints interact with AI frameworks and, ultimately, with each other across GPUs. Libfabric provides a high-performance, low-latency API that hides the details of the underlying transport, so AI frameworks do not need to manage the low-level details of endpoints, buffers, or the underlying address tables that map communication paths. This makes applications more portable across different fabrics while still providing access to advanced features such as zero-copy transfers and RDMA, which are essential for large-scale AI workloads.
During system initialization, libfabric coordinates with the appropriate provider—such as the UET provider—to query the network hardware and organize communication around three main objects: the Fabric, the Domain, and the Endpoint. Each object manages specific sub-objects and resources. For example, a Domain handles memory registration and hardware resources, while an Endpoint is associated with completion queues, transmit/receive buffers, and transport metadata. Ultra Ethernet maps these objects directly to the network hardware, ensuring that when GPUs begin exchanging training data, the communication paths are already aligned for low-latency, high-bandwidth transfers.
Once initialization is complete, AI frameworks issue standard libfabric calls to send and receive data. Ultra Ethernet ensures that this data flows efficiently across GPUs and servers. By separating initialization from runtime communication, this approach reduces overhead, minimizes bottlenecks, and enables scalable training of modern AI models.
In the following sections, we will look more closely at these libfabric objects and explain how they are created and used within Ultra Ethernet.
Discovery Phase: How libfabric and Ultra Ethernet reveal NIC capabilities
Phase 1: Application starts discovery
The discovery process begins in the application. Before any communication can occur, the application must understand which services and transport capabilities are available. It uses the fi_getinfo() API call, where it may provide a set of hints to express its requirements. Here, the application specifies the operations it plans to perform—such as remote memory access, reads, writes, and message-based communication—as well as the preferred address format. For Ultra Ethernet, this might include a provider-specific format, indicating that the application expects the UET transport to handle its communication paths efficiently. By calling the fi_getinfo() API with these hints, the application signals libfabric to discover all available providers and endpoints capable of fulfilling its requirements.
For multi-GPU AI workloads, this initial step is crucial. Large-scale training involves many GPUs exchanging gradients and activations rapidly. By allowing the application to express exactly what it needs, libfabric ensures that only endpoints capable of supporting low-latency, high-throughput operations are returned. This avoids runtime surprises and ensures that the chosen transport paths are suitable for demanding AI workloads.
Phase 2: libfabric core loads and initializes a provider
Once the discovery request reaches the libfabric core, the library identifies and loads the appropriate provider from the available options, which may include providers for TCP, verbs (InfiniBand), or sockets (TCP/UDP). In the case of Ultra Ethernet, the core selects the UET provider (libfabric-uet.so). The core invokes the provider entry points, including the getinfo callback, which is responsible for returning the provider’s supported capabilities. The provider structure contains pointers to essential functions like uet_getinfo, uet_fabric, uet_domain, and uet_endpoint. The function pointer uet_getinfo is used here.
This modular design allows libfabric to support multiple transports, such as verbs, or TCP, without changing the application code. For AI workloads, this flexibility is vital. A framework can run on Ultra Ethernet in a GPU cluster, or fallback to a different transport in a mixed environment, while maintaining the same high-level communication logic. The core abstracts away provider-specific details, enabling the application to scale seamlessly across different hardware and system configurations.
Phase 3: Provider inspects capabilities and queries the driver
Inside the UET provider, the getinfo function is executed. The provider does not access the NIC directly; instead, it communicates with the vendor driver or kernel services that manage the hardware. Through this interface, the provider discovers the capabilities that the NIC can safely and efficiently support. These capabilities describe both the types of communication operations that can be performed and how they can be used in AI multi-GPU workloads.
For small control operations, such as signaling the start of a new training step, coordinating collective operations, or confirming completion of data transfers, the provider advertises capabilities that support message-based communication:
- FI_SEND: Allows an endpoint to send control messages or metadata to another endpoint without requiring a prior read request.
- FI_RECV: Enables an endpoint to post a buffer to receive incoming control messages. Combined with FI_SEND, this forms the basic point-to-point control channel between GPUs.
- FI_SHARED_AV: Provides shared address vectors, allowing multiple endpoints to reference the same set of remote addresses efficiently. Even for control messages, this reduces memory overhead and simplifies management in large clusters.
For bulk data transfer operations, such as gradient or parameter exchanges in collective operations like AllReduce, the provider advertises RMA (remote memory access) capabilities. These allow GPUs to exchange large volumes of data efficiently, without involving the CPU, which is critical for maintaining high throughput in distributed training:
- FI_RMA: Enables direct remote memory access, forming the foundation for high-performance transfers of gradients or model parameters between GPUs.
- FI_WRITE: Allows a GPU to push local gradient updates directly into another GPU’s memory.
- FI_REMOTE_WRITE: Signals that remote GPUs can write directly into this GPU’s memory, supporting push-based collective operations.
- FI_READ: Allows a GPU to pull gradients or parameters from a peer GPU’s memory.
- FI_REMOTE_READ: Indicates that remote GPUs can perform read operations on this GPU’s memory, enabling zero-copy pull transfers without CPU involvement.
Low-level hardware metrics, such as link speed or MTU, are not returned in this phase; the focus is on semantic capabilities that the application can rely on. By distinguishing control capabilities from data transfer capabilities, the system ensures that all operations required by the AI framework, both coordination and bulk gradient exchange, are supported by the transport and the underlying hardware driver.
This preemptive check avoids runtime errors, guarantees predictable performance, and ensures that GPUs can rely on zero-copy transfers for gradient synchronization, all-reduce operations, and small control signaling. By exposing both control and RMA capabilities, the UET provider allows AI frameworks to efficiently orchestrate distributed training across many GPUs while maintaining low latency and high throughput.
Phase 4: Provider stores capabilities in an internal in-memory database
After querying the driver, the UET provider organizes the gathered capability information in an internal in-memory database. This cache contains supported capabilities, address formats, and metadata, effectively representing the provider’s working model of the transport. Using memory instead of the file system ensures low-latency access, avoids stale information, and supports concurrent queries from multiple threads or processes.
For AI workloads across many GPUs, this in-memory caching is particularly important. Large-scale training often involves dozens or hundreds of simultaneous communication channels. The in-memory database allows the provider to quickly answer repeated discovery requests without querying the hardware again, ensuring that GPU-to-GPU transfers are set up efficiently and consistently. This rapid access to provider capabilities contributes directly to reducing startup overhead and achieving predictable scaling across large clusters.
Phase 5: libfabric returns discovery results to the application
Finally, libfabric returns the discovery results to the application. The fi_info structures built and filled by the provider are handed back as a linked list, each entry describing a specific combination of fabric, domain, and endpoint attributes, along with supported capabilities and address formats. The application can then select the most appropriate entry and proceed to create the fabric, open domains, register memory regions, and establish endpoints.
This separation between discovery and runtime communication is essential for multi-GPU AI workloads. Applications gain a complete, portable view of the transport capabilities before any data transfer begins. By knowing which operations are supported and having ready-to-use fi_info entries, AI frameworks can coordinate transfers across GPUs with minimal overhead. This ensures high throughput and low latency, even in large clusters, while reducing the risk of bottlenecks caused by unsupported operations or poorly matched endpoints.
In the next section, we will examine how these fi_info entries are used to create the actual libfabric objects—the Fabric, Domain, and Endpoint—and how the UET provider maps these objects directly to network hardware to enable low-latency, high-bandwidth transfers across GPUs.
No comments:
Post a Comment