Friday, 26 September 2025

Ultra Ethernet: Fabric Creation Process in Libfabric

 Phase 1: Application (Discovery & choice)

After the UET provider populated fi_info structures for each NIC/port combination during discovery, the application can begin the object creation process. It first consults the in-memory fi_info list to identify the entry that best matches its requirements. Each fi_info contains nested attribute structures describing fabric, domain, and endpoint capabilities, including fi_fabric_attr (fabric name, provider identifier, version information), fi_domain_attr (memory registration mode, key details, domain capabilities), and fi_ep_attr (endpoint type, reliable versus unreliable semantics, size limits, and supported capabilities). The application examines the returned entries and selects the fi_info that satisfies its needs (for example: provider == "uet", fabric name == "UET", required capabilities, reliable transport, or a specific memory registration mode). The chosen fi_info then provides the attributes — effectively serving as hints — that the application passes into subsequent creation calls such as fi_fabric(), fi_domain(), and fi_endpoint(). Each fi_info acts as a self-contained “capability snapshot,” describing one possible combination of NIC, port, and transport mode.


Phase 2: Libfabric Core (dispatch & wiring)

When the application calls fi_fabric(), the core forwards this request to the corresponding provider’s fabric entry point. In this way, the fi_info produced during discovery effectively becomes the configuration input for object creation.

The core’s role is intentionally lightweight: it matches the application’s selected fi_info to the appropriate provider implementation and invokes the provider callback for fabrics, domains, and endpoints. Throughout this process, the fi_info acts as a context carrier, containing the provider identifier, fabric name, and attribute templates for domains and endpoints. The core passes these attributes directly to the provider during creation, ensuring that the provider has all the information necessary to map the requested objects to the correct NIC and transport configuration.


Phase 3: UET Provider

When the application invokes fi_fabric(), passing the fi_fabric_attr obtained from the chosen fi_info, the call is routed to the UET provider’s uet_fabric() entry point for fabric creation. The provider treats the attributes contained within the chosen fi_info as the authoritative configuration for fabric creation. Because each fi_fabric_attr originates from a NIC-specific fi_info, the provider immediately knows which physical NIC and port are associated with the requested fabric object.

The provider uses the fi_fabric_attr to determine which interfaces belong to the requested fabric, the provider-specific capabilities that must be supported, and any optional flags supplied by the application. It then allocates and initializes internal data structures to represent the fabric object, mapping it to the underlying NIC and driver resources described by the discovery snapshot.

During creation, the provider validates that the requested fabric can be supported by the current hardware and driver state. If the NIC or configuration has changed since discovery—for example, if the NIC is unavailable or the requested capabilities are no longer supported—the provider returns an error, preventing creation of an invalid or unsupported fabric object. Otherwise, the provider completes the fabric initialization, making it ready for subsequent domain and endpoint creation calls.

By relying exclusively on the fi_info snapshot from discovery, the provider ensures that the fabric object is created consistently and deterministically, reflecting the capabilities and constraints reported to the application during the discovery phase.


Phase 4: Libfabric Core (Fabric Object Publication)

Once the UET provider successfully creates the fabric object, the libfabric core generates the corresponding fid_fabric handle, which serves as the application-visible representation of the fabric. The term FID stands for Fabric Identifier, a unique identifier assigned to each libfabric object. All subsequent libfabric objects created within the context of this fabric — including domains, endpoints, and memory regions — are prefixed with this FID to maintain object hierarchy and enable internal tracking.

The fid_fabric structure contains metadata about the fabric, such as the associated provider, the fabric name, and internal pointers to provider-specific data structures. It acts as a lightweight descriptor for the application, while the actual resources and state remain managed by the provider.

The libfabric core stores the fid_fabric in in-memory structures in RAM, typically within internal libfabric tables that track all active fabric objects. This allows the core to efficiently validate future API calls that reference the fabric, to maintain object hierarchies, and to route creation requests (e.g., for domains or endpoints) to the correct provider instance. Because the FID resides in RAM, operations using fid_fabric are fast and transient; the core relies on the provider to maintain the persistent state and hardware mappings associated with the fabric.

By publishing the fabric as a fid_fabric object, libfabric establishes a clear and consistent handle for the application. This handle allows the application to reference the fabric unambiguously in subsequent creation calls, while preserving the mapping between the abstract libfabric object and the underlying NIC resources managed by the provider.

Example fid_fabric Object

id_fabric {

    fid_type        : FI_FABRIC

    fid              : 0x0001ABCD                

    provider         : "uet"                     

    fabric_name      : "UET-Fabric1"         

    version          : 1.0                       

    state            : ACTIVE                     

    ref_count        : 1                          

    provider_data    : <pointer>                 

    creation_flags   : 0x0                        

    timestamp        : 1690000000                 

}


Explanation of fields:

fid_type: Identifies the type of object within libfabric and distinguishes fabric objects (FI_FABRIC) from other object types such as domains or endpoints. This allows both the libfabric core and the provider to validate and correctly handle API calls, object creation, and destruction. The fid field is a unique Fabric Identifier assigned by libfabric, serving as the primary identifier for the fabric object. All child objects, including domains, endpoints, and memory regions, inherit this FID as a prefix to maintain hierarchy and uniqueness, enabling the core to validate object references and route creation calls to the correct provider instance.

provider field: Indicates the name of the provider managing this fabric object, such as "uet". It associates the fabric with the underlying hardware implementation and ensures that all subsequent calls for child objects are dispatched to the correct provider. The fabric_name field contains the human-readable name of the fabric, selected during discovery or by the application, for example "UET-Fabric1". This name allows applications to identify the fabric among multiple available options and is used as a selection criterion during both discovery (fi_getinfo) and creation (fi_fabric).

version field: Specifies the provider or fabric specification version and ensures compatibility between the application and provider. It can be used for logging, debugging, or runtime checks to verify that the fabric supports the required feature set. The state field tracks the current lifecycle status of the fabric object, indicating whether it is active, ready for use, or destroyed. Both the libfabric core and provider validate this field before allowing any operations on the fabric.

ref_count: Maintains a reference counter for the object, preventing it from being destroyed while still in use by the application or other libfabric structures. This counter is incremented during creation or when child objects reference the fabric and decremented when objects are released or destroyed. 

provider_data: Contains an internal pointer to provider-managed structures, including NIC mappings, hardware handles, and configuration details. This field is only accessed by the provider; the application interacts with the fabric through the fid_fabric handle.

creation_flags: Contains optional flags provided by the application during fi_fabric() creation, allowing customization of the fabric initialization process, such as enabling non-default modes or debug options. 

timestamp field: An optional value indicating when the fabric was created. It is useful for debugging, logging, and performance tracking, helping to correlate fabric initialization with other libfabric objects and operations.



Figure 4-2: Objects Creation Process – Fabric for Cluster.


Related post:
https://nwktimes.blogspot.com/2025/09/ultra-ethernet-resource-initialization.html

Wednesday, 24 September 2025

Ultra Ethernet: Resource Initialization

Introduction to libfabric and Ultra Ethernet

[Updated: September-26, 2025]

Libfabric is a communication library that belongs to the OpenFabrics Interfaces (OFI) framework. Its main goal is to provide applications with high-performance and scalable communication services, especially in areas like high-performance computing (HPC) and artificial intelligence (AI). Instead of forcing applications to work directly with low-level networking details, libfabric offers a clean user-space API that hides complexity while still giving applications fast and efficient access to the network.

One of the strengths of libfabric is that it has been designed together with both application developers and hardware vendors. This makes it possible to map application needs closely to the capabilities of modern network hardware. The result is lower software overhead and better efficiency when applications send or receive data.

Ultra Ethernet builds on this foundation by adopting libfabric as its communication abstraction layer. Ultra Ethernet uses the libfabric framework to let endpoints interact with AI frameworks and, ultimately, with each other across GPUs. Libfabric provides a high-performance, low-latency API that hides the details of the underlying transport, so AI frameworks do not need to manage the low-level details of endpoints, buffers, or the underlying address tables that map communication paths. This makes applications more portable across different fabrics while still providing access to advanced features such as zero-copy transfers and RDMA, which are essential for large-scale AI workloads.

During system initialization, libfabric coordinates with the appropriate provider—such as the UET provider—to query the network hardware and organize communication around three main objects: the Fabric, the Domain, and the Endpoint. Each object manages specific sub-objects and resources. For example, a Domain handles memory registration and hardware resources, while an Endpoint is associated with completion queues, transmit/receive buffers, and transport metadata. Ultra Ethernet maps these objects directly to the network hardware, ensuring that when GPUs begin exchanging training data, the communication paths are already aligned for low-latency, high-bandwidth transfers.

Once initialization is complete, AI frameworks issue standard libfabric calls to send and receive data. Ultra Ethernet ensures that this data flows efficiently across GPUs and servers. By separating initialization from runtime communication, this approach reduces overhead, minimizes bottlenecks, and enables scalable training of modern AI models.

In the following sections, we will look more closely at these libfabric objects and explain how they are created and used within Ultra Ethernet.


Discovery Phase: How libfabric and Ultra Ethernet reveal NIC capabilities


Phase 1: Application starts discovery

The discovery process begins in the application. Before communication can start, the application needs to know which services and transport features are available. It calls the function fi_getinfo(). This function uses a data structure, struct fi_info, to exchange information with the provider. The application may fill in parts of this structure with hints that describe its requirements—for example, expected operations such as RMA reads and writes, or message-based exchange, as well as the preferred addressing format. For Ultra Ethernet, this could include a provider-specific format, signaling that the application expects the UET transport to manage its communication efficiently.

For multi-GPU AI workloads, this initial step is critical. Large-scale training depends on fast movement of gradients and activations between GPUs. By letting the application express its needs up front, libfabric ensures that only endpoints with the required low-latency and high-throughput capabilities are returned. This reduces surprises at runtime and confirms that the selected transport paths are suitable for demanding AI workloads.


Phase 2: libfabric core loads and initializes a provider

Once the discovery request reaches the libfabric core, the library identifies and loads the appropriate provider from the available options, which may include providers for TCP, verbs (InfiniBand), or sockets (TCP/UDP). In the case of Ultra Ethernet, the core selects the UET provider (libfabric-uet.so). The core invokes the provider entry points, including the getinfo callback, which is responsible for returning the provider’s supported capabilities. The provider structure contains pointers to essential functions like uet_getinfo, uet_fabric, uet_domain, and uet_endpoint. The function pointer uet_getinfo is used here.

This modular design allows libfabric to support multiple transports, such as verbs, or TCP, without changing the application code. For AI workloads, this flexibility is vital. A framework can run on Ultra Ethernet in a GPU cluster, or fallback to a different transport in a mixed environment, while maintaining the same high-level communication logic. The core abstracts away provider-specific details, enabling the application to scale seamlessly across different hardware and system configurations.


Phase 3: Provider inspects capabilities and queries the driver

Inside the UET provider, the fi_getinfo() function is executed. The provider does not access the NIC hardware directly; instead, it queries the vendor driver or kernel services that manage the device. Through this interface, the provider identifies the capabilities that each NIC can safely and efficiently support. These capabilities are exposed via endpoints, and each endpoint is associated with exactly one domain (i.e., one NIC or NIC port). The endpoint capabilities describe the types of communication operations available—such as RMA, message passing, and reads/writes—and indicate how they can be used effectively in demanding multi-GPU AI workloads.

For small control operations, such as signaling the start of a new training step, coordinating collective operations, or confirming completion of data transfers, the provider advertises capabilities that support message-based communication:

  • FI_SEND: Allows an endpoint to send control messages or metadata to another endpoint without requiring a prior read request.
  • FI_RECV: Enables an endpoint to post a buffer to receive incoming control messages. Combined with FI_SEND, this forms the basic point-to-point control channel between GPUs.
  • FI_SHARED_AV: Provides shared address vectors, allowing multiple endpoints to reference the same set of remote addresses efficiently. Even for control messages, this reduces memory overhead and simplifies management in large clusters.


For bulk data transfer operations, such as gradient or parameter exchanges in collective operations like AllReduce, the provider advertises RMA (remote memory access) capabilities. These allow GPUs to exchange large volumes of data efficiently, without involving the CPU, which is critical for maintaining high throughput in distributed training:


  • FI_RMA: Enables direct remote memory access, forming the foundation for high-performance transfers of gradients or model parameters between GPUs.
  • FI_WRITE: Allows a GPU to push local gradient updates directly into another GPU’s memory.
  • FI_REMOTE_WRITE: Signals that remote GPUs can write directly into this GPU’s memory, supporting push-based collective operations.
  • FI_READ: Allows a GPU to pull gradients or parameters from a peer GPU’s memory.
  • FI_REMOTE_READ: Indicates that remote GPUs can perform read operations on this GPU’s memory, enabling zero-copy pull transfers without CPU involvement.

Low-level hardware metrics, such as link speed or MTU, are not returned in this phase; the focus is on semantic capabilities that the application can rely on. By distinguishing control capabilities from data transfer capabilities, the system ensures that all operations required by the AI framework, both coordination and bulk gradient exchange, are supported by the transport and the underlying hardware driver.

This preemptive check avoids runtime errors, guarantees predictable performance, and ensures that GPUs can rely on zero-copy transfers for gradient synchronization, all-reduce operations, and small control signaling. By exposing both control and RMA capabilities, the UET provider allows AI frameworks to efficiently orchestrate distributed training across many GPUs while maintaining low latency and high throughput.

Phase 4: Provider stores capabilities in an internal in-memory database

After querying the driver, the UET provider creates fi_info structures in CPU memory to store the information it has gathered about available endpoints and domains. Each fi_info structure describes one endpoint, including the domain it belongs to, the communication operations it supports (e.g., RMA, send/receive, tagged messages), and the preferred address formats. These structures are entirely in-memory objects and act as a temporary snapshot of the provider’s capabilities at the time of the query.

The application receives a pointer to these structures from fi_getinfo() and can read their fields directly. This allows the application to examine which endpoints and capabilities are available, select those that meet its requirements, and plan subsequent operations such as creating endpoints or registering memory regions. Using in-memory structures ensures fast access, avoids stale information, and supports multiple threads or processes querying provider capabilities concurrently. Once the application no longer needs the information, it can release the memory using fi_freeinfo(), freeing the structures for reuse.

For AI workloads across many GPUs, this in-memory caching is particularly important. Large-scale training often involves dozens or hundreds of simultaneous communication channels. The in-memory database allows the provider to quickly answer repeated discovery requests without querying the hardware again, ensuring that GPU-to-GPU transfers are set up efficiently and consistently. This rapid access to provider capabilities contributes directly to reducing startup overhead and achieving predictable scaling across large clusters.


Phase 5: libfabric returns discovery results to the application

Finally, libfabric returns the discovery results to the application. The fi_info structures built and filled by the provider are handed back as a linked list, each entry describing a specific combination of fabric, domain, and endpoint attributes, along with supported capabilities and address formats. The application can then select the most appropriate entry and proceed to create the fabric, open domains, register memory regions, and establish endpoints.

In addition, when the application calls fi_getinfo(), the returned fi_info structures include several key sub-structures that describe the fabric, domain, and endpoint layers of the transport. The fi_fabric_attr pointer provides fabric-wide information, including the provider name "uet" and the fabric name "UET", which the provider fills to allow the application to create a fabric object with fi_fabric(). The caps field in fi_info reflects the domain or NIC capabilities, such as FI_SEND | FI_RECV | FI_WRITE | FI_REMOTE_WRITE, giving the application a clear view of what operations are supported. The fi_domain_attr sub-structure contains domain-specific details, including the NIC name, for example "Eth0", and the memory registration mode (mr_mode) indicating which memory operations are supported locally and remotely. Endpoint-specific attributes, such as endpoint type and queue configurations, are contained in the fi_ep_attr sub-structure. 

By returning all this information during discovery, the UET provider ensures that the application can select the appropriate fabric, domain, and endpoint configuration before creating any objects. This allows the application to proceed confidently with fabric creation, domain opening, memory registration, and endpoint instantiation without hardcoding any provider, fabric, or domain details. The upcoming sections will explain in detail how each of the fi_info sub-structures—fi_fabric_attr, fi_domain_attr, and fi_ep_attr—is used during the object creation processes, showing how the provider maps them to hardware and how the application leverages them to establish communication endpoints.

This separation between discovery and runtime communication is essential for multi-GPU AI workloads. Applications gain a complete, portable view of the transport capabilities before any data transfer begins. By knowing which operations are supported and having ready-to-use fi_info entries, AI frameworks can coordinate transfers across GPUs with minimal overhead. This ensures high throughput and low latency, even in large clusters, while reducing the risk of bottlenecks caused by unsupported operations or poorly matched endpoints.




Figure 4-1: Initialization stage – Discovery of Provider capabilities.


Friday, 19 September 2025

Ultra Ethernet: Libfabric Resource Initialization

Introduction

Ultra Ethernet uses the libfabric communication framework to let endpoints interact with AI frameworks and, ultimately, with each other across GPUs. Libfabric provides a high-performance, low-latency API that hides the details of the underlying transport, so AI frameworks do not need to manage the low-level details of endpoints, buffers, or the underlying address tables that map communication paths. This makes applications more portable across different fabrics while still providing access to advanced features such as zero-copy transfers and RDMA, which are essential for large-scale AI workloads.

During system initialization, libfabric coordinates with the appropriate provider—such as the UET provider—to query the network hardware and organize communication around three main objects: the Fabric, the Domain, and the Endpoint. Each object manages specific sub-objects and resources. For example, a Domain handles memory registration and hardware resources, while an Endpoint is associated with completion queues, transmit/receive buffers, and transport metadata. Ultra Ethernet maps these objects directly to the network hardware, ensuring that when GPUs begin exchanging training data, the communication paths are already aligned for low-latency, high-bandwidth transfers.

Once initialization is complete, AI frameworks issue standard libfabric calls to send and receive data. Ultra Ethernet ensures that this data flows efficiently across GPUs and servers. By separating initialization from runtime communication, this approach reduces overhead, minimizes bottlenecks, and enables scalable training of modern AI models.


Ultra Ethernet Inter-GPU Capability Discovery Flow  


Phase 1: Application Requests Transport Capabilities

The process begins when an AI application specifies its needs for Ultra Ethernet RMA transport between GPUs, including the type of communication, completion method, and other preferences. The application calls the libfabric function fi_getinfo, providing these hints. Libfabric receives the request and prepares to identify network interfaces and capabilities that match the application’s requirements. This information will eventually be returned in a fi_info structure, which describes the available interfaces and their features.


Phase 2: Libfabric Core Loads the Provider

The libfabric core determines which provider can fulfill the request. In this case, it selects the UET provider for Ultra Ethernet RMA transport. Libfabric can support multiple providers for different network types, such as TCP, InfiniBand (verbs), or other specialized fabrics. The core loads the chosen provider and calls its getinfo function through the registration structure, which references the provider’s main operations: fabric creation, domain creation, endpoint creation, and network information retrieval. This allows libfabric to interact with the provider without knowing its internal implementation.


Phase 3-4: Provider Queries the NIC

Inside the provider, the registration structure directs libfabric to the correct function implementation (getinfo). This function queries each network interface on the host. The hardware driver responds with detailed information for each interface, including MTU, link speed, address formats, memory registration support, and supported transport modes like RMA or messaging. At this stage, the provider has a complete picture of the hardware, but the information has not yet been organized into fi_info structures.


Phase 5: Provider Fills fi_info Structures and Libfabric Filters Results

The provider fills the fi_info structures with the discovered NIC capabilities. The list is then returned to the libfabric core, which applies the application’s original hints to filter the results. Only the interfaces and transport options that match the requested criteria are presented to the application, providing a pre-filtered set of network options ready for use.


Phase 6: Application Receives Filtered Capabilities

The filtered fi_info structures are returned to the application. The AI framework can now create fabrics, domains, and endpoints, confident that the selected NICs are ready for efficient inter-GPU communication. By separating initialization from runtime communication, Ultra Ethernet and libfabric ensure that resources are aligned with the hardware, minimizing overhead and enabling predictable, high-bandwidth transfers.

The flow—from the application request, through libfabric core, the UET provider, the NIC and hardware driver, and back to the application—establishes a clear separation of responsibilities: the application defines what it needs, libfabric coordinates the providers, the provider interacts with the hardware, and the application receives a pre-filtered set of options ready for low-latency, high-bandwidth inter-GPU communication.



Figure 4-1: Initialization stage – Discovery of Provider capabilities.

Simplified Coding Examples (Optional)


This book does not aim to teach programming, but simplified code examples are provided here to support readers who learn best by studying practical snippets. These examples map directly to the six phases of initialization and show how libfabric and the UET provider work together under the hood.

Example 1 – Application fi_info Request with Hints


Why: The application needs to describe what kind of transport it wants (e.g., RMA for GPU communication) without worrying about low-level NIC details.

What: It prepares a fi_info structure with hints, such as transport type and completion method, then calls fi_getinfo() to ask libfabric which providers and NICs can match these requirements.

How: The application sets fields in hints, then hands them to fi_getinfo(). Libfabric uses this information to begin provider discovery.

struct fi_info *hints, *info;
hints = fi_allocinfo();

// Request a reliable datagram endpoint for inter-GPU RMA
hints->ep_attr->type = FI_EP_RDM;

// Request RMA and messaging capabilities
hints->caps = FI_RMA | FI_MSG;

// Ask for completion notifications per operation (context-based)
hints->mode = FI_CONTEXT;

// Memory registration preferences
hints->domain_attr->mr_mode = FI_MR_LOCAL | FI_MR_VIRT_ADDR;

// Query libfabric to get matching providers and NICs
int ret = fi_getinfo(FI_VERSION(1, 18), NULL, NULL, 0, hints, &info);

Example 2 – UET Provider Registration with getinfo


Why: Providers tell libfabric what operations they support. Registration connects the generic libfabric core to the UET-specific implementation.

What: The fi_provider structure is filled with function pointers. Among them, uet_getinfo is the callback used when libfabric queries the UET provider for NIC capabilities.

How: Libfabric calls these functions through the registration, so it doesn’t need to know the provider’s internal code.

struct fi_provider uet_prov = {
    .name    = "uet",
    .version = FI_VERSION(1, 18),
    .getinfo = uet_getinfo,   // provider-specific implementation
    .fabric  = uet_fabric,
    .domain  = uet_domain,
    .endpoint= uet_endpoint,
};


Example 3 – Provider Builds fi_info Structures


Why: After querying the NIC driver, the provider must describe all capabilities in a format libfabric understands. This includes link speed, MTU, memory registration, and supported transport modes. These details allow libfabric to determine which providers can satisfy the application’s requirements.

What: The provider allocates and fills an fi_info structure with the capabilities of each discovered NIC. This structure represents the full picture of the hardware, independent of what the application specifically requested.

How: The uet_getinfo() function queries each NIC, populates fi_info fields, and returns them to libfabric for further filtering.


int uet_getinfo(uint32_t version, const char *node, const char *service,
                uint64_t flags, struct fi_info *hints, struct fi_info **info) {
    struct fi_info *fi;
    fi = fi_allocinfo();

    // Provider and endpoint information
    fi->fabric_attr->prov_name = strdup("uet");
    fi->fabric_attr->name = strdup("UltraEthernetFabric");
    fi->ep_attr->type = FI_EP_RDM;
    fi->caps = FI_RMA | FI_MSG;
    fi->domain_attr->mr_mode = FI_MR_LOCAL | FI_MR_VIRT_ADDR;

    // Illustrative NIC attributes (not mandatory for the example)
    fi->domain_attr->mtu = 9000;                // max transmission unit
    fi->fabric_attr->link_speed = 20000;        // 20 Gbps
    fi->domain_attr->address_format = FI_SOCKADDR_IN; // IPv4 example

    *info = fi;
    return 0;
}

Note: The provider describes all capabilities, even those not requested, so libfabric can filter later. Fields like MTU, link speed, and address format are illustrative, showing the kind of hardware information included.

Example 4 – Filtered fi_info Returned to Application


Why: The application does not need all hardware details — only the NICs and transport modes that match its original hints. Filtering ensures it sees a concise, relevant set of options.

What: Libfabric applies the application’s criteria to the provider’s fi_info structures and returns only the matching entries.

How: The application receives the filtered list and can use it to initialize fabrics, domains, and endpoints for communication.

struct fi_info *p;
for (p = info; p; p = p->next) {
    // Display key information relevant to the application
    printf("Provider: %s, Endpoint type: %d, Caps: %llu, MTU: %d\n",
           p->fabric_attr->prov_name,
           p->ep_attr->type,
           (unsigned long long)p->caps,
           p->domain_attr->mtu);
}
fi_freeinfo(info);

Note: Only the relevant subset of NIC capabilities is shown to the application. The application can now create fabrics, domains, and endpoints confident that the hardware matches its requirements. Filtering reduces complexity and ensures predictable, high-performance inter-GPU transfers.

References

[1] Libfabric Programmer's Manual: Libfabric man pages https://ofiwg.github.io/libfabric/v2.3.0/man/
[2] Ultra Ethernet Specification v1.0, June 11, 2025 by Ultra Ethernet Consortium, https://ultraethernet.org



Sunday, 7 September 2025

Ultra Ethernet: Fabric Setup

Introduction: Job Environment Initialization

Distributed AI training requires careful setup of both hardware and software resources. In a UET-based system, the environment initialization proceeds through several key phases, each ensuring that GPUs, network interfaces, and processes are correctly configured before training begins:


1. Fabric Endpoint (FEP) Creation

Each GPU process is associated with a logical Fabric Endpoint (FEP) that abstracts the connection to its NIC port. FEPs, together with the connected switch ports, form a Fabric Plane (FP)—an isolated, high-performance data path. The NICs advertise their capabilities via LLDP messages to ensure compatibility and readiness.

2. Vendor UET Provider Publication

Once FEPs are created, they are published to the Vendor UET Provider, which exposes them as Libfabric domains. This step makes the Fabric Addresses (FAs) discoverable, but actual communication objects (endpoints, address vectors) are created later by the application processes. This abstraction ensures consistent interaction with the hardware regardless of vendor-specific implementations.

3. Job Launcher and Environment Variables

When a distributed training job is launched, the job launcher (e.g., Torchrun) sets up environment variables for each process. These include the master rank IP and port, local and global ranks, and the total number of processes.

4. Environment Variable Interpretation

The framework reads the environment variables to compute process-specific Global Rank IDs and assign processes to GPUs. The lowest global rank is designated as the master rank, which coordinates control connections and allocates GPU memory for training data, model weights, and gradients.

5. Control Channel Establishment

Processes establish TCP connections with the master rank, exchanging metadata including JobID, ranks, and Fabric Endpoint information. The master generates and distributes the NCCL Unique ID (UID), defining collective communication groups. The control channel remains open throughout training, used for coordination, synchronization, and distribution of model partitions in model-parallel setups.

6. Initialized Job

After these phases, all GPUs are assigned unique process IDs and global ranks, know their collective communication groups, and have their Fabric Endpoints accessible via Libfabric. The job environment is now fully prepared to run the application—in this case, an AI training workload.

Fabric Endpoint - FEP


A Fabric Endpoint (FEP) is a logical entity that abstracts the connection between a single process running on a GPU and the NIC port attached to that GPU. In a UET-based system, FEPs and their connected interfaces on scale-out backend switches together form a Fabric. The path between FEPs, including the uplink switch ports, defines belongs to same Fabric Plane (FP), an isolated data path between FEPs.


The FEP abstraction is conceptually like a Routing and Forwarding Instance (VRF) on a Layer 3 router or switch. An administrator creates each FEP and assigns it an IP address, referred to in UET as a Fabric Address (FA). FEPs within the same FP may belong to the same or different IP subnets, depending on the chosen backend network rail topology. If FEPs in a closed FP belong to different subnets, those subnets must still be part of the same routing instance on layer 3 devices to preserve plane isolation. In comparison with modern data center networks using BGP EVPN, a Fabric Plane can be thought of as analogous to either a Layer 2 VNI or a Layer 3 VNI.


After a FEP is created, its attached NIC port must be enabled. When the port comes up, it begins sending LLDP messages to its connected peers to advertise and discover UET capabilities. UET NICs and switch ports use LLDP messages to exchange mandatory TLVs (Chassis ID, Port ID, Time to Live, and End of LLDPDU) as well as optional TLVs. The UET specification defines two optional LLDP extensions to advertise support for Link Level Retry (LLR) and Credit-Based Flow Control (CBFC).


The purpose of this LLDP exchange is to confirm that both ends of the link support the same UET feature set before higher-level initialization begins. Once LLDP negotiation succeeds, the participating ports are considered part of the same Fabric Plane and are ready to be used by the upper layers of the Ultra Ethernet stack.


Figure 3-1 illustrates this setup: each node has two FEPs, with FEP0 attached to Eth0 and FEP1 to Eth1. In this example, a rail is implemented as a subnet, and FEP0 on both nodes belong Fabric Plane 0 and FEP1 to Fabric Plane 1.


Figure 3-1: Create FEP and Link Enablement.

Vendor UET Provider


As described in the previous section, a Fabric Endpoint (FEP) abstracts the connection between a GPU process and its associated NIC port. The FEP also serves as the termination point of the Fabric on the node side.

In the UET stack, the NIC publishes its FEPs to an abstraction layer called the Vendor UET Provider. The UET provider is implemented by the NIC vendor and exposes a standardized API to the Libfabric core. In practice, this means that key FEP information—such as the FEP ID, the NIC port it is bound to, and its assigned Fabric Address (FA)—is made available to the upper-layer Libfabric functions.

The Vendor UET Provider translates UET concepts into Libfabric constructs. Each FEP is exposed to Libfabric as a domain, representing the communication resource associated with that NIC port. The Fabric Address (FA) assigned to the FEP becomes an entry in the Libfabric address vector (AV), making it possible for applications to reference and communicate with remote FEPs. Within a domain, applications create Libfabric endpoints, which act as the actual communication contexts for sending and receiving messages, or for performing RMA and atomic operations toward peers identified by their FAs.

It is important to note that when the NIC publishes its FEPs through the UET Provider, the Libfabric domain, endpoint, and address vector objects do not yet exist. At this stage, the FEPs and their Fabric Addresses are simply made discoverable and accessible as resources. The actual Libfabric objects are created later—after the job launcher has assigned ranks and JobIDs to processes—when each application process calls the relevant Libfabric functions (fi_domain(), fi_av_open(), fi_endpoint()). This separation ensures that FEP publishing is independent of job-level initialization and that applications remain in control of when and how communication resources are instantiated.

By handling this mapping and lifecycle, the Vendor UET Provider abstracts away vendor-specific hardware details and ensures that applications interact with a consistent programming model across different NICs. This enables portability: the same Libfabric-based code can run on any UET-compliant hardware, regardless of vendor-specific design choices.

Figure 3-2 illustrates this flow. The FEPs created on the node are published by the NIC through the Vendor UET Provider, making them visible as Libfabric domains with associated Fabric Addresses, ready for use by distributed AI frameworks.


Figure 3-2: Vendor UET Provider.


Job Initialization



Setting Environmental Variable


When a distributed training job is launched, the job launcher tool (such as Torchrun in PyTorch) sets job-specific environment variables for every process participating in the job.

In Figure 3-3, Torchrun defines and stores the environment variables for two processes on each node in the CPU’s DRAM. These tables contain both the shared JobID and the process-specific Process ID (PID). Even though JobID appears as part of the environment variables, it is usually assigned earlier by the cluster’s job scheduler (e.g., SLURM, Kubernetes) and then propagated to the processes. Torchrun itself does not create JobIDs; here it is shown as a conceptual abstraction to describe a unique identifier for the training job.

Torchrun itself defines a specific set of environment variables for process coordination:

NODE_RANK: the index of this node in the cluster
LOCAL_RANK: Local rank of the process within its node
RANK:                 Global rank of the process computed by job launcher
WORLD_SIZE: Total number of processes in the job
MASTER_ADDR: IP/hostname of the master rank
MASTER_PORT: TCP port of the master rank



The variable WORLD_SIZE specifies how many processes participate in the job. Based on this, the master rank (the master process) knows how many control connections will later be opened by peer processes.

A Global Rank ID (unique across all nodes) is computed using the Node ID, Processes per Node, and the Local Rank ID (which is unique within its node).

Each environment variable table also includes the master rank’s IP address and the TCP port that it is listening on. This information is used to establish control connections between processes.

Typically, the job launcher itself runs on one node, while the deep learning framework runs on every node. The job launcher may distribute the environment variables to the processes over an SSH connection via the management network. 


Figure 3-3: Distributing Environmental Variables.

Environment Variable Interpretation


The environment variables table instructs the framework on how to initialize distributed communication and which GPUs will participate in the job.

In our example, Framework PyTorch reads these environment variables and computes the process-specific Global Rank ID by multiplying the Node Rank ID by processes on node and then adding the Local Rank ID. 

Global Rank: Node Rank × Number of Processes per Node + Local Rank ID

Note! Torchrun does not export PROCESSES_PER_NODE as an environment variable. Instead, it is implied by the number of LOCAL_RANK values per node. The job launcher itself knows this value (provided via --nproc_per_node), but it does not need to pass it explicitly, since each process can infer it from WORLD_SIZE / num_nodes.

Each process is then assigned to a GPU, usually based on its local rank.

In Figure 3-4, for example, the first process—with Global Rank ID 0—is assigned to GPU0 on the node, and the second process is assigned to GPU1. The worker with the lowest global rank (typically rank 0) is designated as the master rank. The master rank is responsible for coordinating the training job: it provides its IP address and the TCP port it listens on for control connections from other workers. In this example, GPU0 on Host-A has the lowest global rank, so it becomes the master rank.

PyTorch also allocates memory space in the GPU’s VRAM, which will later be used to store training data, weight parameters, and gradients.

Together, the JobID, local ranks, and global ranks allow the distributed training framework to organize workers, identify the master process, and manage communication efficiently across nodes.


Figure 3-4: Reading Variables and Allocating Memory Space.

Opening Control Channel


After rank assignment and role selection, the processes running on GPUs begin a TCP three-way handshake to establish a connection with the master rank (GPU0 on Host-A). This is done by sending a TCP SYN packet to the destination IP address 10.1.0.11 and TCP port 12345, both read from the environment variable table (followed by SYN ACK and ACK ). The source IP address typically belongs to the NIC connected to the frontend (management) network.

Once the TCP sockets for control connections are established between ranks, each process notifies the master rank with the following information:

JobID: Confirms that all processes are participating in the same job. 

Global and Local Ranks: Used by the master rank to assign the NCCL Unique ID (UID) to all ranks that belong to the same collective communication group. Each process sharing this UID can synchronize gradients or exchange data using collectives.

WORLD_SIZE: Although set during initialization, resending it ensures every process has a consistent view of the total number of participants.

FEP and FA: The Fabric Endpoint (FEP) IP address, expressed as a Fabric Address (FA), is tied to the correct process for RDMA communication.


Figure 3-5: Establishing a TCP Socket for Control Channel.

After the master rank has accepted all expected connections (equal to WORLD_SIZE – 1, since the master itself is excluded), it generates the NCCL Unique ID, a 128-byte token. 

While it might look like another job identifier, the NCCL UID serves a narrower purpose: it defines the scope of a collective communication group within NCCL. Only processes that share the same UID participate in the same communication context (for example, all-reduce or broadcast), while processes with different UIDs are excluded. This separation allows multiple distributed training jobs to run on the same hosts without interfering with each other.

In practice, the NCCL UID can also distinguish between different communication groups inside the same training job. For example, in tensor parallelism, the GPUs holding partitions of a layer’s weights must synchronize partial results using collectives such as all-reduce. These GPUs all share the same NCCL UID, ensuring their collectives are scoped only to that tensor-parallel group. Other groups of GPUs—for instance, those assigned to pipeline stages or data-parallel replicas—use different UIDs to form their own isolated communication contexts.

In short:

The JobID identifies the training job at the cluster level.
The NCCL Unique ID identifies the communication group (or subgroup) of GPUs that must synchronize within that job.

Finally, the master rank distributes the collected information across the job, ensuring that all processes receive the necessary environment variables and their process-specific NCCL UIDs. The WORLD_SIZE value is not redistributed, since it was already defined during initialization and synchronized over the control channel.


Figure 3-6: Distributing NCCL UID along Received Information.

The control channel remains open throughout the lifetime of the training job. It is later used by the framework to exchange metadata and coordination messages between ranks. For example, in model-parallel training, the master rank can use this channel to distribute model partition information or updated parameters to the processes, coordinate checkpointing, or handle dynamic changes in the communication group. Essentially, it serves as a persistent control path for tasks that require synchronization or configuration outside of the high-bandwidth data communication performed via NCCL collectives.

Initialized job


Figure 3-7 summarizes the result of the job environment initialization. All UET NICs are defined as Fabric Endpoints (FEPs) and associated with their respective Fabric Addresses (FAs). The UET NIC kernel has published the NIC-to-FEP/FA associations to the vendor UET Provider, making them accessible via Libfabric APIs. All GPUs have joined the same job and have been assigned unique process IDs and global rank IDs. Additionally, each process is aware of the collective communication group to which it belongs. With this setup, the job environment is fully prepared to serve the application—in our case, AI Training (AIT).


Figure 3-7: Complete Setup for AI Training.