Sunday, 7 September 2025

Ultra Ethernet: Fabric Setup

Introduction: Job Environment Initialization

Distributed AI training requires careful setup of both hardware and software resources. In a UET-based system, the environment initialization proceeds through several key phases, each ensuring that GPUs, network interfaces, and processes are correctly configured before training begins:


1. Fabric Endpoint (FEP) Creation

Each GPU process is associated with a logical Fabric Endpoint (FEP) that abstracts the connection to its NIC port. FEPs, together with the connected switch ports, form a Fabric Plane (FP)—an isolated, high-performance data path. The NICs advertise their capabilities via LLDP messages to ensure compatibility and readiness.

2. Vendor UET Provider Publication

Once FEPs are created, they are published to the Vendor UET Provider, which exposes them as Libfabric domains. This step makes the Fabric Addresses (FAs) discoverable, but actual communication objects (endpoints, address vectors) are created later by the application processes. This abstraction ensures consistent interaction with the hardware regardless of vendor-specific implementations.

3. Job Launcher and Environment Variables

When a distributed training job is launched, the job launcher (e.g., Torchrun) sets up environment variables for each process. These include the master rank IP and port, local and global ranks, and the total number of processes.

4. Environment Variable Interpretation

The framework reads the environment variables to compute process-specific Global Rank IDs and assign processes to GPUs. The lowest global rank is designated as the master rank, which coordinates control connections and allocates GPU memory for training data, model weights, and gradients.

5. Control Channel Establishment

Processes establish TCP connections with the master rank, exchanging metadata including JobID, ranks, and Fabric Endpoint information. The master generates and distributes the NCCL Unique ID (UID), defining collective communication groups. The control channel remains open throughout training, used for coordination, synchronization, and distribution of model partitions in model-parallel setups.

6. Initialized Job

After these phases, all GPUs are assigned unique process IDs and global ranks, know their collective communication groups, and have their Fabric Endpoints accessible via Libfabric. The job environment is now fully prepared to run the application—in this case, an AI training workload.

Fabric Endpoint - FEP


A Fabric Endpoint (FEP) is a logical entity that abstracts the connection between a single process running on a GPU and the NIC port attached to that GPU. In a UET-based system, FEPs and their connected interfaces on scale-out backend switches together form a Fabric. The path between FEPs, including the uplink switch ports, defines belongs to same Fabric Plane (FP), an isolated data path between FEPs.


The FEP abstraction is conceptually like a Routing and Forwarding Instance (VRF) on a Layer 3 router or switch. An administrator creates each FEP and assigns it an IP address, referred to in UET as a Fabric Address (FA). FEPs within the same FP may belong to the same or different IP subnets, depending on the chosen backend network rail topology. If FEPs in a closed FP belong to different subnets, those subnets must still be part of the same routing instance on layer 3 devices to preserve plane isolation. In comparison with modern data center networks using BGP EVPN, a Fabric Plane can be thought of as analogous to either a Layer 2 VNI or a Layer 3 VNI.


After a FEP is created, its attached NIC port must be enabled. When the port comes up, it begins sending LLDP messages to its connected peers to advertise and discover UET capabilities. UET NICs and switch ports use LLDP messages to exchange mandatory TLVs (Chassis ID, Port ID, Time to Live, and End of LLDPDU) as well as optional TLVs. The UET specification defines two optional LLDP extensions to advertise support for Link Level Retry (LLR) and Credit-Based Flow Control (CBFC).


The purpose of this LLDP exchange is to confirm that both ends of the link support the same UET feature set before higher-level initialization begins. Once LLDP negotiation succeeds, the participating ports are considered part of the same Fabric Plane and are ready to be used by the upper layers of the Ultra Ethernet stack.


Figure 3-1 illustrates this setup: each node has two FEPs, with FEP0 attached to Eth0 and FEP1 to Eth1. In this example, a rail is implemented as a subnet, and FEP0 on both nodes belong Fabric Plane 0 and FEP1 to Fabric Plane 1.


Figure 3-1: Create FEP and Link Enablement.

Vendor UET Provider


As described in the previous section, a Fabric Endpoint (FEP) abstracts the connection between a GPU process and its associated NIC port. The FEP also serves as the termination point of the Fabric on the node side.

In the UET stack, the NIC publishes its FEPs to an abstraction layer called the Vendor UET Provider. The UET provider is implemented by the NIC vendor and exposes a standardized API to the Libfabric core. In practice, this means that key FEP information—such as the FEP ID, the NIC port it is bound to, and its assigned Fabric Address (FA)—is made available to the upper-layer Libfabric functions.

The Vendor UET Provider translates UET concepts into Libfabric constructs. Each FEP is exposed to Libfabric as a domain, representing the communication resource associated with that NIC port. The Fabric Address (FA) assigned to the FEP becomes an entry in the Libfabric address vector (AV), making it possible for applications to reference and communicate with remote FEPs. Within a domain, applications create Libfabric endpoints, which act as the actual communication contexts for sending and receiving messages, or for performing RMA and atomic operations toward peers identified by their FAs.

It is important to note that when the NIC publishes its FEPs through the UET Provider, the Libfabric domain, endpoint, and address vector objects do not yet exist. At this stage, the FEPs and their Fabric Addresses are simply made discoverable and accessible as resources. The actual Libfabric objects are created later—after the job launcher has assigned ranks and JobIDs to processes—when each application process calls the relevant Libfabric functions (fi_domain(), fi_av_open(), fi_endpoint()). This separation ensures that FEP publishing is independent of job-level initialization and that applications remain in control of when and how communication resources are instantiated.

By handling this mapping and lifecycle, the Vendor UET Provider abstracts away vendor-specific hardware details and ensures that applications interact with a consistent programming model across different NICs. This enables portability: the same Libfabric-based code can run on any UET-compliant hardware, regardless of vendor-specific design choices.

Figure 3-2 illustrates this flow. The FEPs created on the node are published by the NIC through the Vendor UET Provider, making them visible as Libfabric domains with associated Fabric Addresses, ready for use by distributed AI frameworks.


Figure 3-2: Vendor UET Provider.


Job Initialization



Setting Environmental Variable


When a distributed training job is launched, the job launcher tool (such as Torchrun in PyTorch) sets job-specific environment variables for every process participating in the job.

In Figure 3-3, Torchrun defines and stores the environment variables for two processes on each node in the CPU’s DRAM. These tables contain both the shared JobID and the process-specific Process ID (PID). Even though JobID appears as part of the environment variables, it is usually assigned earlier by the cluster’s job scheduler (e.g., SLURM, Kubernetes) and then propagated to the processes. Torchrun itself does not create JobIDs; here it is shown as a conceptual abstraction to describe a unique identifier for the training job.

Torchrun itself defines a specific set of environment variables for process coordination:

NODE_RANK: the index of this node in the cluster
LOCAL_RANK: Local rank of the process within its node
RANK:                 Global rank of the process computed by job launcher
WORLD_SIZE: Total number of processes in the job
MASTER_ADDR: IP/hostname of the master rank
MASTER_PORT: TCP port of the master rank



The variable WORLD_SIZE specifies how many processes participate in the job. Based on this, the master rank (the master process) knows how many control connections will later be opened by peer processes.

A Global Rank ID (unique across all nodes) is computed using the Node ID, Processes per Node, and the Local Rank ID (which is unique within its node).

Each environment variable table also includes the master rank’s IP address and the TCP port that it is listening on. This information is used to establish control connections between processes.

Typically, the job launcher itself runs on one node, while the deep learning framework runs on every node. The job launcher may distribute the environment variables to the processes over an SSH connection via the management network. 

Figure 3-3: Distributing Environmental Variables.

Environment Variable Interpretation


The environment variables table instructs the framework on how to initialize distributed communication and which GPUs will participate in the job.

In our example, Framework PyTorch reads these environment variables and computes the process-specific Global Rank ID by multiplying the Node Rank ID by processes on node and then adding the Local Rank ID. 

Global Rank: Node Rank × Number of Processes per Node + Local Rank ID

Note! Torchrun does not export PROCESSES_PER_NODE as an environment variable. Instead, it is implied by the number of LOCAL_RANK values per node. The job launcher itself knows this value (provided via --nproc_per_node), but it does not need to pass it explicitly, since each process can infer it from WORLD_SIZE / num_nodes.

Each process is then assigned to a GPU, usually based on its local rank.

In Figure 3-4, for example, the first process—with Global Rank ID 0—is assigned to GPU0 on the node, and the second process is assigned to GPU1. The worker with the lowest global rank (typically rank 0) is designated as the master rank. The master rank is responsible for coordinating the training job: it provides its IP address and the TCP port it listens on for control connections from other workers. In this example, GPU0 on Host-A has the lowest global rank, so it becomes the master rank.

PyTorch also allocates memory space in the GPU’s VRAM, which will later be used to store training data, weight parameters, and gradients.

Together, the JobID, local ranks, and global ranks allow the distributed training framework to organize workers, identify the master process, and manage communication efficiently across nodes.

Figure 3-4: Reading Variables and Allocating Memory Space.

Opening Control Channel


After rank assignment and role selection, the processes running on GPUs begin a TCP three-way handshake to establish a connection with the master rank (GPU0 on Host-A). This is done by sending a TCP SYN packet to the destination IP address 10.1.0.11 and TCP port 12345, both read from the environment variable table (followed by SYN ACK and ACK ). The source IP address typically belongs to the NIC connected to the frontend (management) network.

Once the TCP sockets for control connections are established between ranks, each process notifies the master rank with the following information:

JobID: Confirms that all processes are participating in the same job. 

Global and Local Ranks: Used by the master rank to assign the NCCL Unique ID (UID) to all ranks that belong to the same collective communication group. Each process sharing this UID can synchronize gradients or exchange data using collectives.

WORLD_SIZE: Although set during initialization, resending it ensures every process has a consistent view of the total number of participants.

FEP and FA: The Fabric Endpoint (FEP) IP address, expressed as a Fabric Address (FA), is tied to the correct process for RDMA communication.


Figure 3-5: Establishing a TCP Socket for Control Channel.

After the master rank has accepted all expected connections (equal to WORLD_SIZE – 1, since the master itself is excluded), it generates the NCCL Unique ID, a 128-byte token. 

While it might look like another job identifier, the NCCL UID serves a narrower purpose: it defines the scope of a collective communication group within NCCL. Only processes that share the same UID participate in the same communication context (for example, all-reduce or broadcast), while processes with different UIDs are excluded. This separation allows multiple distributed training jobs to run on the same hosts without interfering with each other.

In practice, the NCCL UID can also distinguish between different communication groups inside the same training job. For example, in tensor parallelism, the GPUs holding partitions of a layer’s weights must synchronize partial results using collectives such as all-reduce. These GPUs all share the same NCCL UID, ensuring their collectives are scoped only to that tensor-parallel group. Other groups of GPUs—for instance, those assigned to pipeline stages or data-parallel replicas—use different UIDs to form their own isolated communication contexts.

In short:

The JobID identifies the training job at the cluster level.
The NCCL Unique ID identifies the communication group (or subgroup) of GPUs that must synchronize within that job.

Finally, the master rank distributes the collected information across the job, ensuring that all processes receive the necessary environment variables and their process-specific NCCL UIDs. The WORLD_SIZE value is not redistributed, since it was already defined during initialization and synchronized over the control channel.


Figure 3-6: Distributing NCCL UID along Received Information.

The control channel remains open throughout the lifetime of the training job. It is later used by the framework to exchange metadata and coordination messages between ranks. For example, in model-parallel training, the master rank can use this channel to distribute model partition information or updated parameters to the processes, coordinate checkpointing, or handle dynamic changes in the communication group. Essentially, it serves as a persistent control path for tasks that require synchronization or configuration outside of the high-bandwidth data communication performed via NCCL collectives.

Initialized job


Figure 3-7 summarizes the result of the job environment initialization. All UET NICs are defined as Fabric Endpoints (FEPs) and associated with their respective Fabric Addresses (FAs). The UET NIC kernel has published the NIC-to-FEP/FA associations to the vendor UET Provider, making them accessible via Libfabric APIs. All GPUs have joined the same job and have been assigned unique process IDs and global rank IDs. Additionally, each process is aware of the collective communication group to which it belongs. With this setup, the job environment is fully prepared to serve the application—in our case, AI Training (AIT).


Figure 3-7: Complete Setup for AI Training.