Sunday, 7 September 2025

Ultra Ethernet: Fabric Setup

Introduction: Job Environment Initialization

Distributed AI training requires careful setup of both hardware and software resources. In a UET-based system, the environment initialization proceeds through several key phases, each ensuring that GPUs, network interfaces, and processes are correctly configured before training begins:


1. Fabric Endpoint (FEP) Creation

Each GPU process is associated with a logical Fabric Endpoint (FEP) that abstracts the connection to its NIC port. FEPs, together with the connected switch ports, form a Fabric Plane (FP)—an isolated, high-performance data path. The NICs advertise their capabilities via LLDP messages to ensure compatibility and readiness.

2. Vendor UET Provider Publication

Once FEPs are created, they are published to the Vendor UET Provider, which exposes them as Libfabric domains. This step makes the Fabric Addresses (FAs) discoverable, but actual communication objects (endpoints, address vectors) are created later by the application processes. This abstraction ensures consistent interaction with the hardware regardless of vendor-specific implementations.

3. Job Launcher and Environment Variables

When a distributed training job is launched, the job launcher (e.g., Torchrun) sets up environment variables for each process. These include the master rank IP and port, local and global ranks, and the total number of processes.

4. Environment Variable Interpretation

The framework reads the environment variables to compute process-specific Global Rank IDs and assign processes to GPUs. The lowest global rank is designated as the master rank, which coordinates control connections and allocates GPU memory for training data, model weights, and gradients.

5. Control Channel Establishment

Processes establish TCP connections with the master rank, exchanging metadata including JobID, ranks, and Fabric Endpoint information. The master generates and distributes the NCCL Unique ID (UID), defining collective communication groups. The control channel remains open throughout training, used for coordination, synchronization, and distribution of model partitions in model-parallel setups.

6. Initialized Job

After these phases, all GPUs are assigned unique process IDs and global ranks, know their collective communication groups, and have their Fabric Endpoints accessible via Libfabric. The job environment is now fully prepared to run the application—in this case, an AI training workload.

Fabric Endpoint - FEP


A Fabric Endpoint (FEP) is a logical entity that abstracts the connection between a single process running on a GPU and the NIC port attached to that GPU. In a UET-based system, FEPs and their connected interfaces on scale-out backend switches together form a Fabric. The path between FEPs, including the uplink switch ports, defines belongs to same Fabric Plane (FP), an isolated data path between FEPs.


The FEP abstraction is conceptually like a Routing and Forwarding Instance (VRF) on a Layer 3 router or switch. An administrator creates each FEP and assigns it an IP address, referred to in UET as a Fabric Address (FA). FEPs within the same FP may belong to the same or different IP subnets, depending on the chosen backend network rail topology. If FEPs in a closed FP belong to different subnets, those subnets must still be part of the same routing instance on layer 3 devices to preserve plane isolation. In comparison with modern data center networks using BGP EVPN, a Fabric Plane can be thought of as analogous to either a Layer 2 VNI or a Layer 3 VNI.


After a FEP is created, its attached NIC port must be enabled. When the port comes up, it begins sending LLDP messages to its connected peers to advertise and discover UET capabilities. UET NICs and switch ports use LLDP messages to exchange mandatory TLVs (Chassis ID, Port ID, Time to Live, and End of LLDPDU) as well as optional TLVs. The UET specification defines two optional LLDP extensions to advertise support for Link Level Retry (LLR) and Credit-Based Flow Control (CBFC).


The purpose of this LLDP exchange is to confirm that both ends of the link support the same UET feature set before higher-level initialization begins. Once LLDP negotiation succeeds, the participating ports are considered part of the same Fabric Plane and are ready to be used by the upper layers of the Ultra Ethernet stack.


Figure 3-1 illustrates this setup: each node has two FEPs, with FEP0 attached to Eth0 and FEP1 to Eth1. In this example, a rail is implemented as a subnet, and FEP0 on both nodes belong Fabric Plane 0 and FEP1 to Fabric Plane 1.


Figure 3-1: Create FEP and Link Enablement.

Vendor UET Provider


As described in the previous section, a Fabric Endpoint (FEP) abstracts the connection between a GPU process and its associated NIC port. The FEP also serves as the termination point of the Fabric on the node side.

In the UET stack, the NIC publishes its FEPs to an abstraction layer called the Vendor UET Provider. The UET provider is implemented by the NIC vendor and exposes a standardized API to the Libfabric core. In practice, this means that key FEP information—such as the FEP ID, the NIC port it is bound to, and its assigned Fabric Address (FA)—is made available to the upper-layer Libfabric functions.

The Vendor UET Provider translates UET concepts into Libfabric constructs. Each FEP is exposed to Libfabric as a domain, representing the communication resource associated with that NIC port. The Fabric Address (FA) assigned to the FEP becomes an entry in the Libfabric address vector (AV), making it possible for applications to reference and communicate with remote FEPs. Within a domain, applications create Libfabric endpoints, which act as the actual communication contexts for sending and receiving messages, or for performing RMA and atomic operations toward peers identified by their FAs.

It is important to note that when the NIC publishes its FEPs through the UET Provider, the Libfabric domain, endpoint, and address vector objects do not yet exist. At this stage, the FEPs and their Fabric Addresses are simply made discoverable and accessible as resources. The actual Libfabric objects are created later—after the job launcher has assigned ranks and JobIDs to processes—when each application process calls the relevant Libfabric functions (fi_domain(), fi_av_open(), fi_endpoint()). This separation ensures that FEP publishing is independent of job-level initialization and that applications remain in control of when and how communication resources are instantiated.

By handling this mapping and lifecycle, the Vendor UET Provider abstracts away vendor-specific hardware details and ensures that applications interact with a consistent programming model across different NICs. This enables portability: the same Libfabric-based code can run on any UET-compliant hardware, regardless of vendor-specific design choices.

Figure 3-2 illustrates this flow. The FEPs created on the node are published by the NIC through the Vendor UET Provider, making them visible as Libfabric domains with associated Fabric Addresses, ready for use by distributed AI frameworks.


Figure 3-2: Vendor UET Provider.


Job Initialization



Setting Environmental Variable


When a distributed training job is launched, the job launcher tool (such as Torchrun in PyTorch) sets job-specific environment variables for every process participating in the job.

In Figure 3-3, Torchrun defines and stores the environment variables for two processes on each node in the CPU’s DRAM. These tables contain both the shared JobID and the process-specific Process ID (PID). Even though JobID appears as part of the environment variables, it is usually assigned earlier by the cluster’s job scheduler (e.g., SLURM, Kubernetes) and then propagated to the processes. Torchrun itself does not create JobIDs; here it is shown as a conceptual abstraction to describe a unique identifier for the training job.

Torchrun itself defines a specific set of environment variables for process coordination:

NODE_RANK: the index of this node in the cluster
LOCAL_RANK: Local rank of the process within its node
RANK:                 Global rank of the process computed by job launcher
WORLD_SIZE: Total number of processes in the job
MASTER_ADDR: IP/hostname of the master rank
MASTER_PORT: TCP port of the master rank



The variable WORLD_SIZE specifies how many processes participate in the job. Based on this, the master rank (the master process) knows how many control connections will later be opened by peer processes.

A Global Rank ID (unique across all nodes) is computed using the Node ID, Processes per Node, and the Local Rank ID (which is unique within its node).

Each environment variable table also includes the master rank’s IP address and the TCP port that it is listening on. This information is used to establish control connections between processes.

Typically, the job launcher itself runs on one node, while the deep learning framework runs on every node. The job launcher may distribute the environment variables to the processes over an SSH connection via the management network. 


Figure 3-3: Distributing Environmental Variables.

Environment Variable Interpretation


The environment variables table instructs the framework on how to initialize distributed communication and which GPUs will participate in the job.

In our example, Framework PyTorch reads these environment variables and computes the process-specific Global Rank ID by multiplying the Node Rank ID by processes on node and then adding the Local Rank ID. 

Global Rank: Node Rank × Number of Processes per Node + Local Rank ID

Note! Torchrun does not export PROCESSES_PER_NODE as an environment variable. Instead, it is implied by the number of LOCAL_RANK values per node. The job launcher itself knows this value (provided via --nproc_per_node), but it does not need to pass it explicitly, since each process can infer it from WORLD_SIZE / num_nodes.

Each process is then assigned to a GPU, usually based on its local rank.

In Figure 3-4, for example, the first process—with Global Rank ID 0—is assigned to GPU0 on the node, and the second process is assigned to GPU1. The worker with the lowest global rank (typically rank 0) is designated as the master rank. The master rank is responsible for coordinating the training job: it provides its IP address and the TCP port it listens on for control connections from other workers. In this example, GPU0 on Host-A has the lowest global rank, so it becomes the master rank.

PyTorch also allocates memory space in the GPU’s VRAM, which will later be used to store training data, weight parameters, and gradients.

Together, the JobID, local ranks, and global ranks allow the distributed training framework to organize workers, identify the master process, and manage communication efficiently across nodes.


Figure 3-4: Reading Variables and Allocating Memory Space.

Opening Control Channel


After rank assignment and role selection, the processes running on GPUs begin a TCP three-way handshake to establish a connection with the master rank (GPU0 on Host-A). This is done by sending a TCP SYN packet to the destination IP address 10.1.0.11 and TCP port 12345, both read from the environment variable table (followed by SYN ACK and ACK ). The source IP address typically belongs to the NIC connected to the frontend (management) network.

Once the TCP sockets for control connections are established between ranks, each process notifies the master rank with the following information:

JobID: Confirms that all processes are participating in the same job. 

Global and Local Ranks: Used by the master rank to assign the NCCL Unique ID (UID) to all ranks that belong to the same collective communication group. Each process sharing this UID can synchronize gradients or exchange data using collectives.

WORLD_SIZE: Although set during initialization, resending it ensures every process has a consistent view of the total number of participants.

FEP and FA: The Fabric Endpoint (FEP) IP address, expressed as a Fabric Address (FA), is tied to the correct process for RDMA communication.


Figure 3-5: Establishing a TCP Socket for Control Channel.

After the master rank has accepted all expected connections (equal to WORLD_SIZE – 1, since the master itself is excluded), it generates the NCCL Unique ID, a 128-byte token. 

While it might look like another job identifier, the NCCL UID serves a narrower purpose: it defines the scope of a collective communication group within NCCL. Only processes that share the same UID participate in the same communication context (for example, all-reduce or broadcast), while processes with different UIDs are excluded. This separation allows multiple distributed training jobs to run on the same hosts without interfering with each other.

In practice, the NCCL UID can also distinguish between different communication groups inside the same training job. For example, in tensor parallelism, the GPUs holding partitions of a layer’s weights must synchronize partial results using collectives such as all-reduce. These GPUs all share the same NCCL UID, ensuring their collectives are scoped only to that tensor-parallel group. Other groups of GPUs—for instance, those assigned to pipeline stages or data-parallel replicas—use different UIDs to form their own isolated communication contexts.

In short:

The JobID identifies the training job at the cluster level.
The NCCL Unique ID identifies the communication group (or subgroup) of GPUs that must synchronize within that job.

Finally, the master rank distributes the collected information across the job, ensuring that all processes receive the necessary environment variables and their process-specific NCCL UIDs. The WORLD_SIZE value is not redistributed, since it was already defined during initialization and synchronized over the control channel.


Figure 3-6: Distributing NCCL UID along Received Information.

The control channel remains open throughout the lifetime of the training job. It is later used by the framework to exchange metadata and coordination messages between ranks. For example, in model-parallel training, the master rank can use this channel to distribute model partition information or updated parameters to the processes, coordinate checkpointing, or handle dynamic changes in the communication group. Essentially, it serves as a persistent control path for tasks that require synchronization or configuration outside of the high-bandwidth data communication performed via NCCL collectives.

Initialized job


Figure 3-7 summarizes the result of the job environment initialization. All UET NICs are defined as Fabric Endpoints (FEPs) and associated with their respective Fabric Addresses (FAs). The UET NIC kernel has published the NIC-to-FEP/FA associations to the vendor UET Provider, making them accessible via Libfabric APIs. All GPUs have joined the same job and have been assigned unique process IDs and global rank IDs. Additionally, each process is aware of the collective communication group to which it belongs. With this setup, the job environment is fully prepared to serve the application—in our case, AI Training (AIT).


Figure 3-7: Complete Setup for AI Training.

Monday, 18 August 2025

Parallelization Strategies in Neural Networks

From a network engineer’s perspective, it is not mandatory to understand the full functionality of every application running in a datacenter. However, understanding the communication patterns of the most critical applications—such as their packet and flow sizes, entropy, transport frequency, and link utilization—is essential. Additionally, knowing the required transport services, including reliability, in-order packet delivery, and lossless transmission, is important.

In AI fabrics, a neural network, including both its training and inference phases, can be considered an application. For this reason, this section first briefly explains the basic operation of the simplest neural network: the Feed Forward Neural Network (FNN). It then discusses the operation of a single neuron. Although a deep understanding of the application itself is not required, this section equips the reader with knowledge of what pieces of information are exchanged between GPUs during each phase and why these data exchanges are important.


Feedforward Neural Network: Forward Pass


Figure 1-7 illustrates a simple four-layer Feed Forward Neural Network (FNN) distributed across four GPUs. The two leftmost GPUs reside in Node-1, and the other two GPUs reside in Node-2. The training data is fed into the first layer. In real neural networks, this first layer is the input layer, which simply passes input data unmodified to the neurons in the first hidden layer. To save space, the input layer is excluded in this illustration.


Each neuron in the first hidden layer is connected to all input values, with each connection assigned an initial weight parameter. For example, the first neuron, n1, receives all three inputs, x1 through x3. The neuron performs two main computations. First, it calculates a weighted sum of its inputs by multiplying each input by its associated weight and summing the results. This produces the neuron’s pre-activation value, denoted as z. The pre-activation value is then passed through an activation function such as the Rectified Linear Unit (ReLU). ReLU is computationally efficient due to its simplicity: if the pre-activation z is greater than zero, the output y equals z; otherwise, the output is zero.


GPU0 stores these outputs in its VRAM. The outputs are then copied (and removed) via Direct Memory Access (DMA) over the scale-up network to the memory of GPU1, which holds the second hidden layer on the same Node-1.


The neurons in the second layer compute their outputs using the same process After the computations in the second layer, the output data is immediately transferred from GPU1 in Node-1 to GPU0 in Node-2 (for example, moving from the second to the third layer) over the scale-out network using Remote Direct Memory Access (RDMA).


The processing in the final two layers follows the same pattern. The output of the last layer represents the model’s prediction. The training dataset is labeled with the expected results, but the prediction rarely matches the target in the first iteration. Therefore, at the end of the forward pass, the model loss value is computed to measure the prediction error. 


Figure 1-7: The Operation of FNN: Forward Pass.


Feedforward Neural Network: Backward Pass


After computing the model error, the training process moves into the backward pass phase. During the backward pass, the model determines how each weight parameter should be adjusted to improve prediction accuracy.

First, the output error and the neuron’s post-activation value z are used to compute a neuron-specific delta value (also called neuron error). Then, this delta value is multiplied by the corresponding input activation from the previous layer to obtain a gradient. The gradient indicates both the magnitude and direction in which the weight should be updated.

To avoid overly large adjustments that could destabilize training, the gradient is scaled by the learning rate parameter. For example, if the gradient suggests increasing a weight by 2.0 and the learning rate is 0.1, the actual adjustment will be 0.1 × 2.0 = 0.2

Because the delta value for one layer depends on the delta of the next layer, it must be propagated backward immediately before gradient computation can begin for the previous layer. In distributed training, this means delta values may need to be transferred between GPUs either within the same node (scale-up network) or across nodes (scale-out network), depending on the parallelization strategy.

Whether exchanging deltas or synchronizing gradients, the transfers occur over scale-up or scale-out networks based on GPU placement.


Figure 1-8: Gradient Calculation


Parallelization Strategies


At the time of writing, the world’s largest single-location GPU supercomputer is Colossus in Memphis, Tennessee, built by Elon Musk’s AI startup xAI. It contains over 200,000 GPUs. The Grok-4 large language model (LLM), published in July 2025, was trained on Colossus. The parameter count of Grok‑4 has not been made public, but Grok-1 (released in October 2023) used a Mixture-of-Experts (MoE) architecture with about 314 billion parameters.

Large-scale and hyper-scale GPU clusters such as Colossus, require parallel computation and communication to achieve fast training and near real-time inference. Parallelization strategies define how computations and data are distributed across GPUs to maximize efficiency and minimize idle time. Besides, parallelism among AI cluster size and collective communication topology is the main factor which affects when, with whom and over which network the GPU communication happens. 

The main approaches include:

  • Model Parallelism: Splits the neural network layers across multiple GPUs when a single GPU cannot hold the entire model.

  • Tensor Parallelism: Divides the computations of a single layer (e.g., matrix multiplications) across multiple GPUs, improving throughput for large layers.

  • Data Parallelism: Distributes different portions of the training dataset to multiple GPUs, ensuring that all GPUs are actively processing data simultaneously.

  • Pipeline Parallelism: Divides the model into sequential stages across GPUs and processes micro-batches of training data in a staggered fashion, reducing idle time between stages.

3D Parallelism: Combines tensor, pipeline, and data parallelism to scale extremely large models efficiently. Tensor parallelism splits computations within layers, pipeline parallelism splits the model across sequential GPU stages, and data parallelism replicates the model to process different batches simultaneously. Together, they maximize GPU utilization and minimize idle time.

These strategies ensure that AI clusters operate at high efficiency, accelerating training while reducing wasted energy and idle GPU time.


Model Parallelism


Each weight in a neural network is typically stored using 32-bit floating point (FP32) format, which consumes 4 bytes of memory per weight. A floating point number allows representation of real numbers, including very large and very small values, with a decimal point. Large neural networks, with billions of weight parameters, quickly exceed the memory capacity of a single GPU. To reduce the memory load on a single GPU, model parallelism distributes the neural network’s layers (including layer-specific weight matrices and neurons) across multiple GPUs. In this approach, each GPU is responsible for computing the forward and backward pass of the layers it holds. This not only reduces the memory usage on a single GPU but also lowers CPU and GPU cycles by distributing the computation across multiple devices.


Note: During training, additional memory is temporarily required to store the results of activation functions as data passes through the network and the gradients needed to compute weight updates during the backward pass, which can easily double the memory consumption.

Figure 1-9 depicts a simple feedforward neural network with three layers (excluding the input layer for simplicity). The input layer is where the training data’s features are passed into the network; no computations are performed in this layer. The first hidden layer receives the input features and performs computations.

First hidden layer (GPU 1): This layer has three neurons and receives four input features from the input layer. Its weight matrix has a size of 3 × 4. Each row corresponds to a specific neuron (n1, n2, n3), and each element in a row represents the weight for a particular input feature—the first element corresponds to the first input feature, the second to the second, and so on. GPU 1 computes the pre-activation values by multiplying the weight matrix with the input features (matrix multiplication) and then applies the activation function to produce the output of this layer.

Second hidden layer (GPU 2): This layer has two neurons and receives three input features from the outputs of the first hidden layer. Its weight matrix has a size of 2 × 3. Each row corresponds to a specific neuron (n4, n5), and each element in a row represents the weight for a particular input feature from the previous layer. GPU 2 computes the pre-activation values by multiplying the weight matrix with the input features from GPU 1 and then applies the activation function to produce the output of this layer.

Output layer (GPU 3): This layer has one neuron and receives two input features from the second hidden layer. Its weight matrix has a size of 1 × 2. The single row corresponds to the neuron in this layer, and each element represents the weight for a specific input feature from GPU 2. GPU 3 computes the pre-activation value by multiplying the weight matrix with the input features from the second hidden layer and then applies the activation function to produce the final output of the network.

By assigning each layer to a different GPU, model parallelism enables large neural networks to be trained even when the combined memory requirements of the weight matrices exceed the capacity of a single GPU.

If all GPUs reside on the same node, the activation values during the forward pass are passed to the next layer over the intra-server Scale-Up network using Direct Memory Access (DMA). If the GPUs are located on different nodes, the communication occurs over the Scale-Out Backend network using Remote Direct Memory Access (RDMA). Gradient synchronization during the backward pass follows the same paths: intra-node communication uses the Scale-Up network, and inter-node communication uses the Scale-Out network. 



Figure 1-9: Model Parallelism.

Tensor Parallelism


Model parallelism distributes layers across GPUs. In very large neural networks, even the weight matrices of a single layer may become too large to store on a single GPU or compute efficiently. Tensor parallelism addresses this by splitting a layer’s weight matrix and computation across multiple GPUs. While model parallelism splits layers, tensor parallelism splits within a layer, allowing multiple GPUs to work on the same layer in parallel. Each GPU holds a portion of the layer’s parameters and computes partial outputs. These partial results are then combined to produce the full output of the layer, enabling scaling without exceeding memory or compute limits.

In Figure 1‑10, we have a 4 × 8 weight matrix that is split in half: the first half is assigned to GPU0, and the second half to GPU1. GPU0 belongs to Tensor Parallel Rank 1 (TP Rank 1), and GPU1 belongs to TP Rank 2. GPU0 has two neurons, where neuron n1 is associated with the first row of weights and neuron n2 with the second row. GPU1 works the same way with its portion of the matrix. Both GPUs process the same input feature matrix. On GPU0, neurons n1 and n2 perform matrix multiplication with their weight submatrix and the input feature matrix to produce pre-activation values, which are then passed through the activation function to produce neuron outputs. Because the layer’s weight matrix is distributed across GPUs, each GPU produces only a partial output vector. 

Before passing the results to the next layer, GPUs synchronize these partial output vectors using AllGather collective communication, forming the complete layer output vector that can then be fed into the next layer. If the GPUs reside on the same node, this communication happens over the intra-node Scale-Up network using DMA. If the GPUs are on different nodes, the communication occurs over the inter-node Scale-Out backend network.


Figure 1-10: Tensor Parallelism.


Figure 1‑11 depicts the AllGather collective communication used for synchronizing partial output vectors in tensor parallelism. Recall that the layer’s weight matrix is split across GPUs, with each GPU computing partial outputs for its assigned portion of the matrix. These partial outputs are first stored in the local VRAM of each GPU. The GPUs then exchange these partial results with other GPUs in the same Tensor Parallel Group. For example, GPU0 (TP Rank 1) computes and temporarily stores the outputs of neurons n1 and n2, and synchronizes them with GPU3. After this AllGather operation, all participating GPUs have the complete output vector, which can then be passed to the next layer of their corresponding TP Rank. Once the complete output vectors are passed on, the memory used to store the partial results can be freed.



Figure 1-11: Tensor Parallelism – AllGather Operation for Complete Output Vector.


3D Parallelism


3D parallelism combines model parallelism, tensor parallelism, data parallelism, and pipeline parallelism into a unified training strategy. The first two were described earlier, and this section introduces the remaining two from the 3D parallelism perspective.
At the bottom of Figure 1-12, is a complete input feature matrix with 16 input elements (x1–x16). This matrix is divided into two mini-batches:

The first mini-batch (x1–x8) is distributed to GPU0 and GPU1.
The second mini-batch (x9–x16) is distributed to GPU4 and GPU5.

This is data parallelism: splitting the dataset into smaller mini-batches so multiple GPUs can process them in parallel, either because the dataset is too large for a single GPU or to speed up training.
Pipeline parallelism further divides each mini-batch into micro-batches, which are processed in a pipeline across the GPUs. In Figure 1-12, the mini-batch (x1–x8) is split into two micro-batches: x1–x4 and x5–x8. These are processed one after the other in pipeline fashion.
In this example, GPU0 and GPU2 belong to TP Rank 1, while GPU1 and GPU3 belong to TP Rank 2. Since they share the same portion of the dataset, TP Ranks 1 and 2 together form a Data Parallel Group (DP Group). The same applies to GPUs 4–7, which form a second DP Group.

Forward Pass in 3D Parallelism


Training proceeds layer by layer:


  1. First layer: The neurons compute their outputs for the first micro-batch (step 1a). Within each TP Rank, these outputs form partial vectors, which are synchronized across TP Ranks within the same DP Group using collective communication (step 1b). The result is a complete output vector, which is passed to the next layer (step 1c).
  2. Second layer: Neurons receive the complete output vector from the first layer and repeat the process—compute outputs (2a), synchronize partial vectors (2b), and pass the complete vector to the next layer (2c).
  3. Pipeline execution: As soon as GPU0 and GPU1 finish processing the first micro-batch and forward it to the third layer, they can immediately begin processing the second micro-batch (step 3a). At the same time, GPUs handling layers three and four start processing the outputs received from second layers.
This overlapping execution means that eventually all eight GPUs are active simultaneously, each processing different micro-batches and layers in parallel. This is the essence of 3D parallelism: maximizing efficiency by distributing memory load and computation while minimizing GPU idle time, which greatly speeds up training.

During the forward pass, communication between GPUs happens only within a Data Parallel Group.

Figure 1-12: 3D Parallelism Forward pass.

Backward Pass in 3D Parallelism


After the model output is produced and the loss is computed, the training process enters the backward pass. The backward pass calculates the gradients, which serve as the basis for determining how much, and in which direction, the model weights should be adjusted to improve performance. 
The gradient computation begins at the output layer, where the derivative of the loss is used to measure the error contribution of each neuron. These gradients are then propagated backward through the network layer by layer, in the reverse order of the forward pass. Within each layer, gradients are first computed locally on each GPU.

Once the local gradients are available, synchronization takes place. Just as neuron outputs were synchronized during the forward pass between TP Ranks within a Data Parallel (DP) Group, the gradients now need to be synchronized both within each DP Group and across DP Groups. This step ensures that all GPUs hold consistent weight updates before the optimizer applies them. Figure 1-3 illustrates this gradient synchronization process.



Figure 1-13: 3D Parallelism Backward pass.

Summary


In this chapter, the essential building blocks of an AI cluster were described. The different networks that connect the system together were introduced: the scale-out backend network, the scale-up network, as well as the management, storage, and frontend networks.

After that foundation was established, the operation of neural networks was explained. The forward pass through individual neurons was described to show how outputs are produced, and the backward pass was outlined to demonstrate how errors propagate, and gradients are computed.

The parallelization strategies—model parallelism, tensor parallelism, pipeline parallelism, and data parallelism—were then presented, and their combination into full 3D parallelism was discussed to illustrate how large GPU clusters can be efficiently utilized.

With this understanding of both the infrastructure and the computational model established, attention can now be turned to Ultra Ethernet, the transport technology used to carry RDMA traffic across the Ethernet-based scale-out backend network that underpins large-scale AI training. 

References


[1] Colossus Supercomputer: https://en.wikipedia.org/wiki/Colossus_(supercomputer)
[2] Inside the 100K GPU xAI Colossus Cluster that Supermicro Helped Build for Elon Musk, https://www.servethehome.com/inside-100000-nvidia-gpu-xai-colossus-cluster-supermicro-helped-build-for-elon-musk/




AI Cluster Networking

Introduction

The Ultra Ethernet Specification v1.0 (UES), created by the Ultra Ethernet Consortium (UEC), defines end-to-end communication practices for Remote Direct Memory Access (RDMA) services in AI and HPC workloads over Ethernet network infrastructure. UES not only specifies a new RDMA-optimized transport layer protocol, Ultra Ethernet Transport (UET), but also defines how the full application stack—from Software through Transport, Network, Link, and Physical—can be adjusted to provide improved RDMA services while continuing to leverage well-established standards. UES includes, but is not limited to, a software API, mechanisms for low-latency and lossless packet delivery, and an end-to-end secure software communication path. 

Before diving into the details of Ultra Ethernet, let’s briefly look at what we are dealing with when we talk about an AI cluster. From this point onward, we focus on Ultra Ethernet from the AI cluster perspective. This chapter first introduces the AI cluster networking. Then, it briefly explains how a neural network operates during the training process, including an short introduction to the backpropagation algorithm and its forward and backward pass functionality.

Note: This book doesn’t include any complex mathematical algorithms related backpropagation algorithm, or detailed explanation of different neural networks. I have written a book Deep Neural Network for Network Engineers, which first part covers Feedforward Neural Network (FNN), Convolutional Neural Network (CNN), and Large Language Models (LLM).


AI Cluster Networking


Scale-Out Backend Network: Inter-Node GPU Communication

Figure 1-1 illustrates a logical view of an AI Training (AIT) cluster consisting of six nodes, each equipped with four GPUs, for a total of 24 GPUs. Each GPU has a dedicated Remote Direct Memory Access-capable Network Interface Card (RDMA-NIC), which at the time of writing typically operates at speeds ranging from 400 to 800 Gbps.


Figure 1-1: Inter-Node GPU Connection: Scale-Out Backend Network.

An RDMA-NIC can directly read from and write to the GPU’s VRAM without involving the host CPU or triggering interrupts. In this sense, RDMA-NICs act as hardware accelerators by offloading data transfer operations from the CPU, reducing latency and freeing up compute resources.

GPUs in the different nodes with the same local rank number are connected to the same rail of the Scale-Out Backend network. For example, all GPUs with a local rank of zero are connected to rail zero, while those with a local rank of one are connected to rail one.

The Scale-Out Backend network is used for inter-node GPU communication and must support low-latency, lossless RDMA message transport. Its physical topology depends on the scale and scalability requirements of the implementation. A leaf switch may be dedicated to a single rail, or it may support multiple rails by grouping its ports, with each port group mapped one-to-one to a rail. Inter-rail traffic is commonly routed through spine switches. In larger implementations, the network often follows a routed two-tier (3-stage) Clos topology or a pod-based, three-tier (5-stage) topology.

The Scale-Out Backend network is primarily used to transfer the results of neuron activation functions to the next layer during the forward pass and to support collective communication for gradient synchronization during the backward pass. The communication pattern between inter-node GPUs, however, depends on the selected parallelization strategy.

Traffic on the Scale-Out Backend network is characterized as high-latency sensitivity, bursty, low-entropy traffic, with a few long-lived elephant flows. Since full link utilization is common during communication phases, an efficient congestion control mechanism must be implemented.

Note: Scale-Out Backend network topologies are covered in more detail in my Deep Learning for Network Engineers book.


Scale-Up Networks: Intra-Node GPU Communication


Intra-node GPU communication occurs over a high-bandwidth, low-latency scale-up network. Common technologies for intra-node GPU communication include NVIDIA NVLink, NVSwitch, and AMD Infinity Fabric, depending on the GPU vendor and server architecture. Additionally, the Ultra Accelerator (UA) Consortium has introduced the Ultra Accelerator Link (UALink) 200G 1.0 Specification, a standards-based, vendor-neutral solution designed to enable GPU communication over intra-node or pod scale-up networks.



Figure 1-2: Intra-Node GPU Connection: Scale-Up Network.

These interconnects form a Scale-Up communication channel that allows GPUs within the same node to exchange data directly, bypassing the host CPU and system memory. Compared to PCIe-based communication, NVLink and similar solutions provide significantly higher bandwidth and lower latency.

In a typical NVLink topology, GPUs are connected in a mesh or fully connected ring, enabling peer-to-peer data transfers. In systems equipped with NVSwitch, all GPUs within a node are interconnected through a centralized switch fabric, allowing uniform access latency and bandwidth across any GPU pair.

The Scale-Up fabric is primarily used for the same purpose that Scale-Out Backend network but is serves as intra-node GPU communication path. 
Because communication happens directly over the GPU interconnect, Scale-Up communication is generally much faster and more efficient than inter-node communication over Scale-Out Backend network. 

Note: In-depth discussion of NVLink/NVSwitch topologies and intra-node parallelism strategies can be found in my Deep Learning for Network Engineers book.

Frontend Network: User Inference


The modern Frontend Network in large scale AI training cluster is often implemented as a routed Clos fabric designed to provide scalable and reliable connectivity for user access, orchestration, and inference workloads. The primary function of Frontend network is to handle user interactions with deployed AI models, serving inference requests.


Figure 1-3: User Inference: Frontend Network.

When multitenancy is required, the modern Frontend Network typically uses BGP EVPN as the control plane and VXLAN as the data plane encapsulation mechanism, enabling virtual network isolation. Data transport is usually based on TCP protocol. Multitenancy also makes it possible to create secure and isolated network segment for training job initialization where GPUs joins the job and receives initial model parameters from the master rank.

Unlike the Scale-Out Backend, which connects GPUs across nodes using dedicated RDMA-NIC per GPU, the Frontend Network is accessed via a shared NIC, commonly operating at 100 Gbps. 

Traffic on the Frontend Network is characterized by bursty, irregular communication patterns, dominated by short-lived, high-entropy mouse flows involving many unique IP and port combinations. These flows are moderately sensitive to latency, particularly in interactive inference scenarios. Despite the burstiness, the average link utilization remains relatively low compared to the Scale-Out or Scale-Up fabrics.


Note: In depth instruction of BGP EVPN/VXLAN can be found from my books Virtual eXtensible LAN – VXLAN Fabric with BGP EVPN Control- Plane

Management Network


The Management Network is a dedicated or logically isolated network used for the orchestration, control, and administration of an AI cluster. It provides secure and reliable connectivity between management servers, compute nodes, and auxiliary systems. These auxiliary systems typically include time synchronization servers (NTP/PTP), authentication and authorization services (such as LDAP or Active Directory), license servers, telemetry collectors, remote management interfaces (e.g., IPMI, Redfish), and configuration automation platforms.



Figure 1-4: Cluster Management: Management Network.

Traffic on the Management Network is typically low-bandwidth but highly sensitive, requiring strong security policies, high reliability, and low-latency access to ensure stability and operational continuity. It supports administrative operations such as remote access, configuration changes, service monitoring, and software updates.

To ensure isolation from user, training, and storage traffic, management traffic is usually carried over separate physical interfaces, or logically isolated using VLANs or VRFs. This network is not used for model training data, gradient synchronization, or inference traffic.

Typical use cases include:

  • Cluster Orchestration and Scheduling: Facilitates communication between orchestration systems and compute nodes for job scheduling, resource allocation, and lifecycle management.
  • Job Initialization and Coordination: Handles metadata exchange and service coordination required to bootstrap distributed training jobs and synchronize GPUs across multiple nodes.
  • Firmware and Software Lifecycle Management: Supports remote OS patching, BIOS or firmware upgrades, driver installation, and configuration rollouts.
  • Monitoring and Telemetry Collection: Enables collection of logs, hardware metrics, software health indicators, and real-time alerts to centralized observability platforms.
  • Remote Access and Troubleshooting: Provides secure access for administrators via SSH, IPMI, or Redfish for diagnostics, configuration, or out-of-band management.
  • Security and Segmentation: Ensures that control plane and administrative traffic remain isolated from data plane workloads, maintaining both performance and security boundaries.

The Management Network is typically built with a focus on operational stability and fault tolerance. While bandwidth requirements are modest, low latency and high availability are critical for maintaining cluster health and responsiveness.

Storage Network


The Storage Network connects compute nodes, including GPUs, to the underlying storage infrastructure that holds training datasets, model checkpoints, and inference data.

Figure 1-5: Data Access: Storage Network.

Key use cases include:

  • High-Performance Data Access: Streaming large datasets from distributed or centralized storage systems (e.g., NAS, SAN, or parallel file systems such as Lustre or GPFS) to GPUs during training.
  • Data Preprocessing and Caching: Supporting intermediate caching layers and fast read/write access for preprocessing pipelines that prepare training data.
  • Shared Storage for Distributed Training: Providing a consistent and accessible file system view across multiple nodes to facilitate synchronization and checkpointing.
  • Model Deployment and Inference: Delivering trained model files to inference services and storing input/output data for auditing or analysis.

Due to the high volume and throughput requirements of training data access, the Storage Network is typically designed for high bandwidth, low latency, and scalability. It may leverage protocols such as NVMe over Fabrics (NVMe-oF), Fibre Channel, or high-speed Ethernet with RDMA support.

Summary of AI Cluster Networks

Scale-Out Backend Network

The Scale-Out Backend network connects GPUs across multiple nodes for inter-node GPU communication. It supports low-latency, lossless RDMA message transport essential for synchronizing gradients and transferring neuron activation results during training. Its topology varies by scale, typically mapping each rail to a leaf switch with inter-rail traffic routed through spine switches. Larger deployments often use routed two-tier (3-stage) or pod-based three-tier (5-stage) Clos architectures.

Scale-Up Network

The Scale-Up network provides high-speed intra-node communication between GPUs within the same server. It typically uses NVLink, NVSwitch, or PCIe fabrics enabling direct, low-latency, high-bandwidth access to GPU VRAM across GPUs in a single node. This network accelerates collective operations and data sharing during training and reduces CPU involvement.

Frontend Network

The Frontend network serves as the user access and orchestration interface in the AI datacenter. Implemented as a routed Clos fabric, it handles inference requests. When multitenancy is required, it leverages BGP EVPN for control plane and VXLAN for data plane encapsulation. This network uses TCP transport and operates typically at 100 Gbps, connecting GPUs through shared NICs.

Management Network

The Management Network is a dedicated or logically isolated network responsible for cluster orchestration, control, and administration. It connects management servers, compute nodes, and auxiliary systems such as time synchronization servers, authentication services, license servers, telemetry collectors, and remote console interfaces. This network supports low-bandwidth but latency-sensitive traffic like job initialization, monitoring, remote access, and software updates, typically isolated via VRFs or VLANs for security.

Storage Network

The Storage Network links compute nodes to storage systems housing training datasets, model checkpoints, and inference data. It supports high-performance data streaming, data preprocessing, shared distributed storage access, and model deployment. Designed for high bandwidth, low latency, and scalability, it commonly uses protocols such as NVMe over Fabrics (NVMe-oF), Fibre Channel, or RDMA-enabled Ethernet.


Figure 1-6: A Logical view of 6x4 GPU Cluster.

Monday, 7 July 2025

Ultra Ethernet

Introduction

Remote Direct Memory Access over Converged Ethernet (RoCE) is a transport model that extends InfiniBand semantics over Ethernet networks. It enables direct memory access between hosts by encapsulating InfiniBand transport headers—such as the InfiniBand Transport Header (IBTH) and the RDMA Extended Transport Header (RETH)—within Ethernet, IP, and UDP packets. In by book "Deep Learning for Network Engineers" Chapter 9, describes how RDMA NICs process application work requests, known as InfiniBand verbs, and how these are encoded into IBTH and RETH headers for delivery to remote targets using RoCEv2.

This post shifts focus to the Ultra Ethernet Transport (UET) model, developed by the Ultra Ethernet Consortium (UEC). UET defines an alternative RDMA transport architecture that operates over standard Ethernet networks, without relying on InfiniBand message formats or semantics. While both RoCEv2 and UET enable remote memory access between nodes, UET is not based on InfiniBand transport headers, and the term RoCE is not used in UET systems.

Instead, UET introduces a new Ultra Ethernet (UE) layer composed of several sublayers, including the Semantic Sublayer (SES) and the Packet Delivery Sublayer (PDS). These sublayers are responsible for encoding and transmitting RDMA operations—such as memory addresses, remote keys (RKEYs), operation codes, and completion signaling—over Ethernet. This contrasts with RoCEv2, where such RDMA information is carried within IBTH and RETH headers.

In this chapter, we explore how UET transports data across the Scale-Out Network—a role comparable to the “Backend Network” in RoCEv2-based systems. We will examine how UET supports GPU-to-GPU communication at data center scale, and how its design differs in terms of packet structure, connection setup, and flow control when compared to InfiniBand-based approaches.

This chapter is based on the Ultra Ethernet Specification v1.0, published on June 11, 2025, by the Ultra Ethernet Consortium.

Figure 14-1 depicts a small, two-node parallel computing system, where each node contains two GPUs and UE-capable NIC per GPU. These nodes, along with the Scale-Out Backend Network (Switching Infrastructure), form a cluster. The UEC specification uses the term Fabric Interface (FI) to refer to a UE-capable NIC. In Figure 14-1, all four FIs, together with the leaf and spine switches in the Scale-Out Backend Network, make up the fabric. In this context, the fabric includes all network components: UE-NICs, switching infrastructure, inter-switch links, cabling, and transceivers. One-way delay in Backend Scale-Out Network should be less than 10 µs.

The Scale-Up Network refers to intra-node GPU communication using NVLink and PCIe. Scale-Up Networks can also include short-range interconnects that use a high-speed, low-latency, single-tier (non-CLOS topology) switch—enabling tightly coupled multi-node systems that retain scale-up characteristics. Latency for Scale-Up network should be < 1 µs.

A Fabric Endpoint (FEP) is a logical entity that terminates the UET protocol and is identified by a unique Fabric Address (FA). The UET protocol consists of three key sublayers—the Semantic (SES), Packet Delivery (PDS), and Transport Security (TSS) sublayers—analogous to how a VXLAN Tunnel Endpoint (VTEP) terminates the VXLAN data plane. These sublayers will be discussed in detail in upcoming chapters.

The Packet Delivery Context (PDC), a component of the FEP, is a logical construct responsible for unidirectional packet delivery between two FEPs. The Congestion Control Context (CCC), in turn, manages the transmission rate of traffic exchanged over a given PDC.

The term port refers to the physical port of the UE-capable NIC that connects to a fabric switch. Since multiple FEPs may exist on a single FI—each assigned a unique FA—but the FI may have only one (or two, if dual-homed) physical ports, the FAs of all FEPs on the same FI typically share the same MAC address.

A Fabric Plane is a communication path between two or more Fabric Endpoints (FEPs). In Figure 14-1, two Fabric Planes are shown. Fabric Plane 1 connects FEP 111 and FEP 122 through interfaces E1 and E2 on Leaf Switch 101. Fabric Plane 2 connects FEP 211 and FEP 222 via interfaces E1 and E2 on Leaf Switch 102.

In a RoCEv2-based solution, the Scale-Out Network is referred to as the Backend Network, and the communication paths between processes are called Rails.

The Fabric Interface exposes FEPs to parallel computing processes, which are identified by Process IDs (PIDs). A Virtual Address Space (VAS) represents the memory space allocated and registered to a specific process and is identified by a Process Address Space Identifier (PASID).

Figure 14-1: Ultra Ethernet Terminology.

 

Thursday, 15 May 2025

Deep Learning for Network Engineers: Understanding Traffic Patterns and Network Requirements in the AI Data Center

 

About This Book

Several excellent books have been published over the past decade on Deep Learning (DL) and Datacenter Networking. However, I have not found a book that covers these topics together—as an integrated deep learning training system—while also highlighting the architecture of the datacenter network, especially the backend network, and the demands it must meet.

This book aims to bridge that gap by offering insights into how Deep Learning workloads interact with and influence datacenter network design.

So, what is Deep Learning?

Deep Learning is a subfield of Machine Learning (ML), which itself is a part of the broader concept of Artificial Intelligence (AI). Unlike traditional software systems where machines follow explicitly programmed instructions, Deep Learning enables machines to learn from data without manual rule-setting.

At its core, Deep Learning is about training artificial neural networks. These networks are mathematical models composed of layers of artificial neurons. Different types of networks suit different tasks—Convolutional Neural Networks (CNNs) for image recognition, and Large Language Models (LLMs) for natural language processing, to name a few.

Training a neural network involves feeding it labeled data and adjusting its internal parameters through a process called backpropagation. During the forward pass, the model makes a prediction based on its current parameters. This prediction is then compared to the correct label to calculate an error. In the backward pass, the model uses this error to update its parameters, gradually improving its predictions. Repeating this process over many iterations allows the model to learn from the data and make increasingly accurate predictions.

Why should network engineers care?

Modern Deep Learning models can be extremely large, often exceeding the memory capacity of a single GPU or CPU. In these cases, training must be distributed across multiple processors. This introduces the need for high-speed communication between GPUs—both within a single server (intra-node) and across multiple servers (inter-node).

Intra-node GPU communication typically relies on high-speed interconnects like NVLink, with Direct Memory Access (DMA) operations enabling efficient data transfers between GPUs. Inter-node communication, however, depends on the backend network, either  InfiniBand or Ethernet-based. Synchronization of model parameters across GPUs places strict requirements on the network: high throughput, ultra-low latency, and zero packet loss. Achieving this in an Ethernet fabric is challenging but possible.  

This is where datacenter networking meets Deep Learning. Understanding how GPUs communicate and what the network must deliver is essential for designing effective AI data center infrastructures.




What this book is—and isn’t


This book provides a theoretical and conceptual overview. It is not a configuration or implementation guide, although some configuration examples are included to support key concepts. Since the focus is on the Deep Learning process, not on interacting with or managing the model, there are no chapters covering frontend or management networks. The storage network is also outside the scope. The focus is strictly on the backend network.
The goal is to help readers—especially network professionals—grasp the “big picture” of how Deep Learning impacts data center networking.

One final note

In all my previous books, I’ve used font size 10 and single line spacing. For this book, I’ve increased the font size to 11 and the line spacing to 1.15. This wasn’t to add more pages but to make the reading experience more comfortable. I’ve also tried to ensure that figures and their explanations appear on the same page, which occasionally results in some white space.
I hope you find this book helpful and engaging as you explore the fascinating intersection of Deep Learning and Datacenter Networking.

How this book is organized


Part I – Chapters 1-8: Deep Learning and Deep Neural Networks


This part of the book lays the theoretical foundation for understanding how modern AI models are built and trained. It introduces the structure and purpose of artificial neurons and gradually builds up to complete deep learning architectures and parallel training methods.

Artificial Neurons and Feedforward Networks (Chapters 1 - 3)

The journey begins with the artificial neuron, also known as a perceptron, which is the smallest functional unit of a neural network. It operates in two key steps: performing a matrix multiplication between inputs and weights, followed by applying a non-linear activation function to provide an output. 
By connecting many neurons across layers, we form a Feedforward Neural Network (FNN). FNNs are ideal for basic classification and regression tasks and provide the stepping stone to more advanced architectures.

Specialized Architectures: CNNs, RNNs, and Transformers  (Chapters 3 - 9)

After covering FNNs, this part dives into models designed for specific data types:
  • Convolutional Neural Networks (CNNs): Optimized for spatial data like images, CNNs use filters to extract local features such as edges, textures, and shapes, while keeping the model size efficient through weight sharing.
  • Recurrent Neural Networks (RNNs): Designed for sequential data like text and time series, RNNs maintain a hidden state that captures previous input history. This allows them to model temporal dependencies and context across sequences.
  • Transformer-based Large Language Models (LLMs): Unlike RNNs, Transformers use self-attention mechanisms to weigh relationships between all tokens in a sequence simultaneously. This architecture underpins state-of-the-art language models and enables scaling to billions of parameters.

Parallel Training and Scaling Deep Learning  (Chapter 8)

As models and datasets grow, training them on a single GPU becomes impractical. This section explores the three major forms of distributed training:


  • Data Parallelism: Each GPU holds a replica of the model but processes different mini-batches of input data. Gradients are synchronized at the end of each iteration to keep weights aligned.
  • Pipeline Parallelism: The model is split across multiple GPUs, with each GPU handling one stage of the forward and backward pass. Micro-batches are used to keep the pipeline full and maximize utilization.
  • Tensor (Model) Parallelism: Very large model layers are broken into smaller slices, and each GPU computes part of the matrix operations. This approach enables the training of ultra-large models that don't fit into a single GPU's memory.

Part II – Chapters 9 – 14: AI Data Center Networking


This part of the book focuses on the network technologies that enable distributed training at scale in modern AI data centers. It begins with an overview of GPU-to-GPU memory transfer mechanisms over Ethernet and then moves on to congestion control, load balancing strategies, network topologies, and GPU communication collectives.

RoCEv2 and GPU-to-GPU Transfers  (Chapter 9)

The section starts by explaining how Direct Memory Access (DMA) is used to copy data between GPUs across Ethernet using RoCEv2 (RDMA over Converged Ethernet version 2). This method allows GPUs located in different servers to exchange large volumes of data without CPU involvement.

DCQCN: Data Center Quantized Congestion Notification  (Chapters 10 - 11)

RoCEv2’s performance depends on a lossless transport layer, which makes congestion management essential. To address this, DCQCN provides an advanced congestion control mechanism. It dynamically adjusts traffic flow based on real-time feedback from the network to minimize latency and packet loss during GPU-to-GPU communication.


  • Explicit Congestion Notification (ECN): Network switches mark packets instead of dropping them when congestion builds. These marks trigger rate adjustments at the sender to prevent overload.
  • Priority-based Flow Control (PFC): PFC ensures that traffic classes like RoCEv2 can pause independently, preventing buffer overflows without stalling the entire link.
Load Balancing Techniques in AI Traffic  (Chapter 12)

In addition to congestion control, effective load distribution is critical for sustaining GPU throughput during collective communication. This section introduces several techniques used in modern data center fabrics:


  • Flow-based Load Balancing: Assigns entire flows or flowlets to paths based on real-time link usage or hash-based distribution, improving path diversity and utilization.
  • Flowlet Switching: Divides a flow into smaller time-separated bursts ("flowlets") that can be load-balanced independently without reordering issues.
  • Packet Spraying: Distributes packets belonging to the same flow across multiple available paths, helping to avoid link-level bottlenecks.

AI Data Center Network Topologies (Chapter 13)

Next, the section discusses design choices in the East-West fabric—the internal network connecting GPU servers. It introduces topologies such as:

  • Top-of-Rack (ToR): Traditional rack-level switching used to connect servers within a rack.
  • Rail and Rail-Optimized Designs: High-throughput topologies tailored for parallel GPU clusters. These layouts improve resiliency and throughput, especially during bursty communication phases in training jobs.

GPU-to-GPU Communication  (Chapter 14)

The part concludes with a practical look at collective communication patterns used to synchronize GPUs across the network. These collectives are essential for distributed training workloads:


  • AllReduce: Each GPU contributes and receives a complete, aggregated copy of the data. Internally, this is implemented in two phases:
    • ReduceScatter: GPUs exchange partial results and compute a portion of the final sum.
    • AllGather: Each GPU shares its computed segment so that every GPU receives the complete aggregated result.
  • Broadcast: A single GPU (often rank 0) sends data—such as communication identifiers or job-level metadata—to all other GPUs at the start of a training job.

Target Audience


I wrote this book for professionals working in the data center networking domain—whether in architectural, design, or specialist roles. It is especially intended for those who are already involved in, or are preparing to work with, the unique demands of AI-driven data centers. As AI workloads reshape infrastructure requirements, this book aims to provide the technical grounding needed to understand both the deep learning models and the networking systems that support them.

Back Cover Text


Deep Learning for Network Engineers bridges the gap between AI theory and modern data center network infrastructure. This book offers a technical foundation for network professionals who want to understand how Deep Neural Networks (DNNs) operate—and how GPU clusters communicate at scale.

Part I (Chapters 1–8) explains the mathematical and architectural principles of deep learning. It begins with the building blocks of artificial neurons and activation functions, and then introduces Feedforward Neural Networks (FNNs) for basic pattern recognition, Convolutional Neural Networks (CNNs) for more advanced image recognition, Recurrent Neural Networks (RNNs) for sequential and time-series prediction, and Transformers for large-scale language modeling using self-attention. The final chapters present parallel training strategies used when models or datasets no longer fit into a single GPU. In data parallelism, the training dataset is divided across GPUs, each processing different mini-batches using identical model replicas. Pipeline parallelism segments the model into sequential stages distributed across GPUs. Tensor (or model) parallelism further divides large model layers across GPUs when a single layer no longer fits into memory.These approaches enable training jobs to scale efficiently across large GPU clusters. 

Part II (Chapters 9–14) focuses on the networking technologies and fabric designs that support distributed AI workloads in modern data centers. It explains how RoCEv2 enables direct GPU-to-GPU memory transfers over Ethernet, and how congestion control mechanisms like DCQCN, ECN, and PFC ensure lossless high-speed transport. You’ll also learn about AI-specific load balancing techniques, including flow-based, flowlet-based, and per-packet spraying, which help avoid bottlenecks and keep GPU throughput high. Later chapters examine GPU collectives such as AllReduce—used to synchronize model parameters across all workers—alongside ReduceScatter and AllGather operations. The book concludes with a look at rail-optimized topologies that keep multi-rack GPU clusters efficient and resilient.

This book is not a configuration or deployment guide. Instead, it equips you with the theory and technical context needed to begin deeper study or participate in cross-disciplinary conversations with AI engineers and systems designers. Architectural diagrams and practical examples clarify complex processes—without diving into implementation details.

Readers are expected to be familiar with routed Clos fabrics, BGP EVPN control planes, and VXLAN data planes. These technologies are assumed knowledge and are not covered in the book.

Whether you're designing next-generation GPU clusters or simply trying to understand what happens inside them, this book provides the missing link between AI workloads and network architecture.

Sunday, 4 May 2025

AI for Network Engineers: Rail Desings in GPU Fabric

 When building a scalable, resilient GPU network fabric, the design of the rail layer, the portion of the topology that interconnects GPU servers via Top-of-Rack (ToR) switches, plays a critical role. This section explores three different models: Multi-rail-per-switch, Dual-rail-per-switch, and Single-rail-per-switch. All three support dual-NIC-per-GPU designs, allowing each GPU to connect redundantly to two separate switches, thereby removing the Rail switch as a single point of failure.


Multi-Rail-per-Switch

In this model, multiple small subnets and VLANs are configured per switch, with each logical rail mapped to a subset of physical interfaces. For example, a single 48-port switch might host four or eight logical rails using distinct Layer 2 and Layer 3 domains. Because all logical rails share the same physical device, isolation is logical. As a result, a hardware or software failure in the switch can impact all rails and their associated GPUs, creating a large failure domain.


This model is not part of NVIDIA’s validated Scalable Unit (SU) architecture but may suit test environments, development clusters, or small-scale GPU fabrics where hardware cost efficiency is a higher priority than strict fault isolation. From a CapEx perspective, multi-rail-per-switch is the most economical, requiring fewer switches. 


Figure 13-10 illustrates the multi-rail-per-switch architecture, where each rail is implemented as a separate VLAN-subnet pair mapped to a subset of switch ports. In the figure, interfaces 1–4 are assigned to subnet 10.0.1.0/28 and VLAN 101, while interfaces 5–8 are mapped to subnet 10.0.2.0/28 and VLAN 102. Each VLAN maintains its own MAC address table, learning GPU NIC MACs through ingress traffic. Although not shown in the figure, the Rail switch acts as the default gateway for all eight VLANs.


The figure also illustrates the BGP process when a Clos architecture with a spine layer is used to connect rail switches. All directly connected subnets are installed into the local Routing Information Base (RIB) as connected routes. These routes are then imported into the BGP Loc-RIB. Next, the routes pass through the BGP output policy engine, where they are aggregated into a single summary route: 10.0.1.0/24. This aggregate is placed into the BGP Adj-RIB-Out. When the BGP Update message is sent to a peer, the NEXT_HOP attribute is set accordingly.

Figure 13-10: Multi-Rail per Switch.

Dual-Rail-per-Switch


While dual-rail-per-switch improves manageability and is easier to scale, it shares the same limitation: both logical rails reside within a single physical switch, so the failure domain remains large. A single switch failure or misconfiguration affects both rails and all associated GPUs. 

This design resembles the dual-rail concept used in scalable AI clusters, but NVIDIA’s SU approach calls for two separate physical switches per rail, which provides full physical isolation. Dual-rail-per-switch hits a middle ground in terms of CapEx and OpEx: fewer switches are required than in the single-rail model, and operational complexity is reduced compared to multi-rail. It’s often a good choice for intermediate-scale environments where some fault tolerance and cost control must be balanced. 

Figure 13-11 illustrates a dual-rail-per-switch design, where the switch interfaces are divided evenly between two separate rails. Rail 1 uses interfaces 1 through 16 and is assigned to subnet 10.0.1.0/25 (VLAN 101). Rail 2 uses interfaces 17 through 32 and is assigned to subnet 10.0.128.0/25 (VLAN 102). Each VLAN has its own MAC address table, and the rail switch serves as the default gateway for both. The individual /25 subnets are redistributed into the BGP process and summarized as 10.0.1.0/24 for advertisement toward the spine layer.

Figure 13-11: Dual-Rail Switch.


Single-Rail-per-Switch


This model offers the highest level of physical isolation. Each switch forms a single rail, serving its connected GPU servers through one subnet and one VLAN. No logical separation is needed, as each rail is entirely independent in hardware. As a result, a switch failure affects only the GPU servers attached to that specific rail, yielding a small, predictable failure domain.

The design closely aligns with NVIDIA’s Scalable Unit (SU) architecture, in which each rack or rack group includes its own rail switch, and horizontal scaling is achieved by repeating modular, self-contained units.

While this model demands the highest CapEx, due to the one-to-one mapping between switches and rails, it offers major operational advantages. Configuration is simpler, troubleshooting is faster, and the risk of cascading faults is minimized. There is no need for route summarization, or custom BGP redistribution logic. Over time, these benefits help drive down OpEx, particularly in large-scale or mission-critical GPU clusters.

To ensure optimal hardware utilization, it is important to align the number of GPU servers per rack with the switch’s port capacity. Otherwise, underutilized ports can lead to inefficiencies in infrastructure cost and resource planning.

Figure 13-12 illustrates a simplified single-rail-per-switch topology. All interfaces from 1 to 32 operate within a single rail, configured with subnet 10.0.1.0/24 and VLAN 101. The rail switch serves as the default gateway, and because the full /24 subnet is used without subnetting, route summarization is not needed.


Figure 13-12: Single-Rail Switch.


AI Fabric Architecture Conclusion


Figure 13-13 illustrates one way to describe the overall architecture of an AI Fabric. It is divided into three domains. The first domain, called the Segment, includes GPU hosts and Rail switches. The second domain, the Pod, aggregates multiple segments using Spine switches. In cases where NCCL builds a topology where cross-rail inter-host traffic is first copied to the local GPU memory (located on the destination rail) and then sent over the GPU NIC to the remote GPU via the correct Rail switch, a Pod architecture with Spine switches may not be necessary. The third domain, multi-Pod, interconnects multiple pods using Super Spine switches, enabling large-scale AI Fabric deployments. Figure 13-10 also depicts global settings and properties shared across the AI Fabric backend network.

Segment: GPU I/O Topology and Rail Switch Fabric Profile


GPU I/O Topology: Each GPU connects to the network through a NIC. You can either dedicate a NIC to each GPU or share one NIC among multiple GPUs. NICs may have single, dual, or quad ports and support speeds such as 100, 200, or 400 Gbps. The interconnect type can be InfiniBand, RoCEv2, or NVLink. A segment typically includes multiple hosts.

Rail Switch Fabric Profile: Rail switches connect directly to GPU hosts. Each rail handles a group of NIC ports. You can map rails one-to-one to switches for physical isolation or map multiple rails per switch for logical isolation. In the latter case, two or more rails can be mapped per switch depending on performance and capacity requirements. Rail switches are responsible for ingress packet classification and for mapping RoCEv2 traffic to the correct queues. 

Pod: Spine Switch Profile:


Spine switches aggregate multiple Rail switches, forming a Pod that consists of n segments. Spine switches enable cross-rail communication between GPUs. They use high-density, high-speed ports. When the Spine layer is used, the result is a 2-tier, 3-stage architecture.

Multi-Pod: Super Spine Switch Profile


Super Spine switches provide inter-Pod connectivity. They are built with very high port density to support all connected Spine switches. When the Super Spine layer is used, the architecture becomes a 3-tier, 5-stage fabric.

Global AI Fabric Profile


All layers are governed by the Global AI Fabric Profile. This profile defines the control plane (eBGP, iBGP, BGP EVPN), the data plane (Ethernet, VXLAN), Layer 3 ECMP strategies (flow-based, flowlet-based, or per-packet), congestion control mechanisms (ECN marking, PFC), inter-switch link monitoring (BFD), and global MTU settings.


Figure 13-13: AI fabric Architecture Description.