Monday, 18 August 2025

Parallelization Strategies in Neural Networks

From a network engineer’s perspective, it is not mandatory to understand the full functionality of every application running in a datacenter. However, understanding the communication patterns of the most critical applications—such as their packet and flow sizes, entropy, transport frequency, and link utilization—is essential. Additionally, knowing the required transport services, including reliability, in-order packet delivery, and lossless transmission, is important.

In AI fabrics, a neural network, including both its training and inference phases, can be considered an application. For this reason, this section first briefly explains the basic operation of the simplest neural network: the Feed Forward Neural Network (FNN). It then discusses the operation of a single neuron. Although a deep understanding of the application itself is not required, this section equips the reader with knowledge of what pieces of information are exchanged between GPUs during each phase and why these data exchanges are important.


Feedforward Neural Network: Forward Pass


Figure 1-7 illustrates a simple four-layer Feed Forward Neural Network (FNN) distributed across four GPUs. The two leftmost GPUs reside in Node-1, and the other two GPUs reside in Node-2. The training data is fed into the first layer. In real neural networks, this first layer is the input layer, which simply passes input data unmodified to the neurons in the first hidden layer. To save space, the input layer is excluded in this illustration.


Each neuron in the first hidden layer is connected to all input values, with each connection assigned an initial weight parameter. For example, the first neuron, n1, receives all three inputs, x1 through x3. The neuron performs two main computations. First, it calculates a weighted sum of its inputs by multiplying each input by its associated weight and summing the results. This produces the neuron’s pre-activation value, denoted as z. The pre-activation value is then passed through an activation function such as the Rectified Linear Unit (ReLU). ReLU is computationally efficient due to its simplicity: if the pre-activation z is greater than zero, the output y equals z; otherwise, the output is zero.


GPU0 stores these outputs in its VRAM. The outputs are then copied (and removed) via Direct Memory Access (DMA) over the scale-up network to the memory of GPU1, which holds the second hidden layer on the same Node-1.


The neurons in the second layer compute their outputs using the same process After the computations in the second layer, the output data is immediately transferred from GPU1 in Node-1 to GPU0 in Node-2 (for example, moving from the second to the third layer) over the scale-out network using Remote Direct Memory Access (RDMA).


The processing in the final two layers follows the same pattern. The output of the last layer represents the model’s prediction. The training dataset is labeled with the expected results, but the prediction rarely matches the target in the first iteration. Therefore, at the end of the forward pass, the model loss value is computed to measure the prediction error. 


Figure 1-7: The Operation of FNN: Forward Pass.


Feedforward Neural Network: Backward Pass


After computing the model error, the training process moves into the backward pass phase. During the backward pass, the model determines how each weight parameter should be adjusted to improve prediction accuracy.

First, the output error and the neuron’s post-activation value z are used to compute a neuron-specific delta value (also called neuron error). Then, this delta value is multiplied by the corresponding input activation from the previous layer to obtain a gradient. The gradient indicates both the magnitude and direction in which the weight should be updated.

To avoid overly large adjustments that could destabilize training, the gradient is scaled by the learning rate parameter. For example, if the gradient suggests increasing a weight by 2.0 and the learning rate is 0.1, the actual adjustment will be 0.1 × 2.0 = 0.2

Because the delta value for one layer depends on the delta of the next layer, it must be propagated backward immediately before gradient computation can begin for the previous layer. In distributed training, this means delta values may need to be transferred between GPUs either within the same node (scale-up network) or across nodes (scale-out network), depending on the parallelization strategy.

Whether exchanging deltas or synchronizing gradients, the transfers occur over scale-up or scale-out networks based on GPU placement.


Figure 1-8: Gradient Calculation


Parallelization Strategies


At the time of writing, the world’s largest single-location GPU supercomputer is Colossus in Memphis, Tennessee, built by Elon Musk’s AI startup xAI. It contains over 200,000 GPUs. The Grok-4 large language model (LLM), published in July 2025, was trained on Colossus. The parameter count of Grok‑4 has not been made public, but Grok-1 (released in October 2023) used a Mixture-of-Experts (MoE) architecture with about 314 billion parameters.

Large-scale and hyper-scale GPU clusters such as Colossus, require parallel computation and communication to achieve fast training and near real-time inference. Parallelization strategies define how computations and data are distributed across GPUs to maximize efficiency and minimize idle time. Besides, parallelism among AI cluster size and collective communication topology is the main factor which affects when, with whom and over which network the GPU communication happens. 

The main approaches include:

  • Model Parallelism: Splits the neural network layers across multiple GPUs when a single GPU cannot hold the entire model.

  • Tensor Parallelism: Divides the computations of a single layer (e.g., matrix multiplications) across multiple GPUs, improving throughput for large layers.

  • Data Parallelism: Distributes different portions of the training dataset to multiple GPUs, ensuring that all GPUs are actively processing data simultaneously.

  • Pipeline Parallelism: Divides the model into sequential stages across GPUs and processes micro-batches of training data in a staggered fashion, reducing idle time between stages.

3D Parallelism: Combines tensor, pipeline, and data parallelism to scale extremely large models efficiently. Tensor parallelism splits computations within layers, pipeline parallelism splits the model across sequential GPU stages, and data parallelism replicates the model to process different batches simultaneously. Together, they maximize GPU utilization and minimize idle time.

These strategies ensure that AI clusters operate at high efficiency, accelerating training while reducing wasted energy and idle GPU time.


Model Parallelism


Each weight in a neural network is typically stored using 32-bit floating point (FP32) format, which consumes 4 bytes of memory per weight. A floating point number allows representation of real numbers, including very large and very small values, with a decimal point. Large neural networks, with billions of weight parameters, quickly exceed the memory capacity of a single GPU. To reduce the memory load on a single GPU, model parallelism distributes the neural network’s layers (including layer-specific weight matrices and neurons) across multiple GPUs. In this approach, each GPU is responsible for computing the forward and backward pass of the layers it holds. This not only reduces the memory usage on a single GPU but also lowers CPU and GPU cycles by distributing the computation across multiple devices.


Note: During training, additional memory is temporarily required to store the results of activation functions as data passes through the network and the gradients needed to compute weight updates during the backward pass, which can easily double the memory consumption.

Figure 1-9 depicts a simple feedforward neural network with three layers (excluding the input layer for simplicity). The input layer is where the training data’s features are passed into the network; no computations are performed in this layer. The first hidden layer receives the input features and performs computations.

First hidden layer (GPU 1): This layer has three neurons and receives four input features from the input layer. Its weight matrix has a size of 3 × 4. Each row corresponds to a specific neuron (n1, n2, n3), and each element in a row represents the weight for a particular input feature—the first element corresponds to the first input feature, the second to the second, and so on. GPU 1 computes the pre-activation values by multiplying the weight matrix with the input features (matrix multiplication) and then applies the activation function to produce the output of this layer.

Second hidden layer (GPU 2): This layer has two neurons and receives three input features from the outputs of the first hidden layer. Its weight matrix has a size of 2 × 3. Each row corresponds to a specific neuron (n4, n5), and each element in a row represents the weight for a particular input feature from the previous layer. GPU 2 computes the pre-activation values by multiplying the weight matrix with the input features from GPU 1 and then applies the activation function to produce the output of this layer.

Output layer (GPU 3): This layer has one neuron and receives two input features from the second hidden layer. Its weight matrix has a size of 1 × 2. The single row corresponds to the neuron in this layer, and each element represents the weight for a specific input feature from GPU 2. GPU 3 computes the pre-activation value by multiplying the weight matrix with the input features from the second hidden layer and then applies the activation function to produce the final output of the network.

By assigning each layer to a different GPU, model parallelism enables large neural networks to be trained even when the combined memory requirements of the weight matrices exceed the capacity of a single GPU.

If all GPUs reside on the same node, the activation values during the forward pass are passed to the next layer over the intra-server Scale-Up network using Direct Memory Access (DMA). If the GPUs are located on different nodes, the communication occurs over the Scale-Out Backend network using Remote Direct Memory Access (RDMA). Gradient synchronization during the backward pass follows the same paths: intra-node communication uses the Scale-Up network, and inter-node communication uses the Scale-Out network. 



Figure 1-9: Model Parallelism.

Tensor Parallelism


Model parallelism distributes layers across GPUs. In very large neural networks, even the weight matrices of a single layer may become too large to store on a single GPU or compute efficiently. Tensor parallelism addresses this by splitting a layer’s weight matrix and computation across multiple GPUs. While model parallelism splits layers, tensor parallelism splits within a layer, allowing multiple GPUs to work on the same layer in parallel. Each GPU holds a portion of the layer’s parameters and computes partial outputs. These partial results are then combined to produce the full output of the layer, enabling scaling without exceeding memory or compute limits.

In Figure 1‑10, we have a 4 × 8 weight matrix that is split in half: the first half is assigned to GPU0, and the second half to GPU1. GPU0 belongs to Tensor Parallel Rank 1 (TP Rank 1), and GPU1 belongs to TP Rank 2. GPU0 has two neurons, where neuron n1 is associated with the first row of weights and neuron n2 with the second row. GPU1 works the same way with its portion of the matrix. Both GPUs process the same input feature matrix. On GPU0, neurons n1 and n2 perform matrix multiplication with their weight submatrix and the input feature matrix to produce pre-activation values, which are then passed through the activation function to produce neuron outputs. Because the layer’s weight matrix is distributed across GPUs, each GPU produces only a partial output vector. 

Before passing the results to the next layer, GPUs synchronize these partial output vectors using AllGather collective communication, forming the complete layer output vector that can then be fed into the next layer. If the GPUs reside on the same node, this communication happens over the intra-node Scale-Up network using DMA. If the GPUs are on different nodes, the communication occurs over the inter-node Scale-Out backend network.


Figure 1-10: Tensor Parallelism.


Figure 1‑11 depicts the AllGather collective communication used for synchronizing partial output vectors in tensor parallelism. Recall that the layer’s weight matrix is split across GPUs, with each GPU computing partial outputs for its assigned portion of the matrix. These partial outputs are first stored in the local VRAM of each GPU. The GPUs then exchange these partial results with other GPUs in the same Tensor Parallel Group. For example, GPU0 (TP Rank 1) computes and temporarily stores the outputs of neurons n1 and n2, and synchronizes them with GPU3. After this AllGather operation, all participating GPUs have the complete output vector, which can then be passed to the next layer of their corresponding TP Rank. Once the complete output vectors are passed on, the memory used to store the partial results can be freed.



Figure 1-11: Tensor Parallelism – AllGather Operation for Complete Output Vector.


3D Parallelism


3D parallelism combines model parallelism, tensor parallelism, data parallelism, and pipeline parallelism into a unified training strategy. The first two were described earlier, and this section introduces the remaining two from the 3D parallelism perspective.
At the bottom of Figure 1-12, is a complete input feature matrix with 16 input elements (x1–x16). This matrix is divided into two mini-batches:

The first mini-batch (x1–x8) is distributed to GPU0 and GPU1.
The second mini-batch (x9–x16) is distributed to GPU4 and GPU5.

This is data parallelism: splitting the dataset into smaller mini-batches so multiple GPUs can process them in parallel, either because the dataset is too large for a single GPU or to speed up training.
Pipeline parallelism further divides each mini-batch into micro-batches, which are processed in a pipeline across the GPUs. In Figure 1-12, the mini-batch (x1–x8) is split into two micro-batches: x1–x4 and x5–x8. These are processed one after the other in pipeline fashion.
In this example, GPU0 and GPU2 belong to TP Rank 1, while GPU1 and GPU3 belong to TP Rank 2. Since they share the same portion of the dataset, TP Ranks 1 and 2 together form a Data Parallel Group (DP Group). The same applies to GPUs 4–7, which form a second DP Group.

Forward Pass in 3D Parallelism


Training proceeds layer by layer:


  1. First layer: The neurons compute their outputs for the first micro-batch (step 1a). Within each TP Rank, these outputs form partial vectors, which are synchronized across TP Ranks within the same DP Group using collective communication (step 1b). The result is a complete output vector, which is passed to the next layer (step 1c).
  2. Second layer: Neurons receive the complete output vector from the first layer and repeat the process—compute outputs (2a), synchronize partial vectors (2b), and pass the complete vector to the next layer (2c).
  3. Pipeline execution: As soon as GPU0 and GPU1 finish processing the first micro-batch and forward it to the third layer, they can immediately begin processing the second micro-batch (step 3a). At the same time, GPUs handling layers three and four start processing the outputs received from second layers.
This overlapping execution means that eventually all eight GPUs are active simultaneously, each processing different micro-batches and layers in parallel. This is the essence of 3D parallelism: maximizing efficiency by distributing memory load and computation while minimizing GPU idle time, which greatly speeds up training.

During the forward pass, communication between GPUs happens only within a Data Parallel Group.

Figure 1-12: 3D Parallelism Forward pass.

Backward Pass in 3D Parallelism


After the model output is produced and the loss is computed, the training process enters the backward pass. The backward pass calculates the gradients, which serve as the basis for determining how much, and in which direction, the model weights should be adjusted to improve performance. 
The gradient computation begins at the output layer, where the derivative of the loss is used to measure the error contribution of each neuron. These gradients are then propagated backward through the network layer by layer, in the reverse order of the forward pass. Within each layer, gradients are first computed locally on each GPU.

Once the local gradients are available, synchronization takes place. Just as neuron outputs were synchronized during the forward pass between TP Ranks within a Data Parallel (DP) Group, the gradients now need to be synchronized both within each DP Group and across DP Groups. This step ensures that all GPUs hold consistent weight updates before the optimizer applies them. Figure 1-3 illustrates this gradient synchronization process.



Figure 1-13: 3D Parallelism Backward pass.

Summary


In this chapter, the essential building blocks of an AI cluster were described. The different networks that connect the system together were introduced: the scale-out backend network, the scale-up network, as well as the management, storage, and frontend networks.

After that foundation was established, the operation of neural networks was explained. The forward pass through individual neurons was described to show how outputs are produced, and the backward pass was outlined to demonstrate how errors propagate, and gradients are computed.

The parallelization strategies—model parallelism, tensor parallelism, pipeline parallelism, and data parallelism—were then presented, and their combination into full 3D parallelism was discussed to illustrate how large GPU clusters can be efficiently utilized.

With this understanding of both the infrastructure and the computational model established, attention can now be turned to Ultra Ethernet, the transport technology used to carry RDMA traffic across the Ethernet-based scale-out backend network that underpins large-scale AI training. 

References


[1] Colossus Supercomputer: https://en.wikipedia.org/wiki/Colossus_(supercomputer)
[2] Inside the 100K GPU xAI Colossus Cluster that Supermicro Helped Build for Elon Musk, https://www.servethehome.com/inside-100000-nvidia-gpu-xai-colossus-cluster-supermicro-helped-build-for-elon-musk/




AI Cluster Networking

Introduction

The Ultra Ethernet Specification v1.0 (UES), created by the Ultra Ethernet Consortium (UEC), defines end-to-end communication practices for Remote Direct Memory Access (RDMA) services in AI and HPC workloads over Ethernet network infrastructure. UES not only specifies a new RDMA-optimized transport layer protocol, Ultra Ethernet Transport (UET), but also defines how the full application stack—from Software through Transport, Network, Link, and Physical—can be adjusted to provide improved RDMA services while continuing to leverage well-established standards. UES includes, but is not limited to, a software API, mechanisms for low-latency and lossless packet delivery, and an end-to-end secure software communication path. 

Before diving into the details of Ultra Ethernet, let’s briefly look at what we are dealing with when we talk about an AI cluster. From this point onward, we focus on Ultra Ethernet from the AI cluster perspective. This chapter first introduces the AI cluster networking. Then, it briefly explains how a neural network operates during the training process, including an short introduction to the backpropagation algorithm and its forward and backward pass functionality.

Note: This book doesn’t include any complex mathematical algorithms related backpropagation algorithm, or detailed explanation of different neural networks. I have written a book Deep Neural Network for Network Engineers, which first part covers Feedforward Neural Network (FNN), Convolutional Neural Network (CNN), and Large Language Models (LLM).


AI Cluster Networking


Scale-Out Backend Network: Inter-Node GPU Communication

Figure 1-1 illustrates a logical view of an AI Training (AIT) cluster consisting of six nodes, each equipped with four GPUs, for a total of 24 GPUs. Each GPU has a dedicated Remote Direct Memory Access-capable Network Interface Card (RDMA-NIC), which at the time of writing typically operates at speeds ranging from 400 to 800 Gbps.


Figure 1-1: Inter-Node GPU Connection: Scale-Out Backend Network.

An RDMA-NIC can directly read from and write to the GPU’s VRAM without involving the host CPU or triggering interrupts. In this sense, RDMA-NICs act as hardware accelerators by offloading data transfer operations from the CPU, reducing latency and freeing up compute resources.

GPUs in the different nodes with the same local rank number are connected to the same rail of the Scale-Out Backend network. For example, all GPUs with a local rank of zero are connected to rail zero, while those with a local rank of one are connected to rail one.

The Scale-Out Backend network is used for inter-node GPU communication and must support low-latency, lossless RDMA message transport. Its physical topology depends on the scale and scalability requirements of the implementation. A leaf switch may be dedicated to a single rail, or it may support multiple rails by grouping its ports, with each port group mapped one-to-one to a rail. Inter-rail traffic is commonly routed through spine switches. In larger implementations, the network often follows a routed two-tier (3-stage) Clos topology or a pod-based, three-tier (5-stage) topology.

The Scale-Out Backend network is primarily used to transfer the results of neuron activation functions to the next layer during the forward pass and to support collective communication for gradient synchronization during the backward pass. The communication pattern between inter-node GPUs, however, depends on the selected parallelization strategy.

Traffic on the Scale-Out Backend network is characterized as high-latency sensitivity, bursty, low-entropy traffic, with a few long-lived elephant flows. Since full link utilization is common during communication phases, an efficient congestion control mechanism must be implemented.

Note: Scale-Out Backend network topologies are covered in more detail in my Deep Learning for Network Engineers book.


Scale-Up Networks: Intra-Node GPU Communication


Intra-node GPU communication occurs over a high-bandwidth, low-latency scale-up network. Common technologies for intra-node GPU communication include NVIDIA NVLink, NVSwitch, and AMD Infinity Fabric, depending on the GPU vendor and server architecture. Additionally, the Ultra Accelerator (UA) Consortium has introduced the Ultra Accelerator Link (UALink) 200G 1.0 Specification, a standards-based, vendor-neutral solution designed to enable GPU communication over intra-node or pod scale-up networks.



Figure 1-2: Intra-Node GPU Connection: Scale-Up Network.

These interconnects form a Scale-Up communication channel that allows GPUs within the same node to exchange data directly, bypassing the host CPU and system memory. Compared to PCIe-based communication, NVLink and similar solutions provide significantly higher bandwidth and lower latency.

In a typical NVLink topology, GPUs are connected in a mesh or fully connected ring, enabling peer-to-peer data transfers. In systems equipped with NVSwitch, all GPUs within a node are interconnected through a centralized switch fabric, allowing uniform access latency and bandwidth across any GPU pair.

The Scale-Up fabric is primarily used for the same purpose that Scale-Out Backend network but is serves as intra-node GPU communication path. 
Because communication happens directly over the GPU interconnect, Scale-Up communication is generally much faster and more efficient than inter-node communication over Scale-Out Backend network. 

Note: In-depth discussion of NVLink/NVSwitch topologies and intra-node parallelism strategies can be found in my Deep Learning for Network Engineers book.

Frontend Network: User Inference


The modern Frontend Network in large scale AI training cluster is often implemented as a routed Clos fabric designed to provide scalable and reliable connectivity for user access, orchestration, and inference workloads. The primary function of Frontend network is to handle user interactions with deployed AI models, serving inference requests.


Figure 1-3: User Inference: Frontend Network.

When multitenancy is required, the modern Frontend Network typically uses BGP EVPN as the control plane and VXLAN as the data plane encapsulation mechanism, enabling virtual network isolation. Data transport is usually based on TCP protocol. Multitenancy also makes it possible to create secure and isolated network segment for training job initialization where GPUs joins the job and receives initial model parameters from the master rank.

Unlike the Scale-Out Backend, which connects GPUs across nodes using dedicated RDMA-NIC per GPU, the Frontend Network is accessed via a shared NIC, commonly operating at 100 Gbps. 

Traffic on the Frontend Network is characterized by bursty, irregular communication patterns, dominated by short-lived, high-entropy mouse flows involving many unique IP and port combinations. These flows are moderately sensitive to latency, particularly in interactive inference scenarios. Despite the burstiness, the average link utilization remains relatively low compared to the Scale-Out or Scale-Up fabrics.


Note: In depth instruction of BGP EVPN/VXLAN can be found from my books Virtual eXtensible LAN – VXLAN Fabric with BGP EVPN Control- Plane

Management Network


The Management Network is a dedicated or logically isolated network used for the orchestration, control, and administration of an AI cluster. It provides secure and reliable connectivity between management servers, compute nodes, and auxiliary systems. These auxiliary systems typically include time synchronization servers (NTP/PTP), authentication and authorization services (such as LDAP or Active Directory), license servers, telemetry collectors, remote management interfaces (e.g., IPMI, Redfish), and configuration automation platforms.



Figure 1-4: Cluster Management: Management Network.

Traffic on the Management Network is typically low-bandwidth but highly sensitive, requiring strong security policies, high reliability, and low-latency access to ensure stability and operational continuity. It supports administrative operations such as remote access, configuration changes, service monitoring, and software updates.

To ensure isolation from user, training, and storage traffic, management traffic is usually carried over separate physical interfaces, or logically isolated using VLANs or VRFs. This network is not used for model training data, gradient synchronization, or inference traffic.

Typical use cases include:

  • Cluster Orchestration and Scheduling: Facilitates communication between orchestration systems and compute nodes for job scheduling, resource allocation, and lifecycle management.
  • Job Initialization and Coordination: Handles metadata exchange and service coordination required to bootstrap distributed training jobs and synchronize GPUs across multiple nodes.
  • Firmware and Software Lifecycle Management: Supports remote OS patching, BIOS or firmware upgrades, driver installation, and configuration rollouts.
  • Monitoring and Telemetry Collection: Enables collection of logs, hardware metrics, software health indicators, and real-time alerts to centralized observability platforms.
  • Remote Access and Troubleshooting: Provides secure access for administrators via SSH, IPMI, or Redfish for diagnostics, configuration, or out-of-band management.
  • Security and Segmentation: Ensures that control plane and administrative traffic remain isolated from data plane workloads, maintaining both performance and security boundaries.

The Management Network is typically built with a focus on operational stability and fault tolerance. While bandwidth requirements are modest, low latency and high availability are critical for maintaining cluster health and responsiveness.

Storage Network


The Storage Network connects compute nodes, including GPUs, to the underlying storage infrastructure that holds training datasets, model checkpoints, and inference data.

Figure 1-5: Data Access: Storage Network.

Key use cases include:

  • High-Performance Data Access: Streaming large datasets from distributed or centralized storage systems (e.g., NAS, SAN, or parallel file systems such as Lustre or GPFS) to GPUs during training.
  • Data Preprocessing and Caching: Supporting intermediate caching layers and fast read/write access for preprocessing pipelines that prepare training data.
  • Shared Storage for Distributed Training: Providing a consistent and accessible file system view across multiple nodes to facilitate synchronization and checkpointing.
  • Model Deployment and Inference: Delivering trained model files to inference services and storing input/output data for auditing or analysis.

Due to the high volume and throughput requirements of training data access, the Storage Network is typically designed for high bandwidth, low latency, and scalability. It may leverage protocols such as NVMe over Fabrics (NVMe-oF), Fibre Channel, or high-speed Ethernet with RDMA support.

Summary of AI Cluster Networks

Scale-Out Backend Network

The Scale-Out Backend network connects GPUs across multiple nodes for inter-node GPU communication. It supports low-latency, lossless RDMA message transport essential for synchronizing gradients and transferring neuron activation results during training. Its topology varies by scale, typically mapping each rail to a leaf switch with inter-rail traffic routed through spine switches. Larger deployments often use routed two-tier (3-stage) or pod-based three-tier (5-stage) Clos architectures.

Scale-Up Network

The Scale-Up network provides high-speed intra-node communication between GPUs within the same server. It typically uses NVLink, NVSwitch, or PCIe fabrics enabling direct, low-latency, high-bandwidth access to GPU VRAM across GPUs in a single node. This network accelerates collective operations and data sharing during training and reduces CPU involvement.

Frontend Network

The Frontend network serves as the user access and orchestration interface in the AI datacenter. Implemented as a routed Clos fabric, it handles inference requests. When multitenancy is required, it leverages BGP EVPN for control plane and VXLAN for data plane encapsulation. This network uses TCP transport and operates typically at 100 Gbps, connecting GPUs through shared NICs.

Management Network

The Management Network is a dedicated or logically isolated network responsible for cluster orchestration, control, and administration. It connects management servers, compute nodes, and auxiliary systems such as time synchronization servers, authentication services, license servers, telemetry collectors, and remote console interfaces. This network supports low-bandwidth but latency-sensitive traffic like job initialization, monitoring, remote access, and software updates, typically isolated via VRFs or VLANs for security.

Storage Network

The Storage Network links compute nodes to storage systems housing training datasets, model checkpoints, and inference data. It supports high-performance data streaming, data preprocessing, shared distributed storage access, and model deployment. Designed for high bandwidth, low latency, and scalability, it commonly uses protocols such as NVMe over Fabrics (NVMe-oF), Fibre Channel, or RDMA-enabled Ethernet.


Figure 1-6: A Logical view of 6x4 GPU Cluster.

Monday, 7 July 2025

Ultra Ethernet

Introduction

Remote Direct Memory Access over Converged Ethernet (RoCE) is a transport model that extends InfiniBand semantics over Ethernet networks. It enables direct memory access between hosts by encapsulating InfiniBand transport headers—such as the InfiniBand Transport Header (IBTH) and the RDMA Extended Transport Header (RETH)—within Ethernet, IP, and UDP packets. In by book "Deep Learning for Network Engineers" Chapter 9, describes how RDMA NICs process application work requests, known as InfiniBand verbs, and how these are encoded into IBTH and RETH headers for delivery to remote targets using RoCEv2.

This post shifts focus to the Ultra Ethernet Transport (UET) model, developed by the Ultra Ethernet Consortium (UEC). UET defines an alternative RDMA transport architecture that operates over standard Ethernet networks, without relying on InfiniBand message formats or semantics. While both RoCEv2 and UET enable remote memory access between nodes, UET is not based on InfiniBand transport headers, and the term RoCE is not used in UET systems.

Instead, UET introduces a new Ultra Ethernet (UE) layer composed of several sublayers, including the Semantic Sublayer (SES) and the Packet Delivery Sublayer (PDS). These sublayers are responsible for encoding and transmitting RDMA operations—such as memory addresses, remote keys (RKEYs), operation codes, and completion signaling—over Ethernet. This contrasts with RoCEv2, where such RDMA information is carried within IBTH and RETH headers.

In this chapter, we explore how UET transports data across the Scale-Out Network—a role comparable to the “Backend Network” in RoCEv2-based systems. We will examine how UET supports GPU-to-GPU communication at data center scale, and how its design differs in terms of packet structure, connection setup, and flow control when compared to InfiniBand-based approaches.

This chapter is based on the Ultra Ethernet Specification v1.0, published on June 11, 2025, by the Ultra Ethernet Consortium.

Figure 14-1 depicts a small, two-node parallel computing system, where each node contains two GPUs and UE-capable NIC per GPU. These nodes, along with the Scale-Out Backend Network (Switching Infrastructure), form a cluster. The UEC specification uses the term Fabric Interface (FI) to refer to a UE-capable NIC. In Figure 14-1, all four FIs, together with the leaf and spine switches in the Scale-Out Backend Network, make up the fabric. In this context, the fabric includes all network components: UE-NICs, switching infrastructure, inter-switch links, cabling, and transceivers. One-way delay in Backend Scale-Out Network should be less than 10 µs.

The Scale-Up Network refers to intra-node GPU communication using NVLink and PCIe. Scale-Up Networks can also include short-range interconnects that use a high-speed, low-latency, single-tier (non-CLOS topology) switch—enabling tightly coupled multi-node systems that retain scale-up characteristics. Latency for Scale-Up network should be < 1 µs.

A Fabric Endpoint (FEP) is a logical entity that terminates the UET protocol and is identified by a unique Fabric Address (FA). The UET protocol consists of three key sublayers—the Semantic (SES), Packet Delivery (PDS), and Transport Security (TSS) sublayers—analogous to how a VXLAN Tunnel Endpoint (VTEP) terminates the VXLAN data plane. These sublayers will be discussed in detail in upcoming chapters.

The Packet Delivery Context (PDC), a component of the FEP, is a logical construct responsible for unidirectional packet delivery between two FEPs. The Congestion Control Context (CCC), in turn, manages the transmission rate of traffic exchanged over a given PDC.

The term port refers to the physical port of the UE-capable NIC that connects to a fabric switch. Since multiple FEPs may exist on a single FI—each assigned a unique FA—but the FI may have only one (or two, if dual-homed) physical ports, the FAs of all FEPs on the same FI typically share the same MAC address.

A Fabric Plane is a communication path between two or more Fabric Endpoints (FEPs). In Figure 14-1, two Fabric Planes are shown. Fabric Plane 1 connects FEP 111 and FEP 122 through interfaces E1 and E2 on Leaf Switch 101. Fabric Plane 2 connects FEP 211 and FEP 222 via interfaces E1 and E2 on Leaf Switch 102.

In a RoCEv2-based solution, the Scale-Out Network is referred to as the Backend Network, and the communication paths between processes are called Rails.

The Fabric Interface exposes FEPs to parallel computing processes, which are identified by Process IDs (PIDs). A Virtual Address Space (VAS) represents the memory space allocated and registered to a specific process and is identified by a Process Address Space Identifier (PASID).

Figure 14-1: Ultra Ethernet Terminology.

 

Thursday, 15 May 2025

Deep Learning for Network Engineers: Understanding Traffic Patterns and Network Requirements in the AI Data Center

 

About This Book

Several excellent books have been published over the past decade on Deep Learning (DL) and Datacenter Networking. However, I have not found a book that covers these topics together—as an integrated deep learning training system—while also highlighting the architecture of the datacenter network, especially the backend network, and the demands it must meet.

This book aims to bridge that gap by offering insights into how Deep Learning workloads interact with and influence datacenter network design.

So, what is Deep Learning?

Deep Learning is a subfield of Machine Learning (ML), which itself is a part of the broader concept of Artificial Intelligence (AI). Unlike traditional software systems where machines follow explicitly programmed instructions, Deep Learning enables machines to learn from data without manual rule-setting.

At its core, Deep Learning is about training artificial neural networks. These networks are mathematical models composed of layers of artificial neurons. Different types of networks suit different tasks—Convolutional Neural Networks (CNNs) for image recognition, and Large Language Models (LLMs) for natural language processing, to name a few.

Training a neural network involves feeding it labeled data and adjusting its internal parameters through a process called backpropagation. During the forward pass, the model makes a prediction based on its current parameters. This prediction is then compared to the correct label to calculate an error. In the backward pass, the model uses this error to update its parameters, gradually improving its predictions. Repeating this process over many iterations allows the model to learn from the data and make increasingly accurate predictions.

Why should network engineers care?

Modern Deep Learning models can be extremely large, often exceeding the memory capacity of a single GPU or CPU. In these cases, training must be distributed across multiple processors. This introduces the need for high-speed communication between GPUs—both within a single server (intra-node) and across multiple servers (inter-node).

Intra-node GPU communication typically relies on high-speed interconnects like NVLink, with Direct Memory Access (DMA) operations enabling efficient data transfers between GPUs. Inter-node communication, however, depends on the backend network, either  InfiniBand or Ethernet-based. Synchronization of model parameters across GPUs places strict requirements on the network: high throughput, ultra-low latency, and zero packet loss. Achieving this in an Ethernet fabric is challenging but possible.  

This is where datacenter networking meets Deep Learning. Understanding how GPUs communicate and what the network must deliver is essential for designing effective AI data center infrastructures.




What this book is—and isn’t


This book provides a theoretical and conceptual overview. It is not a configuration or implementation guide, although some configuration examples are included to support key concepts. Since the focus is on the Deep Learning process, not on interacting with or managing the model, there are no chapters covering frontend or management networks. The storage network is also outside the scope. The focus is strictly on the backend network.
The goal is to help readers—especially network professionals—grasp the “big picture” of how Deep Learning impacts data center networking.

One final note

In all my previous books, I’ve used font size 10 and single line spacing. For this book, I’ve increased the font size to 11 and the line spacing to 1.15. This wasn’t to add more pages but to make the reading experience more comfortable. I’ve also tried to ensure that figures and their explanations appear on the same page, which occasionally results in some white space.
I hope you find this book helpful and engaging as you explore the fascinating intersection of Deep Learning and Datacenter Networking.

How this book is organized


Part I – Chapters 1-8: Deep Learning and Deep Neural Networks


This part of the book lays the theoretical foundation for understanding how modern AI models are built and trained. It introduces the structure and purpose of artificial neurons and gradually builds up to complete deep learning architectures and parallel training methods.

Artificial Neurons and Feedforward Networks (Chapters 1 - 3)

The journey begins with the artificial neuron, also known as a perceptron, which is the smallest functional unit of a neural network. It operates in two key steps: performing a matrix multiplication between inputs and weights, followed by applying a non-linear activation function to provide an output. 
By connecting many neurons across layers, we form a Feedforward Neural Network (FNN). FNNs are ideal for basic classification and regression tasks and provide the stepping stone to more advanced architectures.

Specialized Architectures: CNNs, RNNs, and Transformers  (Chapters 3 - 9)

After covering FNNs, this part dives into models designed for specific data types:
  • Convolutional Neural Networks (CNNs): Optimized for spatial data like images, CNNs use filters to extract local features such as edges, textures, and shapes, while keeping the model size efficient through weight sharing.
  • Recurrent Neural Networks (RNNs): Designed for sequential data like text and time series, RNNs maintain a hidden state that captures previous input history. This allows them to model temporal dependencies and context across sequences.
  • Transformer-based Large Language Models (LLMs): Unlike RNNs, Transformers use self-attention mechanisms to weigh relationships between all tokens in a sequence simultaneously. This architecture underpins state-of-the-art language models and enables scaling to billions of parameters.

Parallel Training and Scaling Deep Learning  (Chapter 8)

As models and datasets grow, training them on a single GPU becomes impractical. This section explores the three major forms of distributed training:


  • Data Parallelism: Each GPU holds a replica of the model but processes different mini-batches of input data. Gradients are synchronized at the end of each iteration to keep weights aligned.
  • Pipeline Parallelism: The model is split across multiple GPUs, with each GPU handling one stage of the forward and backward pass. Micro-batches are used to keep the pipeline full and maximize utilization.
  • Tensor (Model) Parallelism: Very large model layers are broken into smaller slices, and each GPU computes part of the matrix operations. This approach enables the training of ultra-large models that don't fit into a single GPU's memory.

Part II – Chapters 9 – 14: AI Data Center Networking


This part of the book focuses on the network technologies that enable distributed training at scale in modern AI data centers. It begins with an overview of GPU-to-GPU memory transfer mechanisms over Ethernet and then moves on to congestion control, load balancing strategies, network topologies, and GPU communication collectives.

RoCEv2 and GPU-to-GPU Transfers  (Chapter 9)

The section starts by explaining how Direct Memory Access (DMA) is used to copy data between GPUs across Ethernet using RoCEv2 (RDMA over Converged Ethernet version 2). This method allows GPUs located in different servers to exchange large volumes of data without CPU involvement.

DCQCN: Data Center Quantized Congestion Notification  (Chapters 10 - 11)

RoCEv2’s performance depends on a lossless transport layer, which makes congestion management essential. To address this, DCQCN provides an advanced congestion control mechanism. It dynamically adjusts traffic flow based on real-time feedback from the network to minimize latency and packet loss during GPU-to-GPU communication.


  • Explicit Congestion Notification (ECN): Network switches mark packets instead of dropping them when congestion builds. These marks trigger rate adjustments at the sender to prevent overload.
  • Priority-based Flow Control (PFC): PFC ensures that traffic classes like RoCEv2 can pause independently, preventing buffer overflows without stalling the entire link.
Load Balancing Techniques in AI Traffic  (Chapter 12)

In addition to congestion control, effective load distribution is critical for sustaining GPU throughput during collective communication. This section introduces several techniques used in modern data center fabrics:


  • Flow-based Load Balancing: Assigns entire flows or flowlets to paths based on real-time link usage or hash-based distribution, improving path diversity and utilization.
  • Flowlet Switching: Divides a flow into smaller time-separated bursts ("flowlets") that can be load-balanced independently without reordering issues.
  • Packet Spraying: Distributes packets belonging to the same flow across multiple available paths, helping to avoid link-level bottlenecks.

AI Data Center Network Topologies (Chapter 13)

Next, the section discusses design choices in the East-West fabric—the internal network connecting GPU servers. It introduces topologies such as:

  • Top-of-Rack (ToR): Traditional rack-level switching used to connect servers within a rack.
  • Rail and Rail-Optimized Designs: High-throughput topologies tailored for parallel GPU clusters. These layouts improve resiliency and throughput, especially during bursty communication phases in training jobs.

GPU-to-GPU Communication  (Chapter 14)

The part concludes with a practical look at collective communication patterns used to synchronize GPUs across the network. These collectives are essential for distributed training workloads:


  • AllReduce: Each GPU contributes and receives a complete, aggregated copy of the data. Internally, this is implemented in two phases:
    • ReduceScatter: GPUs exchange partial results and compute a portion of the final sum.
    • AllGather: Each GPU shares its computed segment so that every GPU receives the complete aggregated result.
  • Broadcast: A single GPU (often rank 0) sends data—such as communication identifiers or job-level metadata—to all other GPUs at the start of a training job.

Target Audience


I wrote this book for professionals working in the data center networking domain—whether in architectural, design, or specialist roles. It is especially intended for those who are already involved in, or are preparing to work with, the unique demands of AI-driven data centers. As AI workloads reshape infrastructure requirements, this book aims to provide the technical grounding needed to understand both the deep learning models and the networking systems that support them.

Back Cover Text


Deep Learning for Network Engineers bridges the gap between AI theory and modern data center network infrastructure. This book offers a technical foundation for network professionals who want to understand how Deep Neural Networks (DNNs) operate—and how GPU clusters communicate at scale.

Part I (Chapters 1–8) explains the mathematical and architectural principles of deep learning. It begins with the building blocks of artificial neurons and activation functions, and then introduces Feedforward Neural Networks (FNNs) for basic pattern recognition, Convolutional Neural Networks (CNNs) for more advanced image recognition, Recurrent Neural Networks (RNNs) for sequential and time-series prediction, and Transformers for large-scale language modeling using self-attention. The final chapters present parallel training strategies used when models or datasets no longer fit into a single GPU. In data parallelism, the training dataset is divided across GPUs, each processing different mini-batches using identical model replicas. Pipeline parallelism segments the model into sequential stages distributed across GPUs. Tensor (or model) parallelism further divides large model layers across GPUs when a single layer no longer fits into memory.These approaches enable training jobs to scale efficiently across large GPU clusters. 

Part II (Chapters 9–14) focuses on the networking technologies and fabric designs that support distributed AI workloads in modern data centers. It explains how RoCEv2 enables direct GPU-to-GPU memory transfers over Ethernet, and how congestion control mechanisms like DCQCN, ECN, and PFC ensure lossless high-speed transport. You’ll also learn about AI-specific load balancing techniques, including flow-based, flowlet-based, and per-packet spraying, which help avoid bottlenecks and keep GPU throughput high. Later chapters examine GPU collectives such as AllReduce—used to synchronize model parameters across all workers—alongside ReduceScatter and AllGather operations. The book concludes with a look at rail-optimized topologies that keep multi-rack GPU clusters efficient and resilient.

This book is not a configuration or deployment guide. Instead, it equips you with the theory and technical context needed to begin deeper study or participate in cross-disciplinary conversations with AI engineers and systems designers. Architectural diagrams and practical examples clarify complex processes—without diving into implementation details.

Readers are expected to be familiar with routed Clos fabrics, BGP EVPN control planes, and VXLAN data planes. These technologies are assumed knowledge and are not covered in the book.

Whether you're designing next-generation GPU clusters or simply trying to understand what happens inside them, this book provides the missing link between AI workloads and network architecture.

Sunday, 4 May 2025

AI for Network Engineers: Rail Desings in GPU Fabric

 When building a scalable, resilient GPU network fabric, the design of the rail layer, the portion of the topology that interconnects GPU servers via Top-of-Rack (ToR) switches, plays a critical role. This section explores three different models: Multi-rail-per-switch, Dual-rail-per-switch, and Single-rail-per-switch. All three support dual-NIC-per-GPU designs, allowing each GPU to connect redundantly to two separate switches, thereby removing the Rail switch as a single point of failure.


Multi-Rail-per-Switch

In this model, multiple small subnets and VLANs are configured per switch, with each logical rail mapped to a subset of physical interfaces. For example, a single 48-port switch might host four or eight logical rails using distinct Layer 2 and Layer 3 domains. Because all logical rails share the same physical device, isolation is logical. As a result, a hardware or software failure in the switch can impact all rails and their associated GPUs, creating a large failure domain.


This model is not part of NVIDIA’s validated Scalable Unit (SU) architecture but may suit test environments, development clusters, or small-scale GPU fabrics where hardware cost efficiency is a higher priority than strict fault isolation. From a CapEx perspective, multi-rail-per-switch is the most economical, requiring fewer switches. 


Figure 13-10 illustrates the multi-rail-per-switch architecture, where each rail is implemented as a separate VLAN-subnet pair mapped to a subset of switch ports. In the figure, interfaces 1–4 are assigned to subnet 10.0.1.0/28 and VLAN 101, while interfaces 5–8 are mapped to subnet 10.0.2.0/28 and VLAN 102. Each VLAN maintains its own MAC address table, learning GPU NIC MACs through ingress traffic. Although not shown in the figure, the Rail switch acts as the default gateway for all eight VLANs.


The figure also illustrates the BGP process when a Clos architecture with a spine layer is used to connect rail switches. All directly connected subnets are installed into the local Routing Information Base (RIB) as connected routes. These routes are then imported into the BGP Loc-RIB. Next, the routes pass through the BGP output policy engine, where they are aggregated into a single summary route: 10.0.1.0/24. This aggregate is placed into the BGP Adj-RIB-Out. When the BGP Update message is sent to a peer, the NEXT_HOP attribute is set accordingly.

Figure 13-10: Multi-Rail per Switch.

Dual-Rail-per-Switch


While dual-rail-per-switch improves manageability and is easier to scale, it shares the same limitation: both logical rails reside within a single physical switch, so the failure domain remains large. A single switch failure or misconfiguration affects both rails and all associated GPUs. 

This design resembles the dual-rail concept used in scalable AI clusters, but NVIDIA’s SU approach calls for two separate physical switches per rail, which provides full physical isolation. Dual-rail-per-switch hits a middle ground in terms of CapEx and OpEx: fewer switches are required than in the single-rail model, and operational complexity is reduced compared to multi-rail. It’s often a good choice for intermediate-scale environments where some fault tolerance and cost control must be balanced. 

Figure 13-11 illustrates a dual-rail-per-switch design, where the switch interfaces are divided evenly between two separate rails. Rail 1 uses interfaces 1 through 16 and is assigned to subnet 10.0.1.0/25 (VLAN 101). Rail 2 uses interfaces 17 through 32 and is assigned to subnet 10.0.128.0/25 (VLAN 102). Each VLAN has its own MAC address table, and the rail switch serves as the default gateway for both. The individual /25 subnets are redistributed into the BGP process and summarized as 10.0.1.0/24 for advertisement toward the spine layer.

Figure 13-11: Dual-Rail Switch.


Single-Rail-per-Switch


This model offers the highest level of physical isolation. Each switch forms a single rail, serving its connected GPU servers through one subnet and one VLAN. No logical separation is needed, as each rail is entirely independent in hardware. As a result, a switch failure affects only the GPU servers attached to that specific rail, yielding a small, predictable failure domain.

The design closely aligns with NVIDIA’s Scalable Unit (SU) architecture, in which each rack or rack group includes its own rail switch, and horizontal scaling is achieved by repeating modular, self-contained units.

While this model demands the highest CapEx, due to the one-to-one mapping between switches and rails, it offers major operational advantages. Configuration is simpler, troubleshooting is faster, and the risk of cascading faults is minimized. There is no need for route summarization, or custom BGP redistribution logic. Over time, these benefits help drive down OpEx, particularly in large-scale or mission-critical GPU clusters.

To ensure optimal hardware utilization, it is important to align the number of GPU servers per rack with the switch’s port capacity. Otherwise, underutilized ports can lead to inefficiencies in infrastructure cost and resource planning.

Figure 13-12 illustrates a simplified single-rail-per-switch topology. All interfaces from 1 to 32 operate within a single rail, configured with subnet 10.0.1.0/24 and VLAN 101. The rail switch serves as the default gateway, and because the full /24 subnet is used without subnetting, route summarization is not needed.


Figure 13-12: Single-Rail Switch.


AI Fabric Architecture Conclusion


Figure 13-13 illustrates one way to describe the overall architecture of an AI Fabric. It is divided into three domains. The first domain, called the Segment, includes GPU hosts and Rail switches. The second domain, the Pod, aggregates multiple segments using Spine switches. In cases where NCCL builds a topology where cross-rail inter-host traffic is first copied to the local GPU memory (located on the destination rail) and then sent over the GPU NIC to the remote GPU via the correct Rail switch, a Pod architecture with Spine switches may not be necessary. The third domain, multi-Pod, interconnects multiple pods using Super Spine switches, enabling large-scale AI Fabric deployments. Figure 13-10 also depicts global settings and properties shared across the AI Fabric backend network.

Segment: GPU I/O Topology and Rail Switch Fabric Profile


GPU I/O Topology: Each GPU connects to the network through a NIC. You can either dedicate a NIC to each GPU or share one NIC among multiple GPUs. NICs may have single, dual, or quad ports and support speeds such as 100, 200, or 400 Gbps. The interconnect type can be InfiniBand, RoCEv2, or NVLink. A segment typically includes multiple hosts.

Rail Switch Fabric Profile: Rail switches connect directly to GPU hosts. Each rail handles a group of NIC ports. You can map rails one-to-one to switches for physical isolation or map multiple rails per switch for logical isolation. In the latter case, two or more rails can be mapped per switch depending on performance and capacity requirements. Rail switches are responsible for ingress packet classification and for mapping RoCEv2 traffic to the correct queues. 

Pod: Spine Switch Profile:


Spine switches aggregate multiple Rail switches, forming a Pod that consists of n segments. Spine switches enable cross-rail communication between GPUs. They use high-density, high-speed ports. When the Spine layer is used, the result is a 2-tier, 3-stage architecture.

Multi-Pod: Super Spine Switch Profile


Super Spine switches provide inter-Pod connectivity. They are built with very high port density to support all connected Spine switches. When the Super Spine layer is used, the architecture becomes a 3-tier, 5-stage fabric.

Global AI Fabric Profile


All layers are governed by the Global AI Fabric Profile. This profile defines the control plane (eBGP, iBGP, BGP EVPN), the data plane (Ethernet, VXLAN), Layer 3 ECMP strategies (flow-based, flowlet-based, or per-packet), congestion control mechanisms (ECN marking, PFC), inter-switch link monitoring (BFD), and global MTU settings.


Figure 13-13: AI fabric Architecture Description.

Sunday, 27 April 2025

Backend Network Topologies for AI Fabrics

Although there are best practices for AI Fabric backend networks, such as Data Center Quantized Congestion Control (DCQCN) for congestion avoidance, rail-optimized routed Clos fabrics, and Layer 2 Rail-Only topologies for small-scale implementations, each vendor offers its own validated design. This approach is beneficial because validated designs are thoroughly tested, and when you build your system based on the vendor’s recommendations, you receive full vendor support and avoid having to reinvent the wheel.

However, instead of focusing on any specific vendor’s design, this chapter explains general design principles for building a resilient, non-blocking, and lossless Ethernet backend network for AI workloads.

Before diving into backend network design, this chapter first provides a high-level overview of a GPU server based on NVIDIA H100 GPUs. The first section introduces a shared NIC architecture, where 8 GPUs share two NICs. The second section covers an architecture where each of the 8 GPUs has a dedicated NIC.


Shared NIC


Figure 13-1 illustrates a shared NIC approach. In this example setup, NVIDIA H100 GPUs 0–3 are connected to NVSwitch chips 1-1, 1-2, 1-3, and 1-4 on baseboard-1, while GPUs 4–7 are connected to NVSwitch chips 2-1, 2-2, 2-3, and 2-4 on baseboard-2. Each GPU connects to all four NVSwitch chips on its respective baseboard using a total of 18 NVLink 4 connections: 5 links to chip 1-1, 4 links to chip 1-2, 4 links to chip 1-3, and 5 links to chip 1-4.

The NVSwitch chips themselves are paired between the two baseboards. For example, chip 1-1 on baseboard-1 connects to chip 2-1 on baseboard-2 with four NVLink connections, chip 1-2 connects to chip 2-2, and so on. This design forms a fully connected crossbar topology across the entire system.

Thanks to this balanced pairing, GPU-to-GPU communication is very efficient whether the GPUs are located on the same baseboard or on different baseboards. Each GPU can achieve up to 900 GB/s of total GPU-to-GPU bandwidth at full NVLink 4 speed.

For inter-GPU server connection, GPUs are also connected to a shared NVIDIA ConnectX-7 200 GbE NIC through a PEX89144 PCIe Gen5 switch. Each GPU has a dedicated PCIe Gen5 x16 link to the switch, providing up to 64 GB/s of bidirectional bandwidth (32 GB/s in each direction) between the GPU and the switch. The ConnectX-7 (200Gbps) NIC is also connected to the same PCIe switch, enabling high-speed data transfers between remote GPUs and the NIC through the PCIe fabric.

While each GPU benefits from a high-bandwidth, low-latency PCIe connection to the switch, the NIC itself has a maximum network bandwidth of 200 GbE, which corresponds to roughly 25 GB/s. Therefore, the PCIe switch is not a bottleneck; instead, the NIC’s available bandwidth must be shared among all eight GPUs. In scenarios where multiple GPUs are sending or receiving data simultaneously, the NIC becomes the limiting factor, and the bandwidth is divided between the GPUs.

In real-world AI workloads, however, GPUs rarely saturate both the PCIe interface and the NIC at the same time. Data transfers between the GPUs and the NIC are often bursty and asynchronous, depending on the training or inference pipeline stage. For example, during deep learning training, large gradients might be exchanged periodically, but not every GPU constantly sends data at full speed. Additionally, many optimizations like gradient compression, pipeline parallelism, and overlapping computation with communication further reduce the likelihood of sustained full-speed congestion.

As a result, even though the NIC bandwidth must be shared, the shared ConnectX-7 design generally provides sufficient network performance for typical AI workloads without significantly impacting training or inference times.

In high-performance environments, such as large-scale training workloads or GPU communication across nodes, this shared setup can become a bottleneck. Latency may increase under load, and data transfer speeds can slow down. 

Despite these challenges, the design is still useful in many cases. It is well-suited for development environments, smaller models, or setups where cost is a primary concern. If the workload does not require maximum GPU-to-network performance, sharing a NIC across GPUs can be a reasonable and efficient solution. However, for optimal performance and full support for technologies like GPUDirect RDMA, it is better to use a dedicated NIC for each GPU. 

Figure 13-1: Shared NIC GPU Server.

NIC per GPU


Figure 13-2 builds on the shared NIC design from Figure 13-1 but takes a different approach. In this setup, each GPU has its own dedicated ConnectX-7 200 GbE NIC. All NICs are connected to the PCIe Gen5 switch, just like in the earlier setup, but now each GPU uses its own PCIe Gen5 x16 connection to a dedicated NIC. This design eliminates the need for NIC sharing and allows every GPU to use the full 64 GB/s PCIe bandwidth independently.

The biggest advantage of this design is in GPU-to-NIC communication. There is no bandwidth contention at the PCIe level, and each GPU can fully utilize RDMA and GPUDirect features with its own NIC. This setup improves network throughput and reduces latency, especially in multi-node training workloads where GPUs frequently send and receive large amounts of data over Ethernet. 
The main drawback of this setup is cost. Adding one NIC per GPU increases both hardware costs and power consumption. It also requires more switch ports and cabling, which may affect system design. Still, these trade-offs are often acceptable in performance-critical environments.

This overall design reflects NVIDIA’s DGX and HGX architecture, where GPUs are fully interconnected using NVLink and NVSwitch and each GPU is typically paired with a dedicated ConnectX or BlueField NIC to maximize network performance. In addition, this configuration is well suited for rail-optimized backend networks, where consistent per-GPU network bandwidth and predictable east-west traffic patterns are important.


Figure 13-2: Dedicated NIC per GPU.

Before moving to the design sections, it is worth mentioning that the need for a high-performance backend network, and how it is designed, is closely related to the size of the neural networks being used. Larger models require more GPU memory and often must be split across multiple GPUs or even servers. This increases the need for fast, low-latency communication between GPUs, which puts more pressure on the backend network.

Figure 13-3 shows a GPU server with 8 GPUs. Each GPU has 80 GB of memory, giving a total of 640 GB GPU memory. This kind of setup is common in high-performance AI clusters.
The figure also shows three examples of running large language models (LLMs) with different parameter sizes:
  • 8B model: This model has 8 billion parameters and needs only approximately 16 GB of memory. It fits on a single GPU if model parallelism is not required. 
  • 70B model: This larger model has 70 billion parameters and needs approximately 140 GB of memory. It cannot fit into one GPU, so it must use at least two GPUs. In this case, the GPUs communicate using intra-host GPU connections across NVLink.
  • 405B model: This large model has 405 billion parameters and needs approximately 810 GB of memory. It does not fit into one server. Running this model requires at least 10 GPUs across multiple servers. The GPUs must use both intra-GPU connections inside a server and inter-GPU connections between servers.
This figure highlights how model size directly affects memory needs, and the number of GPUs required. As models grow, parallelism and fast GPU interconnects become essential.

Figure 13-3: Model Size and Required GPUs.

Design Scenarios


Single Rail Switch Design with Dedicated, Single-Port NICs per GPU


Figure 13-4 illustrates a single rail switch design. The switch interfaces are divided into three groups of eight 200 Gbps interface each. The first group of eight ports is reserved for Host-1, the second group for Host-2, and the third group for Host-3. Each host has eight GPUs, and each GPU is equipped with a dedicated, single-port NIC.

Within each group, ports are assigned to different VLANs to separate traffic into different logical rails. Specifically, the first port of each group belongs to the VLAN representing Rail-1, the second port belongs to Rail-2, and so on. This pattern continues across all three host groups.


Benefits


  • Simplicity: The architecture is very easy to design, configure, and troubleshoot. A single switch and straightforward VLAN assignment simplify management.
  • Cost-Effectiveness: Only one switch is needed, reducing capital expenditure (CapEx) compared to dual-rail or redundant designs. Less hardware also means lower operational expenditure (OpEx), including reduced power, cooling, and maintenance costs. Additionally, fewer devices translate to lower subscription-based licensing fees and service contract costs, further improving the total cost of ownership.
  • Efficient Use of Resources: Ports are used efficiently by directly mapping each GPU’s NIC to a specific port on the switch, minimizing wasted capacity.
  • Low Latency within the Rail: Since all communications stay within the same switch, latency is minimized, benefiting tightly-coupled GPU workloads.
  • Sufficient for Smaller Deployments: In smaller clusters or test environments where absolute redundancy is not critical, this design is perfectly sufficient.

Drawbacks


  • No Redundancy: A single switch creates a single point of failure. If the switch fails, all GPU communications are lost.
  • Limited Scalability: Expanding beyond the available switch ports can be challenging. Adding more hosts or GPUs might require replacing the switch or redesigning the network.
  • Potential Oversubscription: With all GPUs sending and receiving traffic through the same switch, there’s a risk of oversubscription, especially under heavy AI workload patterns where network traffic bursts are common.
  • Difficult Maintenance: Software upgrades or hardware maintenance on the switch impact all connected hosts, making planned downtime more disruptive.
  • Not Suitable for High Availability (HA) Requirements: Critical AI workloads, especially in production environments, often require dual-rail (redundant) networking to meet high availability requirements. This design would not meet such standards.
Single rail designs are cost-efficient and simple but lack redundancy and scalability, making them best suited for small or non-critical AI deployments.



Figure 13-4: Single Rail Switch Design: GPU with Single Port NIC.

Dual-Rail Switch Topology with Dedicated, Dual-Port NICs per GPU


In this topology, each host contains 8 GPUs, and each GPU has a dedicated dual-port NIC. The NICs are connected across two independent Rail switches equipped with 200 Gbps interfaces. This design ensures that every GPU has redundant network connectivity through separate switches, maximizing performance, resiliency, and failover capabilities.

Each Rail switch independently connects to one port of each NIC, creating a dual-homed connection per GPU. To ensure seamless operations and redundancy, the two switches must logically appear as a single device to the host NICs, even though they are physically distinct systems.

Benefits

  • High Availability: The failure of a single switch, link, or NIC port does not isolate any GPU, maintaining system uptime.
  • Load Balancing: Traffic can be distributed across both switches, maximizing bandwidth utilization and reducing bottlenecks.
  • Scalability: Dual-rail architectures can be extended easily to larger deployments while maintaining predictable performance and redundancy.
  • Operational Flexibility: Maintenance can often be performed on one switch without service disruption.

Drawbacks


  • Higher Cost: Requires two switches, twice the number of cables, and dual-port NICs, increasing CapEx and OpEx.
  • Complexity: Managing a dual-rail environment introduces more design complexity due to Multi-Chassis Link Aggregation (MLAG).
  • Increased Power and Space Requirements: Two switches and more cabling demand more rack space, power, and cooling.

Challenges of Multi-Chassis Link Aggregation (MLAG)


To create a logical channel between dual-port NICs and two switches, the switches must be presented as a single logical device to each NIC. Multi-Chassis Link Aggregation (MLAG) is often used for this purpose. MLAG allows a host to see both switch uplinks as part of the same LAG (Link Aggregation Group).
Another solution is to assign the two NIC ports to different VLANs without bundling them into a LAG, though this approach may limit bandwidth utilization and redundancy benefits compared to MLAG.
MLAG introduces several challenges:

  • MAC Address Synchronization: Both switches must advertise the same MAC address to the host NICs, allowing the two switches to appear as a single device.
  • Port Identification: A common approach to building MLAG is to use the same interface numbers on both switches. Therefore, the system must be capable of uniquely identifying each member link internally.
  • Control Plane Synchronization: The two switches must exchange state information (e.g., MAC learning, link status) to maintain a consistent and synchronized view of the network.
  • Failover Handling: The switches must detect failures quickly and handle them gracefully without disrupting existing sessions, requiring robust failure detection and recovery mechanisms.


Vendor-Specific MLAG Solutions


The following list shows some of the vendor proprietary MLAG:

  • Cisco Virtual Port Channel (vPC): Cisco's vPC allows two Nexus switches to appear as one logical switch to connected devices, synchronizing MAC addresses and forwarding state.
  • Juniper Virtual Chassis / MC-LAG: Juniper offers Virtual Chassis and MC-LAG solutions, where two or more switches operate with a shared control plane, presenting themselves as a single switch to the host.
  • Arista MLAG: Arista Networks implements MLAG with a simple peer-link architecture, supporting independent control planes while synchronizing forwarding state.
  • NVIDIA/Mellanox MLAG: Mellanox switches also offer MLAG solutions, often optimized for HPC and AI workloads.


Standards-Based Alternative: EVPN ESI Multihoming


Instead of vendor-specific MLAG, a standards-based approach using Ethernet Segment Identifier (ESI) Multihoming under BGP EVPN can be used. In this model:

  • Switches advertise shared Ethernet segments (ESIs) to the host over BGP EVPN.
  • Hosts see multiple physical links but treat them as part of a logical redundant connection.
  • EVPN ESI Multihoming allows for interoperable solutions across vendors, but typically adds more complexity to the control plane compared to simple MLAG setups.


Figure 13-5: Dual Rail Switch Design: GPU with Dual-Port NIC.


Cross-Rail Communication over NVLink in Rail-Only Topologies


In the introduced single- and dual-rail topologies (Figures 13-4 and 13-5), each GPU is connected to a dedicated NIC, and each NIC connects to a specific Rail switch. However, there is no direct cross-rail connection between the switches themselves — no additional spine layer interconnecting the rails. As a result, if a GPU needs to send data to a destination GPU that belongs to a different rail, special handling is required within the host before the data can exit over the network.

For example, consider a memory copy operation where GPU-2 (connected to Rail 3) on Host-1 needs to send data to GPU-3 (connected to Rail 4) on Host-2. Since GPU-2’s NIC is associated with Rail 3 and GPU-3 expects data arriving over Rail 4, the communication path must traverse multiple stages:

  1. Intra-Host Transfer: The data is first copied locally over NVLink from GPU-2 to GPU-3 within Host-1. NVLink provides a high-bandwidth, low-latency connection between GPUs inside the same server.
  2. NIC Transmission: Once the data resides in GPU-3’s memory, it can be sent out through GPU-3’s NIC, which connects to Rail 4.
  3. Inter-Host Transfer: The packet travels over Rail 4 through one of the Rail switches to reach Host-2.
  4. Destination Reception: Finally, the data is delivered to GPU-3 on Host-2.

This method ensures that each network link (and corresponding NIC) is used according to its assigned rail without needing direct switch-to-switch rail interconnects.

To coordinate and optimize such multi-step communication, NVIDIA Collective Communications Library (NCCL) plays a critical role. NCCL automatically handles GPU-to-GPU communication across multiple nodes and rails, selecting the appropriate path, initiating memory copies over NVLink, and scheduling transmissions over the correct NICs — all while maximizing bandwidth and minimizing latency. The upcoming chapter will explore NCCL in greater detail.

Figure 13-6 illustrates how the upcoming topology in Figure 13-7 maps NIC-to-Rail connections, transitioning from a switch interface-based view to a rail-based view. Figure 13-6 shows a partial interface layout of a Cisco Nexus 9348D-GX2A switch and how its interfaces are grouped into different rails as follows:

Rail-1 Interfaces: 1, 4, 7, 10
Rail-2 Interfaces: 13, 16, 19, 22
Rail-3 Interfaces: 25, 28, 31, 34
Rail-4 Interfaces: 37, 40, 43, 46
Rail-5 Interfaces: 2, 5, 8, 11
Rail-6 Interfaces: 14, 17, 20, 23
Rail-7 Interfaces: 26, 29, 32, 35
Rail-8 Interfaces: 38, 41, 44, 47

However, a port-based layout becomes extremely messy when describing larger implementations. Therefore, the common practice is to reference the rail number instead of individual switch interface identifiers.


Figure 13-6: Interface Block to Rail Mapping.

Figure 13-7 provides an example showing how each NIC is now connected to a rail instead of being directly mapped to a specific physical interface. In this approach, each rail represents a logical group of physical interfaces, simplifying the overall design and making larger deployments easier to visualize and document.

In our example "Host-Segment" (an unofficial name), we have four hosts, each equipped with eight GPUs — 32 GPUs in total. Each GPU has a dedicated 200 Gbps dual-port NIC. All GPUs are connected to two rail switches over a 2 × 200 Gbps MLAG, providing 400 Gbps of transmission speed per GPU.

Figure 13-7: Example Figure of Connecting 32 Dual-Port NICs 8 Rails on 2 Switches.

Figure 13-8 shows how multiple Host-Segments can be connected. The figure illustrates a simplified two-tier, three-stage Clos fabric topology, where full-mesh Layer 3 links are established between the four Rail switches (leaf switches) and the Spine switches. The figure also presents the link capacity calculations. Each Rail switch has 32 × 100 Gbps connections to the hosts, providing a total downlink capacity of 3.2 Tbps.

Since oversubscription is generally not preferred in GPU clusters — to maintain high performance and low latency — the uplink capacity from each Rail switch to the Spine layer must also match 3.2 Tbps. To achieve this, each Rail switch must have uplinks capable of an aggregate transfer rate of 3.2 Tbps. This can be implemented either by using native 800 Gbps interfaces or by forming a logical Layer 3 port channel composed of two 400 Gbps links per Spine connection. Additionally, Inter-Switch capacity can be increased by adding more switches in the Spine layer. This is one of the benefits of a Clos fabric: the capacity can be scaled without the need to replace 400 Gbps interfaces with 800 Gbps interfaces, for example.


This topology forms a Pod and supports 64 GPUs in total and provides a non-blocking architecture, ensuring optimal east-west traffic performance between GPUs across different Host-Segments.

In network design, the terms "two-tier" and "three-stage" Clos fabric describe different aspects of the same overall topology. "Two-tier" focuses on the physical switch layers (typically Leaf and Spine) and describes the depth of the topology, offering a hierarchy view of the architecture. Essentially, it's concerned with how many switching layers are present. On the other hand, three-stage Clos describes the logical data path a packet follows when moving between endpoints: Leaf–Spine–Leaf. It focuses on how data moves through the network and the stages traffic flows through. Therefore, while a two-tier topology refers to the physical switch structure, a three-stage Clos describes the logical path taken by packets, which crosses through three stages: Leaf, Spine, and Leaf. These two perspectives are complementary, not contradictory, and together they provide a complete view of the Clos network design.


Figure 13-8: AI fabric – Pod Design.

Figure 13-9 extends the previous example by adding a second 64-GPU Pod, creating a larger multi-Pod architecture. To interconnect the two Pods, four Super-Spine switches are introduced, forming an additional aggregation layer above the Spine layer. Each Pod retains its internal two-tier Clos fabric structure, with Rail switches fully meshed to the Spine switches as described earlier. The Spine switches from both Pods are then connected northbound to the Super-Spine switches over Layer 3 links.

Due to the introduction of the Super-Spine layer, the complete system now forms a three-tier, five-stage Clos topology. This design supports scalable expansion while maintaining predictable latency and high bandwidth between GPUs across different Pods. Similar to the Rail-to-Spine design, maintaining a non-blocking architecture between the Spine and Super-Spine layers is critical. Each Spine switch aggregates 3.2 Tbps of traffic from its Rail switches; therefore, the uplink capacity from each Spine to the Super-Spine layer must also be 3.2 Tbps.

This can be achieved either by using native 800 Gbps links or logical Layer 3 port channels composed of two 400 Gbps links per Super-Spine connection. All Spine switches are fully meshed with all Super-Spine switches to ensure high availability and consistent bandwidth. This architecture enables seamless east-west traffic between GPUs located in different Pods, ensuring that inter-Pod communication maintains the same non-blocking performance as intra-Pod traffic.

Figure 13-9: AI fabric – Multi-Pod Design.

In this chapter, we focus mainly on different topology options, such as Single Rail with Single-Port GPU NIC, Dual Rail Switch with Dual-Port GPU NIC, Cross-Rail Over Layer 3 Clos fabric, and finally, Inter-Pod architecture. The next chapter will delve more in-depth into the technical solutions and challenges.