Monday, 17 March 2025

Tensor Parallelism

 The previous section described how Pipeline Parallelism distributes entire layers across multiple GPUs. However, Large Language Models (LLMs) based on transformer architectures contain billions of parameters, making this approach insufficient.

For example, GPT-3 has approximately 605 million parameters in a single self-attention layer and about 1.2 billion parameters in a feedforward layer, and these figures apply to just one transformer block. Since GPT-3 has 96 transformer blocks, the total parameter count reaches approximately 173 billion. When adding embedding and normalization parameters, the total increases to roughly 175 billion parameters.

The number of parameters in a single layer alone often exceeds the memory capacity of a single GPU, making Pipeline Parallelism insufficient. Additionally, performing large matrix multiplications on a single GPU would be extremely slow and inefficient. Tensor Parallelism addresses this challenge by splitting computations within individual layers across multiple GPUs rather than assigning whole layers to separate GPUs, as done in Pipeline Parallelism.

Chapter 7 introduces Transformer architecture but for memory refreshing, figure 8-15 illustrates a stack of decoder modules in a transformer architecture. Each decoder module consists of a Self-Attention layer and a Feedforward layer. The figure also shows how an input word, represented by x1, is first mapped to a token. The token, in turn, receives a positional word embedding vector through lookups in the word embedding and position embedding tables.

The resulting word vector is used to compute Query (Q) and Key (K) matrices, which, in turn, produces logits via dot products. These logits are then passed through the SoftMax function. The resulting matrix from the SoftMax function is multiplied with the Value (V) matrices. After Add & Normalization computation, the resulting matrix is fed into the Feedforward, fully connected, neural network.

Figure 8-15: An Overview of a Transformer Architecture.


Self-Attention Layer


In most cases, the word embedding matrix fits within a single GPU and is not split across multiple GPUs when using Tensor Parallelism. This is because a typical embedding matrix is approximately 200 MB, which is significantly smaller than large Transformer layers that can contain billions of parameters.

Another reason for keeping the embedding matrix on a single GPU is efficient lookup operations. Unlike large matrix multiplications, embedding lookups are memory-efficient and do not impose significant computational overhead. Splitting the embedding matrix across multiple GPUs would introduce high communication costs, as each GPU would store only a fraction of the vocabulary. This would require frequent cross-GPU communication for token lookups, increasing latency and reducing efficiency. After the embedding lookup, the embedding vectors are broadcasted to all GPUs before the Transformer computations start. 

However, in very large-scale models (such as GPT-3 with 175 billion parameters), embeddings may be sharded across multiple GPUs using distributed embeddings or model parallelism techniques. One approach is row-wise parallelism, where the vocabulary is split across GPUs, and each GPU stores only a fraction of the embeddings, handling lookups for the tokens it owns. 

Figure 8-16 illustrates how the positional word embedding matrix (Eepv) is multiplied with the Query (Q), Key (K), and Value (V) matrices to produce the corresponding Q, K, and V vectors. The Query and Key vectors are then used as inputs to the self-attention layer.


Figure 8-16: Local Query (Q), Key (K), and Value (V) Matrices.


Figure 8-17 illustrates how the Query, Key, and Value matrices are sharded across two GPUs. The first fragments of these matrices are assigned to GPU A1, while the second fragments are assigned to GPU A2. The positional word embedding matrix (Eevp ) is also distributed between GPU A1 and GPU A2. Matrix multiplication is then performed between the corresponding fragment of Eevp  and the respective shards of the Q, K, and V matrices. 


Figure 8-17: Shared Query (Q), Key (K), and Value (V) Matrices.


Figure 8-18 illustrates the cross-GPU communication involved in the forward pass of the Self-Attention layer when using Tensor Parallelism. In this example, both the word embedding, and positional embedding matrices fit within GPU A1. After computing the positional word embeddings for the input words, the resulting vectors are broadcasted to GPU A2.

Since we are using Tensor Parallelism, the Query (Q), Key (K), and Value (V) matrices are partitioned across GPU A1 and GPU A2. Once each GPU has computed its assigned slices of the Q, K, and V vectors, the Q and K vectors are shared between GPUs using an All-Gather operation. This ensures that each GPU receives the missing parts of the Q and K matrices, reconstructing the complete matrices across GPUs. Only the Q and K matrices are synchronized; the V matrix remains local to each GPU.

The Q and K matrices are then used in the Self-Attention layer, where the first operation is a matrix multiplication between the Query vectors and Key vectors for all tokens. The process is explained in detail in Chapter 7. The resulting scores are used to compute logits, which are inputs to the SoftMax function, using scaled dot-product attention. The output of the SoftMax function is then multiplied by the local fragment of the V matrix on each GPU.

The SoftMax operation produces a Context Vector (Cv) for each input word, which serves as the input to the Feedforward Neural Network (FFN) layer. That said, the SoftMax in the self-attention layer is not the final prediction layer, it’s used to compute attention weights. The feedforward network processes the context vectors token representations produced by self-attention, not the predicted token. The final prediction is typically made by a separate output projection followed by a SoftMax over the vocabulary.

Figure 8-18: Tensor Parallelism in Self-Attention Layer.


Feedforward Layer


Figure 8-19 illustrates a Feedforward layer in the decoder module of a transformer. The feedforward network consists of two hidden layers and an output layer. In addition to Tensor Parallelism, we also employ Model Parallelism with Pipeline Parallelism.

The first hidden layer is split between GPU A1 and GPU B1, both located in the same server. The weight matrices for neurons 1–3 reside in GPU A1, while the weight matrices for neurons 4–6 are in GPU B1. The inter-GPU communication between GPU A1 and GPU B1 occurs over NVLinks, which I refer to as the High-speed Domain (HsD).

The second hidden layer is distributed across GPU A2 and GPU B2 within the same server. GPU A2 holds the weight matrices for neurons 1–2, while GPU B2 contains the weight matrices for neurons 3–4. The inter-GPU connection between GPU A2 and GPU B2 also utilizes NVLinks.

The output layer is divided between GPU A3 and GPU B3, both residing in the same server. The weight matrix for neuron 1 is stored in GPU A3, while the weight matrix for neuron 2 is in GPU B3. Inter-GPU communication occurs over NVLinks.

Additionally, GPU A1, GPU A2, and GPU A3 are interconnected via Rail Switch-1 across the Backend Network. Similarly, GPU B1, GPU B2, and GPU B3 are connected via Rail Switch-2 across the Backend Network.


Figure 8-19: Tensor, Model and Pipeline Parallelism in Feedforward Layer.


Backpropagation


Forward pass


First Hidden Layer (H1): The input to H1, the output of the Self-Attention block after the Add & Norm step (context vectors), is shared with GPU A1 and GPU B1. Each GPU then performs its local matrix multiplication. After these local computations are complete, the partial outputs are synchronized between GPU A1 and GPU B1 using an All-Gather operation. This synchronization ensures that the complete H1 output (ynA1+B1) is calculated before it is passed to the next stage. Because GPU A1 and GPU B1 reside on the same server, the communication occurs over a high-speed domain via NVLink.

In the context of pipeline parallelism, H1 constitutes one pipeline stage. Once its context vector-based output is fully assembled, it is sent to the GPUs responsible for the next layer. Specifically, GPU A1 and GPU B1 first pass the output computed from the first context vector (C1), and then the GPUs process the next context vector. This communication occurs over the backend network. GPU A1, GPU A2, and GPU A3 are all connected to the same rail switch, so the RDMA packets traverse only one switch. The same design applies to GPU B1, GPU B2, and GPU B3. If communication between GPUs connected to different rail switches is required, the rail switches must be interconnected via spine switches. Alternatively, the RDMA packets may be sent over the high-speed domain to a GPU on the same rail as the destination GPU.

Second Hidden Layer (H2): The complete output from H1 (obtained after synchronization in the previous stage) is pipelined to GPUs A2 and B2. Each of these GPUs performs its own local matrix multiplication. As before, after the local computations, the partial outputs from GPU A2 and GPU B2 are synchronized via an All-Gather operation, forming the complete H2 output (ynA2+B2).

The synchronization and forwarding between hidden layer 2 and output layer, and within an output layer follow the same model as in the previous hidden layers.

This hybrid approach, using tensor parallelism within each stage and pipeline parallelism across stages, helps balance the computational load and memory usage across the six GPUs while minimizing idle time.

Although the focus of this section is on tensor parallelism, pipeline parallelism is also discussed because large language models (LLMs) can process multiple sentences from their vocabulary simultaneously during the training process.

On the other hand, during the inference when answering to our questions, LLMs use autoregressive next-word prediction. In this process, the final SoftMax layer of the Transformer calculates the probabilities over the vocabulary to predict the next token. This predicted token is then converted into a word and mapped to a new token. The lookup process assigns the token a positional embedding vector, which is used to compute the Query, Key, and Value vectors that feed into the Transformer's self-attention layer. Consequently, pipeline parallelism is not required during the inference phase.

Backward pass


The error propagates backward from the Feedforward Neural Network (FFNN) layer to the Self-Attention layer. The backpropagation process in a Transformer follows a sequential order, meaning the error from the output propagates first to the FFNN layer, and from there, it continues backward to the Self-Attention mechanism.

The process begins at the output layer, where the error is computed using the SoftMax function and cross-entropy loss. This error is then backpropagated through the FFNN layer, where gradients for the weight matrices are computed. Since the FFNN weights are split across multiple GPUs in Tensor Parallelism, each GPU computes its local gradient. An All-Reduce operation is then performed to synchronize these gradients across GPUs, ensuring that all GPUs have the correct weight updates before proceeding.

Once the gradients for the FFNN weights are synchronized, the error propagates back to the Self-Attention layer. Here, gradients for the Query (Q), Key (K), and Value (V) matrices are computed. Since these matrices were split across GPUs during the forward pass, the missing Q and K fragments must be gathered before calculating gradients. An All-Gather operation is used to collect Q and K values across GPUs. Once each GPU has a complete Q and K matrix, it computes the required gradients locally. After the local gradient computation, an All-Reduce operation is performed to ensure all GPUs have the synchronized gradients before updating the weights.

After both layers complete their gradient computations and synchronizations, the optimizer updates the weights, and the next iteration begins. The key communication phases include All-Gather for assembling required Q and K values before gradient computation and All-Reduce for synchronizing gradients before weight updates.



Tuesday, 11 March 2025

Model Parallelism with Pipeline Parallelism

 

In Model Parallelism, the neural network is partitioned across multiple GPUs, with each GPU responsible for specific layers of the model. This strategy is particularly beneficial for large-scale models that surpass the memory limitations of a single GPU.

Conversely, Pipeline Parallelism involves dividing the model into consecutive stages, assigning each stage to a different GPU. This setup allows data to be processed in a pipeline fashion, akin to an assembly line, enabling simultaneous processing of multiple training samples. Without pipeline parallelism, each GPU would process its inputs sequentially from the complete dataset, while all other GPUs remain idle.

Our example neural network in Figure 8-3 consists of three hidden layers and an output layer. The first hidden layer is assigned to GPU A1, while the second and third hidden layers are assigned to GPU A2 and GPU B1, respectively. The output layer is placed on GPU B2. The training dataset is divided into four micro-batches and stored on the GPUs. These micro-batches are fed sequentially into the first hidden layer on GPU A1. 

Note 8-1. In this example, we use a small training dataset. However, if the dataset is too large to fit on a single GPU, we combine model parallelism, pipeline parallelism, and data parallelism to distribute the workload efficiently. See the note 8-2 for more detail.

I have divided the forward pass and backward pass into time steps, which are further split into computation and communication phases.

During the forward pass, neurons first calculate the weighted sum of inputs, apply the activation function, and produce an output y (computation phase). The computed outputs y, stored in GPU memory, are then transferred to peer GPU(s) using Remote Direct Memory Access (RDMA) (communication phase).

During the backward pass, the backpropagation algorithm computes the model error (computation phase) and propagates it backward across GPUs using RDMA (communication phase). This process was explained in detail in Chapter 2. 

Note 8-2: In our example, Hidden Layer 1 fits entirely on GPU A1. The same applies to other layers—they each fit within a single GPU. However, if the input dataset is too large to fit into a single GPU, it must be split across multiple GPUs. In that case, Hidden Layer 1 will be distributed across multiple GPUs, with each GPU handling a different portion of the dataset. When this happens, the gradients of Hidden Layer 1 must be synchronized across all GPUs that store part of the layer.

Time step 1:

     Computing:

·    A1 processes the input x1 resulting the output y1.

Communication:

·    A1 transports y1 to A2. 

Active GPUs (25%): A1

Figure 8-3: Model Parallelism with Pipeline Parallelism – Time Step 1.

 

Time step 2: 

Computing:

·    A1 processes the input x2 and produces the output y2.

·    A2 processes the input y1 and produces the output y1.

Communication:

·    A1 transports y2 to A2.

·    A2 transports y1 to B3. 

Active GPUs (50%): A1, A2



Figure 8-4: Model Parallelism with Pipeline Parallelism – Time Step 2.

 

Time step 3:

Computing:

·    A1 processes the input x3 and produces the output y3.

·    A2 processes the input y2 and produces the output y2.

·    B1 processes the input y1 and produces the output y1.

Communication:

·    A1 transports y3 to A2.

·    A2 transports y2 to B1.

·    B1 transports y1 to B2

 Active GPUs (75%): A1, A2, B1

 

Figure 8-5: Model Parallelism with Pipeline Parallelism – Time Step 3. 

Time step 4: 

Computing:

·    A1 processes the input x4 and produces the output y4.

·    A2 processes the input y3 and produces the output y3.

·    B1 processes the input y2 and produces the output y2.

·    B2 processes the input y1 and produces the model output 1.

Communication:

·    A1 transports y3 to A2.

·    A2 transports y2 to B1.

·    B1 transports y1 to B2 

Active GPUs (100%): A1, A2, B1, B2


Figure 8-6: Model Parallelism with Pipeline Parallelism – Time Step 4. 

Time step 5:

Computing:

·    A2 processes the input y4 and produces the output y4.

·    B1 processes the input y3 and produces the output y3.

·    B2 processes the input y2 and produces the model output 2.

·    B2 Computes local neuron error E1, and gradient G1.

Communication:

·    A2 transports y4 to B1.

·    B1 transports y3 to B2.

·    B2 transports error E1 to B1

Active GPUs (75%): A2, B1, B2

 The notation x3 above G1 on GPU B2 indicates that the algorithm computes gradients from the error for each weight associated with the inputs, including the bias. This process is repeated with all four micro-batches. This notation will be used in the upcoming figures as well.


Figure 8-7: Model Parallelism with Pipeline Parallelism – Time Step 5. 

Time step 6: 

Computing:

·    B1 processes the input y4 and produces the output y4.

·    B2 processes the input y3 and produces model output 3.

·    B2 Computes local neuron error E2, and gradient G2.

·    B1 Computes local neuron error E1, and gradient G1.

Communication:

·    B1 transports y4 to B2.

·    B2 transports error E2 to B1

 Active GPUs (50%): B1, B2

 


 Figure 8-8: Model Parallelism with Pipeline Parallelism – Time Step 6. 

Time step 7: 

Computing:

·    B2 processes the input y4 and produces model output 4.

·    B2 Computes local neuron error E3, and gradient G3.

·    B1 Computes local neuron error E2, and gradient G2.

·    A2 Computes local neuron error E1, and gradient G1.

Communication:

·    B2 transports error E3 to B1

·    B1 transports error E2 to A2

·    A2 transports error E1 to A1 

Active GPUs (75%): A2, B1, B2

 

 

Figure 8-9: Model Parallelism with Pipeline Parallelism – Time Step 7. 

Time step 8: 

Computing: 

·    B2 Computes local neuron error E4, and gradient G4.

·    B1 Computes local neuron error E3, and gradient G3.

·    A2 Computes local neuron error E2, and gradient G2.

·    A1 Computes local neuron error E1, and gradient G1.

Communication:

·    B2 transports error E4 to B1

·    B1 transports error E3 to A2

·    A2 transports error E2 to A1 

Active GPUs (100%): A1, A2, B1, B2

 

Figure 8-10: Model Parallelism with Pipeline Parallelism – Time Step 8. 

Time step 9: 

Computing: 

·    B1 Computes local neuron error E4, and gradient G4.

·    A2 Computes local neuron error E3, and gradient G3.

·    A1 Computes local neuron error E2, and gradient G2.

Communication: 

·    B1 transports error E4 to A2

·    A2 transports error E3 to A1 

Active GPUs (75%): A1, A2, B1

 

Figure 8-11: Model Parallelism with Pipeline Parallelism – Time Step 9. 

Time step 10:

Computing: 

·    A2 Computes local neuron error E4, and gradient G4.

·    A1 Computes local neuron error E3, and gradient G3.

Communication: 

·    A2 transports error E4 to A1 

Active GPUs (50%): A1, A2

 

Figure 8-12: Model Parallelism with Pipeline Parallelism – Time Step 10. 

Time step 11: 

Computing: 

·    A1 Computes local neuron error E4, and gradient G4.

Communication: 

Active GPUs (25%): A1

 

In our example, the micro-batches fit into a single GPU, so we don’t need to split them across multiple GPUs. That said, once GPU A1 has computed the gradients for the last micro-batch, the weights are adjusted, and the second iteration of the forward pass begins.

 

Figure 8-13: Model Parallelism with Pipeline Parallelism – Time Step 11. 

If the test dataset is too large for a single GPU and must be split across multiple GPUs, the layers must also be shared between GPUs. For example, hidden layer 1 is on GPUs A1 and C1, while hidden layer 2 is on GPUs A2 and C2. This requires intra-layer gradient synchronization between GPUs sharing the same layer, resulting in inter-GPU packet transport. Figure 8-14 illustrates how gradients are first synchronized (inter-layer). Then, each GPU averages the gradients (sum of gradients divided by the number of GPUs). Finally, the averaged gradients are synchronized.

 

Figure 8-14: Model Parallelism with Pipeline Parallelism – Synchronization.