In Model Parallelism, the
neural network is partitioned across multiple GPUs, with each GPU responsible
for specific layers of the model. This strategy is particularly
beneficial for large-scale models that surpass the memory limitations of a
single GPU.
Conversely, Pipeline Parallelism involves dividing the model into consecutive stages, assigning each stage to a different GPU. This setup allows data to be processed in a pipeline fashion, akin to an assembly line, enabling simultaneous processing of multiple training samples. Without pipeline parallelism, each GPU would process its inputs sequentially from the complete dataset, while all other GPUs remain idle.
Our example neural network in Figure 8-3 consists of three hidden layers and an output layer. The first hidden layer is assigned to GPU A1, while the second and third hidden layers are assigned to GPU A2 and GPU B1, respectively. The output layer is placed on GPU B2. The training dataset is divided into four micro-batches and stored on the GPUs. These micro-batches are fed sequentially into the first hidden layer on GPU A1.
Note 8-1. In this example, we use a small training
dataset. However, if the dataset is too large to fit on a single GPU, we
combine model parallelism, pipeline parallelism, and data parallelism to
distribute the workload efficiently. See the note 8-2 for more detail.
I have divided the forward pass and backward pass into time steps, which are further split into computation and communication phases.
During the forward pass, neurons first calculate the weighted sum of inputs, apply the activation function, and produce an output y (computation phase). The computed outputs y, stored in GPU memory, are then transferred to peer GPU(s) using Remote Direct Memory Access (RDMA) (communication phase).
During the backward pass, the
backpropagation algorithm computes the model error (computation phase) and
propagates it backward across GPUs using RDMA (communication phase). This
process was explained in detail in Chapter 2.
Note 8-2: In our example, Hidden Layer 1 fits entirely on GPU A1. The same applies to other layers—they each fit within a single GPU. However, if the input dataset is too large to fit into a single GPU, it must be split across multiple GPUs. In that case, Hidden Layer 1 will be distributed across multiple GPUs, with each GPU handling a different portion of the dataset. When this happens, the gradients of Hidden Layer 1 must be synchronized across all GPUs that store part of the layer.
Time step 1:
Computing:
· A1 processes the input x1 resulting the output y1.
Communication:
· A1 transports y1 to A2.
Active
GPUs (25%):
A1
Figure 8-3: Model
Parallelism with Pipeline Parallelism – Time Step 1.
Computing:
·
A1 processes the input x2 and produces the output y2.
·
A2 processes the input y1 and produces the output y1.
Communication:
·
A1 transports y2 to A2.
· A2 transports y1 to B3.
Active GPUs (50%): A1, A2
Figure 8-4: Model
Parallelism with Pipeline Parallelism – Time Step 2.
Computing:
·
A1 processes the input x3 and produces the output y3.
·
A2 processes the input y2 and produces the output y2.
·
B1 processes the input y1 and produces the output y1.
Communication:
·
A1 transports y3 to A2.
·
A2 transports y2 to B1.
·
B1 transports y1 to B2
Figure 8-5: Model Parallelism with Pipeline Parallelism – Time Step 3.
Computing:
·
A1 processes the input x4 and produces the output y4.
·
A2 processes the input y3 and produces the output y3.
·
B1 processes the input y2 and produces the output y2.
·
B2 processes the input y1 and produces the model output
Communication:
·
A1 transports y3 to A2.
·
A2 transports y2 to B1.
· B1 transports y1 to B2
Active GPUs (100%): A1, A2, B1, B2
Figure 8-6: Model Parallelism with Pipeline Parallelism – Time Step 4.
Computing:
·
A2 processes the input y4 and produces the output y4.
·
B1 processes the input y3 and produces the output y3.
·
B2 processes the input y2 and produces the model
output
·
B2 Computes local neuron error E1, and gradient G1.
Communication:
·
A2 transports y4 to B1.
·
B1 transports y3 to B2.
· B2 transports error E1 to B1
Active GPUs (75%): A2, B1, B2
The notation x3 above G1 on GPU B2 indicates that the algorithm computes gradients from the error for each weight associated with the inputs, including the bias. This process is repeated with all four micro-batches. This notation will be used in the upcoming figures as well.
Figure 8-7: Model Parallelism with Pipeline Parallelism – Time Step 5.
Computing:
·
B1 processes the input y4 and produces the output y4.
·
B2 processes the input y3 and produces model output
·
B2 Computes local neuron error E2, and gradient G2.
·
B1 Computes local neuron error E1, and gradient G1.
Communication:
·
B1 transports y4 to B2.
·
B2 transports error E2 to B1
Active GPUs (50%): B1, B2
Figure 8-8: Model Parallelism with Pipeline Parallelism – Time Step 6.
Time step 7:
Computing:
·
B2 processes the input y4 and produces model output
·
B2 Computes local neuron error E3, and gradient G3.
·
B1 Computes local neuron error E2, and gradient G2.
·
A2 Computes local neuron error E1, and gradient G1.
Communication:
·
B2 transports error E3 to B1
·
B1 transports error E2 to A2
· A2 transports error E1 to A1
Active
GPUs (75%):
A2, B1, B2
Figure 8-9: Model Parallelism with Pipeline Parallelism – Time Step 7.
Computing:
·
B2 Computes local neuron error E4, and gradient G4.
·
B1 Computes local neuron error E3, and gradient G3.
·
A2 Computes local neuron error E2, and gradient G2.
·
A1 Computes local neuron error E1, and gradient G1.
Communication:
·
B2 transports error E4 to B1
·
B1 transports error E3 to A2
· A2 transports error E2 to A1
Active GPUs (100%): A1, A2, B1, B2
Figure 8-10: Model Parallelism with Pipeline Parallelism – Time Step 8.
Time step 9:
Computing:
·
B1 Computes local neuron error E4, and gradient G4.
·
A2 Computes local neuron error E3, and gradient G3.
·
A1 Computes local neuron error E2, and gradient G2.
Communication:
·
B1 transports error E4 to A2
· A2 transports error E3 to A1
Active
GPUs (75%):
A1, A2, B1
Figure 8-11: Model Parallelism with Pipeline Parallelism – Time Step 9.
Computing:
·
A2 Computes local neuron error E4, and gradient G4.
·
A1 Computes local neuron error E3, and gradient G3.
Communication:
· A2 transports error E4 to A1
Active GPUs (50%): A1, A2
Figure 8-12: Model Parallelism with Pipeline Parallelism – Time Step 10.
Time step 11:
Computing:
·
A1 Computes local neuron error E4, and gradient G4.
Communication:
Active
GPUs (25%):
A1
In our example,
the micro-batches fit into a single GPU, so we don’t need to split them across
multiple GPUs. That said, once GPU A1 has computed the gradients for the last
micro-batch, the weights are adjusted, and the second iteration of the forward
pass begins.
Figure 8-13: Model Parallelism with Pipeline Parallelism – Time Step 11.
If
the test dataset is too large for a single GPU and must be split across
multiple GPUs, the layers must also be shared between GPUs. For example, hidden
layer 1 is on GPUs A1 and C1, while hidden layer 2 is on GPUs A2 and C2. This
requires intra-layer gradient synchronization between GPUs sharing the same
layer, resulting in inter-GPU packet transport. Figure 8-14 illustrates how
gradients are first synchronized (inter-layer). Then, each GPU averages the
gradients (sum of gradients divided by the number of GPUs). Finally, the
averaged gradients are synchronized.
Figure 8-14: Model
Parallelism with Pipeline Parallelism – Synchronization.