In Model Parallelism, the
neural network is partitioned across multiple GPUs, with each GPU responsible
for specific layers of the model. This strategy is particularly
beneficial for large-scale models that surpass the memory limitations of a
single GPU.
Conversely, Pipeline Parallelism
involves dividing the model into consecutive stages, assigning each stage to a
different GPU. This setup allows data to be processed in a pipeline fashion,
akin to an assembly line, enabling simultaneous processing of multiple training
samples. Without pipeline parallelism, each GPU would process its inputs
sequentially from the complete dataset, while all other GPUs remain idle.
Our example neural network in
Figure 8-3 consists of three hidden layers and an output layer. The first
hidden layer is assigned to GPU A1, while the second and third hidden
layers are assigned to GPU A2 and GPU B1, respectively. The
output layer is placed on GPU B2. The training dataset is divided into
four micro-batches and stored on the GPUs. These micro-batches are fed
sequentially into the first hidden layer on GPU A1.
Note 8-1. In this example, we use a small training
dataset. However, if the dataset is too large to fit on a single GPU, we
combine model parallelism, pipeline parallelism, and data parallelism to
distribute the workload efficiently. See the note 8-2 for more detail.
I have divided the forward pass and
backward pass into time steps, which are further split into computation and
communication phases.
During the forward pass, neurons
first calculate the weighted sum of inputs, apply the activation function, and
produce an output y (computation phase). The computed outputs y, stored in GPU
memory, are then transferred to peer GPU(s) using Remote Direct Memory Access
(RDMA) (communication phase).
During the backward pass, the
backpropagation algorithm computes the model error (computation phase) and
propagates it backward across GPUs using RDMA (communication phase). This
process was explained in detail in Chapter 2.
Note 8-2: In our example, Hidden Layer 1 fits
entirely on GPU A1. The same applies to other layers—they each fit within a
single GPU. However, if the input dataset is too large to fit into a single
GPU, it must be split across multiple GPUs. In that case, Hidden Layer 1 will
be distributed across multiple GPUs, with each GPU handling a different portion
of the dataset. When this happens, the gradients of Hidden Layer 1 must be
synchronized across all GPUs that store part of the layer.
Time step 1:
Computing:
·
A1 processes the input x1 resulting the output y1.
Communication:
·
A1 transports y1 to A2.
Active
GPUs (25%):
A1
Figure 8-3: Model
Parallelism with Pipeline Parallelism – Time Step 1.
Time step 2:
Computing:
·
A1 processes the input x2 and produces the output y2.
·
A2 processes the input y1 and produces the output y1.
Communication:
·
A1 transports y2 to A2.
·
A2 transports y1 to B3.
Active GPUs (50%): A1, A2
Figure 8-4: Model
Parallelism with Pipeline Parallelism – Time Step 2.
Time step 3:
Computing:
·
A1 processes the input x3 and produces the output y3.
·
A2 processes the input y2 and produces the output y2.
·
B1 processes the input y1 and produces the output y1.
Communication:
·
A1 transports y3 to A2.
·
A2 transports y2 to B1.
·
B1 transports y1 to B2
Active GPUs (75%): A1, A2, B1
Figure 8-5: Model Parallelism with Pipeline Parallelism – Time
Step 3.
Time step 4:
Computing:
·
A1 processes the input x4 and produces the output y4.
·
A2 processes the input y3 and produces the output y3.
·
B1 processes the input y2 and produces the output y2.
·
B2 processes the input y1 and produces the model output
1.
Communication:
·
A1 transports y3 to A2.
·
A2 transports y2 to B1.
·
B1 transports y1 to B2
Active GPUs (100%): A1, A2, B1, B2
Figure 8-6: Model
Parallelism with Pipeline Parallelism – Time Step 4.
Time step 5:
Computing:
·
A2 processes the input y4 and produces the output y4.
·
B1 processes the input y3 and produces the output y3.
·
B2 processes the input y2 and produces the model
output
2.
·
B2 Computes local neuron error E1, and gradient G1.
Communication:
·
A2 transports y4 to B1.
·
B1 transports y3 to B2.
·
B2 transports error E1 to B1
Active GPUs (75%): A2, B1, B2
The
notation x3 above G1 on GPU B2 indicates that the algorithm computes gradients
from the error for each weight associated with the inputs, including the bias. This
process is repeated with all four micro-batches. This notation will be used in
the upcoming figures as well.
Figure 8-7: Model Parallelism with Pipeline Parallelism – Time
Step 5.
Time step 6:
Computing:
·
B1 processes the input y4 and produces the output y4.
·
B2 processes the input y3 and produces model output
3.
·
B2 Computes local neuron error E2, and gradient G2.
·
B1 Computes local neuron error E1, and gradient G1.
Communication:
·
B1 transports y4 to B2.
·
B2 transports error E2 to B1
Active GPUs (50%): B1, B2
Figure 8-8: Model Parallelism with Pipeline Parallelism – Time
Step 6.
Time
step 7:
Computing:
·
B2 processes the input y4 and produces model output
4.
·
B2 Computes local neuron error E3, and gradient G3.
·
B1 Computes local neuron error E2, and gradient G2.
·
A2 Computes local neuron error E1, and gradient G1.
Communication:
·
B2 transports error E3 to B1
·
B1 transports error E2 to A2
·
A2 transports error E1 to A1
Active
GPUs (75%):
A2, B1, B2
Figure 8-9: Model Parallelism with Pipeline Parallelism – Time
Step 7.
Time step 8:
Computing:
·
B2 Computes local neuron error E4, and gradient G4.
·
B1 Computes local neuron error E3, and gradient G3.
·
A2 Computes local neuron error E2, and gradient G2.
·
A1 Computes local neuron error E1, and gradient G1.
Communication:
·
B2 transports error E4 to B1
·
B1 transports error E3 to A2
·
A2 transports error E2 to A1
Active GPUs (100%): A1, A2, B1, B2
Figure 8-10: Model
Parallelism with Pipeline Parallelism – Time Step 8.
Time
step 9:
Computing:
·
B1 Computes local neuron error E4, and gradient G4.
·
A2 Computes local neuron error E3, and gradient G3.
·
A1 Computes local neuron error E2, and gradient G2.
Communication:
·
B1 transports error E4 to A2
·
A2 transports error E3 to A1
Active
GPUs (75%):
A1, A2, B1
Figure 8-11: Model
Parallelism with Pipeline Parallelism – Time Step 9.
Time step 10:
Computing:
·
A2 Computes local neuron error E4, and gradient G4.
·
A1 Computes local neuron error E3, and gradient G3.
Communication:
·
A2 transports error E4 to A1
Active GPUs (50%): A1, A2
Figure 8-12: Model Parallelism with Pipeline Parallelism – Time
Step 10.
Time
step 11:
Computing:
·
A1 Computes local neuron error E4, and gradient G4.
Communication:
Active
GPUs (25%):
A1
In our example,
the micro-batches fit into a single GPU, so we don’t need to split them across
multiple GPUs. That said, once GPU A1 has computed the gradients for the last
micro-batch, the weights are adjusted, and the second iteration of the forward
pass begins.
Figure 8-13: Model
Parallelism with Pipeline Parallelism – Time Step 11.
If
the test dataset is too large for a single GPU and must be split across
multiple GPUs, the layers must also be shared between GPUs. For example, hidden
layer 1 is on GPUs A1 and C1, while hidden layer 2 is on GPUs A2 and C2. This
requires intra-layer gradient synchronization between GPUs sharing the same
layer, resulting in inter-GPU packet transport. Figure 8-14 illustrates how
gradients are first synchronized (inter-layer). Then, each GPU averages the
gradients (sum of gradients divided by the number of GPUs). Finally, the
averaged gradients are synchronized.
Figure 8-14: Model
Parallelism with Pipeline Parallelism – Synchronization.