The Network Times: Training Neural Networks: Backpropagation Algorithm

Introduction

The previous chapter explained the operation of a single artificial neuron. It covered how input values are multiplied by their respective weight parameters, summed together, and combined with a bias term. The resulting value, z, is then passed through a non-linear sigmoid function, which squeezed a neuron’s output value y ̂ between 0 and 1.

In this chapter, we form the smallest possible Feed Forward Neural Network (FFNN) model using only two neurons. While this is far from a Deep Neural Network (DNN), a simple NN with two neurons is sufficient to explain the Backpropagation algorithm, which is the focus of this chapter.

The goal is to demonstrate the training process and illustrate how the Forward Pass (computation phase) first generates a model output, y ̂. The algorithm then evaluates the model’s accuracy by computing the error term using Mean Squared Error (MSE). The first training iteration rarely, if ever, produces a perfect output. To gradually bring the training result closer to the expected value, the Backward Pass (adjustment and communication phase) calculates the magnitude and direction by which the weight values should be adjusted. The Backward Pass is repeated as many times as necessary until an acceptable model output is achieved. We are using a supervised training process with a pre-labeled test dataset, although it is not shown in Figure 2-1. Chapter Three covers the training datasets in detail.

After the training process is completed, we use a test dataset to evaluate the model's performance. Test dataset also contains input data and labels, but these labels are not used during training. Instead, after training is complete, the model is evaluated on the test dataset to measure its performance. At this phase, we measure how well the predictions from the training and test phases align. When the model produces the expected results on the test dataset, it can be taken into production.

Forward Pass

Figure 1-1 illustrates how neuron-a computes a weighted sum from three input values, adds a bias term, and produces a pre-activation value za. This value is then passed through the Sigmoid activation function. The output y ̂a from neuron-a serves as an input for neuron-b, which processes it and generates the final model output y ̂b. Since these computational steps were covered in detail in Chapter 1, we will not repeat them here.

As the final step of the Forward Pass, we apply the error function E to the model output. The error function measures how far model output y ̂b is from expected value y. We use the Mean Squared Error (MSE), which is computed by subtracting the expected value from the model’s output, squaring the result, and multiplying it by 0.5 (or equivalently, dividing by two).

Figure 2-1: An Overview of a Complete Forward Pass Process.

On the right side of Figure 2-2, we have a two-dimensional error space. In this space, a symmetric parabolic curve visualizes the error function. The curve is centered at the expected value, which is 1.0 in our example. The horizontal axis represents the model output, y, and the vertical axis represents the error. For instance, if the model prediction is 1.7, you can draw a vertical line from this point on the horizontal axis to meet the parabolic curve. In our case, this intersection shows an error term of 0.245. In real-life scenarios, the error landscape often has many peaks and valleys rather than a simple symmetric curve.

The Mean Squared Error (MSE) is a loss function that measures the difference between the predicted values and the expected values. It provides an overall error value for the model, which is also called the loss or cost, indicates how far off the predictions are.

Next, the gradient is computed by taking the derivative of the loss function with respect to the model's weights. This gradient shows both the direction and the magnitude of the steepest increase in error. During the Backward Pass, the algorithm calculates the gradient for each weight. By moving in the opposite direction of the gradient (using a method called Gradient Descent), the algorithm adjusts the weights to reduce the loss. This process is repeated many times so that the model output gradually becomes closer to the expected value.

The following sections will cover the processes and computations performed during the Backward Pass.

Figure 2-2: Mean Square Error.

Learning Rate

Besides determining the direction in which the error should be reduced, the process also needs to know the size of each adjustment step. This is defined by the Learning Rate. The Learning Rate value affects how much the weights are adjusted in response to the gradient during each iteration of the Backward Pass. A small Learning Rate leads to small, gradual changes, which may result in slower training but a more stable convergence. On the other hand, a large Learning Rate can speed up training by making larger adjustments, yet it might overshoot the optimal values and cause instability. Therefore, choosing the right Learning Rate is crucial for effective and efficient training. This is illustrated in the Figure 2-3. We will get back to Learning Rate in the Backward Pass section.

Figure 2-3: Learning Rate.

Backward Pass

The Forward Pass produces the model output ŷ, which is then used to compute the model error E. The closer ŷ is to the expected value y, the smaller the error, indicating better model performance. The purpose of the Backward Pass, as part of the Backpropagation algorithm, is to adjust the model’s weight parameters during training in a direction that gradually moves the model’s predictions closer to the expected values y.

In Figure 2-4, the model’s output ŷb depends on the weighted sum zb of neuron-b. This weighted sum zb, in turn, is calculated by multiplying an input value ya by its associated weights w1. The same process applies to neuron-a. Backpropagation algorithm cannot directly modify the results of an activation function or the weighted sum itself. Nor can it alter the input values directly. Instead, it calculates weight adjustments, which are then used to update the model’s weights.

Figure 2-4 illustrates this dependency chain and provides a high-level overview of how the Backpropagation algorithm determines weight adjustments. The following sections will explain this process in detail.

Figure 2-4: Backpropagation Overview: Backward Pass Dependency Chain.

The somewhat crowded Figure 2-5 illustrates the components of the backpropagation algorithm, along with their relationships and dependencies. The figure consists of three main blocks. The rightmost block depicts the calculation of the error function. The middle and left blocks outline the steps for defining and adjusting new weight values. The complete backward pass process is explained next in detail, one step at a time.

Figure 2-5: The Backward Pass Overview.

Partial Derivative for Error Function – Output Error (Gradient)

The goal of training a model is to minimize the error, meaning we want yb (the model's prediction/output) to get as close as possible to y (the expected value).

After computing the error E = 0.245, we compute the partial derivative of the error function with respect to yb = 1.7, which shows how small changes in yb affect the error E. A derivative is called partial when one of its input values is held constant (i.e., not adjusted by the algorithm). In our example, the expected value y is constant input. The result of the partial derivative of the error function indicates how the predicted output yb should change to minimize the model’s error.

We use the following formula for computing the derivative of the error function:

MSE'=- (y-yb)

MSE'=- (1.0-1.7)

MSE'=0.7

Since the model output yb =1.7 is too high, the positive gradient suggests that it should be lowered by 0.7, which is the derivative of the error function (MSE’). This makes perfect sense because by subtracting the MSE's 0.7 from the model output yb = 1.7, we obtain 1.0, which matches the expected value.

Figure 2-6: The Backward Pass – Derivative of the Error Function.

Partial Derivative for the Activation Function

After computing the output error, we calculate the derivative of the activation function f(b) with respect to zb . Neuron-b uses a ReLU activation function, which states that if the function’s output is greater than 0, the derivative is 1; otherwise, it is 0. In our case, the result of the activation function f(b)=1.7, so the derivative is 1.

Error Term for Neurons

The error term for neuron-b is calculated by multiplying the partial derivative of the error function MSE’ = 0.7, by the derivative of the neuron's activation function f^' (b)= 1.0. This means we propagate the model's error backward using it as a base value for finetuning the model accuracy (i.e., refining new weight values). This is why the term backward pass fits perfectly for the process.

Error term (Enb) for Neuron-b = MSE’ ⋅f^' (b)= 0.7 ⋅1=0.7

Figure 2-7: The Backward Pass – Error Term for Neuron-b.

After computing the error term for neuron-b, the backward pass moves to the preceding layer, the hidden layer, to calculate the error term for neuron a. First, the process computes a weighted sum of w ⋅ E across all connected neurons in the next layer, output layer in our example. This sum is then multiplied by the derivative of the activation function, f'(a). Since neuron-a is only connected to neuron-b, its error term is calculated as w1⋅ Enb ⋅ f'(a), resulting error term for neuron-a, Ean = 0.4 ⋅ 0.7 ⋅ 1= 0.28.

Figure 2-8: The Backward Pass – Error Term for Neuron-a.

Weight Adjustment Value

Gradient Calculation

After computing the error terms for all neurons in every layer, the algorithm simultaneously calculates weight-based gradients. Each gradient is determined by multiplying the input value by the corresponding error term.

In our example, the gradient for weight wa1 , which connects input x1 to neuron-a, is calculated by multiplying the input value x1 (=3.0) by the error term Enb of neuron-a (=0.28), resulting in a gradient of 0.84. Similarly, the gradient for weight wb1 in neuron-b is computed by multiplying the output y (=3.0) of the activation function from neuron-a by the error term Enb of neuron b (=0.7), yielding a gradient of 2.1.

If the test dataset is divided across multiple GPUs, gradients must be synchronized before computing the actual weight-based adjustment values. Each GPU sums all received gradients, including its own. The sum is then averaged by dividing it by the number of GPUs.

Next, the GPUs synchronize these averaged gradients to ensure that each one uses the same values when calculating the final weight adjustments. This process is part of a data parallelization strategy, where the training dataset is too large to fit into a single GPU’s memory and is split into micro-batches. Each GPU processes its micro-batches using the same model with the same parameters.

Figure 2-9: The Backward Pass – Gradient for Neurons.

Weight Adjustment

The weight adjustment value is computed by multiplying the gradient, averaged in our example, by the learning rate η. We use a learning rate of 0.012. This results in a weight adjustment of 0.042 for weight wa1 and 0.105 for weight wb1.

The weight adjustment values are then subtracted from the initial weights. This yields an updated weight of 0.058 (0.1−0.042) for wa1 and 0.295 (0.4−0.105) for wb1.

Figure 2-10: The Backward Pass – Compute New Weight Values.

The Second Iteration - Forward Pass

After updating all the weight values, including those associated with biases, the backpropagation process begins the second iteration of the forward pass. As shown in Figure 2-11, the model output y ̂b = 1.28 is very close to the expected value y = 1.0. The new MSE = 0.007 is significantly lower than the initial MSE = 0.245 computed in the first iteration.

Figure 2-11: The Second Iteration of the Forward Pass.

In Figure 2-12, we have two 2-dimensional error spaces. Using the initial weight values, the model output is 1.7, resulting in an MSE of 0.245. After adjusting the weights, the model prediction is 1.119, reducing the MSE to 0.007.

Figure 2-12: Results Comparison.

Network Impact

In figure 2-13 we have a fully-connected Feed Forward Neural Network (FFNN) with four layers; input layer, two hidden layers, and output layer. Training data set is split into two batches, A and B, which are processed by GPU-A and GPU-B.

After computing a model prediction during the forward pass, the backpropagation algorithm begins the backward pass by calculating the gradient (output error) for the error function. Once computed, the gradients are synchronized between the GPUs. The algorithm then averages the gradients, and the process moves to the preceding layer. Neurons in the preceding layer calculate their gradient by multiplying the weighted sum of their connected neurons’ averaged gradients and connected weight with the local activation function’s partial derivative. These neuron-based gradients are then synchronized over connections. Before the process moves to the preceding layer, gradients are averaged. The backpropagation algorithm executes the same process through all layers.

If packet loss occurs during the synchronization, it can ruin the entire training process, which would need to be restarted unless snapshots were taken. The cost of losing even a single packet could be enormous, especially if training has been ongoing for several days or weeks. Why is a single packet so important? If the synchronization between the gradients of two parallel neurons fails due to packet loss, the algorithm cannot compute the average, and the neurons in the preceding layer cannot calculate their gradient. Besides, if the connection, whether the synchronization happens over NVLink, InfiniBand, Ethernet (RoCE or RoCEv2), or wireless connection, causes a delay, the completeness of the training slows down. This causes GPU under-utilization which is not efficient from the business perspective.

Figure 2-13: Inter-GPU Communication.

Sunday, 2 March 2025

Training Neural Networks: Backpropagation Algorithm