Sunday, 2 March 2025

Training Neural Networks: Backpropagation Algorithm

 Introduction 


The previous chapter explained the operation of a single artificial neuron. It covered how input values are multiplied by their respective weight parameters, summed together, and combined with a bias term. The resulting value, z, is then passed through a non-linear sigmoid function, which squeezed a neuron’s output value y ̂ between 0 and 1.

In this chapter, we form the smallest possible Feed Forward Neural Network (FFNN) model using only two neurons. While this is far from a Deep Neural Network (DNN), a simple NN with two neurons is sufficient to explain the Backpropagation algorithm, which is the focus of this chapter.

The goal is to demonstrate the training process and illustrate how the Forward Pass (computation phase) first generates a model output, y ̂. The algorithm then evaluates the model’s accuracy by computing the error term using Mean Squared Error (MSE). The first training iteration rarely, if ever, produces a perfect output. To gradually bring the training result closer to the expected value, the Backward Pass (adjustment and communication phase) calculates the magnitude and direction by which the weight values should be adjusted. The Backward Pass is repeated as many times as necessary until an acceptable model output is achieved. We are using a supervised training process with a pre-labeled test dataset, although it is not shown in Figure 2-1. Chapter Three covers the training datasets in detail.

After the training process, we use a test dataset to evaluate the model's performance with unlabeled data. At this phase, we measure how well the predictions from the training and test phases align. When the model produces the expected results on the test dataset, it can be taken into production.

Forward Pass


Figure 1-1 illustrates how neuron-a computes a weighted sum from three input values, adds a bias term, and produces a pre-activation value za. This value is then passed through the Sigmoid activation function. The output y ̂a from neuron-a serves as an input for neuron-b, which processes it and generates the final model output y ̂b. Since these computational steps were covered in detail in Chapter 1, we will not repeat them here.

As the final step of the Forward Pass, we apply the error function E to the model output. The error function measures how far model output y ̂b is from expected value y. We use the Mean Squared Error (MSE), which is computed by subtracting the expected value from the model’s output, squaring the result, and multiplying it by 0.5 (or equivalently, dividing by two).


Figure 2-1: An Overview of a Complete Forward Pass Process.


On the right side of Figure 2-2, we have a two-dimensional error space. In this space, a symmetric parabolic curve visualizes the error function. The curve is centered at the expected value, which is 1.0 in our example. The horizontal axis represents the model output, y, and the vertical axis represents the error. For instance, if the model prediction is 1.7, you can draw a vertical line from this point on the horizontal axis to meet the parabolic curve. In our case, this intersection shows an error term of 0.245. In real-life scenarios, the error landscape often has many peaks and valleys rather than a simple symmetric curve.

The Mean Squared Error (MSE) is a loss function that measures the difference between the predicted values and the expected values. It provides an overall error value for the model, which is also called the loss or cost, indicates how far off the predictions are.

Next, the gradient is computed by taking the derivative of the loss function with respect to the model's weights. This gradient shows both the direction and the magnitude of the steepest increase in error. During the Backward Pass, the algorithm calculates the gradient for each weight. By moving in the opposite direction of the gradient (using a method called Gradient Descent), the algorithm adjusts the weights to reduce the loss. This process is repeated many times so that the model output gradually becomes closer to the expected value. 

The following sections will cover the processes and computations performed during the Backward Pass.



Figure 2-2: Mean Square Error.

Learning Rate


Besides determining the direction in which the error should be reduced, the process also needs to know the size of each adjustment step. This is defined by the Learning Rate. The Learning Rate value affects  how much the weights are adjusted in response to the gradient during each iteration of the Backward Pass. A small Learning Rate leads to small, gradual changes, which may result in slower training but a more stable convergence. On the other hand, a large Learning Rate can speed up training by making larger adjustments, yet it might overshoot the optimal values and cause instability. Therefore, choosing the right Learning Rate is crucial for effective and efficient training. This is illustrated in the Figure 2-3. We will get back to Learning Rate in the Backward Pass section.


Figure 2-3: Learning Rate.

Backward Pass


The Forward Pass produces the model output ŷ, which is then used to compute the model error E. The closer ŷ is to the expected value y, the smaller the error, indicating better model performance. The purpose of the Backward Pass, as part of the Backpropagation algorithm, is to adjust the model’s weight parameters during training in a direction that gradually moves the model’s predictions closer to the expected values y.

In Figure 2-4, the model’s output ŷb depends on the weighted sum zb of neuron b. This weighted sum zb, in turn, is influenced by the input values of neurons and their associated weights associated with bias and input value ŷb, the output of neuron-a. The Backpropagation algorithm cannot directly modify the results of an activation function or the weighted sum itself. Nor can it alter the input values directly. Instead, it calculates weight adjustment values, which are then used to update the model’s weights.

Figure 2-4 illustrates this dependency chain and provides a high-level overview of how the Backpropagation algorithm determines weight adjustments. The following sections will explain this process in detail.




The backpropagation algorithm computes the partial derivative of the error function e with respect to the weight parameters wn using function composition and the chain rule. In terms of Neuron-a's weight w1, the error function is:

e(w1) = e(b(zb(a(za(w1)))))

where:

a is the output of the activation function of Neuron-a,
b is the output of the activation function of Neuron-b,
za  and zb  represent the weighted sums of Neuron-a and Neuron-b, respectively.

Using function composition, this can be rewritten as:

e (w1)  = e b zb a za(w1)

To compute how the error function changes with respect to changes of w1, we apply the chain rule, giving:


This expression shows how the error gradient propagates backward through the network. Each term represents a partial derivative, capturing how small changes in one variable influence the next, ultimately determining how w₁ should be updated during training. The upcoming sections describe these complex formulas using flowcharts that will hopefully make learning easier.

The somewhat crowded Figure 2-5 illustrates the components of the backpropagation algorithm, along with their relationships and dependencies. The figure consists of three main blocks. The rightmost block depicts the calculation of the error function. The middle and left blocks outline the steps for defining and adjusting new weight values. The complete backward pass process is explained next in detail, one step at a time.




Figure 2-5: The Backward Pass Overview.


Partial Derivative for Error Function – Output Error (Gradient)

The goal of training a model is to minimize the error, meaning we want yb  (the model's prediction/output) to get as close as possible to y (the expected value). 

After computing the error E = 0.245, we compute the partial derivative of the error function with respect to yb  = 1.7, which shows how small changes in yb affect the error E.  A derivative is called partial when one of its input values is held constant (i.e., not adjusted by the algorithm). In our example, the expected value y is constant input. The result of the partial derivative of the error function indicates how the predicted output yb  should change to minimize the model’s error.

We use the following formula for computing the derivative of the error function:

MSE'=- (y-yb)

MSE'=- (1.0-1.7)

MSE'=0.7

Since the model output yb =1.7 is too high, the positive gradient suggests that it should be lowered by 0.7, which is the derivative of the error function (MSE’). This makes perfect sense because by subtracting the MSE's 0.7 from the model output yb = 1.7, we obtain 1.0, which matches the expected value.



Figure 2-6: The Backward Pass – Derivative of the Error Function.


Partial Derivative for the Activation Function


After computing the output error, we calculate the derivative of the activation function f(b) with respect to zb . Neuron-b uses a ReLU activation function, which states that if the function’s output is greater than 0, the derivative is 1; otherwise, it is 0. In our case, the result of the activation function f(b)=1.7, so the derivative is 1.



Error Term for Neurons (Gradient)


The error term (Gradient) for neuron-b is calculated by multiplying the partial derivative of the error function MSE’ = 0.7,  by the derivative of the neuron's activation function f^' (b)= 1.0. This means we propagate the model's error backward using it as a base value for finetuning the model accuracy (i.e., refining new weight values). This is why the term backward pass fits perfectly for the process.

E-term for Neuron-b = MSE’ ⋅f' (b)= 0.7 ⋅1=0.7




Figure 2-7: The Backward Pass – Error Term (Gradient) for Neuron-b.


After computing the error term for Neuron-b, the backward pass moves to the preceding layer, the hidden layer, and calculates the error term for Neuron-a. The algorithm computes the derivative for the activation function f(a) = 1, just as it did with the Neuron-b. Next, it multiplies the result by Neuron-b's error term (0.7) and the connected weight parameter, wb1 =0.4. The result 0.28, is the error term for Neuron-a.



Figure 2-8: The Backward Pass – Error Term (Gradient) for Neuron-a.

Weight Adjustment Value


After computing error terms for all neurons in every layer, the algorithm simultaneously calculates the weight adjustment value for each weight. The process is straightforward: the error term is multiplied by the input value connected to the weight and the learning rate (η). The learning rate balances convergence speed and training stability. We have set it to 0.5 for the first iteration.

The learning rate is a hyperparameter, meaning it is set by the user rather than learned by the model during training. It affects the behavior of the backpropagation algorithm by controlling the size of weight updates. It is also possible to adjust the learning rate during training—starting with a higher value to allow faster convergence and lowering it later to prevent overshooting the optimal result.

Weight adjustment values for neuron-b and neuron-a respectively:




Figure 2-9: The Backward Pass – Weight Adjustment Value for Neurons.

Refine Weights 


As the final step, the backpropagation algorithm updates each weight parameter in the model by subtracting the weight adjustment value from the initial weight.
New weight values are computed for wb1 and wa1, respectively.



Figure 2-10: The Backward Pass – Compute New Weight Values.

The Second Iteration - Forward Pass


After updating all the weight values, including those associated with biases, the backpropagation process begins the second iteration of the forward pass. As shown in Figure 2-11, the model output y ̂b = 1.28 is very close to the expected value y = 1.0. The new MSE = 0.007 is significantly lower than the initial MSE = 0.245 computed in the first iteration.



Figure 2-11: The Second Iteration of the Forward Pass.


In Figure 2-12, we have two 2-dimensional error spaces. Using the initial weight values, the model output is 1.7, resulting in an MSE of 0.245. After adjusting the weights, the model prediction is 1.119, reducing the MSE to 0.007.



Figure 2-12: Results Comparison.


Network Impact


In figure 2-13 we have a fully-connected Feed Forward Neural Network (FFNN) with four layers; input layer, two hidden layers, and output layer. Training data set is split into two batches, A and B, which are processed by GPU-A and GPU-B. 

After computing a model prediction during the forward pass, the backpropagation algorithm begins the backward pass by calculating the gradient (output error) for the error function. Once computed, the gradients are synchronized between the GPUs. The algorithm then averages the gradients, and the process moves to the preceding layer. Neurons in the preceding layer calculate their gradient by multiplying the weighted sum of their connected neurons’ averaged gradients and connected weight with the local activation function’s partial derivative. These neuron-based gradients are then synchronized over connections. Before the process moves to the preceding layer, gradients are averaged. The backpropagation algorithm executes the same process through all layers. 

If packet loss occurs during the synchronization, it can ruin the entire training process, which would need to be restarted unless snapshots were taken. The cost of losing even a single packet could be enormous, especially if training has been ongoing for several days or weeks. Why is a single packet so important? If the synchronization between the gradients of two parallel neurons fails due to packet loss, the algorithm cannot compute the average, and the neurons in the preceding layer cannot calculate their gradient. Besides, if the connection, whether the synchronization happens over NVLink, InfiniBand, Ethernet (RoCE or RoCEv2), or wireless connection, causes a delay, the completeness of the training slows down. This causes GPU under-utilization which is not efficient from the business perspective.



Figure 2-13: Inter-GPU Communication.