Introduction
The previous chapter explained the operation of a single artificial neuron. It covered how input values are multiplied by their respective weight parameters, summed together, and combined with a bias term. The resulting value, z, is then passed through a non-linear sigmoid function, which squeezed a neuron’s output value y ̂ between 0 and 1.
In this chapter, we form the smallest possible Feed Forward Neural Network (FFNN) model using only two neurons. While this is far from a Deep Neural Network (DNN), a simple NN with two neurons is sufficient to explain the Backpropagation algorithm, which is the focus of this chapter.
The goal is to demonstrate the training process and illustrate how the Forward Pass (computation phase) first generates a model output, y ̂. The algorithm then evaluates the model’s accuracy by computing the error term using Mean Squared Error (MSE). The first training iteration rarely, if ever, produces a perfect output. To gradually bring the training result closer to the expected value, the Backward Pass (adjustment and communication phase) calculates the magnitude and direction by which the weight values should be adjusted. The Backward Pass is repeated as many times as necessary until an acceptable model output is achieved. We are using a supervised training process with a pre-labeled test dataset, although it is not shown in Figure 2-1. Chapter Three covers the training datasets in detail.
After the training process, we use a test dataset to evaluate the model's performance with unlabeled data. At this phase, we measure how well the predictions from the training and test phases align. When the model produces the expected results on the test dataset, it can be taken into production.
Forward Pass
Figure 1-1 illustrates how neuron-a computes a weighted sum from three input values, adds a bias term, and produces a pre-activation value za. This value is then passed through the Sigmoid activation function. The output y ̂a from neuron-a serves as an input for neuron-b, which processes it and generates the final model output y ̂b. Since these computational steps were covered in detail in Chapter 1, we will not repeat them here.
As the final step of the Forward Pass, we apply the error function E to the model output. The error function measures how far model output y ̂b is from expected value y. We use the Mean Squared Error (MSE), which is computed by subtracting the expected value from the model’s output, squaring the result, and multiplying it by 0.5 (or equivalently, dividing by two).
Figure 2-1: An Overview of a Complete Forward Pass Process.
On the right side of Figure 2-2, we have a two-dimensional error space. In this space, a symmetric parabolic curve visualizes the error function. The curve is centered at the expected value, which is 1.0 in our example. The horizontal axis represents the model output, y, and the vertical axis represents the error. For instance, if the model prediction is 1.7, you can draw a vertical line from this point on the horizontal axis to meet the parabolic curve. In our case, this intersection shows an error term of 0.245. In real-life scenarios, the error landscape often has many peaks and valleys rather than a simple symmetric curve.
The Mean Squared Error (MSE) is a loss function that measures the difference between the predicted values and the expected values. It provides an overall error value for the model, which is also called the loss or cost, indicates how far off the predictions are.
Next, the gradient is computed by taking the derivative of the loss function with respect to the model's weights. This gradient shows both the direction and the magnitude of the steepest increase in error. During the Backward Pass, the algorithm calculates the gradient for each weight. By moving in the opposite direction of the gradient (using a method called Gradient Descent), the algorithm adjusts the weights to reduce the loss. This process is repeated many times so that the model output gradually becomes closer to the expected value.
The following sections will cover the processes and computations performed during the Backward Pass.
Learning Rate
Figure 2-3: Learning Rate.
Backward Pass
The Forward Pass produces the model output ŷ, which is then used to compute the model error E. The closer ŷ is to the expected value y, the smaller the error, indicating better model performance. The purpose of the Backward Pass, as part of the Backpropagation algorithm, is to adjust the model’s weight parameters during training in a direction that gradually moves the model’s predictions closer to the expected values y.
In Figure 2-4, the model’s output ŷb depends on the weighted sum zb of neuron b. This weighted sum zb, in turn, is influenced by the input values of neurons and their associated weights associated with bias and input value ŷb, the output of neuron-a. The Backpropagation algorithm cannot directly modify the results of an activation function or the weighted sum itself. Nor can it alter the input values directly. Instead, it calculates weight adjustment values, which are then used to update the model’s weights.
Figure 2-4 illustrates this dependency chain and provides a high-level overview of how the Backpropagation algorithm determines weight adjustments. The following sections will explain this process in detail.
e(w1)
= e(b(zb(a(za(w1)))))
e (w1)
= e ∘ b ∘ zb
∘ a ∘ za(w1)
To compute how the error function changes with respect to changes of w1, we apply the chain rule, giving:
This expression shows how the error gradient propagates backward through the network. Each term represents a partial derivative, capturing how small changes in one variable influence the next, ultimately determining how w₁ should be updated during training. The upcoming sections describe these complex formulas using flowcharts that will hopefully make learning easier.
The somewhat crowded Figure 2-5 illustrates the components of the backpropagation algorithm, along with their relationships and dependencies. The figure consists of three main blocks. The rightmost block depicts the calculation of the error function. The middle and left blocks outline the steps for defining and adjusting new weight values. The complete backward pass process is explained next in detail, one step at a time.
Figure 2-5: The Backward Pass Overview.
Partial Derivative for Error Function – Output Error (Gradient)
The goal of training a model is to minimize the error, meaning we want yb (the model's prediction/output) to get as close as possible to y (the expected value).
After computing the error E = 0.245, we compute the partial derivative of the error function with respect to yb = 1.7, which shows how small changes in yb affect the error E. A derivative is called partial when one of its input values is held constant (i.e., not adjusted by the algorithm). In our example, the expected value y is constant input. The result of the partial derivative of the error function indicates how the predicted output yb should change to minimize the model’s error.
We use the following formula for computing the derivative of the error function:
MSE'=- (y-yb)
MSE'=- (1.0-1.7)
MSE'=0.7
Since the model output yb =1.7 is too high, the positive gradient suggests that it should be lowered by 0.7, which is the derivative of the error function (MSE’). This makes perfect sense because by subtracting the MSE's 0.7 from the model output yb = 1.7, we obtain 1.0, which matches the expected value.
Partial Derivative for the Activation Function
Error Term for Neurons (Gradient)
Weight Adjustment Value
Refine Weights
The Second Iteration - Forward Pass
Network Impact
Figure 2-13: Inter-GPU Communication.