Introduction
The previous chapter explained the operation of a single artificial neuron. It covered how input values are multiplied by their respective weight parameters, summed together, and combined with a bias term. The resulting value, z, is then passed through a non-linear sigmoid function, which squeezed a neuron’s output value y ̂ between 0 and 1.
In this chapter, we form the smallest possible Feed Forward Neural Network (FFNN) model using only two neurons. While this is far from a Deep Neural Network (DNN), a simple NN with two neurons is sufficient to explain the Backpropagation algorithm, which is the focus of this chapter.
The goal is to demonstrate the training process and illustrate how the Forward Pass (computation phase) first generates a model output, y ̂. The algorithm then evaluates the model’s accuracy by computing the error term using Mean Squared Error (MSE). The first training iteration rarely, if ever, produces a perfect output. To gradually bring the training result closer to the expected value, the Backward Pass (adjustment and communication phase) calculates the magnitude and direction by which the weight values should be adjusted. The Backward Pass is repeated as many times as necessary until an acceptable model output is achieved. We are using a supervised training process with a pre-labeled test dataset, although it is not shown in Figure 2-1. Chapter Three covers the training datasets in detail.
After the training process is completed, we use a test dataset to evaluate the model's performance. Test dataset also contains input data and labels, but these labels are not used during training. Instead, after training is complete, the model is evaluated on the test dataset to measure its performance. At this phase, we measure how well the predictions from the training and test phases align. When the model produces the expected results on the test dataset, it can be taken into production.
Forward Pass
Figure 1-1 illustrates how neuron-a computes a weighted sum from three input values, adds a bias term, and produces a pre-activation value za. This value is then passed through the Sigmoid activation function. The output y ̂a from neuron-a serves as an input for neuron-b, which processes it and generates the final model output y ̂b. Since these computational steps were covered in detail in Chapter 1, we will not repeat them here.
As the final step of the Forward Pass, we apply the error function E to the model output. The error function measures how far model output y ̂b is from expected value y. We use the Mean Squared Error (MSE), which is computed by subtracting the expected value from the model’s output, squaring the result, and multiplying it by 0.5 (or equivalently, dividing by two).
Figure 2-1: An Overview of a Complete Forward Pass Process.
On the right side of Figure 2-2, we have a two-dimensional error space. In this space, a symmetric parabolic curve visualizes the error function. The curve is centered at the expected value, which is 1.0 in our example. The horizontal axis represents the model output, y, and the vertical axis represents the error. For instance, if the model prediction is 1.7, you can draw a vertical line from this point on the horizontal axis to meet the parabolic curve. In our case, this intersection shows an error term of 0.245. In real-life scenarios, the error landscape often has many peaks and valleys rather than a simple symmetric curve.
The Mean Squared Error (MSE) is a loss function that measures the difference between the predicted values and the expected values. It provides an overall error value for the model, which is also called the loss or cost, indicates how far off the predictions are.
Next, the gradient is computed by taking the derivative of the loss function with respect to the model's weights. This gradient shows both the direction and the magnitude of the steepest increase in error. During the Backward Pass, the algorithm calculates the gradient for each weight. By moving in the opposite direction of the gradient (using a method called Gradient Descent), the algorithm adjusts the weights to reduce the loss. This process is repeated many times so that the model output gradually becomes closer to the expected value.
The following sections will cover the processes and computations performed during the Backward Pass.
Learning Rate
Figure 2-3: Learning Rate.
Backward Pass
The Forward Pass produces the model output ŷ, which is then used to compute the model error E. The closer ŷ is to the expected value y, the smaller the error, indicating better model performance. The purpose of the Backward Pass, as part of the Backpropagation algorithm, is to adjust the model’s weight parameters during training in a direction that gradually moves the model’s predictions closer to the expected values y.
In Figure 2-4, the model’s output ŷb depends on the weighted sum zb of neuron-b. This weighted sum zb, in turn, is calculated by multiplying an input value ya by its associated weights w1. The same process applies to neuron-a. Backpropagation algorithm cannot directly modify the results of an activation function or the weighted sum itself. Nor can it alter the input values directly. Instead, it calculates weight adjustments, which are then used to update the model’s weights.
Figure 2-4 illustrates this dependency chain and provides a high-level overview of how the Backpropagation algorithm determines weight adjustments. The following sections will explain this process in detail.
Figure 2-4: Backpropagation Overview: Backward Pass Dependency Chain.
Figure 2-5: The Backward Pass Overview.
Partial Derivative for Error Function – Output Error (Gradient)
The goal of training a model is to minimize the error, meaning we want yb (the model's prediction/output) to get as close as possible to y (the expected value).
After computing the error E = 0.245, we compute the partial derivative of the error function with respect to yb = 1.7, which shows how small changes in yb affect the error E. A derivative is called partial when one of its input values is held constant (i.e., not adjusted by the algorithm). In our example, the expected value y is constant input. The result of the partial derivative of the error function indicates how the predicted output yb should change to minimize the model’s error.
We use the following formula for computing the derivative of the error function:
MSE'=- (y-yb)
MSE'=- (1.0-1.7)
MSE'=0.7
Since the model output yb =1.7 is too high, the positive gradient suggests that it should be lowered by 0.7, which is the derivative of the error function (MSE’). This makes perfect sense because by subtracting the MSE's 0.7 from the model output yb = 1.7, we obtain 1.0, which matches the expected value.
Partial Derivative for the Activation Function
Error Term for Neurons
The error term for neuron-b is calculated by multiplying the partial derivative of the error function MSE’ = 0.7, by the derivative of the neuron's activation function f^' (b)= 1.0. This means we propagate the model's error backward using it as a base value for finetuning the model accuracy (i.e., refining new weight values). This is why the term backward pass fits perfectly for the process.
Error term (Enb) for Neuron-b = MSE’ ⋅f^' (b)= 0.7 ⋅1=0.7
Weight Adjustment Value
Weight Adjustment
The weight adjustment value is computed by multiplying the gradient, averaged in our example, by the learning rate η. We use a learning rate of 0.012. This results in a weight adjustment of 0.042 for weight wa1 and 0.105 for weight wb1.
The weight adjustment values are then subtracted from the initial weights. This yields an updated weight of 0.058 (0.1−0.042) for wa1 and 0.295 (0.4−0.105) for wb1.
The Second Iteration - Forward Pass
Network Impact
Figure 2-13: Inter-GPU Communication.
No comments:
Post a Comment