Monday 14 October 2024

AI for Network Engineers: Backpropagation Algorithm

 Introduction 


This chapter introduces the training model of a neural network based on the Backpropagation algorithm. The goal is to provide a clear and solid understanding of the process without delving deeply into the mathematical formulas, while still explaining the fundamental operations of the involved functions. The chapter also briefly explains why, and in which phases the training job generates traffic to the network, and why lossless packet transport is required. The Backpropagation algorithm is composed of two phases: the Forward pass (computation phase) and the Backward pass (adjustment and communication phase).

In the Forward pass, neurons in the first hidden layer calculate the weighted sum of input parameters received from the input layer, which is then passed to the neuron's activation function. Note that neurons in the input layer are not computational units; they simply pass the input variables to the connected neurons in the first hidden layer. The output from the activation function of a neuron is then used as input for the connected neurons in the next layer. The result of the activation function in the output layer represents the model's prediction, which is compared to the expected value (ground truth) using the error function. The output of the error function indicates the accuracy of the current training iteration. If the result is sufficiently close to the expected value (error function close to zero), the training is complete. Otherwise, it triggers the Backward pass process.

As the first step in the backward pass, the backpropagation algorithm calculates the derivative of the error function, providing the output error (gradient) of the model. Next, the algorithm computes the error term (gradient) for the neuron(s) in the output layer by multiplying the derivative of each neuron’s activation function by the model's error term. Then, the algorithm moves to the preceding layer and calculates the error term (gradient) for its neuron(s). This error term is now calculated using the error term of the connected neuron(s) in the next layer, the derivative of each neuron’s activation function, and the value of the weight parameter associated with the connection to the next layer.

After calculating the error terms, the algorithm determines the weight adjustment values for all neurons simultaneously. This computation is based on the input values, the adjustment values, and a user-defined learning rate. Finally, the backpropagation algorithm refines all weight values by adding the adjustment values to the initial weights. Once the backward pass is complete, the backpropagation algorithm starts a new iteration of the forward pass, gradually improving the model's predictions until they closely match the expected values, at which point the training is complete.


Figure 2-1: Backpropagation Overview.


The First Iteration - Forward Pass


Training a model often requires multiple iterations of forward and backward passes. In the forward pass, neurons in the first hidden layer calculate the weighted sum of input values, each multiplied by its associated weight parameter. These neurons then apply an activation function to the result. Neurons in subsequent layers use the activation output from previous layers as input for their own weighted sum calculations. This process continues through all the layers until reaching the output layer, where the activation function produces the model's prediction.

After the forward pass, the backpropagation algorithm calculates the error by comparing the model's output with the expected value, providing a measure of accuracy. If the model's output is close to the expected value, training is complete. Otherwise, the backpropagation algorithm initiates the backward pass to adjust the weights and reduce the error in subsequent iterations.

Neuron-a Forward Pass Calculations

Weighted Sum


In Figure 2-2, we have an imaginary training dataset with three inputs and a bias term. Input values and their respective initial weight values are listed below: 

x1 = 0.2 , initial weight wa1 = 0.1
x2 = 0.1, initial weight wa2 = 0.2
x3 = 0.4 , initial weight wa3 = 0.3
ba0 = 1.0 , initial weight wa0 = 0.6

From the model training perspective, the input values are constant, unchageable values, while weight values are variables which will be refined during the backward pass process.

The standard way to write the weighted sum formula is: 


Where:
n = 3 represents the number of input values (x1, x2, x3).
Each input xi  is multiplied by its respective weight wi, and the sum of these products is added to the bias term b.

In this case, the equation can be explicitly stated as:


Which with our parameters gives:

Activation Function


Neuron-a uses the previously calculated weighted sum as input for the activation function. We are using the ReLU function (Rectified Linear Unit), which is more popular than the hyperbolic tangent and sigmoid functions due to its simplicity and lower computational cost.

The standard way to write the ReLU function is:


Where:
f(a) represents the activation function.
z  is the weighted sum of inputs.

The ReLU function returns the z if z > 0. Otherwise, it returns 0 if z ≤ 0.

In our example, the weighted sum za is 0.76, so the ReLU function returns:



Figure 2-2: Activation Function for Neuron-a.


Neuron-b Forward Pass Calculations

Weighted Sum


Besides the bias term value of 1.0,  Neuron-b uses the result provided by the activation function of neuron-a as an input to weighted sum calculation. Input values and their respective initial weight values are listed below: 


This gives us:


Activation Function


Just like Neuron-a, Neuron-b uses the previously calculated weighted sum as input for the activation function. Because the zb = 0.804 is greater than zero, the ReLU activation function f(b) returns:


Neuron-b is in the output layer, so its activation function result yb represents the prediction of the model. 

 

Figure 2-3: Activation Function for Neuron-b.

Error Function


To keep things simple, we have used only one training example. However, in real-life scenarios, there will always be more than one training example. For instance, a training dataset might contain 10 images of cats and 10 images of dogs, each having 28x28 pixels. Each image represents a training example, giving us a total of 20 training examples. The purpose of the error function is to provide a single error metric for all training examples. In this case, we are using the Mean Squared Error (MSE).

We can calculate the MSE using the formula below where the expected value y is 1.0 and the model’s  prediction for the training example yb = 0.804. This gives an error metric of 0.019, which can be interpreted as an indicator of the model's accuracy.

The result of the error function is not sufficiently close to the desired value, which is why this result triggers the backward pass process.

Figure 2-4: Calculating the Error Function for Training Examples.

Backward Pass


In the forward pass, we calculate the model’s accuracy using several functions. First, Neuron-a computes the weighted sum Σ(za ) by multiplying the input values and the bias term with their respective weights. The output, za, is then passed through the activation function f(a), producing ya. Neuron-b, in turn, takes ya and the bias term to calculate the weighted sum Σ(zb ). The activation function f(b) then uses zb to compute the model’s output, yb. Finally, the error function f(e) calculates the model’s accuracy based on the output.

So, dependencies between function can be seen as:


The backpropagation algorithm combines these five functions to create a new error function, enew(x), using function composition and the chain rule. The following expression shows how the error function relates to the weight parameter w1 used by Neuron-a:


This can be expressed using the composition operator (∘) between functions:


Next, we use a method called gradient descent to gradually adjust the initial weight values, refining them to bring the model's output closer to the expected result. To do this, we compute the derivative of the composite function using the chain rule, where we take the derivatives of:

1. The error function (e) with respect to the activation function (b).
2. The activation function b with respect to the weighted sum (zb). 
3. The weighted sum (zb) with respect to the activation function (a).
4. The activation function (a) with respect to weighted sum (za(w1)). 

In Leibniz’s notation, this looks like:


Figure 2-5 illustrates the components of the backpropagation algorithm, along with their relationships and dependencies.


Figure 2-5: The Backward Pass Overview.


Partial Derivative for Error Function – Output Error (Gradient)


As a recap, and for illustrating that the prediction of the first iteration fails, Figure 2-6 includes the computation for the error function (MSE = 0.019). 

As a first step, we calculate the partial derivative of the error function. In this case, the partial derivative describes the rate of change of the error function when the input variable yb changes. The derivative is called partial when one of its input values is held constant (i.e., not adjusted by the algorithm). In our example, the expected value y is constant input. The result of the partial derivative of the error function describes how the predicted output should change yb to minimize the model’s error.

We use the following formula for computing the derivative of the error function:


Figure 2-6: The Backward Pass – Derivative of the Error Function.

The following explanation is meant for readers interested in why there is a minus sign in front of the function.

When calculating the derivative, we use the Power Rule. The Power Rule states that if we have a function f(x) = xn , then its derivative is f’(x) = n ⋅ xn-1. In our case, this applies to the error function:


Using the Power Rule, the derivative becomes:


Next, we apply the chain rule by multiplying this result by the derivative of the inner function (y − yb), with respect to yb . Since y is treated as a constant (because it represents our target value, which doesn't change during optimization), the derivative of (y − yb) with respect to yb  is simply −1, as the derivative of − yb  with respect to yb  is −1, and the derivative of y (a constant) is 0.

Therefore, the final derivative of the error function with respect to yb  is:


Partial Derivative for the Activation Function


After computing the output error, we calculate the derivative of the activation function f(b) with respect to zb . Neuron b uses the ReLU activation function, which states that if the input to the function is greater than 0, the derivative is 1; otherwise, it is 0. In our case, the result of the activation function f(b)=0.804, so the derivative is 1.

Error Term for Neurons (Gradient)


The error term (Gradient) for neuron-b is calculated by multiplying the output error, the partial derivative of the error function,  by the derivative of the neuron's activation function. This means that now we propagate the model's error backward using it as a base value for finetuning the model accuracy (i.e., refining new weight values). This is why the term backward pass fits perfectly for the process.



Figure 2-7: The Backward Pass – Error Term (Gradient) for Neuron-b.


After computing the error term for Neuron-b, the backward pass moves to the preceding layer, the hidden layer, and calculates the error term for Neuron-a. The algorithm computes the derivative for the activation function f(a) = 1, as it did with the Neuron-b. Next, it multiplies the result by Neuron-b's error term (-0.196) and the connected weight parameter , wb1 =0.4. The result -0.0784 is the error term for Neuron-a.


Figure 2-8: The Backward Pass – Error Term (Gradient) for Neuron-a.


Weight Adjustment Value


After computing error terms for all neurons in every layer, the algorithm simultaneously calculates the weight adjustment value for every weight. The process is simple, the error term is multiplied with an input value connected to weight and with learning rate (η). The learning rate balances convergence speed and training stability. We have set it to -0.6 for the first iteration. The learning rate is a hyper-parameter, meaning it is set by the user rather than learned by the model during training. It affects the behavior of the backpropagation algorithm by controlling the size of the weight updates. It is also possible to adjust the learning rate during training—using a higher learning rate at the start to allow faster convergence and lowering it later to avoid overshooting the optimal result. 

Weight adjustment value for weight wb1 and wa1 respectively:


Note! It is not recommended to use a negative learning rate. I use it here because we get a good enough output for the second forward pass iteration.


Figure 2-9: The Backward Pass – Weight Adjustment Value for Neurons.

Refine Weights


As the last step, the backpropagation algorithm computes new values for every weight parameter in the model by simply summing the initial weight value and weight adjustment value.

New values for weight  parameters wb1 and wa1 respectively:



Figure 2-10: The Backward Pass – Compute New Weight Values.

The Second Iteration - Forward Pass


After updating all the weight values (wa0, wa1, wa2, and wa3 ), the backpropagation process begins the second iteration of the forward pass. As shown in Figure 2-11, the model output yb = 0.9982 is very close to the expected value y = 1.0. The new MSE = 0.0017, is much better than 0.019 computed in the first iteration.


Figure 2-11: The Second Iteration of the Forward Pass.

Network Impact


Figure 2-12 shows a hypothetical example of Data Parallelization, where our training data set is split into two batches, A and B, which are processed by GPU-A and GPU-B, respectively. The training model is the same on both GPUs: Fully-Connected, with two hidden layers of four neurons each, and one output neuron in the output layer.

After computing a model prediction during the forward pass, the backpropagation algorithm begins the backward pass by calculating the gradient (output error) for the error function. Once computed, the gradients are synchronized between the GPUs. The algorithm then averages the gradients, and the process moves to the preceding layer. Neurons in the preceding layer calculate their gradient by multiplying the weighted sum of their connected neurons’ averaged gradients and connected weight with the local activation function’s partial derivative. These neuron-based gradients are then synchronized over connections. Before the process moves to the preceding layer, gradients are averaged. The backpropagation algorithm executes the same process through all layers. 

If packet loss occurs during the synchronization, it can ruin the entire training process, which would need to be restarted unless snapshots were taken. The cost of losing even a single packet could be enormous, especially if training has been ongoing for several days or weeks. Why is a single packet so important? If the synchronization between the gradients of two parallel neurons fails due to packet loss, the algorithm cannot compute the average, and the neurons in the preceding layer cannot calculate their gradient. Besides, if the connection, whether the synchronization happens over NVLink, InfiniBand, Ethernet (RoCE or RoCEv2), or wireless connection, causes a delay, the completeness of the training slows down. This causes GPU under-utilization which is not efficient from the business perspective.

Figure 2-12: Backward Pass – Gradient Synchronization and Averaging.

To be conntinued...