Neural Network Architecture Overview
Deep Neural Networks (DNN) leverage various architectures for training, with one of the simplest and most fundamental being the Feedforward Neural Network (FNN). Figure 2-1 illustrates our simple, three-layer FNN.
Input Layer:
The first layer doesn’t have neurons, instead the input data parameters X1, X2, and X3 are in this layer, from where they are fed to first hidden layer.
Hidden Layer:
The neurons in the hidden layer calculate a weighted sum of the input data, which is then passed through an activation function. In our example, we are using the Rectified Linear Unit (ReLU) activation function. These calculations produce activation values for neurons. The activation value is modified input data value received from the input layer and published to upper layer.
Output Layer:
Neurons in this layer calculate the weighted sum in the same manner as neurons in the hidden layer, but the result of the activation function is the final output.
The process described above is known as the Forwarding pass operation. Once the forward pass process is completed, the result is passed through a loss function, where the received value is compared to the expected value. The difference between these two values triggers the backpropagation process. The Loss calculation is the initial phase of Backpropagation process. During backpropagation, the network fine-tunes the weight values , neuron by neuron, from the output layer through the hidden layers. The neurons in the input layer do not participate in the backpropagation process because they do not have weight values to be adjusted.
After the backpropagation process, a new iteration of the forward pass begins from the first hidden layer. This loop continues until the received and expected values are close enough to expected value, indicating that the training is complete.
Figure 2-1: Deep Neural Network Basic Structure and Operations.
Forwarding Pass
Next, let's examine the operation of a Neural Network in more detail. Figure 2-2 illustrates a simple, three-layer Feedforward Neural Network (FNN) data model. The input layer has two neurons, H1 and H2, each receiving one input data value: a value of one (1) is fed to neuron H1 by input neuron X1, and a value of zero (0) is fed to neuron H2 by input neuron X2. The neurons in the input layer do not calculate a weighted sum or an activation value but instead pass the data to the next layer, which is the first hidden layer.
The hidden layer in our example consists of two neurons. These neurons use the ReLU activation function to calculate the activation value. During the initialization phase, the weight values for these neurons are assigned using the He Initialization method, which is often used with the ReLU function. The He Initialization method calculates the variance as 2/n where n is the number of neurons in the previous layer. In this example, with two input neurons, this gives a variance of 1 (=2/2). The weights are then drawn from a normal distribution ~N(0,√variance), which in this case is ~N(0,1). Basically, this means that the randomly generated weight values are centered around zero with a standard deviation of one.
In Figure 2-2, the weight value for neuron H3 in the hidden layer is 0.5 for both input sources X1 (input data 1) and X2 (input data 0). Similarly, for the hidden layer neuron H4, the weight value is 1 for both input sources X1 (input data 1) and X2 (input data 0). Neurons in the hidden and output layers also have a bias variable. If the input to a neuron is zero, the output would also be zero if there were no bias. The bias ensures that a neuron can still produce a meaningful output even when the input is zero (i.e., the neuron is inactive). Neurons H3 and O5 have a bias value of 0.5, while neuron H4 has a bias value of 0 (I am using zero for simplify the calculation).
Let’s start the forward pass process from neuron H3 in the hidden layer. First, we calculate the weighted sum using the formula below, where Z3 represents the weighted sum of input. Here, Xn is the actual input data value received from the input layer’s neuron, and Wn is the weight associated with that particular input neuron.
The weighted sum calculation (Z3) for neuron H3:
Z3 = (X1 ⋅ W31) + (X2 ⋅ W32) + b3
Given:
Z3 = (1 ⋅ 0.5) + (0 ⋅ 0.5) + 0
Z3 = 0.5 + 0 + 0
Z3 = 0.5
To get the activation value a3 (shown as H3=0.5 in figure), we apply the ReLU function. The ReLU function outputs zero (0) if the calculated weighted sum Z is less than or equal to zero; otherwise, it outputs the value of the weighted sum Z.
The activation value a3 for H3 is:
ReLU (Z3) = ReLU (0.5) = 0.5
The weighted sum calculation for neuron H4:
Z4 = (X1 ⋅ W41) + (X2 ⋅ W42) + b4
Given:
Z4 = (1 ⋅ 1) + (0 ⋅1) + 0.5
Z4 = 1 + 0 + 0.5
Z4 = 1.5
The activation value using ReLU for Z4 is:
ReLU (Z4) = ReLU (1.5) = 1.5
Figure 2-2: Forwarding Pass on Hidden Layer.
After neurons H3 and H4 publish their activation values to neuron O5 in the output layer, O5 calculates the weighted sum Z5 for inputs with weights W53=1and W54=1. Using Z5, it calculates the output using the ReLU function. The difference between the received output value (Yr) and the expected value (Ye) triggers a backpropagation process. In our example, Yr−Ye=0.5.
Backpropagation process
The loss function measures the difference between the predicted output and the actual expected output. The loss function value indicates how well the neural network is performing. A high loss value means the network's predictions are far from the actual values, while a low loss value means the predictions are close.
After calculating the loss, backpropagation is initiated to minimize this loss. Backpropagation involves calculating the gradient of the loss function with respect to each weight and bias in the network. This step is crucial for adjusting the weights and biases to reduce the loss in subsequent forwarding pass iterations.
Loss function is calculated using the formula below:
Loss (L) = (H3 x W53 + H4 x W54 + b5 – Ye)2
L = (0.5 x 1 + 1.5 x 1 + 0.5 - 2)2
L = (0.5 + 1.5 + 0.5 - 2)2
Figure 2-3: Forwarding Pass on Output Layer.
The result of the loss function is then fed into the gradient calculation process, where we compute the gradient of the loss function with respect to each weight and bias in the network. The gradient calculation result is then used to fine-tune the old weight values. The Eta hyper-parameter η (the learning rate) controls the step size during weight updates in the backpropagation process, balancing the speed of convergence with the stability of training. In our example, we are using a learning rate of 1/100 = 0.01. The term hyper-parameters refers to parameters that affect the final result.
First, we compute the partial derivative of the loss function (gradient calculation) with respect to the old weight values. The following example shows the gradient calculation for weight W53. The same computation applies to W54 and b3.
Gradient Calculation:
∂L = 2W53 x (Yr – Ye)
∂W53
Given
= 2 x 0.5 x (2.5 - 2)
= 1 x 0.5
= 0.5
New weight value calculation.
W53 (new) = W53(old) – η x ∂L/∂W53
Given:
W53 (new) = 1–0.01 x 0.5
W53 (new) = 0.995
Figure 2-4: Backpropagation - Gradient Calculation and New Weight Value Computation.
Figure 2-5 shows the formulas for calculating the new bias b3. The process is the same than what was used with updating the weight values.
Figure 2-5: Backpropagation - Gradient Calculation and New Bias Computation.
After updating the weights and biases, the backpropagation process moves to the hidden layer. Gradient computation in the hidden layer is more complex because the loss function only includes weights from the output layer as you can see from the Loss function formula below:
Loss (L) = (H3 x W53 + H4 x W54 + b5 – Ye)2
The formula for computing the weights and biases for neurons in the hidden layers uses the chain rule. The mathematical formula shown below, but the actual computation is beyond the scope of this chapter.
∂L = ∂L x ∂H3
∂W31 ∂H3 ∂W31
After the backpropagation process is completed, the next iteration of the forward pass starts. This loop continues until the received result is close enough to the expected result.
If the size of the input data exceeds the GPU’s memory capacity or if the computing power of one GPU is insufficient for the data model, we need to decide on a parallelization strategy. This strategy defines how the training workload is distributed across several GPUs. Parallelization impacts network load if we need more GPUs than are available on one server. Dividing the workload among GPUs within a single GPU-server or between multiple GPU-servers triggers synchronization of calculated gradients between GPUs. When the gradient is calculated, the GPUs synchronize the results and compute the average gradient, which is then used to update the weight values.
The upcoming chapter introduces pipeline parallelization and synchronization processes in detail. We will also discuss why lossless connection is required for AI/ML.