Monday 21 October 2024

AI for Network Engineers: Multi-Class Classification

 Introduction 

This chapter explains the multi-class classification training process. It begins with an introduction to the MNIST dataset (Modified National Institute of Standards and Technology dataset). Next, it describes how the SoftMax activation function computes the probability of the image fed into the model during the forward pass and how the weight parameters are adjusted during the backward pass to improve training results. Additionally, the chapter discusses the data parallelization strategy from a network perspective.


MINST Dataset

We will use the MNIST dataset [1], which consists of handwritten digits, to demonstrate the training process. The MNIST dataset includes four files: (1) a training set with 60,000 gray-scale images (28x28 pixels) and their respective (2) labels, and a test set with 10,000 images (28x28 pixels) and their respective labels. Figure 3-1 illustrates the structure and dependencies between the test dataset and the labels.

The file train-images-idx3-ubyte contains metadata describing how the images are ordered, along with the image pixel order. The file train-labels-idx1-ubyte defines which label (the digits 0-9) corresponds to which image in the image file. Since we have ten possible outputs, we use ten output neurons.

Before the training process begins, the labels for each image-label pair are one-hot encoded. This process occurs before the data is fed into the model and involves marking the neuron that corresponds to the digit being represented by the image. For example, image number 142 in Figure 3-1, represents the digit 8, which corresponds to output neuron 9. Note that the first digit, 0, is mapped to neuron 1, so the digit 8 is mapped to neuron 9. One-hot encoding creates a vector of ten values, where the number 1 is placed at the position of the expected neuron, and all other positions are set to 0.

Figure 3-1: Training Dataset & Labels– The MNIST Database.

Forward Pass


Model Probability


Figures 3-2 and 3-3 illustrate the forward pass process for multi-class classification. The input layer flattens the 28x28 pixel image into 784 input parameters, where each parameter represents the intensity of a pixel (0 = black, 255 = white). These 784 input values are then passed to all 128 neurons in the hidden layer. Each neuron in the hidden layer receives all 784 inputs, and each of these inputs is associated with a unique weight. Therefore, each of 128 neurons have 784 weight parameters, and total weight parameter count of hidden layer is 100 352.

In the hidden layer, each neuron computes the weighted sum of its inputs and then applies the ReLU activation function to the result. This process produces 128 activation values—one for each neuron in the hidden layer.

Next, these 128 activation values are fed into the output layer, which consists of 10 neurons (corresponding to the 10 possible classes for the MNIST dataset). Each output neuron is connected to all 128 activation values from the hidden layer. Therefore, the weight parameter counts in the output layer is 1280. Again, each neuron computes the weighted sum of its inputs, and the result of this calculation is called a logit. 

In the output layer, the SoftMax activation function is applied to these logits. SoftMax first computes the exponential of each logit, using Euler’s number e as the base. Then, it computes the sum of these exponentials, which in this example is 24.813. The probability for each class (denoted as y ̂) is calculated by dividing each neuron’s exponential by the sum of all exponentials.

In this example, the output neuron corresponding to class "5" produces the highest probability, meaning the model predicts the digit in the image is 5. However, since this prediction is incorrect in the first iteration, the model will adjust its weights during backpropagation.

In our model, we have 101,632 weight parameters. The number of bits used to store each weight parameter in a neural network depends on the numerical precision chosen for the model. The 32-bit floating point (FP32 – single precision) is the standard precision used, where each weight is represented using 32 bits (4 bytes). This format offers good precision but can be memory-intensive for large models. To reduce memory usage and increase speed, many modern hardware systems use 16-bit floating point (FP16 – half precision), where each weight is represented using 16 bits (2 bytes). There is also 64-bit floating point (FP64 – double precision), which uses 64 bits (8 bytes), providing more precision and a larger range than FP32, but at the cost of increased memory usage.

In our model, using FP32, the memory required for the weight parameters is 406,528 bytes (4 × 101,632).

Figure 3-2: Forward pass – Probability Computation.

Cross-Entropy Loss


In our example, the highest probability value (0.244) is provided by neuron 5, though the expected value should be produced by neuron 9. Next, the algorithm computes the cross-entropy loss by taking the logarithm of the probability value for the expected neuron, as defined by the one-hot encoded label. In our example, the probability of the digit being 8, computed by neuron 9, is 0.181. The cross-entropy loss is calculated by taking the log of 0.181, resulting in 0.734. 

Figure 3-3: Forward pass – Cross-Entropy Loss.

Backward Pass


Gradient Computing


The gradient for the neurons in the output layer is calculated by subtracting the ground truth value (from the one-hot encoding) from the probability generated by the SoftMax function. Although the cross-entropy loss is used during training, it is not directly involved in this gradient calculation. Figure 3-4 illustrates the gradient computation process.

For neurons in the hidden layer, the gradient is computed by multiplying the weighted sum of the gradients from the connected output neurons with the initial weight values. This result is then multiplied by the derivative of the neuron’s own ReLU activation function. The formula is shown in the figure below.

Figure 3-4: Backward pass - Gradient Calculation.

Weight Adjustment Values


After calculating the gradients for all neurons, the backpropagation algorithm determines the weight adjustments. While this process was explained in the previous chapter, let's briefly recap it here. The weight adjustment value is computed by multiplying the gradient of that neuron by the weight-associated input value and the shared hyperparameter, the learning rate. 

Figure 3-5 illustrates the computation from two perspectives: neuron 9 in the output layer and neuron 1 in the hidden layer. 

Figure 3-5: Backward Pass - Weight Adjustment Value.

Weight Update


Figure 3-6 depicts how the new weight value is obtained by adding the adjustment value to the initial weight value. 

Figure 3-6: Backward Pass – Computing New Value for the Weight Parameter.

Data Parallelization and Network Impact


So far, we have explored how the backpropagation algorithm functions in multi-class classification on a single GPU. In the next section, we will discuss the process when the input data size exceeds the memory capacity of a single GPU and how we leverage a Data Parallelization Strategy to divide the training data among multiple GPUs. We will also look at the Data Parallelization Strategy from the link utilization perspective.

In Figure 3-7, we have divided the training data into mini-batches. The first half of these mini-batches is stored in system memory DRAM-1 (on Server-1), and the second half is stored in system memory DRAM-2 (on Server-2). This approach demonstrates how, when the data exceeds the memory capacity of a GPU, idle mini-batches can be stored in system memory. The model and its parameters are stored in the GPU’s VRAM alongside the active mini-batch being processed for training.

In our example, each mini-batch composed of 64 images, which the GPU processes in parallel, with each image handled by a dedicated GPU core. Thus, for the first mini-batch, consisting of 64 images, we have 768 neurons (64 x 138), and 6,422,528 weight parameters (128neurons x 64images x 784inputs) + (10neurons x 64images x 128inputs) for the input data, along with 8,832 bias weight parameters. After the forward pass (computation phase), the backward pass begins. During this phase, the backpropagation algorithm calculates the gradients for all neurons.

We employ the All-Reduce collective communication model, where gradients for each layer are summed across mini-batches and synchronized between GPUs. Synchronization happens through direct memory copy using Remote Direct Memory Access (RDMA), enabling GPUs to communicate directly with each other's memory while bypassing the network stack. The RDMA process is covered in detail in a later chapter.

Gradients are summed and synchronized among the GPUs over a network connection. During this synchronization, the GPU’s network interface controller (NIC) forwards packets at line rate, resulting in nearly 100% link utilization. Once synchronization is complete, the GPUs average the gradients and compute new weight adjustment values, which are then used to update all weight parameters. These updated weights are also synchronized between GPUs via RDMA.

Next, the training data is fed back into the model with the newly adjusted weights. During this computation phase, network link utilization is relatively low. The training process typically requires multiple iterations of the forward and backward passes. Once the training results are satisfactory, the adjusted weight values are stored in VRAM, and the processed mini-batch is copied back to system memory, while a new mini-batch is transferred from system memory to the GPU’s VRAM. This transfer usually occurs over the PCIe link, which can introduce delays and increase the overall training time.

To mitigate this, multiple GPUs can be used, allowing the entire training dataset to be stored across their combined memory. This avoids frequent data transfers over PCIe and accelerates the training process. If all GPUs are within the same server, communication occurs over the high-speed NVLink, further enhancing performance.

Due to the time-consuming nature of the training process, which can take several weeks or even months, it is crucial that the communication channel between GPUs is lossless and forwards packets at a line rate. Additionally, regular snapshots of the training results should be taken to guard against potential packet loss, which could occur if the backend network loses even a single packet for any reason. Without these snapshots, the training job would have to start over from the beginning.

Figure 3-7: Gradient Synchronization and Network Utilization.

References



1. Yann LeCun, Corina Cortes, Christoper J.C. Burges: The MNIST database of handwritten digits. 





Monday 14 October 2024

AI for Network Engineers: Backpropagation Algorithm

 Introduction 


This chapter introduces the training model of a neural network based on the Backpropagation algorithm. The goal is to provide a clear and solid understanding of the process without delving deeply into the mathematical formulas, while still explaining the fundamental operations of the involved functions. The chapter also briefly explains why, and in which phases the training job generates traffic to the network, and why lossless packet transport is required. The Backpropagation algorithm is composed of two phases: the Forward pass (computation phase) and the Backward pass (adjustment and communication phase).

In the Forward pass, neurons in the first hidden layer calculate the weighted sum of input parameters received from the input layer, which is then passed to the neuron's activation function. Note that neurons in the input layer are not computational units; they simply pass the input variables to the connected neurons in the first hidden layer. The output from the activation function of a neuron is then used as input for the connected neurons in the next layer. The result of the activation function in the output layer represents the model's prediction, which is compared to the expected value (ground truth) using the error function. The output of the error function indicates the accuracy of the current training iteration. If the result is sufficiently close to the expected value (error function close to zero), the training is complete. Otherwise, it triggers the Backward pass process.

As the first step in the backward pass, the backpropagation algorithm calculates the derivative of the error function, providing the output error (gradient) of the model. Next, the algorithm computes the error term (gradient) for the neuron(s) in the output layer by multiplying the derivative of each neuron’s activation function by the model's error term. Then, the algorithm moves to the preceding layer and calculates the error term (gradient) for its neuron(s). This error term is now calculated using the error term of the connected neuron(s) in the next layer, the derivative of each neuron’s activation function, and the value of the weight parameter associated with the connection to the next layer.

After calculating the error terms, the algorithm determines the weight adjustment values for all neurons simultaneously. This computation is based on the input values, the adjustment values, and a user-defined learning rate. Finally, the backpropagation algorithm refines all weight values by adding the adjustment values to the initial weights. Once the backward pass is complete, the backpropagation algorithm starts a new iteration of the forward pass, gradually improving the model's predictions until they closely match the expected values, at which point the training is complete.


Figure 2-1: Backpropagation Overview.


The First Iteration - Forward Pass


Training a model often requires multiple iterations of forward and backward passes. In the forward pass, neurons in the first hidden layer calculate the weighted sum of input values, each multiplied by its associated weight parameter. These neurons then apply an activation function to the result. Neurons in subsequent layers use the activation output from previous layers as input for their own weighted sum calculations. This process continues through all the layers until reaching the output layer, where the activation function produces the model's prediction.

After the forward pass, the backpropagation algorithm calculates the error by comparing the model's output with the expected value, providing a measure of accuracy. If the model's output is close to the expected value, training is complete. Otherwise, the backpropagation algorithm initiates the backward pass to adjust the weights and reduce the error in subsequent iterations.

Neuron-a Forward Pass Calculations

Weighted Sum


In Figure 2-2, we have an imaginary training dataset with three inputs and a bias term. Input values and their respective initial weight values are listed below: 

x1 = 0.2 , initial weight wa1 = 0.1
x2 = 0.1, initial weight wa2 = 0.2
x3 = 0.4 , initial weight wa3 = 0.3
ba0 = 1.0 , initial weight wa0 = 0.6

From the model training perspective, the input values are constant, unchageable values, while weight values are variables which will be refined during the backward pass process.

The standard way to write the weighted sum formula is: 


Where:
n = 3 represents the number of input values (x1, x2, x3).
Each input xi  is multiplied by its respective weight wi, and the sum of these products is added to the bias term b.

In this case, the equation can be explicitly stated as:


Which with our parameters gives:

Activation Function


Neuron-a uses the previously calculated weighted sum as input for the activation function. We are using the ReLU function (Rectified Linear Unit), which is more popular than the hyperbolic tangent and sigmoid functions due to its simplicity and lower computational cost.

The standard way to write the ReLU function is:


Where:
f(a) represents the activation function.
z  is the weighted sum of inputs.

The ReLU function returns the z if z > 0. Otherwise, it returns 0 if z ≤ 0.

In our example, the weighted sum za is 0.76, so the ReLU function returns:



Figure 2-2: Activation Function for Neuron-a.


Neuron-b Forward Pass Calculations

Weighted Sum


Besides the bias term value of 1.0,  Neuron-b uses the result provided by the activation function of neuron-a as an input to weighted sum calculation. Input values and their respective initial weight values are listed below: 


This gives us:


Activation Function


Just like Neuron-a, Neuron-b uses the previously calculated weighted sum as input for the activation function. Because the zb = 0.804 is greater than zero, the ReLU activation function f(b) returns:


Neuron-b is in the output layer, so its activation function result yb represents the prediction of the model. 

 

Figure 2-3: Activation Function for Neuron-b.

Error Function


To keep things simple, we have used only one training example. However, in real-life scenarios, there will always be more than one training example. For instance, a training dataset might contain 10 images of cats and 10 images of dogs, each having 28x28 pixels. Each image represents a training example, giving us a total of 20 training examples. The purpose of the error function is to provide a single error metric for all training examples. In this case, we are using the Mean Squared Error (MSE).

We can calculate the MSE using the formula below where the expected value y is 1.0 and the model’s  prediction for the training example yb = 0.804. This gives an error metric of 0.019, which can be interpreted as an indicator of the model's accuracy.

The result of the error function is not sufficiently close to the desired value, which is why this result triggers the backward pass process.

Figure 2-4: Calculating the Error Function for Training Examples.

Backward Pass


In the forward pass, we calculate the model’s accuracy using several functions. First, Neuron-a computes the weighted sum Σ(za ) by multiplying the input values and the bias term with their respective weights. The output, za, is then passed through the activation function f(a), producing ya. Neuron-b, in turn, takes ya and the bias term to calculate the weighted sum Σ(zb ). The activation function f(b) then uses zb to compute the model’s output, yb. Finally, the error function f(e) calculates the model’s accuracy based on the output.

So, dependencies between function can be seen as:


The backpropagation algorithm combines these five functions to create a new error function, enew(x), using function composition and the chain rule. The following expression shows how the error function relates to the weight parameter w1 used by Neuron-a:


This can be expressed using the composition operator (∘) between functions:


Next, we use a method called gradient descent to gradually adjust the initial weight values, refining them to bring the model's output closer to the expected result. To do this, we compute the derivative of the composite function using the chain rule, where we take the derivatives of:

1. The error function (e) with respect to the activation function (b).
2. The activation function b with respect to the weighted sum (zb). 
3. The weighted sum (zb) with respect to the activation function (a).
4. The activation function (a) with respect to weighted sum (za(w1)). 

In Leibniz’s notation, this looks like:


Figure 2-5 illustrates the components of the backpropagation algorithm, along with their relationships and dependencies.


Figure 2-5: The Backward Pass Overview.


Partial Derivative for Error Function – Output Error (Gradient)


As a recap, and for illustrating that the prediction of the first iteration fails, Figure 2-6 includes the computation for the error function (MSE = 0.019). 

As a first step, we calculate the partial derivative of the error function. In this case, the partial derivative describes the rate of change of the error function when the input variable yb changes. The derivative is called partial when one of its input values is held constant (i.e., not adjusted by the algorithm). In our example, the expected value y is constant input. The result of the partial derivative of the error function describes how the predicted output should change yb to minimize the model’s error.

We use the following formula for computing the derivative of the error function:


Figure 2-6: The Backward Pass – Derivative of the Error Function.

The following explanation is meant for readers interested in why there is a minus sign in front of the function.

When calculating the derivative, we use the Power Rule. The Power Rule states that if we have a function f(x) = xn , then its derivative is f’(x) = n ⋅ xn-1. In our case, this applies to the error function:


Using the Power Rule, the derivative becomes:


Next, we apply the chain rule by multiplying this result by the derivative of the inner function (y − yb), with respect to yb . Since y is treated as a constant (because it represents our target value, which doesn't change during optimization), the derivative of (y − yb) with respect to yb  is simply −1, as the derivative of − yb  with respect to yb  is −1, and the derivative of y (a constant) is 0.

Therefore, the final derivative of the error function with respect to yb  is:


Partial Derivative for the Activation Function


After computing the output error, we calculate the derivative of the activation function f(b) with respect to zb . Neuron b uses the ReLU activation function, which states that if the input to the function is greater than 0, the derivative is 1; otherwise, it is 0. In our case, the result of the activation function f(b)=0.804, so the derivative is 1.

Error Term for Neurons (Gradient)


The error term (Gradient) for neuron-b is calculated by multiplying the output error, the partial derivative of the error function,  by the derivative of the neuron's activation function. This means that now we propagate the model's error backward using it as a base value for finetuning the model accuracy (i.e., refining new weight values). This is why the term backward pass fits perfectly for the process.



Figure 2-7: The Backward Pass – Error Term (Gradient) for Neuron-b.


After computing the error term for Neuron-b, the backward pass moves to the preceding layer, the hidden layer, and calculates the error term for Neuron-a. The algorithm computes the derivative for the activation function f(a) = 1, as it did with the Neuron-b. Next, it multiplies the result by Neuron-b's error term (-0.196) and the connected weight parameter , wb1 =0.4. The result -0.0784 is the error term for Neuron-a.


Figure 2-8: The Backward Pass – Error Term (Gradient) for Neuron-a.


Weight Adjustment Value


After computing error terms for all neurons in every layer, the algorithm simultaneously calculates the weight adjustment value for every weight. The process is simple, the error term is multiplied with an input value connected to weight and with learning rate (η). The learning rate balances convergence speed and training stability. We have set it to -0.6 for the first iteration. The learning rate is a hyper-parameter, meaning it is set by the user rather than learned by the model during training. It affects the behavior of the backpropagation algorithm by controlling the size of the weight updates. It is also possible to adjust the learning rate during training—using a higher learning rate at the start to allow faster convergence and lowering it later to avoid overshooting the optimal result. 

Weight adjustment value for weight wb1 and wa1 respectively:


Note! It is not recommended to use a negative learning rate. I use it here because we get a good enough output for the second forward pass iteration.


Figure 2-9: The Backward Pass – Weight Adjustment Value for Neurons.

Refine Weights


As the last step, the backpropagation algorithm computes new values for every weight parameter in the model by simply summing the initial weight value and weight adjustment value.

New values for weight  parameters wb1 and wa1 respectively:



Figure 2-10: The Backward Pass – Compute New Weight Values.

The Second Iteration - Forward Pass


After updating all the weight values (wa0, wa1, wa2, and wa3 ), the backpropagation process begins the second iteration of the forward pass. As shown in Figure 2-11, the model output yb = 0.9982 is very close to the expected value y = 1.0. The new MSE = 0.0017, is much better than 0.019 computed in the first iteration.


Figure 2-11: The Second Iteration of the Forward Pass.

Network Impact


Figure 2-12 shows a hypothetical example of Data Parallelization, where our training data set is split into two batches, A and B, which are processed by GPU-A and GPU-B, respectively. The training model is the same on both GPUs: Fully-Connected, with two hidden layers of four neurons each, and one output neuron in the output layer.

After computing a model prediction during the forward pass, the backpropagation algorithm begins the backward pass by calculating the gradient (output error) for the error function. Once computed, the gradients are synchronized between the GPUs. The algorithm then averages the gradients, and the process moves to the preceding layer. Neurons in the preceding layer calculate their gradient by multiplying the weighted sum of their connected neurons’ averaged gradients and connected weight with the local activation function’s partial derivative. These neuron-based gradients are then synchronized over connections. Before the process moves to the preceding layer, gradients are averaged. The backpropagation algorithm executes the same process through all layers. 

If packet loss occurs during the synchronization, it can ruin the entire training process, which would need to be restarted unless snapshots were taken. The cost of losing even a single packet could be enormous, especially if training has been ongoing for several days or weeks. Why is a single packet so important? If the synchronization between the gradients of two parallel neurons fails due to packet loss, the algorithm cannot compute the average, and the neurons in the preceding layer cannot calculate their gradient. Besides, if the connection, whether the synchronization happens over NVLink, InfiniBand, Ethernet (RoCE or RoCEv2), or wireless connection, causes a delay, the completeness of the training slows down. This causes GPU under-utilization which is not efficient from the business perspective.

Figure 2-12: Backward Pass – Gradient Synchronization and Averaging.

To be conntinued...



Monday 30 September 2024

AI for Network Engineers: Chapter 2 - Backpropagation Algorithm: Introduction

This chapter introduces the training model of a neural network based on the Backpropagation algorithm. The goal is to provide a clear and solid understanding of the process without delving deeply into the mathematical formulas, while still explaining the fundamental operations of the involved functions. The chapter also briefly explains why, and in which phases the training job generates traffic to the network, and why lossless packet transport is required. The Backpropagation algorithm is composed of two phases: the Forward pass (computation phase) and the Backward pass (adjustment and communication phase).

In the Forward pass, neurons in the first hidden layer calculate the weighted sum of input parameters received from the input layer, which is then passed to the neuron's activation function. Note that neurons in the input layer are not computational units; they simply pass the input variables to the connected neurons in the first hidden layer. The output from the activation function of a neuron is then used as input for the connected neurons in the next layer, whether it is another hidden layer or the output layer. The result of the activation function in the output layer represents the model's prediction, which is compared to the expected value (ground truth) using the error function. The output of the error function indicates the accuracy of the current training iteration. If the result is sufficiently close to the expected value, the training is complete. Otherwise, it triggers the Backward pass process.

In the Backward pass process, the Backpropagation algorithm first calculates the derivative of the error function. This derivative is then used to compute the error term for each neuron in the model. Neurons use their calculated error terms to determine how much and in which direction the current weight values must be fine-tuned. Depending on the model and the parallelization strategy, GPUs in multi-GPU clusters synchronize information during the Backpropagation process. This process affects network utilization.

Our feedforward neural network, shown in Figure 2-1, has one hidden layer and one output layer. If we wanted our example to be a deep neural network, we would need to add additional layers, as the definition of "deep" requires two or more hidden layers. For simplicity, the input layer is not shown in the figure.

We have three input parameters connected to neuron-a in the hidden layer as follows:

Input X1 = 0.2 > neuron-a via weight Wa1 = 0.1

Input X2 = 0.1 > neuron-a via weight Wa2 = 0.2

Input X3 = 0.4 > neuron-a via weight Wa3 = 0.3

Bias ba0 = 1.0 > neuron-a via weight Wa0 = 0.6

The bias term helps ensure that the neuron is active, meaning its output value is not zero.

The input parameters are treated as constant values, while the weight values are variables that will be adjusted during the Backward pass if the training result does not meet expectations. The initial weight values are our best guess for achieving the desired training outcome. The result of the weighted sum calculation is passed to the activation function, which provides the input for neuron-b in the output layer. We use the ReLU (Rectified Linear Unit) activation function in both layers due to its simplicity. There are other activation functions, such as hyperbolic tangent (tanh), sigmoid, and softmax, but those are outside the scope of this chapter.

The input values and weights for neuron-b are:

Neuron-a activation function output f(af) > neuron-b via weight Wb1

Bias ba0 = 1.0 > neuron-b via weight Wa0 = 0.5

The output, Ŷ, from neuron-b represents our feedforward neural network's prediction. This value is used along with the expected result, y, as input for the error function. In this example, we use the Mean Squared Error (MSE) error function. As we will see, the result of the first training iteration does not match our expected value, leading us to initiate the Backward pass process.

In the first step of the Backward pass, the Backpropagation algorithm calculates the derivative of the error function (MSE’). Neurons-a and b use this result as input to compute their respective error terms by multiplying MSE’ with the result of the activation function and the weight value associated with the connection to the next neuron. Note that for neuron-b, there is no next layer—just the error function—so the weight parameter is excluded from the error term calculation of neuron-b. Next, the error term value is multiplied by an input value and learning rate, and this adjustment value is added to the current weight.

After completing the Backward pass, the Backpropagation algorithm starts a new iteration of the Forward pass, gradually improving the model's prediction until it closely matches the expected value, at which point the training is complete.

Figure 2-1: Backpropagation Algorithm.



Sunday 15 September 2024

Artificial Intelligence for Network Engineers: Introduction

Several books on artificial intelligence (AI) and deep learning (DL) have been published over the past decade. However, I have yet to find a book that explains deep learning from a networking perspective while providing a solid introduction to DL. My goal is to fill this gap by writing a book titled AI for Network Engineers (note that the title name may change during the writing process). Writing about such a complex subject will take time, but I hope to complete and release it within a year.

Part I: Deep Learning and Deep Neural Networks

The first part of the book covers the theory behind Deep Learning. It begins by explaining the construct of a single artificial neuron and its functionality. Then, it explores various Deep Neural Network models, such as Feedforward Neural Networks (FNN), Convolutional Neural Networks (CNN), and Recurrent Neural Networks (RNN). Next, the first part discusses data and model parallelization strategies such as Data, Pipeline, and Tensor Parallelism, explaining how input data and/or model sizes that exceed the memory capacity of GPUs within a single server can be distributed across multiple GPU servers.

Part II: AI Data Center Networking - Lossless Ethernet

After a brief introduction of RoCEv2, the second part continues from part one by explaining how parallelization strategies affect network utilization. It then discusses the Data Center Quantized Congestion Notification (DCQCN) scheme for RoCEv2, introducing key concepts such as Explicit Congestion Notification (ECN) and Priority-based Flow Control (PFC). In addition to ECN and PFC, this section covers other congestion-avoidance methods, such as packet spraying and deep buffers. The second part also delves into AI data center design choices focusing on the East-West backend network. It introduces Rail, Top-of-Rack (ToR), and Rail-Optimized designs.



Figure 1:
Book Introduction.

AI for Network Engineers: Chapter 1 - Deep Learning Basics

Content

Introduction 
Artificial Neuron 
  Weighted Sum for Pre-Activation Value 
  ReLU Activation Function for Post-Activation 
  Bias Term
  S-Shaped Functions – TANH and SIGMOID
Network Impact
Summary
References

Introduction


Artificial Intelligence (AI) is a broad term for solutions that aim to mimic the functions of the human brain. Machine Learning (ML), in turn, is a subset of AI, suitable for tasks like simple pattern recognition and prediction. Deep Learning (DL), the focus of this section, is a subset of ML that leverages algorithms to extract meaningful patterns from data. Unlike ML, DL does not necessarily require human intervention, such as providing structured, labeled datasets (e.g., 1,000 bird images labeled as “bird” and 1,000 cat images labeled as “cat”). 


DL utilizes layered, hierarchical Deep Neural Networks (DNNs), where hidden and output layers consist of computational units, artificial neurons, which individually process input data. The nodes in the input layer pass the input data to the first hidden layer without performing any computations, which is why they are not considered neurons or computational units. Each neuron calculates a pre-activation value (z) based on the input received from the previous layer and then applies an activation function to this value, producing a post-activation output (ŷ) value. There are various DNN models, such as Feed-Forward Neural Networks (FNN), Convolutional Neural Networks (CNN), and Recurrent Neural Networks (RNN), each designed for different use cases. For example, FNNs are suitable for simple, structured tasks like handwritten digit recognition using the MNIST dataset [1], CNNs are effective for larger image recognition tasks such as with the CIFAR-10 dataset [2], and RNNs are commonly used for time-series forecasting, like predicting future sales based on historical sales data. 


To provide accurate predictions based on input data, neural networks are trained using labeled datasets. The MNIST (Modified National Institute of Standards and Technology) dataset [1] contains 60,000 training and 10,000 test images of handwritten digits (grayscale, 28x28 pixels). The CIFAR-10 [2] dataset consists of 60,000 color images (32x32 pixels), with 50,000 training images and 10,000 test images, divided into 10 classes. The CIFAR-100 dataset [3], as the name implies, has 100 image classes, with each class containing 600 images (500 training and 100 test images per class). Once the test results reach the desired level, the neural network can be deployed to production.


Figure 1-1: Deep Learning Introduction.

Saturday 10 August 2024

AI/ML Networking: Part-IV: Convolutional Neural Network (CNN) Introduction

Feed-forward Neural Networks are suitable for simple tasks like basic time series prediction without long-term relationships. However, FNNs is not a one-size-fits-all solution. For instance, digital image training process uses pixel values of image as input data. Consider training a model to recognize a high resolution (600 dpi), 3.937 x 3.937 inches digital RGB (red, green, blue) image. The number of input parameters can be calculated as follows:

Width: 3.937 in x 600 ≈ 2362 pixels
Height: 3.937 in x 600 ≈ 2362 pixels
Pixels in image: 2362 x 2362 = 5,579,044 pixels
RGB (3 channels): 5,579,044 pxls x 3 channels = 16 737 132
Total input parameters: 16 737 132
Memory consumption: ≈ 16 MB

FNNs are not ideal for digital image training. If we use FNN for training in our example, we fed 16,737,132 input parameters to the first hidden layer, each having unique weight. For image training, there might be thousands of images, handling millions of parameters demands significant computation cycles and is a memory-intensive process. Besides, FNNs treat each pixel as an independent unit. Therefore, FNN algorithm does not understand dependencies between pixels and cannot recognize the same image if it shifts within the frame. Besides, FNN does not detect edges and other crucial details. 

A better model for training digital images is Convolutional Neural Networks (CNNs). Unlike in FFN neural networks where each neuron has a unique set of weights, CNNs use the same set of weights (Kernel/Filter) across different regions of the image, which reduces the number of parameters. Besides, CNN algorithm understands the pixel dependencies and can recognize patterns and objects regardless of their position in the image. 

The input data processing in CNNs is hierarchical. The first layer, convolutional layers, focuses on low-level features such as textures and edges. The second layer, pooling layer, captures higher-level features like shapes and objects. These two layers significantly reduce the input data parameters before they are fed into the neurons in the first hidden layer, the fully connected layer, where each neuron has unique weights (like FNNs).



Friday 19 July 2024

AI/ML Networking: Part-III: Basics of Neural Networks Training Process

Neural Network Architecture Overview

Deep Neural Networks (DNN) leverage various architectures for training, with one of the simplest and most fundamental being the Feedforward Neural Network (FNN). Figure 2-1 illustrates our simple, three-layer FNN.

Input Layer: 

The first layer doesn’t have neurons, instead the input data parameters X1, X2, and X3 are in this layer, from where they are fed to first hidden layer. 

Hidden Layer: 

The neurons in the hidden layer calculate a weighted sum of the input data, which is then passed through an activation function. In our example, we are using the Rectified Linear Unit (ReLU) activation function. These calculations produce activation values for neurons. The activation value is modified input data value received from the input layer and published to upper layer.

Output Layer: 

Neurons in this layer calculate the weighted sum in the same manner as neurons in the hidden layer, but the result of the activation function is the final output.


The process described above is known as the Forwarding pass operation. Once the forward pass process is completed, the result is passed through a loss function, where the received value is compared to the expected value. The difference between these two values triggers the backpropagation process. The Loss calculation is the initial phase of Backpropagation process. During backpropagation, the network fine-tunes the weight values , neuron by neuron, from the output layer through the hidden layers. The neurons in the input layer do not participate in the backpropagation process because they do not have weight values to be adjusted.


After the backpropagation process, a new iteration of the forward pass begins from the first hidden layer. This loop continues until the received and expected values are close enough to expected value, indicating that the training is complete.


Figure 2-1: Deep Neural Network Basic Structure and Operations.

Forwarding Pass 


Next, let's examine the operation of a Neural Network in more detail. Figure 2-2 illustrates a simple, three-layer Feedforward Neural Network (FNN) data model. The input layer has two neurons, H1 and H2, each receiving one input data value: a value of one (1) is fed to neuron H1 by input neuron X1, and a value of zero (0) is fed to neuron H2 by input neuron X2. The neurons in the input layer do not calculate a weighted sum or an activation value but instead pass the data to the next layer, which is the first hidden layer.

The hidden layer in our example consists of two neurons. These neurons use the ReLU activation function to calculate the activation value. During the initialization phase, the weight values for these neurons are assigned using the He Initialization method, which is often used with the ReLU function. The He Initialization method calculates the variance as 2/where n is the number of neurons in the previous layer. In this example, with two input neurons, this gives a variance of  1 (=2/2). The weights are then drawn from a normal distribution ~N(0,√variance), which in this case is  ~N(0,1). Basically, this means that the randomly generated weight values are centered around zero with a standard deviation of one.

In Figure 2-2, the weight value for neuron H3 in the hidden layer is 0.5 for both input sources X1 (input data 1) and X2 (input data 0). Similarly, for the hidden layer neuron H4, the weight value is 1 for both input sources X1 (input data 1) and X2 (input data 0). Neurons in the hidden and output layers also have a bias variable. If the input to a neuron is zero, the output would also be zero if there were no bias. The bias ensures that a neuron can still produce a meaningful output even when the input is zero (i.e., the neuron is inactive). Neurons H3 and O5 have a bias value of 0.5, while neuron H4 has a bias value of 0 (I am using zero for simplify the calculation). 

Let’s start the forward pass process from neuron H3 in the hidden layer. First, we calculate the weighted sum using the formula below, where Z3 represents the weighted sum of input. Here, Xn is the actual input data value received from the input layer’s neuron, and Wn  is the weight associated with that particular input neuron.

The weighted sum calculation (Z3) for neuron H3:

Z3 = (X1 ⋅ W31) + (X2 ⋅ W32) + b3
Given:
Z3 = (1 ⋅ 0.5) + (0 ⋅ 0.5) + 0
Z3 = 0.5 + 0 + 0
Z3 = 0.5

To get the activation value a3 (shown as H3=0.5 in figure), we apply the ReLU function. The ReLU function outputs zero (0) if the calculated weighted sum Z is less than or equal to zero; otherwise, it outputs the value of the weighted sum Z.

The activation value a3 for H3 is:

ReLU (Z3) = ReLU (0.5) = 0.5

The weighted sum calculation for neuron H4:

Z4 = (X1 ⋅ W41) + (X2 ⋅ W42) + b4
Given:
Z4 = (1 ⋅ 1) + (0 ⋅1) + 0.5
Z4 = 1 + 0 + 0.5
Z4 = 1.5

The activation value using ReLU for Z4 is:

ReLU (Z4) = ReLU (1.5) = 1.5

 

Figure 2-2: Forwarding Pass on Hidden Layer.

After neurons H3 and H4 publish their activation values to neuron O5 in the output layer, O5 calculates the weighted sum Z5 for inputs with weights W53=1and W54=1. Using Z5, it calculates the output using the ReLU function. The difference between the received output value (Yr) and the expected value (Ye) triggers a backpropagation process. In our example, Yr−Ye=0.5.

Backpropagation process

The loss function measures the difference between the predicted output and the actual expected output. The loss function value indicates how well the neural network is performing. A high loss value means the network's predictions are far from the actual values, while a low loss value means the predictions are close.

After calculating the loss, backpropagation is initiated to minimize this loss. Backpropagation involves calculating the gradient of the loss function with respect to each weight and bias in the network. This step is crucial for adjusting the weights and biases to reduce the loss in subsequent forwarding pass iterations.

Loss function is calculated using the formula below:

Loss (L) = (H3 x W53 + H4 x W54 + b5 – Ye)2
Given:
L = (0.5 x 1 + 1.5 x 1 + 0.5 - 2)2
L = (0.5 + 1.5 + 0.5 - 2)2
L = 0.52
L= 0.25

 


Figure 2-3: Forwarding Pass on Output Layer.

The result of the loss function is then fed into the gradient calculation process, where we compute the gradient of the loss function with respect to each weight and bias in the network. The gradient calculation result is then used to fine-tune the old weight values. The Eta hyper-parameter η (the learning rate) controls the step size during weight updates in the backpropagation process, balancing the speed of convergence with the stability of training. In our example, we are using a learning rate of 1/100 = 0.01. The term hyper-parameters refers to parameters that affect the final result.

First, we compute the partial derivative of the loss function (gradient calculation) with respect to the old weight values. The following example shows the gradient calculation for weight W53. The same computation applies to W54  and b3.

Gradient Calculation:

∂L   = 2W53 x (Yr – Ye)
∂W53

 Given

 = 2 x 0.5 x (2.5 - 2)
 = 1 x 0.5
 = 0.5

New weight value calculation.

W53 (new) = W53(old) – η x ∂L/∂W53
Given:
W53 (new) = 1–0.01 x 0.5
W53 (new) = 0.995

 


Figure 2-4: Backpropagation - Gradient Calculation and New Weight Value Computation.


Figure 2-5 shows the formulas for calculating the new bias b3. The process is the same than what was used with updating the weight values.


Figure 2-5: Backpropagation - Gradient Calculation and New Bias Computation.

After updating the weights and biases, the backpropagation process moves to the hidden layer. Gradient computation in the hidden layer is more complex because the loss function only includes weights from the output layer as you can see from the Loss function formula below:

Loss (L) = (H3 x W53 + H4 x W54 + b5 – Ye)2

The formula for computing the weights and biases for neurons in the hidden layers uses the chain rule. The mathematical formula shown below, but the actual computation is beyond the scope of this chapter.

∂L   =    ∂L  x  ∂H3    
∂W31   ∂H3    ∂W31    

After the backpropagation process is completed, the next iteration of the forward pass starts. This loop continues until the received result is close enough to the expected result.

If the size of the input data exceeds the GPU’s memory capacity or if the computing power of one GPU is insufficient for the data model, we need to decide on a parallelization strategy. This strategy defines how the training workload is distributed across several GPUs. Parallelization impacts network load if we need more GPUs than are available on one server. Dividing the workload among GPUs within a single GPU-server or between multiple GPU-servers triggers synchronization of calculated gradients between GPUs. When the gradient is calculated, the GPUs synchronize the results and compute the average gradient, which is then used to update the weight values.

The upcoming chapter introduces pipeline parallelization and synchronization processes in detail. We will also discuss why lossless connection is required for AI/ML.



Tuesday 16 July 2024

AI/ML Networking: Part-II: Introduction of Deep Neural Networks

Machine Learning (ML) is a subset of Artificial Intelligence (AI). ML is based on algorithms that allow learning, predicting, and making decisions based on data rather than pre-programmed tasks. ML leverages Deep Neural Networks (DNNs), which have multiple layers, each consisting of neurons that process information from sub-layers as part of the training process. Large Language Models (LLMs), such as OpenAI’s GPT (Generative Pre-trained Transformers), utilize ML and Deep Neural Networks.

For network engineers, it is crucial to understand the fundamental operations and communication models used in ML training processes. To emphasize the importance of this, I quote the Chinese philosopher and strategist Sun Tzu, who lived around 600 BCE, from his work The Art of War.

If you know the enemy and know yourself, you need not fear the result of a hundred battles.

We don’t have to be data scientists to design a network for AI/ML, but we must understand the operational fundamentals and communication patterns of ML. Additionally, we must have a deep understanding of network solutions and technologies to build a lossless and cost-effective network for enabling efficient training processes.

In the upcoming two posts, I will explain the basics of: 

a) Data Models: Layers and neurons, forward and backward passes, and algorithms. 

b) Parallelization Strategies: How training times can be reduced by dividing the model into smaller entities, batches, and even micro-batches, which are processed by several GPUs simultaneously.

The number of parameters, the selected data model, and the parallelization strategy affect the network traffic that crosses the data center switch fabric.

After these two posts, we will be ready to jump into the network part. 

At this stage, you may need to read (or re-read) my previous post about Remote Direct Memory Access (RDMA), a solution that enables GPUs to write data from local memory to remote GPUs' memory.