The Network Times: AI for Network Engineers: Multi-Class Classification

Introduction

This chapter explains the multi-class classification training process. It begins with an introduction to the MNIST dataset (Modified National Institute of Standards and Technology dataset). Next, it describes how the SoftMax activation function computes the probability of the image fed into the model during the forward pass and how the weight parameters are adjusted during the backward pass to improve training results. Additionally, the chapter discusses the data parallelization strategy from a network perspective.

MINST Dataset

We will use the MNIST dataset [1], which consists of handwritten digits, to demonstrate the training process. The MNIST dataset includes four files: (1) a training set with 60,000 gray-scale images (28x28 pixels) and their respective (2) labels, and a test set with 10,000 images (28x28 pixels) and their respective labels. Figure 3-1 illustrates the structure and dependencies between the test dataset and the labels.

The file train-images-idx3-ubyte contains metadata describing how the images are ordered, along with the image pixel order. The file train-labels-idx1-ubyte defines which label (the digits 0-9) corresponds to which image in the image file. Since we have ten possible outputs, we use ten output neurons.

Before the training process begins, the labels for each image-label pair are one-hot encoded. This process occurs before the data is fed into the model and involves marking the neuron that corresponds to the digit being represented by the image. For example, image number 142 in Figure 3-1, represents the digit 8, which corresponds to output neuron 9. Note that the first digit, 0, is mapped to neuron 1, so the digit 8 is mapped to neuron 9. One-hot encoding creates a vector of ten values, where the number 1 is placed at the position of the expected neuron, and all other positions are set to 0.

Figure 3-1: Training Dataset & Labels– The MNIST Database.

Forward Pass

Model Probability

Figures 3-2 and 3-3 illustrate the forward pass process for multi-class classification. The input layer flattens the 28x28 pixel image into 784 input parameters, where each parameter represents the intensity of a pixel (0 = black, 255 = white). These 784 input values are then passed to all 128 neurons in the hidden layer. Each neuron in the hidden layer receives all 784 inputs, and each of these inputs is associated with a unique weight. Therefore, each of 128 neurons have 784 weight parameters, and total weight parameter count of hidden layer is 100 352.

In the hidden layer, each neuron computes the weighted sum of its inputs and then applies the ReLU activation function to the result. This process produces 128 activation values—one for each neuron in the hidden layer.

Next, these 128 activation values are fed into the output layer, which consists of 10 neurons (corresponding to the 10 possible classes for the MNIST dataset). Each output neuron is connected to all 128 activation values from the hidden layer. Therefore, the weight parameter counts in the output layer is 1280. Again, each neuron computes the weighted sum of its inputs, and the result of this calculation is called a logit.

In the output layer, the SoftMax activation function is applied to these logits. SoftMax first computes the exponential of each logit, using Euler’s number e as the base. Then, it computes the sum of these exponentials, which in this example is 24.813. The probability for each class (denoted as y ̂) is calculated by dividing each neuron’s exponential by the sum of all exponentials.

In this example, the output neuron corresponding to class "5" produces the highest probability, meaning the model predicts the digit in the image is 5. However, since this prediction is incorrect in the first iteration, the model will adjust its weights during backpropagation.

In our model, we have 101,632 weight parameters. The number of bits used to store each weight parameter in a neural network depends on the numerical precision chosen for the model. The 32-bit floating point (FP32 – single precision) is the standard precision used, where each weight is represented using 32 bits (4 bytes). This format offers good precision but can be memory-intensive for large models. To reduce memory usage and increase speed, many modern hardware systems use 16-bit floating point (FP16 – half precision), where each weight is represented using 16 bits (2 bytes). There is also 64-bit floating point (FP64 – double precision), which uses 64 bits (8 bytes), providing more precision and a larger range than FP32, but at the cost of increased memory usage.

In our model, using FP32, the memory required for the weight parameters is 406,528 bytes (4 × 101,632).

Figure 3-2: Forward pass – Probability Computation.

Cross-Entropy Loss

In our example, the highest probability value (0.244) is provided by neuron 5, though the expected value should be produced by neuron 9. Next, the algorithm computes the cross-entropy loss by taking the logarithm of the probability value for the expected neuron, as defined by the one-hot encoded label. In our example, the probability of the digit being 8, computed by neuron 9, is 0.181. The cross-entropy loss is calculated by taking the log of 0.181, resulting in 0.734.

Figure 3-3: Forward pass – Cross-Entropy Loss.

Backward Pass

Gradient Computing

The gradient for the neurons in the output layer is calculated by subtracting the ground truth value (from the one-hot encoding) from the probability generated by the SoftMax function. Although the cross-entropy loss is used during training, it is not directly involved in this gradient calculation. Figure 3-4 illustrates the gradient computation process.

For neurons in the hidden layer, the gradient is computed by multiplying the weighted sum of the gradients from the connected output neurons with the initial weight values. This result is then multiplied by the derivative of the neuron’s own ReLU activation function. The formula is shown in the figure below.

Figure 3-4: Backward pass - Gradient Calculation.

Weight Adjustment Values

After calculating the gradients for all neurons, the backpropagation algorithm determines the weight adjustments. While this process was explained in the previous chapter, let's briefly recap it here. The weight adjustment value is computed by multiplying the gradient of that neuron by the weight-associated input value and the shared hyperparameter, the learning rate.

Figure 3-5 illustrates the computation from two perspectives: neuron 9 in the output layer and neuron 1 in the hidden layer.

Figure 3-5: Backward Pass - Weight Adjustment Value.

Weight Update

Figure 3-6 depicts how the new weight value is obtained by adding the adjustment value to the initial weight value.

Figure 3-6: Backward Pass – Computing New Value for the Weight Parameter.

Data Parallelization and Network Impact

So far, we have explored how the backpropagation algorithm functions in multi-class classification on a single GPU. In the next section, we will discuss the process when the input data size exceeds the memory capacity of a single GPU and how we leverage a Data Parallelization Strategy to divide the training data among multiple GPUs. We will also look at the Data Parallelization Strategy from the link utilization perspective.

In Figure 3-7, we have divided the training data into mini-batches. The first half of these mini-batches is stored in system memory DRAM-1 (on Server-1), and the second half is stored in system memory DRAM-2 (on Server-2). This approach demonstrates how, when the data exceeds the memory capacity of a GPU, idle mini-batches can be stored in system memory. The model and its parameters are stored in the GPU’s VRAM alongside the active mini-batch being processed for training.

In our example, each mini-batch composed of 64 images, which the GPU processes in parallel, with each image handled by a dedicated GPU core. Thus, for the first mini-batch, consisting of 64 images, we have 768 neurons (64 x 138), and 6,422,528 weight parameters (128neurons x 64images x 784inputs) + (10neurons x 64images x 128inputs) for the input data, along with 8,832 bias weight parameters. After the forward pass (computation phase), the backward pass begins. During this phase, the backpropagation algorithm calculates the gradients for all neurons.

We employ the All-Reduce collective communication model, where gradients for each layer are summed across mini-batches and synchronized between GPUs. Synchronization happens through direct memory copy using Remote Direct Memory Access (RDMA), enabling GPUs to communicate directly with each other's memory while bypassing the network stack. The RDMA process is covered in detail in a later chapter.

Gradients are summed and synchronized among the GPUs over a network connection. During this synchronization, the GPU’s network interface controller (NIC) forwards packets at line rate, resulting in nearly 100% link utilization. Once synchronization is complete, the GPUs average the gradients and compute new weight adjustment values, which are then used to update all weight parameters. These updated weights are also synchronized between GPUs via RDMA.

Next, the training data is fed back into the model with the newly adjusted weights. During this computation phase, network link utilization is relatively low. The training process typically requires multiple iterations of the forward and backward passes. Once the training results are satisfactory, the adjusted weight values are stored in VRAM, and the processed mini-batch is copied back to system memory, while a new mini-batch is transferred from system memory to the GPU’s VRAM. This transfer usually occurs over the PCIe link, which can introduce delays and increase the overall training time.

To mitigate this, multiple GPUs can be used, allowing the entire training dataset to be stored across their combined memory. This avoids frequent data transfers over PCIe and accelerates the training process. If all GPUs are within the same server, communication occurs over the high-speed NVLink, further enhancing performance.

Due to the time-consuming nature of the training process, which can take several weeks or even months, it is crucial that the communication channel between GPUs is lossless and forwards packets at a line rate. Additionally, regular snapshots of the training results should be taken to guard against potential packet loss, which could occur if the backend network loses even a single packet for any reason. Without these snapshots, the training job would have to start over from the beginning.

Figure 3-7: Gradient Synchronization and Network Utilization.

References

1. Yann LeCun, Corina Cortes, Christoper J.C. Burges: The MNIST database of handwritten digits.

Monday, 21 October 2024

AI for Network Engineers: Multi-Class Classification