Sunday, 23 February 2025

Introduction of an Artificial Neuron

 Introduction 


Before diving into the somewhat complex world of Artificial Intelligence (AI), let’s first consider what intelligence means from a human perspective. Judo, as a martial art, serves as a good—though not an obvious—example. I trained in judo for over 20 years. During that time, I learned which throwing techniques to use to take down an opponent efficiently by leveraging their movement energy and reactions. But how did I learn that? Through a supervised training process, where our coach first taught us the throwing techniques and the situations in which they work best. Then, we practiced them ourselves. Mastering these techniques requires thousands of repetitions before achieving perfection. Ultimately, timing and reaction to the opponent’s movements play a significant role in determining whether a throw is successful or not. After mastering several throwing technics, I was capable of apply them in the situation not necessarily to seen before.

How does this relate to Artificial Intelligence (AI)? AI is a broad term encompassing solutions that aim to mimic human brain functions. A subset of AI is Machine Learning (ML), which enables systems to make decisions based on input data without being explicitly programmed for each scenario. The driving force behind this capability is Deep Learning (DL), which utilizes Deep Neural Networks (DNNs). The intelligence of these networks resides in thousands of neurons (Perceptron) and their interconnections which together form a neural network.

Training a Neural Network to perform well in its given task follows the same principles as training a human to execute a perfectly timed and well-performed throwing technique, for example. The training process takes time, requires thousands of iterations, and involves analyzing results before achieving the expected, high-quality outcome.When training Neural Networks, we use a training dataset that, in addition to input data, includes information about the expected output (supervised learning). Before deploying the network into production, it is tested with a separate test dataset to evaluate how well it performs on unseen data.

The duration of the training process depends on several factors, such as dataset size, network architecture, hardware, and selected parallelization strategies (if any). Training a neural network requires multiple iterations—sometimes even tens of thousands—where, at the end of each iteration, the model's output is compared to the actual value. If the difference between these two values is not small enough, the network is adjusted to improve performance. The entire process may take months, but the result is a system that responds accurately and quickly, providing an excellent user experience.

This chapter begins by discussing the artificial neuron, and its functionality. We then move on to the Feedforward Neural Network (FFNN) model, first explaining its layered structure and how input data flows through it in a process called the Forward Pass (FP). Next, we examine how the FFNN is adjusted during the Backward Pass (BP), which fine-tunes the model by minimizing errors. The combination of FP and BP is known as the Backpropagation Algorithm.


Artificial Neuron 

An artificial neuron, also known as a perceptron, is a fundamental building block of any neural network. It functions as a computational unit that processes input data in two phases. First, it collects and processes all inputs, and then applies an activation function. Figure 1-1 illustrates the basic process without the complex mathematical functions (which I will explain later for those interested in studying them). On the left-hand side, we have a bias term and two input values, x1 and x2. The bias and inputs are connected to the perceptron through adjustable weight parameters: w0, w1, and w2, respectively. During the initial training phase, weight values are randomly generated.


Weighted Sum and Activation Function

As the first step, the neuron calculates the weighted sum of inputs x1 and x2 and adds the bias. A weighted sum simply means that each input is multiplied by its corresponding weight parameter, the results are summed, and the bias is added to the total. The bias value is set to one, so its contribution is always equal to the value of its weight parameter. I will explain the purpose of the bias term later in this chapter. The result of the weighted sum is denoted as z, which serves as a pre-activation value. This value is then passed through a non-linear activation function, which produces the actual output of the neuron, y ̂ (y-hat). Before explaining what non-linearity means in the context of activation functions and why it is used, consider the following: The input values fed into a neuron can be any number between negative infinity (-∞) and positive infinity (+∞). Additionally, there may be thousands of input values. As a result, the weighted sum can become a very large positive or negative value.

Now, think about neural networks with thousands of neurons. In Feedforward Neural Networks (FFNNs), neurons are structured into layers: an input layer, one or more hidden layers, and an output layer. If input values were only processed through the weighted sum computation and passed to the next layer, the neuron outputs would grow linearly with each layer. Even if we applied a linear activation function, the same issue would persist—the output would continuously increase. With a vast number of neurons and large input values, this uncontrolled growth could lead to excessive computational demands, slowing down the training process. Non-linear activation functions help keep output values within a manageable range. For example, an S-shaped Sigmoid activation function squeezes the neuron’s output to a range between 0 and 1, even for very large input values.

Let’s go back to Figure 1-1, where we first multiply the input values by their respective weight parameters, sum them, and then add the bias. Since the bias value is 1, it is reasonable to represent it using only its associated weight parameter in the formula. If we plot the result z on the horizontal axis of a two-dimensional chart and draw a vertical line upwards, we obtain the neuron’s output value y at the point where the line intersects the S-curve. Simple as that. Naturally, there is a mathematical definition and equation for this process, which is depicted in Figure 1-2. 

Before moving on, there is one more thing to note. In the figure below, each weight has an associated adjustment knob. These knobs are simply a visual representation to indicate that weight values are adjustable parameters, which will be tuned by the backpropagation algorithm in case the model output is not close enough to expected result. The backpropagation process is covered in detail in a dedicated chapter.

Figure 1-1: An Architecture of an Artificial Neuron.

Figure 1-2 shows the mathematical equations for calculating the weighted sum and the Sigmoid function. The Greek letter used in the weighted sum equation is Σ (uppercase Sigma). The lowercase i is set to 1 beneath the Sigma symbol, indicating that the weighted sum calculation starts from the first pair of elements: input x1 and its corresponding weight w1. The notation n=2 specifies the total number of paired elements included in the weighted sum calculation. In our example, both input values and their respective weights are included.

After computing the weighted sum, we add the bias term. The result, z, is then passed through the Sigmoid function, producing the output y ̂. The Sigmoid function is commonly represented by the Greek letter σ (lowercase sigma).

The lower equation in Figure 1-2 shows how the Sigmoid function is computed. To obtain the denominator for the fraction, Euler’s number (e≈2.71828) is raised to the power of −z and then summed with 1. The final output is simply the reciprocal of this sum.

Figure 1-2: The Math Behind an Artificial Neuron.


The formulas can be expressed in an even simpler manner using dot products, which are commonly used in linear algebra. Dot products frequently appear in research papers and deep learning literature.

In Figure 1-3, both input values and weights are arranged as column vectors. The notation for the input vector uses an uppercase X, while the weight vector is denoted by an uppercase W. Although these are technically vectors, it is not a major issue to illustrate them as a simple matrix for demonstration purposes. Generally speaking, a matrix has more than one row and column, as you will learn later.

The dot product performs a straightforward matrix multiplication, as shown in the figure. This greatly simplifies the computation.


Figure 1-3: Matrix Multiplication with Dot Product.

Bias term


Figures 1-4 and 1-5 illustrate how changes in the bias weight parameter affect the weighted sum and shift z horizontally. This, in turn, changes the output of the Sigmoid function and the neuron’s final output.

In Figure 1-4, the initial weight values for the bias, input x1, and input x2 are +0.3, +0.5, and −0.3, respectively. The calculated weighted sum is z=4.8. Applying the Sigmoid function to output z=4.8, we obtain an output value of 0.992. Figure 1-4 visualizes this process: z=4.8 is positioned on the horizontal axis, and the intersection with the S-curve results in an output of 0.992.


Figure 1-4: Construct of an Artificial Neuron.

Now, we adjust the weight w0 associated with the bias from +0.3 down to −4.0. As a result, the weighted sum decreases from 4.84 to 0.50, shifting z 4.3 steps to the left on the horizontal axis. Applying the Sigmoid function to z, the neuron’s output decreases from 0.992 to 0.622.

Figure 1-5: Construct of an Artificial Neuron.

In the example calculation above, imagine that input values x1 and x2 are zero. Without a bias term, the activation value will be zero, regardless of how large the weight parameters are. Therefore, the bias term also allows the neuron to produce non-zero outputs, even when all input values are zero.

ReLU Activation Function


A lighter alternative to the Sigmoid activation function is ReLU (Rectified Linear Unit). The ReLU activation function is non-linear for values less than or equal to zero and linear for values greater than zero. This means that if the weighted sum z≤0, the output is zero. If z>0, the output is equal to z.

From a computational perspective, ReLU requires fewer CPU cycles than the Sigmoid function. Figure 1-6 illustrates how z=4.8 is processed by ReLU, resulting in an output value of y ̂=4.8. The figure also shows two common notations for ReLU. The first notation states:
  • If z>0, return z.
  • If z≤0, return 0.
The second notation, written as MAX(0,z), simply means selecting the greater value between 0 and z.


Figure 1-6: Artificial Neuron with a ReLU Activation Function. 

Network Impact


A single artificial neuron is the smallest unit of a neural network. The size of the neuron depends on its connections to input nodes. Every connection has an associated weight parameter, which is typically a 32-bit value. In our example, with 2 connections and bias, the size of the neuron is 3 x 32 bits = 96 bits.

Although we haven’t defined the size of the input in this example, let’s assume that each input (x) is an 8-bit value, giving us 2 x 8 bits = 16 bits for the input data. Thus, our single neuron "model" requires 96 bits for the weights plus 16 bits for the input data, totaling 112 bits of memory. This is small enough to not require parallelization. Besides, the weight parameters and input values, the result of weighted sum and the neuron output must be stored for processing. 

However, if the memory requirement of the neural network model combined with the input data (the "job") exceeds the memory capacity of a GPU, a parallelization strategy is needed. The job can be split across multiple GPUs within a single server, with synchronization happening over high-speed NVLink. If the job must be divided between multiple GPU servers, synchronization occurs over the backend network, which must provide lossless, high-speed packet forwarding.

Parallelization strategies will be discussed in the next chapter, which introduces a Feedforward Neural Network using the Backpropagation algorithm, and in later chapters dedicated to Parallelization.

Summary


Deep Learning leverages Neural Networks, which consist of artificial neurons. An artificial neuron mimics the structure and operation of a biological neuron. Input data is fed to the neuron through connections, each with its own weight parameter. The neuron uses these weights to calculate a weighted sum of the inputs, known as the pre-activation value. This result is then passed through an activation function, which provides the post-activation value, or the actual output of the neuron. The activation functions discussed in this chapter are the non-linear ReLU (Rectified Linear Unit) and logistic Sigmoid functions.

References

Yann LeCun, Corina Cortes, Christoper J.C. Burges: The MNIST database of handwritten digits. 

Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton: The CIFAR-10 dataset

Alex Krizhevsky, Learning Multiple Layers of Features from Tiny Images, April 2009

Jason Brownlee: A Gentle Introduction to the Rectified Linear Unit (ReLU), August 20, 2020.

Eric W. Weisstein: Wolfram Mathworld – Hyperbolic Tangent

Eric W. Weisstein: Wolfram Mathworld – Sigmoid Tangent

2 comments:

  1. Hi , could you please share what reference are you using to write these articles?

    ReplyDelete
    Replies
    1. Sure I can. I actually forgot to add them at the end of the post. Updated now.

      Delete