Sunday, 12 January 2025

AI for Network Engineers: LSTM-Based RNN


Recap of the Operation of an LSTM Cell

The previous section introduced the construction and operation of a single Long Short-Term Memory (LSTM) cell. This section briefly discusses an LSTM-based Recurrent Neural Network (RNN). Before diving into the details, let’s recap how an individual LSTM cell operates with a theoretical, non-mathematical example.

Suppose we want our model to produce the sentence: “It was cloudy, but it is raining now.” The first part of it refers to the past, and one of the LSTM cells has stored the tense “was” in its internal cell state. However, the last portion of the sentence refers to the present. Naturally, we want the model to forget the previous tense “was” and update its state to reflect the current tense “is.”

The Forget Gate plays a role in discarding unnecessary information. In this case, the forget gate suppresses the word “was” by closing its gate (outputting 0). The Input Gate  is responsible for providing a new candidate cell state, which in this example is the word “is.” The input gate is fully open (outputting 1) to allow the latest information to be introduced.

The Identification function computes the updated cell state by summing the contributions of the forget gate and the input gate. This updated cell state represents the memory for the next time step. Additionally, the updated cell state is passed through an Output Activation function, which provides the cell’s output.

The Output Gate controls how much of this activated output is shared as the public output. In this example, the output gate is fully open (outputting 1), allowing the word “is” to be published as the final output.

An Overview of an LSTM-Based RNN

Figure 6-5 illustrates an LSTM-based RNN model featuring two LSTM layers and a SoftMax layer. The input vectors x1 and x2, along with the cell output ht−1 from the previous time step, are fed into all LSTM cells in the input layer. To keep the figure simple, only two LSTM cells are shown per layer.

The input vectors pass through gates, producing both the internal cell state and the cell output. The internal states are stored using a Constant Error Carousel (CEC) to be utilized in subsequent time steps. The cell output is looped back as an input vector for the next time step. Additionally, the cell output is passed to all LSTM cells in the next layer.

Finally, the SoftMax layer generates the model's output. Note that Figure 6-5 depicts a single time step.


Figure 6-5: LSTM based RNN Layer Model.

Figure 6-6 illustrates a layered LSTM-based Recurrent Neural Network (RNN) model that processes sequential data across four time steps. The model consists of three layers: the input LSTM layer, a hidden LSTM layer, and a SoftMax output layer. Each gray square labeled "LSTM" represents a layer containing n LSTM cells.

At the first time step, the input value x1 is fed to the LSTM cells in the input layer. Each LSTM cell computes its internal cell state (C), applies it to the output activation function, and produces a cell output (ht ). This output is passed both to the LSTM cells in the next time step via recurrent connections and to the LSTM cells in the hidden layer at the same time step as an input vector. 

The LSTM cells in the hidden layer repeat the process performed by the input layer LSTM cells. Their output (ht) is passed to the SoftMax layer, which computes probabilities for each possible output class, generating the model's predictions (y1). The cell output is also passed to the next time step on the same layer.

The figure also depicts the autoregressive mode, where the output of the SoftMax layer at the initial time step t1 is fed back as part of the input for the next time step (t+1) in the input layer. This feedback loop enables the model to use its predictions from previous time steps to inform its processing of subsequent time steps. Autoregressive models are particularly useful in tasks such as sequence generation, where the output sequence depends on previously generated elements.

Key Features Depicted in Figure 6-6

Recurrent Data Flow: The outputs from each time step are recurrently fed into the next time step, capturing temporal dependencies.

Layered Structure: The vertical connections between layers allow the model to hierarchically process input data, with higher layers learning progressively abstract features.

Autoregressive Feedback: The use of SoftMax outputs as part of the next time step’s input highlights the autoregressive nature of the model, commonly used in sequence prediction and generation tasks.

Figure 6-6: LSTM-Based RNN Model with Layered Structure and Four Time Steps.

Conclusion


Figure 6-6 demonstrates the interplay between sequential and layered data flow in a multi-layered LSTM model, showcasing how information is processed both temporally (across time steps) and hierarchically (across layers). The autoregressive feedback loop further illustrates the model’s capability to adapt its predictions based on prior outputs, making it well-suited for tasks such as time series forecasting, natural language processing, and sequence generation.


Monday, 6 January 2025

AI for Network Engineers: Long Short-Term Memory (LSTM)

 Introduction


As mentioned in the previous chapter, Recurrent Neural Networks (RNNs) can have hundreds or even thousands of time steps. These basic RNNs often suffer from the gradient vanishing problem, where the network struggles to retain historical information across all time steps. In other words, the network gradually "forgets" historical information as it progresses through the time steps.

One solution to address the horizontal gradient vanishing problem between time steps is the use of Long Short-Term Memory (LSTM) based RNN instead of basic RNN. LSTM cells can preserve historical information across all time steps, whether the model contains ten or several thousand time steps. 

Figure 6-1 illustrates the overall architecture of an LSTM cell. It includes three gates: the Forget gate, the Input gate (a.k.a. Remember gate), and the Output gate. Each gate contains input neurons that use the Sigmoid activation function. The reason for employing the Sigmoid function, as shown in Figure 5-4 of the previous chapter, is its ability to produce outputs in the range of 0 to 1. An output of 0 indicates that the gate is "closed," meaning the information is excluded from contributing to the cell's internal state calculations. An output of 1, on the other hand, means that the information is fully utilized in the computation. However, the sigmoid function never gives an exact output of zero. Instead, as the input value becomes more and more negative (approaching negative infinity), the output gets closer and closer to zero, but it never actually reaches it. Similarly, the sigmoid function's output approaches one as the input value becomes very large (approaching positive infinity). However, just like with zero, the function never exactly reaches one; it only gets very close. 

As a one way of completely closing any of the gates, you may set a threshold value manually and define, for example, that the outputs less than 0.01 are interpreted as zero (gate closed). The same principle applies to gate opening, you can set the threshold to, for example, output higher than 0.95 are interpreted as one (gate fully open). However, instead of hard coded threshold, consider alternatives like smooth activation adjustments.

This gating mechanism enables LSTM cells to selectively retain or discard information, allowing the network to manage long-term dependencies effectively.


Figure 6-1: Long Short-Term Memory Cell – Architectural Overview.


LTSM cell operation

In addition to producing input for the next layer (if one exists), the output (h) of the LSTM cell serves as input for the next time step via the recurrent connection. This process is like how neurons in a basic RNN operate. The LSTM cell also has a cell state (C), which is used to retain historical information utilizes the Constant Error Carousel (CEC) mechanism, which feeds back the cell state (C) into the computation process where the new cell state is calculated. The following sections briefly describe the processes how an LSTM cell computes the cell state (C), and the cell output (h), and explains the role of the gates in the process.


Forget Gate

The Forget Gate (FG) adjusts the extent to which historical data is preserved. In Figure 6-2, the cell state Ct-1 represents historical data computed by the identity function during a preceding time step. The cell state (C ) represents an LSTM cell internal state, not the LSTM cell output (h), and it is used for protecting historical data for gradient vanishing during the BPTT.  The adjustment factor for Ct-1  is calculated by a neuron using the Sigmoid activation function within the FG.

The neuron in the FG uses shared, input specific weight matrices for the input data (X1) and the input received from the preceding LSTM cell's output (ht-1). These weight matrices are shared across FG neurons over all time steps, like the approach used in a basic recurrent neural network (RNN). As described in the previous chapter, this sharing reduces the computational requirements for calculating weight adjustment values during Backpropagation Through Time (BPTT). Additionally, the shared weight matrices help reduce the model memory utilization by limiting the number of weight variables.

In the figure, the matrix WFG1 is associated with the input received from the preceding time step, while the matrix UFG1 is used for the new input value X1. The weighted sum (WFG1 ⋅ ht-1) + (UFG1 ⋅ X1) is passed through the Sigmoid activation function, which produces the adjustment factor for the cell state value Ct-1. The closer the output of the sigmoid function is to the value one, the more the original value affects the calculation of the new value. The same applies to opposite direction,  the closer the output of the sigmoid function is to zero, the less the original value affects the calculation of the new value.

Finally, the output of the FG, referred to as XFG, is computed by multiplying the Sigmoid output by the cell state Ct-1.


Figure 6-2: Long Short-Term Memory Cell – Forget Gate.

Input Gate


The Input Gate (IG) determines to what extent the input X1 and the output ht-1 from the preceding time step affect the new cell state Ct. For this process, the LSTM cell has two neurons. In Figure 6-3, the internal neuron of IG uses the Sigmoid function, while the Input Activation neuron leverages the ReLU function. Both neurons use input-specific weight matrices in the same way as the Forget Gate. 
The Input Gate neuron feeds the weighted sum (WIG1 ⋅ ht-1) + (UIG1 ⋅ X1) to the sigmoid function. The output determines the proportion in which new input values X1 and ht-1 influence the computation of the cell's internal value. 
The Input Activation neuron feeds the weighted sum (WIA1 ⋅ ht-1) + (UIA1 ⋅ X1) to the ReLU function. The output is then multiplied by the output of the Sigmoid function, providing the result of the Input Gate.
At this phase, the LSTM cell has computed output for both Forget Gate (XFG) and Input Gate (XIG). Next, the LSTM feeds these values to the Identification Function.

Figure 6-3: Long Short-Term Memory Cell – Input Gate.

Output Gate


The Output Gate determines whether the output of the Output Activation neuron (ReLU) is fully published, partially published, or left unpublished. The factor of the Output Gate is calculated based on the input value X1 and the output ht-1 from the previous time step. That said, all Sigmoid neurons and the ReLU Input Activation function use the same inputs, and they leverage shared weight matrices. 

The input to the Output Activation neuron is the sum of the outputs from the Forget Gate (XFG) and the Input Gate (XIG). In the figure, the sum is represented as f(x)=XFG+XIG. The operation is computed by a neuron that uses the Identification function (IDF). The original output of the Identification function is preserved as the internal cell state (C) for the next time step through the CEC (Constant Error Carousel) connection.

The output of the Identification Function is then passed to the ReLU Output Activation function. This output is multiplied by the result of the Output Gate, producing the actual cell output h1. This value serves as input to the same LSTM cell in the next time step. In a multi-layer model, the cell output is also used as input for the subsequent layer.


Figure 6-4: Long Short-Term Memory Cell – Output Gate.