Monday, 6 January 2025

AI for Network Engineers: Long Short-Term Memory (LSTM)

 Introduction


As mentioned in the previous chapter, Recurrent Neural Networks (RNNs) can have hundreds or even thousands of time steps. These basic RNNs often suffer from the gradient vanishing problem, where the network struggles to retain historical information across all time steps. In other words, the network gradually "forgets" historical information as it progresses through the time steps.

One solution to address the horizontal gradient vanishing problem between time steps is the use of Long Short-Term Memory (LSTM) based RNN instead of basic RNN. LSTM cells can preserve historical information across all time steps, whether the model contains ten or several thousand time steps. 

Figure 6-1 illustrates the overall architecture of an LSTM cell. It includes three gates: the Forget gate, the Input gate (a.k.a. Remember gate), and the Output gate. Each gate contains input neurons that use the Sigmoid activation function. The reason for employing the Sigmoid function, as shown in Figure 5-4 of the previous chapter, is its ability to produce outputs in the range of 0 to 1. An output of 0 indicates that the gate is "closed," meaning the information is excluded from contributing to the cell's internal state calculations. An output of 1, on the other hand, means that the information is fully utilized in the computation. However, the sigmoid function never gives an exact output of zero. Instead, as the input value becomes more and more negative (approaching negative infinity), the output gets closer and closer to zero, but it never actually reaches it. Similarly, the sigmoid function's output approaches one as the input value becomes very large (approaching positive infinity). However, just like with zero, the function never exactly reaches one; it only gets very close. 

As a one way of completely closing any of the gates, you may set a threshold value manually and define, for example, that the outputs less than 0.01 are interpreted as zero (gate closed). The same principle applies to gate opening, you can set the threshold to, for example, output higher than 0.95 are interpreted as one (gate fully open). However, instead of hard coded threshold, consider alternatives like smooth activation adjustments.

This gating mechanism enables LSTM cells to selectively retain or discard information, allowing the network to manage long-term dependencies effectively.


Figure 6-1: Long Short-Term Memory Cell – Architectural Overview.


LTSM cell operation

In addition to producing input for the next layer (if one exists), the output (h) of the LSTM cell serves as input for the next time step via the recurrent connection. This process is like how neurons in a basic RNN operate. The LSTM cell also has a cell state (C), which is used to retain historical information utilizes the Constant Error Carousel (CEC) mechanism, which feeds back the cell state (C) into the computation process where the new cell state is calculated. The following sections briefly describe the processes how an LSTM cell computes the cell state (C), and the cell output (h), and explains the role of the gates in the process.


Forget Gate

The Forget Gate (FG) adjusts the extent to which historical data is preserved. In Figure 6-2, the cell state Ct-1 represents historical data computed by the identity function during a preceding time step. The cell state (C ) represents an LSTM cell internal state, not the LSTM cell output (h), and it is used for protecting historical data for gradient vanishing during the BPTT.  The adjustment factor for Ct-1  is calculated by a neuron using the Sigmoid activation function within the FG.

The neuron in the FG uses shared, input specific weight matrices for the input data (X1) and the input received from the preceding LSTM cell's output (ht-1). These weight matrices are shared across FG neurons over all time steps, like the approach used in a basic recurrent neural network (RNN). As described in the previous chapter, this sharing reduces the computational requirements for calculating weight adjustment values during Backpropagation Through Time (BPTT). Additionally, the shared weight matrices help reduce the model memory utilization by limiting the number of weight variables.

In the figure, the matrix WFG1 is associated with the input received from the preceding time step, while the matrix UFG1 is used for the new input value X1. The weighted sum (WFG1 ⋅ ht-1) + (UFG1 ⋅ X1) is passed through the Sigmoid activation function, which produces the adjustment factor for the cell state value Ct-1. The closer the output of the sigmoid function is to the value one, the more the original value affects the calculation of the new value. The same applies to opposite direction,  the closer the output of the sigmoid function is to zero, the less the original value affects the calculation of the new value.

Finally, the output of the FG, referred to as XFG, is computed by multiplying the Sigmoid output by the cell state Ct-1.


Figure 6-2: Long Short-Term Memory Cell – Forget Gate.

Input Gate


The Input Gate (IG) determines to what extent the input X1 and the output ht-1 from the preceding time step affect the new cell state Ct. For this process, the LSTM cell has two neurons. In Figure 6-3, the internal neuron of IG uses the Sigmoid function, while the Input Activation neuron leverages the ReLU function. Both neurons use input-specific weight matrices in the same way as the Forget Gate. 
The Input Gate neuron feeds the weighted sum (WIG1 ⋅ ht-1) + (UIG1 ⋅ X1) to the sigmoid function. The output determines the proportion in which new input values X1 and ht-1 influence the computation of the cell's internal value. 
The Input Activation neuron feeds the weighted sum (WIA1 ⋅ ht-1) + (UIA1 ⋅ X1) to the ReLU function. The output is then multiplied by the output of the Sigmoid function, providing the result of the Input Gate.
At this phase, the LSTM cell has computed output for both Forget Gate (XFG) and Input Gate (XIG). Next, the LSTM feeds these values to the Identification Function.

Figure 6-3: Long Short-Term Memory Cell – Input Gate.

Output Gate


The Output Gate determines whether the output of the Output Activation neuron (ReLU) is fully published, partially published, or left unpublished. The factor of the Output Gate is calculated based on the input value X1 and the output ht-1 from the previous time step. That said, all Sigmoid neurons and the ReLU Input Activation function use the same inputs, and they leverage shared weight matrices. 

The input to the Output Activation neuron is the sum of the outputs from the Forget Gate (XFG) and the Input Gate (XIG). In the figure, the sum is represented as f(x)=XFG+XIG. The operation is computed by a neuron that uses the Identification function (IDF). The original output of the Identification function is preserved as the internal cell state (C) for the next time step through the CEC (Constant Error Carousel) connection.

The output of the Identification Function is then passed to the ReLU Output Activation function. This output is multiplied by the result of the Output Gate, producing the actual cell output h1. This value serves as input to the same LSTM cell in the next time step. In a multi-layer model, the cell output is also used as input for the subsequent layer.


Figure 6-4: Long Short-Term Memory Cell – Output Gate.

No comments:

Post a Comment