Introduction
So far, this book has introduced two neural network architectures. The first one, the Feed-Forward Neural Network (FNN), works well for simple tasks, such as recognizing handwritten digits in small-sized images. The second one, the Convolutional Neural Network (CNN), is designed for processing larger images. CNNs can identify objects in images even when the location or orientation of the object changes.
This chapter introduces the Recurrent Neural Network (RNN). Unlike FNNs and CNNs, an RNN’s inputs include not only the current data but also all the inputs it has processed previously. In other words, an RNN preserves and uses historical data. This is achieved by feeding the output of the previous time step back into the hidden layer along with the current input vector.
Although RNNs can be used for predicting sequential data of variable lengths, such as sales figures or a patient’s historical health records, this chapter focuses on how RNNs can perform character-based text autocompletion. The upcoming chapters will explore word-based text prediction.
Text Datasets
For training the RNN model, we typically use text datasets like IMDB Reviews or the Wikipedia Text Corpus. However, in this chapter, we simplify the process by using a tailored dataset containing only the word "alley". Figure 5-1 illustrates the steps involved.
Splitting the text into characters: First, we break the word into its individual letters (e.g., a, l, l, e, y).
Index mapping: Each character is assigned an index number, which maps it to a one-hot-encoded vector. For example, the letter a is assigned index 0, corresponding to the one-hot vector [1, 0, 0, 0].
Sequence creation: Finally, we define the sequence of characters to predict. For example, when the input character is a (input vector [1, 0, 0, 0]), the model should output the letter l (output vector [0, 0, 1, 0]).
Figure 5-1: Recurrent Neural Networks – Text Dataset and One-Hot Encoding.
Training Recurrent Neural Networks
Figure 5-2 illustrates a simplified view of the RNN training process. In the previous section, we explained how one-hot encoding is used to produce an input vector for training. For example, the character “a” is represented by the input vector [1, 0, 0, 0], which is fed into the hidden layer. Each neuron in the hidden layer has its own dedicated weight matrix associated with the input vector.
Weight Matrices in RNNs
The weight values associated with input vectors are denoted as U, while the weights for the recurrent connections (connections between neurons across time steps) are noted as W. This separation is a standard way to distinguish weights for input processing from those used in recurrent operations.
Weighted Sum Calculation in the Hidden Layer
The neurons in the hidden layer calculate the weighted sum of the input vector. Only the sequence corresponding to the 1 in the input vector contributes to the calculation, as all other sequences result in zero when multiplied. This calculation also includes a bias term. For example, if the weight matrix for the input vector [1, 0, 0, 0] is [Un,1, Un,2, Un,3, Un,4], only the weight Un,1 contributes to the sum.
The result of this weighted sum for the initial time step is denoted as hn,(t-1). This result is "stored" and used as an input for the next time step. After calculating the weighted sum, it is passed through an activation function, and the resulting activation values are fed into the output layer.
Output Layer Operations
In our example, there are two output neurons for simplicity, but in real-life scenarios, the output layer typically contains the same number of neurons as the input vector dimensions (four in this case). Each output neuron calculates a weighted sum of its inputs, producing a value known as a logit. These logits are passed through the SoftMax activation function, which converts them into probabilities for each output neuron. Note, SoftMax function is discussed in chapter 3 – Multi-Class Classification.
In this example, the output neuron with the highest probability corresponds to the third position (not shown in the figure). This results in the output vector [0, 0, 1, 0], which represents the character “l.”
Comparison with Feed-Forward Neural Networks (FNNs)
So far, this process resembles that of a Feed-Forward Neural Network (FNN). Input vectors are passed from the input layer to the hidden layer, where the neurons compute weighted sums and apply an activation function. Since the hidden and output layers are fully connected, the hidden layer's activation values are passed to the output layer.
Moving to the Second Time Step
At the second time step, the output vector [0, 0, 1, 0], along with the weighted sum hn,(t-1) from the previous step, is used to calculate the new weighted sum. This calculation also includes a bias term. Since the same model is used at every time step, the weight matrices remain unchanged. At this time step, only the weight Un,3 contributes to the sum, as it corresponds to the non-zero value in the input vector. The rest of the process follows the same steps as in the initial time step.
Once time step 1 is completed, the process advances to time step 2, repeating the same calculations. This sequence continues until the training is completed.
Figure 5-2: Recurrent Neural Networks – Basic Operation.
Backward Pass in Recurrent Neural Networks
The backward pass in RNNs is called Backpropagation Through Time (BPTT) because it involves propagating errors not only through the network layers but also backward through time steps. If you think of time steps as stacked layers, the BPTT process requires fewer computation cycles and memory than Feed-Forward Neural Network (FNN), because RNN uses shared weight matrices across the layers while FNN has assigned per-layer weight values. Like RNN, the Convolutional Neural Network (CNN), introduced in Chapter 4, leverages shared weight matrices but within a layer not between the layers.