Challenges of a RNN Modell
Figure 5-3 shows the last two time steps of our Recurrent Neural Network (RNN). At the time step n (on the left side), there are two inputs for the weighted sum calculation: Xn (the input at the current time step) and ht−1 (the hidden state from the previous time step).
First, the model calculates the weighted sum of these inputs. The result is then passed through the neuron’s activation function (Sigmoid in this example). The output of the activation function, ht , is fed back into the recurrent layer on the next time step, n+1. At time step n+1, the ht is combined with the input Xn to calculate weighted sum. This result is then passed through the activation function, which now produces the model's prediction, y ̂ (y hat). These steps are part of the Forward Pass process.
As the final step in the forward pass, we calculate the model's accuracy using the Mean Square Error (MSE) function (explained in Chapter 2).
If the model's accuracy is not close enough to the expected result, it begins the Backward Pass to improve its performance. The most used optimization algorithm for minimizing the loss function during the backward pass is Gradient Descent, which updates the model's parameters step by step.
The backward pass process starts by calculating the derivative of the error function (i.e., the gradient of the error function with respect to the output activation value) to determine the Output Error.
Next, the Output Error is multiplied with the derivative of the activation function to compute the local Error Term for the neuron. (i.e., the derivative of the activation function with respect to its input is the local gradient, which determines how the activation values changes in response to its input change.) The error terms are then propagated through all time steps to calculate the actual Weight Adjustment Values.
In this example, we focus on how the weight value associated with the recurrent connection is updated. However, this process also applies to weights linked to the input values. The neuron-specific weight adjustment values are calculated by multiplying the local error term with the corresponding input value and the learning rate.
The difference between the backward pass process in a Feedforward Neural Network (FNN) and a Recurrent Neural Network (RNN) is that the RNN uses Backpropagation Through Time (BPTT). In this method, the weight adjustment values from each time step are accumulated during backpropagation. Optionally, these accumulated gradients can be averaged over the number of time steps to prevent the gradient magnitude from becoming too large for long sequences. This averaging is the default behavior in implementations using TensorFlow and PyTorch frameworks.
Since the RNN model uses shared weight matrices across all time steps, only one weight parameter per recurrent connection needs to be updated. In this simplified example, we have one recurrent connection because there is only one neuron in the recurrent layer. However, in real-world scenarios, RNN layers often have hundreds of neurons and thousands of time steps.