Challenges of a RNN Modell
Figure 5-3 shows the last two time steps of our Recurrent Neural Network (RNN). At the time step n (on the left side), there are two inputs for the weighted sum calculation: Xn (the input at the current time step) and ht−1 (the hidden state from the previous time step).
First, the model calculates the weighted sum of these inputs. The result is then passed through the neuron’s activation function (Sigmoid in this example). The output of the activation function, ht , is fed back into the recurrent layer on the next time step, n+1. At time step n+1, the ht is combined with the input Xn to calculate weighted sum. This result is then passed through the activation function, which now produces the model's prediction, y ̂ (y hat). These steps are part of the Forward Pass process.
As the final step in the forward pass, we calculate the model's accuracy using the Mean Square Error (MSE) function (explained in Chapter 2).
If the model's accuracy is not close enough to the expected result, it begins the Backward Pass to improve its performance. The most used optimization algorithm for minimizing the loss function during the backward pass is Gradient Descent, which updates the model's parameters step by step.
The backward pass process starts by calculating the derivative of the error function (i.e., the gradient of the error function with respect to the output activation value) to determine the Output Error.
Next, the Output Error is multiplied with the derivative of the activation function to compute the local Error Term for the neuron. (i.e., the derivative of the activation function with respect to its input is the local gradient, which determines how the activation values changes in response to its input change.) The error terms are then propagated through all time steps to calculate the actual Weight Adjustment Values.
In this example, we focus on how the weight value associated with the recurrent connection is updated. However, this process also applies to weights linked to the input values. The neuron-specific weight adjustment values are calculated by multiplying the local error term with the corresponding input value and the learning rate.
The difference between the backward pass process in a Feedforward Neural Network (FNN) and a Recurrent Neural Network (RNN) is that the RNN uses Backpropagation Through Time (BPTT). In this method, the weight adjustment values from each time step are accumulated during backpropagation. Optionally, these accumulated gradients can be averaged over the number of time steps to prevent the gradient magnitude from becoming too large for long sequences. This averaging is the default behavior in implementations using TensorFlow and PyTorch frameworks.
Since the RNN model uses shared weight matrices across all time steps, only one weight parameter per recurrent connection needs to be updated. In this simplified example, we have one recurrent connection because there is only one neuron in the recurrent layer. However, in real-world scenarios, RNN layers often have hundreds of neurons and thousands of time steps.
Figure 5-3: Overview of the Weight Adjustment Process.
Saturated Neurons
Figure 5-4 depicts the S-curve of the Sigmoid activation function. It shows how the output of the function (y) changes in response to variations in the input (z). The chart illustrates how the rate of change slows down significantly when the input value exceeds 2.2 or falls below -2.2. Beyond these thresholds, approaching input values of 5.5 and -5.5, the rate of change becomes negligible from a learning perspective. This behavior can occur due to a poor initial weight assignment strategy, where the initial weight values are either too small or too large, potentially causing backpropagation through time (BPTT) to adjust the weights in the wrong direction. This issue is commonly known as neuron saturation.
Another issue illustrated in the figure is that the Sigmoid activation function output (y) is practically zero when the input value is less than -5. For example, with z = −5, y = 0.0998, but with z = −7, y drops to just 0.0009. The problem with these "almost-zero" output values is that the neuron becomes "dead," meaning its output (y) has negligible impact on the model's learning process. In an RNN model, where the neuron's output is reused in the recurrent layer as the hidden state (h), a close-to-zero value causes the neuron to "forget" inputs from preceding time steps.
Figure 5-4: The Problem with the S-curved Sigmoid Function.
Figure 5-5 illustrates a hypothetical RNN with five time steps. This example demonstrates how some recurrent connections for the hidden state values (h) can become insignificant from the perspective of subsequent time steps. For instance, if the output of the Sigmoid activation function at time step 1 (h1) is 0.0007, the corresponding value at time step 2 (h2) would increase by only 0.0008 compared to the scenario where h1 is zero.
Similarly, in an RNN with 1000 time steps, the learning process is prone to the vanishing gradient problem during backpropagation through time (BPTT). As gradients are propagated backward across many time steps, they often shrink exponentially due to repeated multiplication by small values (e.g., derivatives of activation Sigmoid functions). This can cause the learning curve to plateau or decrease, leading to poor weight updates and suboptimal learning. In severe cases, the learning process may effectively stop, preventing the model from achieving the expected performance.
Figure 5-5: RNN and "Forgotten" History.
When using a data parallelization strategy with Recurrent Neural Networks (RNNs), input data batches are distributed across multiple GPUs, each running the same model independently on its assigned batch. During the backpropagation through time (BPTT) process, each GPU calculates gradients locally for its portion of the data. These gradients are then synchronized across all GPUs, typically by averaging them, to ensure consistent updates to the shared model parameters.
Since the weight matrices are part of the shared model, the updated weights remain synchronized across all GPUs after each training step. This synchronization ensures that all GPUs use the same model for subsequent forward and backward passes. However, due to the sequential nature of RNNs, BPTT must compute gradients step by step, which can still limit scalability when dealing with long sequences. Despite this, data parallelization accelerates training by distributing the workload and reducing the computational burden for each GPU.
We can also implement the model parallelization strategy with RNNs, which synchronizes both activation values during the forward pass and gradients during backpropagation.
The parallelization strategy significantly affects network utilization due to the synchronization process—specifically, what we synchronize and at what rate. Several upcoming chapters will focus on different parallelization strategies.
No comments:
Post a Comment