In many applications, memorization in recurrent neural networks (RNNs) remains a hurdle. Although vanilla RNNs frequently fail to do this, we would like RNNs to be able to retain information over a large number of timesteps and retrieve it when it becomes useful.
A number of network topologies, including Gated Recurrent Units (GRU) and Long-Short-Term Memory (LSTM) units, have been proposed to address various parts of this issue. The practical issue of memorizing still presents a difficulty, though. As a result, research into creating new recurrent units with improved memorization skills is still ongoing.
Both older and more modern papers, like the Nested LSTM paper by Monzi et al., mostly rely on quantitative comparisons to evaluate a recurrent unit versus its alternatives. On common challenges like Penn Treebank, Chinese Poetry Generation, or text8, where the objective is to predict the next character given the available input, these comparisons frequently assess accuracy or cross-entropy loss.
Although helpful, quantitative comparisons only reveal a portion of the memorization process of a recurrent unit. For instance, a model can attain high accuracy and cross entropy loss by simply making predictions that are extremely accurate when short-term memorizing is all that is needed, but erroneous when long-term memorization is required. For instance, once the majority of the letters are present, a model with only short-term knowledge may nevertheless demonstrate great accuracy when autocompleting word ends in sentences. But when only a few letters are known, it won’t be able to anticipate words without longer-term contextual awareness.
A qualitative visualization technique for comparing recurrent units in terms of contextual understanding and memorizing is presented in this article. Nested LSTMs, LSTMs, and GRUs are the three recurrent units to which the method is applied.
Units That Recur
Theoretically, the temporal dependency enables it to understand every aspect of the previous sequence in each iteration. However, long-term dependencies are sometimes overlooked during training due to a vanishing gradient issue caused by this temporal reliance.
Vanishing Gradient: the point in the gradient for the vanilla RNN unit where the contribution from the prior steps is negligible.
Over the years, a number of solutions to the vanishing gradient problem have been put forth. Although the LSTM and GRU units indicated above are the most widely used, research in this field is currently ongoing. GRU and LSTM are both well-known and extensively described in the literature.
The reasons why one recurrent unit performs better than another in some applications while another sort of recurrent unit performs better in others are not fully understood. Although they all theoretically solve the vanishing gradient problem, their actual performance varies greatly depending on the application.
Determining the causes of these variations is probably a difficult and ambiguous task. This article aims to illustrate a graphical method that can more effectively illustrate these distinctions. Ideally, this kind of comprehension can result in a more profound comprehension.
Recurrent Unit Comparison
There is frequently more to comparing various Recurrent Units than just comparing cross entropy loss or accuracy. Variations in these high-level quantitative metrics can be explained by a variety of factors, including a slight improvement in predictions that only calls for short-term contextual knowledge, but long-term contextual knowledge is frequently of relevance.
A Challenge For Qualitative Analysis
As a result, a good problem for qualitative contextual understanding analysis should rely on both short-term and long-term contextual understanding and have an output that is understandable by humans. The common problems that are frequently employed, such Penn Treebank, Chinese Poetry Generation, or text8 generation, only produce a single letter or demand a thorough comprehension of either grammar or Chinese poetry, thus their outputs are difficult to reason about.
This article examines the autocomplete issue in order to do this. Every character has a target that corresponds to the word as a whole. That target should likewise be mapped to the space preceding the word. This space-character-based prediction is very helpful for demonstrating contextual awareness.
The sole distinction between the autocomplete and text8 generation problems is that the model predicts a whole word rather than just the next letter. This greatly improves the output’s interpretability. Lastly, the existing research on text8 generation is important and comparable because to its close relationship to the autocomplete problem; that is, models that perform well on text8 generation should also perform well on the autocomplete problem.
Autocomplete: A program that relies on both immediate and long-term contextual knowledge to provide output that is human-interpretable. In this instance, the network recognizes that the following word should be a nation by using historical data.
The GRU model generated the output seen in this picture; the appendix has descriptions of all model configurations. To check if the network is still providing insightful recommendations, try deleting the final letters. You can type your own text as well. (reset).
The complete text8 dataset is used to create the autocomplete dataset. Two layers of 600 units each make up the recurrent neural networks that were employed to solve the problem. Three models are used: Nested LSTM, LSTM, and GRU. For further information, see the appendix.
The Autocomplete Problem’s Connectivity
By showing individual cell activations, the authors of the recently published Nested LSTM research qualitatively compared their Nested LSTM unit to other recurrent units to demonstrate how it memorizes in comparison.
Karpathy et al.’s identification of cells that capture a certain trait served as the inspiration for this image. This visualization technique is effective for identifying a certain feature. However, because the result depends solely on the feature that the particular cell captures, it is not a viable argument for memorizing in general.
Rather, the connectivity between the input and the desired output is examined to gain a better understanding of how well each model memorizes and uses memory for contextual comprehension.
A remarkable amount of information about the various models’ capacity for long-term contextual awareness may be gained by investigating the connectedness. To examine what data the various models utilize to make their predictions, try interacting with the picture below.
Let’s focus on three particular scenarios:
1. See how the models use just the first two characters as input to forecast the term “learning.” The Nested LSTM model only recommends frequent words that begin with the letter “l” since it hardly ever uses prior knowledge.
On the other hand, “learning” is suggested by both the LSTM and GRU models. According to the suggestions, the GRU model predicts a higher chance of “learning” than the LSTM model, and it exhibits stronger connectivity with the word “advanced.”
These findings demonstrate how effective the connection visualization is in comparing models based on the prior inputs they employ to gain context. Nevertheless, models can only be compared on a single example and the same dataset. Therefore, even while these findings might indicate that Nested LSTM lacks long-term contextual awareness in this particular instance, they might not apply to other datasets or hyperparameters.
Future Research; Quantitative Indicator
It seems from the aforementioned observations that the word being predicted itself is frequently a part of short-term contextual comprehension. In other words, when additional letters become available, the models begin to use previously encountered letters from the word itself. On the other hand, models—particularly the GRU network—use previously encountered words as background for the prediction at the start of a word prediction.
This finding points to a quantitative metric: calculate the accuracy based on the number of letters in the predicted word that are already known. It is unclear if this is the best quantitative metric because it depends heavily on the situation and doesn’t condense the model into a single figure, which one might want for a more straightforward comparison.
The findings imply that while the LSTM model performs better in short-term contextual understanding, the GRU model performs better in long-term contextual understanding. The insights are useful because they explain why the GRU and LSTM models’ total accuracy is almost the same, and the connection visualization demonstrates how much superior the GRU model is at long-term contextual knowledge.
Even though more in-depth quantitative measurements like these offer fresh perspectives, qualitative analysis such as the connectivity figure in this article is still quite valuable. In contrast to a quantitative statistic, the connectivity visualization provides an intuitive understanding of the model’s operation. It also demonstrates that an incorrect forecast might still be seen as a helpful one, like a synonym or a prediction that makes sense in the context.
In Conclusion
It is not very interesting to look at cross entropy loss and total accuracy alone. While both models may have comparable accuracy and cross entropy, different models may place a higher priority on short-term or long-term contextual knowledge.
Therefore, when evaluating models, a qualitative study that examines how prior input is employed in the prediction is equally crucial. In this instance, the GRU model outperforms LSTM and Nested LSTM in terms of long-term contextual comprehension, as seen by the network plot and autocomplete predictions. In the case of LSTM, the difference is significantly more than what would be expected based just on cross entropy loss and overall accuracy. Given that it probably depends heavily on the hyperparameters and the particular application, this observation is not really intriguing on its own.
Much more beneficial than merely examining accuracy and cross entropy, this graphical technique enables a much deeper intuitive understanding of the differences between the models. It is evident from this application that the GRU model makes predictions to a far greater extent than the LSTM and Nested LSTM models by using recurring words and the semantic meaning of previous words. This information is crucial for creating better models in the future as well as a useful insight when selecting the final model.

