What are the real problems of continual learning?
Reflections on catastrophic interference, plasticity, and learning for the future in the era of large language models
Lately, there’s been a surge of interest in continual learning. In particular, there’s been an increasing sense that continual learning is one of the areas where the gap between humans and AI is largest.1 In this post, I want to explain what continual learning is, and some of the past perspectives on it — including how I think the field focused on what in retrospect turned out to be the wrong problems. I’ll then describe what I think are the actual problems that remain, why humans are still superior to AI, and why I think continual learning is so important.
What is continual learning?
In short, continual learning is the ability of a system to keep improving throughout its existence — just as humans can learn new skills and knowledge throughout our lives. This is not an inherent feature of most contemporary AI systems; if you have one conversation with a language model, and then have another conversation on that topic, the model will have no recollection of the first one or what you explained during it (unless the AI writes notes for itself, like some memory systems allow). If you work with a language model to write a paper, you will have to put the first paper in context when you want to work on a follow-up. Why don’t these systems support continual learning? To understand this question, let’s start from earlier perspectives.
Catastrophic interference: the original problem
From the early eras of neural network research, it was noted that these networks exhibit catastrophic interference (or catastrophic forgetting) — when the network is trained on a new task, it catastrophically degrades the network’s ability to perform the tasks it was trained on previously. Thus, the networks were generally trained on data that were constant over training, or sampled uniformly (IID), rather than a sequence of tasks. This was seen to pose a challenge for using neural networks as cognitive models, since human development and education is fundamentally sequential.
This challenge provided one motivation for theories of the role of the human hippocampus (episodic memory system) as allowing for rapid learning that complemented the cortical system — if the hippocampus can rapidly store a memory, it can then be learned by the cortical system more gradually as it is interleaved with other experiences. This type of interleaving makes the data distribution closer to IID, and thus reduces the problem of catastrophic interference. The idea of replaying past experiences to smooth the data distribution has been very influential in subsequent machine learning works, for example in reinforcement learning.
However, machine learning approaches to replay have generally relied on storing veridical experiences; as such, they tended to seem impractical as the tasks being learned became more numerous and complex. This problem led to an explosion of approaches trying to address catastrophic interference through other changes to architectures or learning objectives — for example, various approaches that preserve weights proportional to their importance to prior tasks, or prevent gradients from interfering with prior tasks, so that learning will occur where it interferes the least.
Is interference as catastrophic as it seems?
However, various recent works have argued that “interference” in continual learning is not quite as catastrophic as it seems. Often, the knowledge of earlier tasks is preserved within the model’s representations in some sense, and can be recovered relatively easily. For example, one paper finds that internal representations often preserve relatively high linear decodability of earlier task information, even when performance degrades. Several other papers have suggested similarly that interference is strongest at the readout layers, while earlier layer representations preserve information about other tasks — thus allowing relatively easy recovery of earlier tasks by retraining the output layers. On their own, these findings do not completely resolve the interference problem — but they do suggest it may not be fundamental.
Loss of plasticity: forward interference
More recently, there has been a newer perspective on a different type of interference: loss of plasticity. While typically catastrophic forgetting works backwards (tasks learned later interfere with earlier tasks), loss of plasticity is instead about forward interference: how earlier learning impairs the ability of the network to learn later tasks. Evidently, for a continual learner this type of interference is just as bad as catastrophic forgetting; a continual learning system cannot lose its ability to learn over time.
However, again, the problems are not universal. Commonly-used interventions like layer normalization and weight decay can substantially reduce loss of plasticity. Moreover, some researchers have argued that loss of plasticity is an artifact of artificial, hard boundaries between tasks — if the environment drifts more continuously, plasticity may be preserved.
The modern paradigm: scale and pretraining reduce interference
However, I believe that large pretrained models have more fundamentally changed the paradigm for continual learning and interference. Large language models are trained on huge quantities of data, over a series of stages that often include pretraining (possibly with longer sequences at later stages), midtraining, supervised fine tuning, and RL from various types of rewards. If language models experienced substantial catastrophic interference or loss of plasticity across these dramatically different distributions, these training pipelines simply would not work. So why do they?
One key piece of the answer is scale. Scale is something that I think prior continual learning research often got wrong. Historically, continual learning research focused on small models (sometimes only a few layers) trained on many tasks. However, more recently researchers have found that larger models exhibit substantially less interference. For example, wider models show less catastrophic forgetting on visual continual learning tasks, in part because gradients are sparser and more orthogonal between tasks in wider architectures. Indeed, it seems intuitive that a model with more capacity will show reduced interference between tasks.
Moreover, scale has a positive interaction with pretraining for both vision and language models — pretrained models forget less than models trained from scratch, and the benefits of pretraining increase with scale. (In fact, it was noted in a relatively-unknown cognitive science proceedings paper in 1993 that pretraining on related tasks reduces catastrophic interference on subsequent sequential learning.)
Larger models may memorize more data — but they also show less overfitting even when they memorize. Perhaps because they are less overfit to the training distribution, larger models have also been found to perform better on downstream tasks, even if their pretraining loss is similar to smaller models.
In a recent preprint, we’ve argued that processes like these drive many of the benefits of language model scaling: the reason larger language models learn more than smaller models is precisely because of their reduced interference and increased capacity for memorization, allowing the model to more effectively preserve its learning about rarer structures in the data until they are next encountered.
Beyond scale, other details of language model training may help to ameliorate interference. Strategies that smooth the distributional shift between phases of training, such as mid-training to bridge between pre-training and post-training distributions, may help to reduce interference and loss of plasticity. Language models already incorporate architectural features like normalization layers, and often training techniques like weight decay, that can reduce loss of plasticity. Other simple strategies, such as regularizing with KL-divergence to the original model (as is often done in RL with LLMs), or using parameter-efficient fine-tuning methods that constrain updates to a subset of the parameters, likely reduce interference as well.
Thus, pretraining and scale — especially when coupled with relatively standard strategies for preserving prior knowledge — have substantially reduced the catastrophic interference and loss-of-plasticity problems in large language models, even if they have not eliminated them entirely.
In search of positive transfer & cumulative learning
However, avoiding interference is only one piece of continual learning. From the early days of continual learning research, it was observed that human and animal learning does not just avoid interference, but that learned tasks actively support one another — for example, after learning hundreds of mathematical concepts, a mathematician will be faster to learn the next one (forward positive transfer), and may even use it to improve their understanding of earlier concepts (backward positive transfer). There has been a renewed interest more recently in this kind of cumulative learning, where tasks build on each other and help to learn future tasks.2
It’s in this area that I believe current language models still show the weakest continual learning abilities. Despite having been trained on huge numbers of tasks, language models cannot necessarily learn an entirely new domain as efficiently and reliably as one might hope, given the incredible number of related prior tasks from which they could transfer.
For example, while pretrained large language models can often learn a new task effectively in context, there is not a universal recipe for transferring that knowledge beyond that context. However, there are a growing number of interesting explorations on how to help language models learn more effectively for the future, including:
Data augmentation: allowing a model to “learn by thinking” and generate further synthetic inferences from some data, that can then be distilled back into the model (e.g., 1, 2, 3, 4).
Document Retrieval: explicitly retrieving documents from the training corpus (or another) that are relevant to the present task (e.g., 1, 2, 3), which can allow more flexible transfer.
KV caches: a variety of papers have suggested clever strategies for adapting the KV cache (which normally stores just the key-value activations from the earlier parts of the present document), such as compressing its knowledge into fewer KV pairs (e.g., 1, 2). While these strategies are often simply intended for more efficient inference, compressed KV caches can sometimes be recomposed to integrate knowledge across multiple corpora — thus showing their potential for continual learning (e.g., 1, 2).
Textual memory: the model can also write succinct notes, such as causal abstractions, about its tasks that (e.g., 1) — indeed, most chat platforms now incorporate some form of textual memory for personalization (e.g., 1, 2, 3).
Context distillation and self-distillation: a variety of works have considered distilling a context into the model’s parameters by training it to make similar predictions without the context, or distilling from a teacher that has access to privileged information (e.g., 1, 2) — these approaches appear to offer an effective path to integrating skills from context into the weights, and have thus been suggested as a path to continual learning (e.g., 1, 2).
Nevertheless, while any of these approaches can partially ameliorate the problem of transferring knowledge forward from a particular learning experience, it is not clear whether they fundamentally solve it. A model cannot fit arbitrarily many retrieved documents, KVs, or textual memories in context; nor can it be trained on arbitrarily many synthetic documents or distilled from every context. Is there still something missing for enabling true continual learning systems?
Handing off between different learning systems: some directions for the future of continual learning
I see two possible paths forward for continual learning research, that involve different ways of handing off between different learning systems.
First, it’s possible that the pieces I listed above — letting the system write notes or other inferences, retrieve them or its experiences when needed, and then distilling them back into the model for the future — is sufficient to achieve more effective continual learning. Combining these approaches is often limited by practical limitations (such as the increased expense or technical difficulty of updating model parameters frequently, or differently for different users), rather than their inadequacy. But because of algorithmic progress, the scale of models needed to achieve a given performance level is decreasing — and thus, it may become increasingly feasible to deploy systems that combine multiple of these approaches.
However, it’s also possible that something more fundamental needs to change.
In the discussion of “the broader spectrum of in-context learning” we noted that there is currently an artificial boundary between in-context learning and the longer-term parametric learning of a language model — whereas natural intelligences do not have any such hard boundary.
Specifically, while natural intelligences do certainly have qualitatively different learning systems — from working memory to episodic memory, and neural plasticity — these learning systems are not so cleanly divided across timescales. Synaptic plasticity can operate on timescales ranging from milliseconds (short term synaptic plasticity) to a lifetime; episodic memory likewise can operate from within a minute to much longer time periods. Maybe allowing multiple systems of learning to work together across every timescale, instead of having entirely separate systems at different timescales, is needed to enable the system to more effectively learn cumulatively.
Thus, it is possible that for our AI systems to achieve efficient and cumulative continual learning, we need to remove the artificial boundaries we’ve introduced between the present context and the rest of their past experiences, and the tokens to come in the present context and those to come in their future tasks.
It’s worth caveating up front that continual learning is not obviously a necessary feature for achieving any particular goal with artificial intelligence; a mathematician might learn a new domain more efficiently than a current language model, but that doesn’t mean language models cannot achieve new insights in an area — as indeed they have recently appeared to.
A related area of recent work focuses on prospective learning where a system explicitly attempts to learn in a way that is future-oriented rather than focused on the current task distribution — such as preplaying possible future tasks, or using episodic memory to store learning experiences in a way that allows more flexible reuse in the future.



Thanks for this interesting post. I agree that in-context learning, memory-based systems, and parametric modifications are all forms of continual learning; in a recent paper (https://cl-eval.github.io/), we crystallised this taxonomy into three "levels" of continual learning (and we discuss how continual learning requires a rethinking of evaluation practice).
Andrew, when do think that continual learning will be satisfactorily solved?