How cognitive science can contribute to AI: methods for understanding

#2 in a series on cognitive science and AI

Dec 23, 2025

An illustration of a scientist examining a robot full of gears; in the background is a shelf with a model of a brain and some books (and plants). Generated by Gemini.

In a prior post, I’ve laid out some of the directions that try to directly build on what we know about the brain in AI, and explain why I think they haven’t generally been the most effective. In this post, I want to lay out a few areas where I do find cognitive science to provide consistently useful perspectives, both in the research I do and in the field more broadly.

How can cognitive scientists help in AI research?

At a high level, the places I’ve found my background in cognitive science to contribute most to my work in AI are:

Analyzing a complex phenomenon by distilling it down to its bare essentials, and understanding levels of analysis.
Knowing how to design a good experiment, and thinking critically about confounds and alternative explanations.
Making sense of complex behavioral datasets, and going beyond single summary statistics like overall accuracy.
Taking high-level inspiration from the behavioral phenomena of natural intelligence to think about learning pressures.
Applying rational analysis — i.e., making sense of behavior in terms of rational adaptation to “environmental” (training) pressures.

(I don’t want to claim that these are the only areas where the cognitive sciences can contribute to AI; I’m merely listing the places where I, personally, have found it to be most useful in my work. I’ll suggest some other relevant directions at the end.)

Note that very few of these are about taking specific *insights* about the architecture or computations of natural intelligence and applying them to improve AI. Instead, these strategies focus on how to thoroughly understand the systems that we have (including their weaknesses), by using *techniques* similar to those we use in cognitive (neuro)science to understand natural intelligence. I think that cognitive scientists usually have unique strengths in making sense of the complex behaviors of a complex system, and that these strengths are often more useful than any particular insights we might have about natural intelligence.

In the remainder of this post, I’ll give a few examples from our work that illustrate some of these themes.

Rational analysis of in-context learning as adaptation to simple data pressures

Much of the excitement around language models (from the research side) started with the surprising observation that large language models somehow acquired the ability to do few-shot learning in-context. The field had recently been building lots of models that could do this, but they had been trained specifically for that task using techniques like meta-learning — that is, heavily engineering the data and training process to focus on the ability. It was surprising to find that models could learn it simply from natural data, without any of the heavy-handed meta-learning techniques in prior work.

My colleague Stephanie Chan (Ph.D. in computational neuroscience!) set out to understand how this could be. In particular, meta-learning approaches for few-shot classification relied on putting multiple examples of each same class into each step of training and randomizing the data labels every time a new task was introduced, to prevent the model from memorizing the answers and instead force it to learn from the examples. Obviously, language models trained on internet data don’t have that randomization or consistent data structure. So how could they learn how to learn something new?

The key insight was that several well-known properties of natural language data: its heavy-tailed, power-law distribution over words (Zipf’s law from linguistics: the idea that word frequency is inversely proportional to frequency rank), and its burstiness (i.e., the fact that if a word or phrase appears once in a document, it is much more likely to appear later in that document than in another random sample over the internet). Essentially, in a long, heavy-tailed distribution there are too many rare-but-occasionally-encountered things to readily memorize, and burstiness gives the necessary structure for the alternate strategy of learning in context. By creating controlled datasets in which these distributional factors were modulated, and training models from scratch, Stephanie showed that these two factors alone sufficed to give the emergent ability to learn new categories in context. (Stephanie also made a bunch of other interesting observations, like the fact that the optimal tradeoff between memorizing common things and learning novel ones occurs at a word frequency distribution with a power law exponent of around 1 — which is empirically the distribution in almost all natural languages.)

This study exemplifies distilling a phenomenon to its most elementary principles — the few data factors that suffice to give rise to it. The inspiration for what those principles should be came directly from what was already known about the structure of natural data from linguistics and other cognitive scientists. This work can also be interpreted as a kind of rational analysis; that is, an interpretation of why the in-context learning behavior is a rational response to the optimization pressures of sufficiently long-tailed data and local bursty structure.

This type of rational analysis has become increasingly popular, e.g. for understanding why chain-of-thought reasoning abilities are a rational response to local structure, and we’ve made a conceptual argument that this same kind of analysis can illuminate a much broader spectrum of in-context behaviors. Similar analyses based on Bayesian models have also been used to argue why transitions from in-context learning to memorization are rational. These analyses can also be applied to make sense of when models might fail, as in our new work that uses a similar level of analysis to argue that language models and RL agents fail to capture latent information in their training data, unless they have access to it via test-time retrieval (or train-time augmentation). In subsequent studies Stephanie and her collaborators have dug deeper into the mechanisms of in-context learning using intervention methods inspired by neuroscience.

These lines of work demonstrate a variety of ways that the methods and analysis techniques of cognitive (neuro)science can shed new light on how why language models do the things they do.

Explanations as a learning signal

One way I’ve often found cognitive science useful is as a source for thinking about learning signals. One thing that cognitive scientists — and especially developmental researchers — often do is to debate over which cues people rely on to infer things like the referents of words, or an entity’s goals. These cues can be implicit or explicit, and come in many forms.

A particular kind of cue I’ve been interested in over the years is the role of explanations as a learning signal. Explanations are thought to be incredibly important to human learning; they can shape what we learn and how we generalize. This kind of learning signal is often missing from classic AI methods like Reinforcement Learning (RL). However, explanations do appear in the training data of language models.

Thus, we embarked on a series of studies to evaluate how explanations could affect learning in AI. We first studied RL; we designed tasks that were very hard for agents to learn from rewards alone, and then showed that language modeling of explanations could enable them to learn. In settings with ambiguous rewards, providing different kinds of explanations could even change how the agents generalize out-of-distribution! We also showed that non-explanatory alternatives (such as true-but-not-currently-relevant statements about the situation) did not offer these benefits; thus, explanations have a special role to play.

In a subsequent study, we considered the role of providing explanations after the answer during in-context learning. Again, we designed careful experiments, and compared against carefully matched ablations (true-non-explanatory statements again, as well including the same explanations in a few-shot prompt, but applying them to the wrong problems). As before, we found that explanations could play a unique role.

Finally, we built on these studies and other past works to ask a fundamental question about language models: what can they learn about causal structures and causal reasoning from their passive imitation training? We showed how explanations, along with other data features like the inherent interventions behind the generation of internet text, can help to unlock generalizable causal reasoning even in the case of passive learning.

This series of studies illustrates how taking inspiration from a high-level aspect of human cognition, at a mostly behavioral level, can provide new perspectives on understanding and improving AI. A persistent theme across the papers is the use of good experimental design and careful controls to eliminate confounds and alternative hypotheses, such as the use of simpler sentence-level associations.

In the in-context explanations part of the work we also drew on particular analytic methods that are often used in the cognitive sciences: mixed models. These models allow appropriately allocating uncertainty to derive more precise effect estimates when there are many non-independent measurements (as when a human is tested on multiple problems, or when we evaluated language models tasks using multiple prompts combined with different explanation conditions). Although these methods aren’t as widely used in AI, they are increasingly relevant to making sense of the complex behavioral datasets produced from language model evaluations — and I hope they will be adopted more often. A nice intro to these methods, as well as many other principles of experiment design and analysis from cognitive sciences, can be found at experimentology.io.

Challenges of bridging levels of analysis in interpretability: representation biases and unfaithful simplifications

In his 1982 textbook on vision, David Marr articulated three levels at which a computational system could be analyzed:

Computational: What is the system’s goal and the high-level strategy by which it achieves it?

Algorithmic: How is this strategy implemented; what representations are used and how are they transformed to execute the computation?

Implementational: How can the representations and algorithms be realized physically?

These three levels of analysis have been substantially influential in computational cognitive science and neuroscience (and even in AI). However, many works since have highlighted challenges to the framework and interactions between these levels.

The levels framework, and its challenges, are useful context for understanding the goals and challenges of mechanistic interpretability—which like neuroscience, attempts to make sense of a system through its internal activity. In our recent works studying challenges of interpretability, we’ve drawn on these debates, and particularly the challenges in linking from representations and algorithms to the higher computational level.

For example, we’ve highlighted how biases in the representations that models learn — for example, biases towards simpler or more prevalent features over more complex or rarer ones, even when both features play an equivalent computational role — can mislead us about the computations of the system. Common analyses of representations (like PCA, or the loss used for training SAEs) implicitly or explicitly assume that stronger signals in the representations are more important, but we’ve found that this isn’t true. This highlights a complexity in the relationship between the algorithmic and computational levels that poses a challenge for making inferences from one to another.

Similarly, we’ve studied how simplifying a system to interpret it can lead to unfaithful interpretations that only reliably describe the system’s behavior on the training distribution. Many approaches to interpretability replace a model with a simplified proxy model; e.g., by treating soft attention as though it were hard, or (as above) assuming that only the dominant features matter. We show how these kinds of approximations can break down in edge cases, resulting in the simplified proxy model behaving like the original model in distribution, but result in systematically different predictions out of distribution. For example, in cases where the simplified model fails out of distribution, the original model may still perform well, or vice versa. Again, this mismatch stems from a failure to appropriately consider the mapping between levels, and in particular the idea that a computational description of a system might only be faithful to its algorithmic details over certain distributions.

These examples illustrate how applying frameworks from neuroscience can help us to think about research methodologies in AI, and their challenges. They also illustrate how reducing a phenomenon (e.g. the relationship between a system’s representations and its computations) down to its simplest instantiations (small models trained on simple data that we understand) can shed new insights into the phenomena at play in more complex settings.

Summary:

I think these examples illustrate the themes I laid out above: how cognitive science can teach AI about methods for probing and making sense of complex systems, through careful experiments, analysis, and reducing phenomena down to their simplest instantiations and models.

Of course, there are many other exciting directions that I haven’t touched on here, or explored as much, but where cognitive science could be especially impactful — particularly in thinking about how AI can be made to help people most effectively. For example:

Work on human-AI interaction that draw on perspectives from cognitive science for understanding social aspects, building interfaces that readily expose useful information without being overwhelming, etc.
Thinking about opportunities and risks for AI’s effects on mental health.
And opportunities and risks for its effects on democracy.
Work on AI for tutoring or education that builds on what we know about human learning.
Etc.

I believe that work on understanding the fundamentals of what AI systems learn, using techniques like those I outlined above, goes hand-in-hand with thinking about AI in more applied areas like these.

Infinite Faculty

Discussion about this post

Ready for more?