Illustration of a brain that is half organic texture and half geometric pattern. — **Learning limits:** Cross talk between machine learning and neuroscience can be valuable for revealing the limitations of the field’s tools and assumptions.

Illustration by Scott Balmer

Trading places: What happens when neuroscience turns into machine learning, and machine learning turns into neuroscience?

Neuroscience has become increasingly concerned with prediction, and machine learning with causal explanation, with each field adopting methods from the other. I asked eight experts to weigh in on what we stand to learn from this exchange.

By Samuel Gershman

23 March 2026 | 22 min read

https://doi.org/10.53053/DGXM8601 Cite this article

Like most sciences, neuroscience has historically sought causal explanations for empirical phenomena. Machine learning, in contrast, has historically sought to engineer systems capable of prediction. Lately, however, these distinctions have been dissolving: Neuroscience has become increasingly concerned with prediction, increasingly adopting machine-learning methods. And machine learning has become increasingly concerned with causal explanation, increasingly adopting neuroscience methods.

Before discussing the implications of this swap, let’s examine a few examples. Brain-Score, an effort to evaluate models based on their ability to predict neural responses, illustrates neuroscience evolving into a predictive discipline. The platform includes a set of quantitative benchmarks, such as neural recordings, along with a leaderboard of models. A parallel effort, inspired by machine learning, has been the development of “foundation models” for neuroscience, trained on vast amounts of neural data and tested on their predictive ability.

On the machine learning side, and its transition into an explanatory discipline, mechanistic interpretability research has emerged to identify “circuit” mechanisms within machine learning systems trained for prediction. In contrast to earlier interpretability research that focused on identifying relationships between inputs and outputs—why a system denied one person a loan but not another, for example—mechanistic interpretability research seeks relationships between computing elements within the system. Its connection to neuroscience is explicit, proposing a sort of connectomics in artificial systems. As Chris Olah, co-founder of Anthropic, and his colleagues wrote in an online review in 2020:

What if we treated individual neurons, even individual weights, as being worthy of serious investigation? What if we were willing to spend thousands of hours tracing through every neuron and its connections? What kind of picture of neural networks would emerge?

Neuroscientists have responded enthusiastically to this call, bringing their tools, ideas and explanatory frameworks. This has included analyses of single-neuron tuning and population-level representational similarity, and the characterization of nonlinear dynamics and circuit ablations, among other approaches. Even when machine-learning researchers are not explicitly borrowing tools from neuroscience, they often end up reinventing similar tools.

However, I argue that these trends may not (on their own) bring us closer to understanding neural systems; prediction cannot completely supplant explanation in neuroscience without sacrificing important insights. And explanation in machine learning inevitably runs into the same problems facing explanation in neuroscience, namely that complex systems do not easily yield to the kinds of tools commonly used in neuroscience. Ironically, this has been recognized by machine-learning researchers (and a few philosophers) but has still not penetrated the neuroscience discourse.

he tension between prediction and explanation is a recurring theme in philosophy, statistics and social science. Historically, science has sought causal-mechanistic explanations for natural phenomena—for example, the explanation for why L-DOPA ameliorates the symptoms of Parkinson’s disease by increasing dopamine levels. Within systems neuroscience, causal-mechanistic explanation is often understood in terms of circuit mechanisms (the inspiration for mechanistic interpretability in machine learning): modules of neurons that compute particular functions through excitatory and inhibitory interactions. For example, the causal mechanism underlying eye position stability is thought to be a network of recurrently connected neurons that implement a line attractor.

Causal-mechanistic explanations in neuroscience, as in other sciences, attempt to discard “spurious correlations” that might nonetheless be useful for prediction. For example, L-DOPA can have side effects, such as involuntary movements and headaches, which are correlated with its ameliorative effects on Parkinson’s symptoms. A machine-learning algorithm might be able to predict the ameliorative effects from the side effects, but it’s generally understood that the side effects do not cause the ameliorative effects. Treating the side effects—taking Tylenol for headaches, for example—without affecting the hypothetical causal mechanism, or dopamine, should leave the symptoms unchanged.

Although this seems to drive a strong wedge between prediction and causal-mechanistic explanation, current thinking in machine learning and statistics links the two: Causal-mechanistic explanation is invariant prediction. A predictive algorithm might be able to exploit spurious correlations on observational data, but this would fail under certain interventions, such as the Tylenol example above. The causal mechanisms are the predictive relations that persist even when spurious correlations are removed.

Invariant prediction may be a necessary condition for causality, but it does not by itself shed light on causal mechanisms. This requires measurement and manipulation of a system’s component processes in order to know which predictive relations persist under which interventions. Approaches that focus purely on prediction, such as Brain-Score and neural foundation models, cannot on their own supplant explanation, assuming neuroscientists will continue to care about explanation as an epistemic goal.

Machine-learning researchers have recognized the importance of a more interventionist approach to causal-mechanistic explanation, motivated by a variety of concerns, including alignment, safety and debugging. The most influential approach is based on the “circuit hypothesis,” in which specific subnetworks of an artificial network drive specific behaviors. Neuroscience seems to offer the perfect tool kit for identifying such circuits: analysis of single-neuron and population-level tuning, brain stimulation and ablation/silencing. However, several pessimistic results indicate insurmountable “complexity barriers” for circuit reduction. For example, comprehensive circuit understanding requires, in the worst case, a number of interventions (silencing subsets of neurons) that grows exponentially with the number of neurons. Such intractability afflicts even approximate circuit understanding.

Another cherished assumption in neuroscience is that interventions can be used to establish functional localization: If stimulating or silencing particular neurons changes the system’s behavior in a specific way, researchers typically conclude that those neurons are functionally responsible for the change. But evidence from machine learning shows that such procedures can give rise to localization illusions, in which an intervention erroneously ties a subnetwork to a specific function. It is also possible to change the system output in specific ways by modifying synaptic weights that are outside the subnetwork identified by functional localization. Another pessimistic result shows that standard dimensionality-reduction techniques, widely used in neuroscience, can give rise to interpretability illusions: Even when the low-dimensional representations provide an adequate summary of the model behavior on the training data, they can fail when the model is tested on a new data distribution.

These observations should give pause to those who hope that the tools of neuroscience will be powerful assets to machine learning. They should also give pause to those who hope that the tools of neuroscience will be powerful assets to neuroscience! Indeed, it has been known for nearly a decade that neuroscience tools can fail to unravel even modestly complex computational circuitry. Yet these tools continue to be used in neuroscience, largely because we have not yet devised better alternatives.

To end on a more positive note, I think that the cross talk between machine learning and neuroscience has been very valuable, if only for revealing the limitations of our tools and the frailty of our assumptions. My hope is that a continued dialogue will be the starting point for new approaches.

To get a sense of how the broader neuroscience community is thinking about these issues, I asked eight neuroscientists to weigh in on several questions: Can we replace explanation with prediction in neuroscience? Is circuit mapping an adequate explanatory framework for deep learning? Is it an adequate explanatory framework for neuroscience?

Neuroscientists are lucky if they can record data from tens of thousands of neurons over the course of a few days. These data will be noisy and probably come from a small mammal performing simple tasks. Meanwhile, large language models such as Claude and GPT can perform many tasks at or above human level, have rich models of the world, and can be deterministically studied with access to every one of their neurons and neuronal connections. This incredibly rich source of data combined with the increasing power of large language models has motivated the “mechanistic interpretability” research that Gershman’s piece discusses.

Although neuroscientists reasonably disagree on the extent to which LLMs differ from biological brains, I believe there are core computational principles that overlap. One of these is how information is represented and stored. Both biological brains and LLMs learn far more “things” than they have neurons or connections. In order to store these things, such as facts, memories, associations, they must find some way to efficiently compress the information into a lower dimensional representation. LLMs have been shown to encode information in “superposition,” where each piece of information is stored not in a single neuron but as a pattern of activation across many neurons (in neuroscience this is called population coding!). To reverse engineer this compression, an algorithm called sparse autoencoders projects the compressed, lower-dimensional representation back into a higher-dimensional one—for example, decomposing a single layer of Claude 3 Sonnet into 30 million unique directions, each corresponding to an interpretable concept such as the Golden Gate Bridge. This is a case where the LLM has to solve a core computational problem that the brain must also solve, and grappling with it in the realm of artificial intelligence is likely to develop algorithms useful for understanding biological intelligence. As neuroscience recording technology scales up, these same tools may prove just as powerful for decoding biological neural representations.

Neuroscience and machine learning have a deeply intertwined history. In recent years, parts of the fields have become even more blurred. In this Transmitter piece, Sam Gershman asks what happens when neuroscience turns into machine learning, expressing skepticism about a growing emphasis on predictive models of neural activity. Although I share general caution about blindly relying on these “digital twins,” I have a more optimistic view of how the new era of high-fidelity predictive models can advance our understanding of neural processing.

A predictive model instantiates a hypothesis for how a computation is achieved or how a representation emerges. A model can be built for many levels of analysis, for instance by abstracting away details of biological implementation or attempting to explicitly build them in. If a model fails to predict the observed data, the hypothesis instantiated in a model has been falsified. But what if it succeeds? Gershman rightfully highlights problematic “spurious correlations” (or “shortcut learning”), in which models predict the right answer but for the wrong reasons. But this is not a reason to abandon predictive models entirely. Rather, it requires that we, as scientists, rigorously design tests that attempt to break these predictive models.

Initiatives such as Brain-Score and the development of foundation models are already moving in this direction, for example by testing on “out-of-distribution” examples. Critically, a machine-learning model may be large and complex, but it is not a black box! In-silico experiments offer efficiency and control. We can run extensive simulations, perform precise ablations, derive targeted stimuli from the model itself or change the training data for a computational controlled rearing experiment. By doing so, we can guide more efficient collection of biological data and reveal potential confounds in existing hypotheses about neural representations.

These “digital twins” have immense potential for engineering approaches. For instance, one could use a predictive model to develop new, potentially individualized, algorithms for neural prosthetics, such as cochlear implants or cortical stimulation. But we can also use the model to probe neural representations directly. We can synthesize stimuli that drive specific neural populations or dissect the necessity of different biological motifs. Though this may require the development of new tools and analysis techniques that work better in these complex systems (as “mechanistic interpretability” attempts to do), a computational model offers grounding for new analysis methods before testing in real biological data.

We are watching two disciplines trade their worst habits: Neuroscience is mistaking benchmarked prediction for understanding, and machine learning is mistaking mechanistic language for mechanism. I agree with the warning that these agendas can get conflated, and I think the cleanest way to say why is to separate prediction (even one that is somewhat invariant) from causal inference.

Prediction (or a forward problem) asks for a function that maps measurements x to predictions of outcomes y. Causal inference (or an inverse problem) asks which parts of the measured system actually affect the outcomes and how they could be changed to produce better outcomes. Both problems are written as y = f(x), which is somewhat regrettable as they are fundamentally different problems. Those are not just different goals; they have different geometry.

Prediction is forgiving because correlated variables can substitute for one another. If two neurons (or two genes) are tightly correlated, many models will predict equally well while assigning “credit” very differently. Data are generally concentrated in a small number of dimensions and highly correlated across dimensions of x. These correlations make predictions easier—we only need to predict well where the data are usually located, on “manifold” (technically they have a 1/f singular vector spectrum, but the conclusions are similar). Causal inference is hard for the same reasons. Inversion requires separating direct from indirect effects in the presence of correlations, and that implicitly or explicitly means inverting the correlation structure. When that structure is ill conditioned, small estimation errors can explode into large swings in inferred causes. Good predictions often signal exactly the conditions that make causal inference hard: strong correlations that substitute freely for one another.

One extra point sharpens the essay’s discussion of “causality as invariant prediction.” In practice, invariance is almost always local: We usually validate stability across similar datasets with mild distribution shifts instead of real causal interventions—such interventions are, after all, expensive. That kind of invariance is useful, but it mainly certifies similarity of regimes. Causality, by contrast, earns its reputation because it aims at farther-out-of-domain generalization, the relations that remain stable across a wide class of interventions because they reflect how the system produces the effect. The practical lesson is to talk explicitly about the scope of invariance: which interventions, how far and under what assumptions. The essay’s cited complexity results sharpen this point. If comprehensive circuit understanding requires interventions exponential in the number of neurons, practical “invariant prediction” is invariant only across the tiny corner of intervention space we actually probed. And given the way we run experiments in neuroscience, we generally only perturb brains a little bit—we may know little about how the brain would respond to truly new stimuli. That’s a statement about local stability, not about the true causal structure that would hold under the interventions we haven’t done and probably can’t afford to do.

The brain doesn’t owe us an explanation. There is nothing in the nature of a recurrent, nonlinear dynamical system that implies a description we can reason about. Quite the opposite. And yet, in bits and pieces, the unlikely happens: We do, from time to time, make some sense of things. In the oculomotor system of primates and the fly central complex, in the learning circuits of songbirds and the retinas of many species, we have at least the first draft of an understanding of brain function. All of which is to say, the world is far less bleak than it might seem if we took the interpretability of artificial neural networks at face value.

But why should this be? Let me suggest two answers to this question. First, we have had comparatively greater success explaining systems that operate under significant constraints. These constraints can be information related (the requirement that early sensory systems selectively compress the world around us) or structural (the highly specific inputs of the fly compass system), but in all cases, the result is a model far simpler than a generic neural network, and it’s this simplicity that has allowed experimentalists and theorists to articulate organizing principles of their function.

The second answer is, of course, evolution. More specifically, though mutation is random, the landscape explored by evolution is highly structured. The fact that nervous systems must be grown via genetically specified programs ensures that the types of connectomes that result are multiply constrained by tissue biophysics and locality, sparsity and cell types. These are not networks randomly initialized but tuned by development, capable of essential, often sophisticated, behaviors at birth.

Thus, we find ourselves in a more fortunate position than we might have predicted. Yes, brain function is dauntingly complex, and much of it will remain opaque to us for a long time yet. But the fact that this complexity accrues through tweaks and modifications, and that it must be constructed according to developmental logic, should be a source of optimism. Perhaps, in that case, the complexity of the brain need not be cut like a Gordian knot, but can be peeled back like the layers of an onion.

Sam makes important points about the limitations of applying neuroscience approaches to machine learning and vice versa. His two main arguments are that prediction cannot supplant explanation and that explanation is intractable for complex systems. I want to offer a more optimistic counterpoint: Both arguments lead to fewer problems once we recognize what explanation actually provides—generalization.

A causal-mechanistic explanation is fundamentally valuable not because it decomposes a system into parts but because it enables prediction under new conditions—across interventions, distribution shifts and tasks. This includes Sam’s invariant prediction. But if explanation is valued for its generalizations, then prediction and explanation are not in tension at all: Explanation is what makes prediction generalize. The localization and interpretability illusions Sam discusses are real, but they reflect failures to test in generalization regimes stringent enough to expose incorrect structure.

Reframing explanatory efforts through generalization also addresses the pessimistic results on circuit explanation. Sam cites worst-case analyses, but such bounds assume any circuit is possible. Real neural systems—biological and artificial—have copious structure, such as sparse connectivity and low-order interactions, and using these constraints as inductive biases makes circuit-level explanation more tractable than the worst case suggests. Still, whether exact circuit reduction is tractable or not, it is not the only level at which explanation is fruitful. The right level of explanation is whichever level provides adequate generalization over the domain of interest.

Foundation models provide an interesting example. Do they explain anything? Many mechanistically distinct networks can produce the same input-output behavior over natural tasks and can even share latent dynamics, making exact circuit reduction unnecessary even for some generalizations (though of course not for generalization to interventions on circuit elements that don’t exist in the model). Many explanatory constraints can remain active without detailed mechanisms, notably at the levels of representations or of normative constraints on resources and behavior. These can still be causal, at least in Aristotle’s sense of final cause (telos, or purpose). They are genuine explanations: They generalize, they are falsifiable, and they tell us something about why the system works. They are just not circuit diagrams. Functional equivalence within domain-relevant tasks is a weaker standard than full causal mechanisms, but it has proven to be the right level of analysis for many questions about complex systems.

The challenge, then, is not choosing between prediction and explanation. It is identifying the level of description that generalizes over the scientifically relevant domain, and designing tests strong enough to verify that it does. This is where the interaction between neuroscience and machine learning is most valuable.

The growing convergence between neuroscience and artificial intelligence has shifted neuroscience toward a primarily predictive discipline, raising questions about explanation and causality. Modern deep-learning models are now widely used to predict brain activity and compare representations across artificial and biological systems, particularly in sensory and language domains. The expectation that constraining models to biological data would systematically improve task performance and robustness has not fully materialized. Instead, the AI field has largely redirected its efforts toward post-hoc analysis tools, many inspired by neuroscience, to probe the inner workings of otherwise opaque models.

Although AI models are highly complex, they remain computational abstractions that omit many structural and dynamical properties of biological neural systems. Representational alignment and predictive accuracy, though informative, are insufficient to establish mechanistic or causal explanations. For example, apparent improvements in representational alignment revealed by interpretability tools may result from indirect training dynamics or architectural biases and not from the mechanisms these tools are assumed to expose.

Despite these limitations, the simplified and controllable nature of AI models constitutes a methodological advantage. Unlike biological systems, AI models can be directly intervened with: Components can be removed, modified or retrained, and learning dynamics can be systematically altered. Such interventions enable controlled causal tests and the systematic identification of confounding factors, allowing alternative explanations for observed behaviors or representations to be evaluated. Although these manipulations might not directly map onto biological systems, they can inform the formulation of causal hypotheses that are otherwise difficult to test in neuroscience. The current emphasis on prediction in neuroscience is defensible as it provides a necessary empirical constraint. Strong predictive performance provides a minimal empirical basis for interpretation but does not by itself establish a mechanism. Without it, claims about mechanisms remain weakly grounded.

Future progress will require integrating interpretability methods with explicit mechanistic analyses, instead of treating alignment or prediction as end points. Rather than focusing solely on prediction and representational alignment, research should interrogate the internal circuits, transformations and learned structures of models in relation to specific cognitive functions.

Predictions can demonstrate our understanding, but only if we actually understand the systems we use to make those predictions. If we train a black box model from observed data and find that it successfully predicts behavior, all we have is a second black box model—hardly better than we started. However, if we can make predictions based on an intuitive proxy for a subject’s computation, then our intuitions are (to some degree) correct even if they do not reflect the subject’s causal mechanisms. Our intuitions have holistically described the subject at the level of computation, if not its component implementation.

On the other hand, even if we identify causal mechanisms successfully, we may again be no better off than before, as Gershman points out by highlighting interpretability illusions. If humans cannot understand the structure of the mechanism, or if we have a flawed explanation of how an intervention takes effect, we have again added a second black box rather than improving human understanding of the subject.

What counts as understanding a system—a brain, LLM or any other process? Not whether our description is causal or predictive but whether the description itself is understood.

The bad news is that this property is inherently subjective. Some people might intuitively understand an exact mathematical description of a system, whereas others will have to take the existence of that intuition on faith. It is therefore impossible for one person to know with certainty that a new description has advanced human understanding unless it has advanced their own personal understanding.

However, there is also good news for those of us who cannot comprehend a billion-parameter simulation. Even if a black box description does not directly advance our understanding, it might permit new tools that the original subject did not. With this assumption, any predictive description has the potential to advance our understanding. The question remains, which ones will?

Artificial neural networks are extremely powerful but difficult to interpret, much like their biological counterparts. Yet because of their effectiveness at predicting an output, y, from an input, x, we are entering a “shut up and train” paradigm for much data in neuroscience (mirroring the “shut up and calculate” mindset in quantum physics). This essay rightly questions this trade of understandability for predictivity.

Uninterpretable models contrast with previous decades where models were largely handcrafted and causal. Bayesian models exemplified this approach, where a distribution over variables, z, is inferred from data, y, according to a causal model y = f(z). Thinking causally is not only more interpretable but naturally handles “out-of-distribution” data—the hallmark of true understanding.

This is the rub of prediction. Without a causal model, successful prediction can rely on variables that correlate with the true causal variable, which then impedes generalization. Reading Agatha Christie novels might make you (or an LLM) good at predicting another of her novel’s killers because you’ve understood her writing style, but it doesn’t make you (or an LLM) a detective, as Agatha Christie likely didn’t orchestrate real-life murders.

Invariant prediction attempts to mitigate this by identifying predictive relations that persist across contexts (causal logic of murder), and ignoring those that vary (author writing styles). However, collecting data over enough contexts to determine whether correlations are spurious is not easy, and even then the neural network’s learned causal model is likely not amenable to interpretability techniques.

This is why the “understanding school” of neuroscience chooses tasks that are simple enough to allow data collection across the whole distribution (unlike natural language). I find meta-learning (explicitly training models to generalize) an attractive task setup, as the tasks are on the cognitive end of the spectrum yet well formalized and can have simple “hidden structure.” This has led to, in my biased opinion, the best mechanistic understanding of the hippocampal and prefrontal algorithms,

Alas, meta-learning just shifts the “out-of-distribution” problem one level higher, with flexibility to different task structures requiring exposure to a diversity of structures in the training set. It’s still prediction but at a level of abstraction that facilitates a causal understanding of the level below. This is the price paid for building an understanding piece by piece rather than relying on post-hoc interpretation of data we don’t understand.

Ultimately, these are two ends of a spectrum, both essential for progress. Whereas mechanistic interpretability on large models or predictive benchmarks such as Brain-Score often lack causal depth, they tackle problems beyond the limits of current causal understanding, thus providing useful fruit for those doing the more meticulous work of causal explanation.

Subscribe to get notified every time a new piece in “The big picture” series is published.

Researchers ask colleagues to weigh in on important topics in the field.

tags:

The big picture, Artificial intelligence, Machine learning, Methods

Explore more from The Transmitter

A hand holds a clear rectangular prism up to a series of abstract patterns, offering a new view of them.

Behavioral neuroscience

Beyond the algorithmic oracle: Rethinking machine learning in behavioral neuroscience

By Nedah Nemati, Matthew Whiteway

3 December 2025 | 7 min read

Theoretical neuroscience

Not playing around: Why neuroscience needs toy models

By Marcus Ghosh

22 December 2025 | 6 min read

Research image of a virtual environment simulating an animal’s viewpoint close to the ground.

Artificial intelligence

Seeing the world as animals do: How to leverage generative AI for ecological neuroscience

By Shahab Bakhtiari

8 December 2025 | 8 min read

Trading places: What happens when neuroscience turns into machine learning, and machine learning turns into neuroscience?

Trenton Bricken

Jenelle Feather

Konrad Körding

John Pearson

Xaq Pitkow

Gemma Roig

Naomi Saphra

James Whittington