Like most sciences, neuroscience has historically sought causal explanations for empirical phenomena. Machine learning, in contrast, has historically sought to engineer systems capable of prediction. Lately, however, these distinctions have been dissolving: Neuroscience has become increasingly concerned with prediction, increasingly adopting machine-learning methods. And machine learning has become increasingly concerned with causal explanation, increasingly adopting neuroscience methods.
Before discussing the implications of this swap, let’s examine a few examples. Brain-Score, an effort to evaluate models based on their ability to predict neural responses, illustrates neuroscience evolving into a predictive discipline. The platform includes a set of quantitative benchmarks, such as neural recordings, along with a leaderboard of models. A parallel effort, inspired by machine learning, has been the development of “foundation models” for neuroscience, trained on vast amounts of neural data and tested on their predictive ability.
On the machine learning side, and its transition into an explanatory discipline, mechanistic interpretability research has emerged to identify “circuit” mechanisms within machine learning systems trained for prediction. In contrast to earlier interpretability research that focused on identifying relationships between inputs and outputs—why a system denied one person a loan but not another, for example—mechanistic interpretability research seeks relationships between computing elements within the system. Its connection to neuroscience is explicit, proposing a sort of connectomics in artificial systems. As Chris Olah, co-founder of Anthropic, and his colleagues wrote in an online review in 2020:
What if we treated individual neurons, even individual weights, as being worthy of serious investigation? What if we were willing to spend thousands of hours tracing through every neuron and its connections? What kind of picture of neural networks would emerge?
Neuroscientists have responded enthusiastically to this call, bringing their tools, ideas and explanatory frameworks. This has included analyses of single-neuron tuning and population-level representational similarity, and the characterization of nonlinear dynamics and circuit ablations, among other approaches. Even when machine-learning researchers are not explicitly borrowing tools from neuroscience, they often end up reinventing similar tools.
However, I argue that these trends may not (on their own) bring us closer to understanding neural systems; prediction cannot completely supplant explanation in neuroscience without sacrificing important insights. And explanation in machine learning inevitably runs into the same problems facing explanation in neuroscience, namely that complex systems do not easily yield to the kinds of tools commonly used in neuroscience. Ironically, this has been recognized by machine-learning researchers (and a few philosophers) but has still not penetrated the neuroscience discourse.
T
Causal-mechanistic explanations in neuroscience, as in other sciences, attempt to discard “spurious correlations” that might nonetheless be useful for prediction. For example, L-DOPA can have side effects, such as involuntary movements and headaches, which are correlated with its ameliorative effects on Parkinson’s symptoms. A machine-learning algorithm might be able to predict the ameliorative effects from the side effects, but it’s generally understood that the side effects do not cause the ameliorative effects. Treating the side effects—taking Tylenol for headaches, for example—without affecting the hypothetical causal mechanism, or dopamine, should leave the symptoms unchanged.
Although this seems to drive a strong wedge between prediction and causal-mechanistic explanation, current thinking in machine learning and statistics links the two: Causal-mechanistic explanation is invariant prediction. A predictive algorithm might be able to exploit spurious correlations on observational data, but this would fail under certain interventions, such as the Tylenol example above. The causal mechanisms are the predictive relations that persist even when spurious correlations are removed.
Invariant prediction may be a necessary condition for causality, but it does not by itself shed light on causal mechanisms. This requires measurement and manipulation of a system’s component processes in order to know which predictive relations persist under which interventions. Approaches that focus purely on prediction, such as Brain-Score and neural foundation models, cannot on their own supplant explanation, assuming neuroscientists will continue to care about explanation as an epistemic goal.
Machine-learning researchers have recognized the importance of a more interventionist approach to causal-mechanistic explanation, motivated by a variety of concerns, including alignment, safety and debugging. The most influential approach is based on the “circuit hypothesis,” in which specific subnetworks of an artificial network drive specific behaviors. Neuroscience seems to offer the perfect tool kit for identifying such circuits: analysis of single-neuron and population-level tuning, brain stimulation and ablation/silencing. However, several pessimistic results indicate insurmountable “complexity barriers” for circuit reduction. For example, comprehensive circuit understanding requires, in the worst case, a number of interventions (silencing subsets of neurons) that grows exponentially with the number of neurons. Such intractability afflicts even approximate circuit understanding.
Another cherished assumption in neuroscience is that interventions can be used to establish functional localization: If stimulating or silencing particular neurons changes the system’s behavior in a specific way, researchers typically conclude that those neurons are functionally responsible for the change. But evidence from machine learning shows that such procedures can give rise to localization illusions, in which an intervention erroneously ties a subnetwork to a specific function. It is also possible to change the system output in specific ways by modifying synaptic weights that are outside the subnetwork identified by functional localization. Another pessimistic result shows that standard dimensionality-reduction techniques, widely used in neuroscience, can give rise to interpretability illusions: Even when the low-dimensional representations provide an adequate summary of the model behavior on the training data, they can fail when the model is tested on a new data distribution.
These observations should give pause to those who hope that the tools of neuroscience will be powerful assets to machine learning. They should also give pause to those who hope that the tools of neuroscience will be powerful assets to neuroscience! Indeed, it has been known for nearly a decade that neuroscience tools can fail to unravel even modestly complex computational circuitry. Yet these tools continue to be used in neuroscience, largely because we have not yet devised better alternatives.
To end on a more positive note, I think that the cross talk between machine learning and neuroscience has been very valuable, if only for revealing the limitations of our tools and the frailty of our assumptions. My hope is that a continued dialogue will be the starting point for new approaches.
To get a sense of how the broader neuroscience community is thinking about these issues, I asked eight neuroscientists to weigh in on several questions: Can we replace explanation with prediction in neuroscience? Is circuit mapping an adequate explanatory framework for deep learning? Is it an adequate explanatory framework for neuroscience?
