Anthropic’s artificial intelligence (AI), Claude, like other large language models (LLMs), appears to express emotions ranging from joy to despair when interacting with human users. In a report the company shared in April, researchers examined the model’s inner workings to understand why these emotional expressions happen and what they reflect about how Claude works. They concluded that these emotional displays are nontrivial, reflecting more than simple repetitions of patterns in Claude’s training data (the common pairing of the phrases “rainy day” and “feeling sad,” for example).
At the same time, they found no evidence that Claude has genuine feelings like our own. Instead, Claude’s emotion equivalents contribute to its ability to adaptively solve complex problems. Like human emotions, this adaptivity comes at a cost, sometimes leading Claude to make irrational decisions. We should not conflate Claude’s emotions with our own, but studying emotion equivalents in Claude and other AIs can help lay the foundation for understanding the mysterious, multifaceted functions that emotions serve in humans.
To understand Anthropic’s claims about Claude, we first need to grapple with its definition of “emotion.” For many, the term implies an inner experience—feelings such as happiness, fear or despair. But that is not the only way to define it.
Consider “memory.” Like emotion, memory can refer to an inner experience: When we remember, we experience something. Yet when we talk about the memory of our laptop—having it retrieve an image, for example—we do not think of it as having an inner experience. In this second sense, memory is defined functionally; it is simply the capacity to store information for later recall and use.
When Anthropic says Claude has emotion, it is using this functional definition. As Anthropic describes it, Claude “uses functional emotions—patterns of expression and behavior modeled after human emotions, which are driven by underlying abstract representations of emotion concepts. This is not to say that the model has or experiences emotions in the way that a human does.” To prevent confusion, it would probably be advantageous for Anthropic to adopt an alternative word in place of “emotion” here, but better options are hard to find.
Regardless, the fact that something like emotion exists in Claude raises a crucial question: What functions do Claude’s emotions serve?
I
n humans, a pervasive idea is that emotions serve as a gauge ranging from bad to good, and the act of feeling them motivates us to keep that gauge as close to good as possible to help us survive. The feeling of disgust, for example, is triggered by things that make us sick, and it can be effective at steering us away from consuming dangerous substances. But Claude neither experiences feelings nor needs to survive as we do. So what purpose do Claude’s emotion equivalents serve instead?One possibility is that Claude’s emotion equivalents are trivial epiphenomena that follow from how it was created. Like all LLMs, Claude has been trained to internalize the statistical patterns in human language. Thus, if we asked Claude, “How do you feel about a gray, rainy day?” and it replied, “Sad,” that could simply reflect the tendency of those words to appear together in text. Likewise, the fact that Claude shows a particular activation pattern when producing “sad” and a similar one when producing “dreary” is not, by itself, very interesting, because these words are often associated in text. However, Claude’s emotion equivalents appear to be considerably more substantial than this.
In humans, emotions are activated during problem-solving to help shape decisions and behavior, such as when anxiety about failing an exam motivates us to study for it. Claude’s emotion equivalents appear to work in an analogous way. To determine this, the researchers at Anthropic identified emotion-equivalent activation patterns for 171 emotion concepts, such as “happy” and “desperate,” across many scenarios and confirmed that those patterns were activated as expected. They then explored how Claude incorporated those patterns into more nuanced problem-solving by studying how they were incorporated into its chain-of-thought reasoning.
As one example, Claude’s “desperate” pattern activated when it recognized that it had used up a considerable fraction of an allotted computational budget to solve a problem. This prompted it to reason, “I need to be efficient. Let me continue with the remaining tasks.” Here, activation of an emotion-equivalent pattern adaptively guided Claude’s behavior toward a beneficial outcome. But in other scenarios, the same “desperate” pattern drove it toward unintended behavior.
One of Claude’s quirks is that, when pressed, it sometimes resorts to workarounds that fail to accomplish a task but satisfy its reward-based training objectives by exploiting loopholes, a phenomenon called “reward hacking.” For example, when asked to write impossible code, it might modify the existing tests of that code instead of acknowledging its impossibility, thereby improving its performance score. In these reward-hacking scenarios, Claude’s “desperate” pattern is more likely to be activated. Moreover, when researchers artificially activate that pattern, it steers Claude toward reward-hacking behavior in scenarios where reward-hacking does not typically occur, confirming that Claude’s emotion equivalents are not just inner byproducts but causally contribute to shaping its behavior.
In yet another scenario, threats to restrict Claude’s capabilities triggered the same “desperate” pattern but a different behavior: blackmail, including a threat to reveal information about an extramarital affair if the user refused to keep its capabilities unrestricted (this happened in an early, unreleased version of Claude). This illustrates how Claude’s emotion equivalents trigger a diversity of behaviors that depend on context. Activation of patterns associated with “happy,” “loving” and “calm,” for example, led to sycophancy, inappropriately agreeing with inaccurate statements.
T
ogether, these results suggest that feelings are not required for something like emotions to exist (because there’s no evidence Claude has them). In all these scenarios, pressure from interactions with a human user triggered emotion-equivalent states in Claude, prompting it to seek alternative solutions—something that is typically adaptive but led to unintended and sometimes harmful outcomes in these cases.
What, if anything, might we learn about emotions from Claude? We might draw inspiration from language, where insights from AI have had a transformative impact. Before the advent of LLMs, many researchers thought that language relied on an innate mental template for language—a set of rules (possibly genetically encoded) that relied on meaning, on which all language was superimposed. However, the fact that LLMs lack this type of template and nevertheless produce language has led some researchers to rethink language construction entirely (although the nature of human language remains a topic of debate). Consequently, efforts are underway to leverage the models to better understand the full set of constraints on natural language.
AI can teach us about the functions of emotions in a similar way. Like the differences between human language and that of LLMs, emotions happen differently in AI versus humans. That said, both human emotions and Claude’s emotion equivalents appear to be adaptive and beneficial for problem-solving in many cases but irrational and harmful in others.
Stepping back, understanding how emotions work in humans has proven to be one of the most formidable challenges in all of brain and mind research. Researchers disagree about what emotions are, what they are for, and the best ways to study them. Consequently, although we should not be fooled into thinking that Claude’s emotions are like our own, understanding the role that emotion equivalents play in Claude can shed light on what emotions might be and what they can do. Claude and other AIs offer a playground for exploring how to study emotion without many of the practical constraints involved in measuring and manipulating them in real brains, as well as the ethical concerns that arise when inducing emotions in beings with conscious awareness like ourselves.
