The Sweet Lesson of Neuroscience

Adam Marblestone

Scientists once hoped that studying the brain would teach us how to build AI. Now, one AI researcher may have something to teach us about the brain.

In the early years of modern deep learning, the brain was a North Star. Ideas like hippocampal replay — the brain’s way of rehearsing past experience — offered templates for how an agent might learn from memories. Meanwhile, work on temporal-difference learning showed that some dopamine neuron responses in the brain closely parallel reward-prediction errors — solidifying a useful framework for reinforcement learning.

DeepMind’s 2013 Atari-playing breakthrough was perhaps the high-water mark of brain-inspired optimism. The system was in part a digital echo of hippocampal replay and dopamine-based learning. DeepMind’s CEO gave talks in the early days with titles like “A systems neuroscience approach to building AGI.”

By around 2020, though, many in AI had accepted what Rich Sutton in 2019 called the “bitter lesson”: Simple general-purpose methods powered by massive compute and data outperformed hand-crafted details, whether brain-inspired or otherwise. “Scaling laws” for transformer-based language modeling — an architecture that owes little to the brain — showed a path to vastly improved performance. And, of course, it didn’t help that our knowledge of neuroscience was, and is still, primitive.

But I believe the brain may have something more to teach us about AI — and that, in the process, AI may have quite a bit to teach us about the brain. Modern AI research centers on three key ingredients: architectures, learning rules, and training signals. The first two — how to build up complex patterns of information from simple ones and how to learn from errors to produce useful patterns — have been substantially mastered by modern AI. But the third factor — what training signals (typically called “loss” functions, “cost” functions, or “reward”) should drive learning — remains deeply underexplored. And that, I think, is where neuroscience still has surprises left to deliver.

I’ve been fascinated by this question since 2016, when advances in artificial deep learning led me to propose that the brain probably has many highly specific cost functions built by evolution that might train different parts of the cerebral cortex to help an animal learn exactly what it needs to in its ecological niche.

More recently, Steve Byrnes, a physicist turned AI safety researcher, has shed new light on the question of how the brain trains itself. In a remarkable synthesis of the neuroscience literature, Byrnes recasts the entire brain as two interacting systems: a learning subsystem and a steering subsystem. The first learns from experience during the animal’s lifetime — a bit like one of AI’s neural networks that starts with randomly initialized “weights,” or “parameters,” inside the network, which are adjusted by training. The second is mostly hardwired and sets the goals, priorities, and reward signals that shape that learning. A learning machine — like a neural network — can learn almost anything; the steering subsystem determines what it is being asked to learn.

Byrnes’ work suggests that some of the most relevant insights in AI alignment will come from neuroscientific frameworks about how the steering system teaches and aligns the learner from within. I agree. This perspective is the seed of what we might call the “sweet lesson” of neuroscience.

Axon of Purkinje neurons in the cerebellum of a drowned man, c. 1900, Santiago Ramón y Cajal, Cajal Institute, Madrid. Courtesy Wikimedia Commons. 

The brain in two parts

So let’s talk about Byrnes, and brains. In Byrnes’ view, brains have two main parts: a learning subsystem and a steering subsystem.

The learning subsystem consists primarily of the neocortex, hippocampus, cerebellum, and striatum. It’s the part of the brain that develops a model of the world, generates plans of action, and predicts how well they’ll work. When we’re born, it isn’t able to produce much in the way of useful outputs, but it learns continuously, over time, to find patterns in the world, and then more abstract patterns among those patterns, until it becomes very useful indeed.

This reflects the “cortical uniformity” idea popular in neuroscience: While the neocortex handles most of our complex cognition, its own structure is relatively simple and consistent. In other words, despite cortical organization into functional substructures, the neocortex's capacity to support complex thought comes only partially from preexisting architecture, with much of its output the result of learning and experience. Instead, the cortex is a learning machine that starts out a bit like an uninitialized neural network — one that has equal potential to be trained to drive a car or to generate language or any number of other things.

Then there’s the steering subsystem, which consists of the hypothalamus and brainstem, with contributions from other regions like the pallidum. Unlike the learning subsystem, the steering subsystem is relatively static — a fixed set of rules “hand-coded” by evolution early in the history of our species. These structures, like most in the brain, have some plasticity throughout life, but their rewiring capacity is dwarfed by that of structures in the learning subsystem, such as the neocortex. These rules control (or “steer”) what the learning subsystem is being asked to learn from, and when it is supposed to learn it. Evolution, in other words, didn’t just build a learner. It built a teacher inside the same brain.

The steering subsystem influences the learning subsystem through “supervision signals.” These are analogous to the cost or reward functions used to train today’s AI, but are much more diverse, elaborate, and species specific. It also handles innate reflexes that have to be present from birth, before the learning subsystem has had a chance to discover the world’s patterns.

The steering subsystem doesn’t simply hand out rewards for evolution’s ultimate goals of survival or reproduction. Those would come far too late to shape an animal's behavior starting early in its life: By the time the animal learned from these signals, it would already be dead or, at best, abjectly failing to mate.

Instead, evolution built an intricate scaffold of intermediate rewards. Some of these are obvious — food and warmth feel rewarding, because we need them to live. Others are much more sophisticated. We get neural rewards for things like play and exploration, which aren't useful at the moment we do them but do help us learn skills that matter later in life. Humans have instincts for things like social bonding, mimicry, attraction to certain kinds of mates, and many other internal assessment signals we don’t yet have names for. The orchestra of training instructions built up from these signals makes it possible to learn useful skills within one lifetime.

These signals differ depending on the types of behaviors an animal is ultimately “supposed” to learn in its particular ecological niche. Humans need to learn language and complex social skills, so our steering subsystem directs us to pay particular attention to the faces, voices, and behavior of our peers. Birds that are fed by their parents when young will have reward signals that help them “imprint” and learn to follow them around. Beavers might have reward signals for picking up sticks, while young squirrels are attentive to acorns. This is why humans learn social deception and squirrels learn to bury nuts, even though the parts of our brains that house the learning subsystem are structured in largely the same way.

In order to produce useful reward signals, the steering subsystem needs its own sensory systems. We normally think of the visual cortex — part of the learning subsystem — as where vision lives in the mammalian brain. But the superior colliculus is a separate and mostly innate visual system that gives the steering subsystem its own window into the world. The superior colliculus quickly detects hardwired cues for motion, faces, and threats, allowing the steering system to react before the cortex finishes processing a scene, or even before the learning subsystem has discovered what “faces” look like. This can be used both to drive innate behaviors and to construct sophisticated reward signals for teaching the rest of the brain.

The social instincts and the symbol grounding problem

The most surprising part of Byrnes’ theory is his explanation of how the steering subsystem makes use of concepts and patterns discovered by the learning subsystem. 

Remember, the steering subsystem itself is a bundle of mostly fixed instincts. It has minimal ability to learn new information or even really compute concepts at all. Let’s say I embarrass myself by making a mistake in front of another scientist whose work I admire. I’ll almost certainly feel a sense of shame. This felt sense comes in large part from the steering subsystem generating some kind of negative reward signal — but what would trigger it? It certainly has no internal representation of “professional acquaintance” or “scientist.”  Even concepts like “respect,” “admiration,” “shame,” and “status” are complex and contingent, far above the steering subsystem’s pay grade. Still, we find all of these things highly motivating, even though the steering subsystem doesn’t know what they are or even where the neurons representing such concepts might show up. 

“Important scientist” and its ilk are patterns that emerge in the learned world model in the learning subsystem of a modern person in the industrialized world. According to Byrnes’ model, the steering subsystem has no way to know in advance what those patterns will be. How is it supposed to emit the right rewards to shape our social development when those rewards depend on concepts it can neither predict nor comprehend?

This is a version of what the cognitive scientist Stevan Harnad called the symbol grounding problem: How do thinking systems connect abstract symbols to their referrants in the real world — or, in this case, how can innate motivations be triggered by learned abstractions? The steering subsystem is a set of hard-coded genetic rules, but it still needs to respond to learned concepts like “colleague” or “friend.” 

Byrnes has a proposal for how that works. He thinks that there are neural circuits in the brain — he calls them Thought Assessors — that learn to recognize important patterns of thought in the learning system and connect them to the more primitive signals that the steering subsystem has knobs for.

The Thought Assessors are part of the learning subsystem (Byrnes predicts they’ll primarily be found in the extended striatum). They predict, based on input from the learning subsystem, what specific elements of the steering subsystem are about to do. A given Thought Assessor might start out with many different learning subsystem neurons feeding into it while it tries to predict a specific steering subsystem signal. If some neurons turn out to help make that prediction accurately, they’ll stay as inputs to that Thought Assessor. If other neurons prove irrelevant or detrimental to the prediction, their connections to that Thought Assessor are weakened or ignored by the learning subsystem. 

Thought Assessors don’t transfer information about abstract concepts to the steering system, which can't process it anyway. But by learning to predict how the steering subsystem will react across many different kinds of situations, they help connect basic instincts to the neurons that handle learned concepts. Once a Thought Assessor is wired up, it can utilize a more sophisticated learned model of how its steering rules should be applied, one that generalizes to the complicated situations we encounter in the real world. 

Byrnes proposes that there are many different Thought Assessors and many different corresponding kinds of supervisory signals from the steering subsystem — perhaps hundreds to thousands of them. One of those Thought Assessors provides a prediction of a thought’s overall valence — something like how rewarding it is to the animal. There are also many other assessors predicting other innately important features, like “I’m about to get goosebumps” or “I’m about to flee a looming predator” or “this will lead to me crying.” The steering subsystem has signals — and the learning subsystem has corresponding Thought Assessors — for most of the key building block variables underlying all innately controlled, species-specific behaviors, including human social behaviors. 

One of Byrnes’ best-developed examples has to do with laughter. Many biologists think that laughter evolved as a way to signal play. Young animals, the theory goes, enjoy playing because it helps them practice activities like fighting, chasing prey, or fleeing predators in a safe, low-stakes environment. And playful animals often have ways of letting their playmates know that their pawing and batting isn’t a serious attack. Dogs exhibit a “play bow.” Rats and humans laugh. This urge to laugh is an innate instinct: In Byrnes’ parlance, it comes from the steering subsystem. In neurological terms, Byrnes believes that there are specific circuits in the hypothalamus that detect the conditions under which an animal should laugh — when it detects some mild danger signs, like being batted, pawed at, or tickled, but has no other reason to believe it’s in serious trouble.

Researchers have actually found these circuits in experiments with tickled rats. Humans, of course, don’t just laugh during play fights — we laugh at things we find funny or unexpected, or even when we’re nervous. When we’re born, our innate instincts tell us to laugh when we’re basically safe, but detect just a little bit of danger. In babies, this might mean tickling or peekaboo. Over time, we learn increasingly abstract mappings of social threat, confusion, and discomfort. We can imagine a laughter Thought Assessor learning to predict which of the ever-more-complex situations it finds itself in triggers the right balance of safety and danger to drive a laughter response, while the laughter response trains the learning subsystem to label some of its learned internal pathways as playful, humorous, safe, or friendly. 

When you feel pride, shame, or empathy, some of those Thought Assessors are likely firing, allowing a learned cognitive pattern to trigger an ancient reinforcement pathway — a pathway that, in turn, can shape the further development of your social responses. 

Speculation, or serious theory?

Byrnes is not the first to suggest ways that innate evolved brain mechanisms could “bootstrap” learning of complex social behaviors. Other cognitive scientists have imagined processes by which simple innate reward signals could steer an animal to pick up on more complex patterns, which themselves could be the basis for the production of more complex forms of reward. 

A version of this concept appeared in the cognitive science literature in Ullman, Harari, and Dorfman’s 2012 paper, “From simple innate biases to complex visual concepts.” The authors proposed a computational model in which infants would use mover-event detectors — simple innate visual cues for when one object causes another to move — as teaching signals. A system that detects “movers” can label the likely source of motion as a “hand.” Once it recognizes “hands” using this primitive labeling scheme — and assuming it also has an innate face detector that can draw attention to the eyes — it can then infer gaze direction by assuming that eyes tend to look toward “hands.” Other cognitive science literature recognizes that gaze direction, in turn, is a useful signal for training the ability to pay attention to the same thing a caretaker is paying attention to, which helps drive imitation learning and theory of mind.

Although far from validated in its details, Byrnes’ theory also is supported by neuroscience. 

A 2023 Nature paper by Fei Chen and colleagues found that a disproportionate number of the mouse brain’s distinct cell types reside in the hypothalamus, the midbrain, and the brainstem, suggesting that evolution poured much of its innovation into the areas that make up the steering subsystem. 

This would make sense if we believe in the learning/steering subsystem distinction. The learning subsystem just needs to be set up to learn: It needs only to create a generic, somewhat random scaffold and a learning algorithm to fill in its details. But to be the kind of sophisticated teacher Byrnes hypothesizes, the steering subsystem would have to contain a lot of information about the useful behaviors and thought patterns that a specific species will likely face in its ecological niche. We would expect a lot of bespoke innate biological complexity to be built into the steering subsystem, and the experimental evidence suggests that this is in fact the case.

There is substantial evidence that the hypothalamus is involved in shaping social behaviors. In a recent paper titled “A hypothalamic circuit underlying the dynamic control of social homeostasis,” Catherine Dulac’s lab at Harvard identifies the specific neuronal circuits in the mouse hypothalamus that play a role in how “social isolation generates an aversive state (‘loneliness’) that motivates social seeking and heightens social interaction upon reunion.” And David Anderson’s lab at Caltech found “a circuit that integrates drive state and social contact to gate mating,” involving a different set of hypothalamus neurons. 

One key aspect of the theory is that the brain’s reward signals are not just simple functions of external conditions. Rather, they are the results of complex computations by the steering subsystem, which themselves can draw input from the learned Thought Assessors. Is this complexity of reward production realistic? 

Song learning in songbirds provides one of the best-understood examples of biological reinforcement learning. Young birds learn their songs by listening to more experienced tutors, storing an internal template, or memory, of what the song should sound like based on the tutor song, and then practicing producing their own songs thousands of times while their brains generate error signals by comparing the sound of their outputs with that of the stored template.

Dopamine neurons in the songbird brain fire precisely when a sung note is closer to — or further from — from the tutor song. More interestingly, these reward signals change with social context: When practicing alone, feedback is sharp and corrective, but when singing to a mate, the same circuits suppress error signals, freezing the learned performance. In other words, the bird has an innate instinct that makes it want to copy the songs of older birds.

But translating this instinct into practice requires more sophistication than the steering subsystem can provide. It needs to remember a repertoire of songs, calculate the difference between a memorized note and the note the bird actually produces at any given time in a song, and know when to ignore this whole reward system when it’s time to stop practicing and focus on a real opportunity to attract a mate. According to Byrnes’ theory, the songbird brain first built something like a Thought Assessor for evaluating “match my song’s sound to that of my tutor’s song,” and then used that evaluator to help train its song production. This allows the songbird brain to generate dopamine reward signals in response to purely internal processes (and change which of those processes is rewarding in different conditions). The songbird doesn’t just learn; it teaches itself.

Byrnes’ model also predicts that there are many different Thought Assessors that link patterns in the learning subsystem to different supervision signals from the steering subsystem. Different areas of the learning subsystem might also learn from different steering subsystem outputs. Indeed, there is evidence of many specialized supervision signals even in fruit flies, an organism for which we have a full brain wiring map that can start to reveal such complexity. The fly brain doesn’t just have one kind of “dopamine neuron.” Instead, it has about 20 different kinds of dopamine neurons capable of assessing a combination of features of the fly's internal state, with some ability to detect external cues. Each of these 20 or so kinds of dopamine neurons sends signals to different subcompartments of the fly brain’s “mushroom body,” which functions as associative learning center. This is suggestive of an evolutionarily ancient structure that supports training many different “assessors” inside a brain.

There are still many unknowns. Byrnes has fascinating hypotheses about how Thought Assessors fit within our current understanding of neuroanatomy, and how specific social instincts are grounded in steering subsystem circuits, but they are not fully biologically or algorithmically fleshed out. Refining and testing them, especially in the absence of a more unified map of the brain, would be something of a fishing expedition. Remedying this situation is not a small task. I think it would take some brain mapping megaprojects and the formation of new groups studying these kinds of questions, and even then it would likely take more than a few years to bear fruit.

Implications for AI alignment research

Byrnes’ research has important implications for how we understand the brain, but that’s not his primary motivation. He thinks that the way our brain steers its own learning could prove important for aligning future “brain like” artificial general intelligence systems. 

Byrnes thinks that today’s LLMs won't scale to true general intelligence. Instead, AI researchers will naturally converge on the same broad type of architecture that evolution has developed in the human brain. Of course, this will involve a malleable learning subsystem and a hard-coded steering subsystem. Within the learning subsystem, he predicts that the model’s architecture will be a form of continuous model-based reinforcement learning, just as our brains build internal world models that we use to simulate the outcomes of actions before we take them, and that update themselves continually as we learn new information.  

This high-level view of the brain substantially overlaps with Yann LeCun’s proposal for the ultimate path to AGI (which is not large language models), as well as perspectives laid out in books like Max Bennett’s A Brief History of Intelligence. Byrnes is among a small set of neuroscience and AI researchers, like Beren Millidge, who have sought to link this idea to questions of AI alignment.

In any case, this type of setup is not how we train LLMs. Today, model-based reinforcement learning is used for other AI applications, like game playing or robotics, and no current AIs are capable of fully continuous learning — their store of knowledge is fixed once they’re trained. Many AI researchers believe these things aren’t necessary: Sufficiently large and sophisticated LLMs might have enough stored knowledge that they don’t need to keep learning and enough internal sophistication to make good predictions without an explicit world model built into their architecture. If they’re right, then Byrnes’ model of the brain doesn’t have much to tell us about AI safety.  

But if you believe, as Byrnes does, that AGI development may ultimately land on such a brain-like architecture, then we might be able to learn something relevant for AI alignment from the brain. 

Any brain-like AGI would need at least a simple steering subsystem, and if it is to make use of learned concepts, it will need something like Thought Assessors. The challenge will be figuring  out how to design this system so that it bootstraps the development of prosocial motives  instead of selfish ones. 

This is where studying the brain might be able to help us. In Byrnes’ framework, the human brain’s particular Thought Assessors, and the particular logic by which they are used by its steering subsystem to generate rewards, are how the brain aligns itself: They are what keep an increasingly sophisticated network of learned motivations and incentives in line with the instinctual demands of the steering subsystem. If we could figure out how our brains create the different kinds of social reward functions that make humans want to help each other, this insight could offer a step toward designing a reward system that can robustly elicit similar behavior in a powerful intelligence that continuously learns.

Recently, Geoffrey Hinton stated that he is becoming more optimistic about AI alignment because mammals possess a “maternal instinct” — a process that allows a baby to strongly control the behavior of its more intelligent and powerful mother. In light of Byrnes’ framework, Hinton’s wacky-sounding idea could become an actual research program, if still a speculative one.

Validating and refining this framework with more specificity could open a path toward a neuroscience of alignment. Imagine comparing the steering circuits of related species with different social instincts. The goal would not be to copy the brain exactly but to emulate its methods of teaching the learner from within.

Which Thought Assessors do different animals have in their brains, and what patterns do human brain Thought Assessors look for when trying to ground out concepts like “friendliness”? What more primitive steering subsystem signals are needed for the concept of “friendliness” to emerge in the first place? Or do the Thought Assessors tag concepts we haven’t thought of yet? Optimistically, we might not have to understand the brain perfectly to gain some relevant insights.

To be clear, we shouldn’t oversell this as a near-term or straightforward path to aligning AI. Even if we understood perfectly how the human brain is steered, this might not generalize to steering an artificial superintelligence.

A cut nerve outside the spinal cord, 1913, Santiago Ramón y Cajal, Cajal Institute, Madrid. Courtesy Wikimedia Commons. 

Implications beyond AI

The implications of understanding the brain’s steering circuits would go well beyond AI. Many psychiatric conditions — addiction, obsessive-compulsive disorder, depression — can be seen as failures of internal teaching: loops on which the brain’s evaluative machinery gets stuck or miscalibrated.

Aspects of normal brain function like gender and sexual identity may trace back to how the brain’s steering subsystem develops its Thought Assessors — how it learns to interpret patterns of attraction, status, or affiliation and ties them to innate steering responses. 

All of this brings us to a single question: How does sophisticated, within-lifetime teaching actually work inside mammalian brains? Answering that could link psychiatry, developmental neuroscience, and new ideas for AI alignment into a consilient research program.

If scaling was AI’s bitter lesson of the early 2020s, the 2030s and beyond may teach a sweet lesson of neuroscience:

Brains are not just learners; they are architectures of internal teachers.

We should try to find those teachers in the brain — and learn from them.

Adam Marblestone is the co-founder and CEO of Convergent Research, and incubator for Focused Research Organizations.

Published

Have something to say? Email us at letters@asteriskmag.com.