Digitally distorted building blocks.
Building blocks: The BabyLM Challenge includes two tracks with datasets of 100 million or 10 million words, both much smaller than typical LM training sets, which include more than a trillion words of text.
Photography by Danielle Ezzo / Stable Diffusion AI

The BabyLM Challenge: In search of more efficient learning algorithms, researchers look to infants

A competition that trains language models on relatively small datasets of words, closer in size to what a child hears up to age 13, seeks solutions to some of the major challenges of today’s large language models.

Language models (LMs) and babies learn language in totally different ways. Typical LM training sets include more than a trillion words of text—1,000 times more words than the average 13-year-old has experienced in their lifetime. Teaching babies language like we train LMs would require a baby to listen to 200 million words a day until they are 13 years old. That’s 2,000 words per second, 24/7!

Clearly, LMs are data-hungry. That’s a problem, both because training them takes a lot of power and because we likely can’t keep up with their appetite. Indeed, one of the main ways LMs have improved in recent years is by training bigger models on more data. But that approach isn’t sustainable—researchers predict that we’re likely to run out of new data to train LMs on by 2028. This raises the question: Are we doing something wrong when we train LMs? Why are they so data inefficient?

This question is the inspiration for the BabyLM Challenge, a competition in which researchers try to develop LMs that work well when trained on an amount of data closer to what a human might encounter in their first 13 years. BabyLM includes two tracks, featuring datasets of 100 million or 10 million words. Aside from limits on the amount of training data, challenge participants can use whatever model and data-processing techniques they like. On the day of the competition, each BabyLM submission is run against a surprise test set to assess performance.

The BabyLM Challenge has been held just twice, in 2023 and 2024, with plans for a third event in November of this year; my group participated in 2024. Although there is still a lot to learn, the project has already generated some interesting takeaways. Results so far suggest that taking some inspiration from patterns of human language acquisition can improve learning and that recombining data in novel ways can help compensate for smaller datasets. Researchers in this year’s competition have struggled to combine vision and language into a model that can be trained in a data-efficient way, but there may still be ways to combine sensory streams to improve learning.

Beyond these specific advances, the BabyLM Challenge has also sparked valuable academic conversations, particularly around the issue of LM efficiency—an urgent consideration given the environmental impact of large-scale training. It also underscores a fundamental reality for academic researchers: We simply can’t compete with the multimillion-dollar budgets of industry giants. But this constraint can be a powerful catalyst for innovation. It forces us to move beyond the engineering goals of making LMs bigger and training them on more data. Instead, we can draw inspiration from fields such as developmental psychology and advance language modeling in more insightful ways. The BabyLM Challenge provides a safe space to try out new ideas, share successes alongside failures and push language modeling in new directions.

O

ne of the popular approaches for BabyLM submissions is inspired by how we teach children to read: Begin training on the simplest text, such as short sentences with simple words, and then progress to more difficult text over the course of training. This approach—ordering the training data by difficulty—is called curriculum learning. Despite its intellectual appeal, curriculum learning did not seem to lead to better-performing LMs—with one exception, a 2024 submission that presented a new take on curriculum learning.

Rather than ordering text by the difficulty of each sentence, Suchir Salhan, a computer science graduate student at the University of Cambridge, and his collaborators ordered the data at the word level, grouping words by their parts of speech, such as nouns and verbs. The authors trained masked language models, which learn to predict randomly selected missing words that can appear anywhere in a sentence. The team’s insight was to apply the curriculum at the word-mask level. The authors masked words based on their parts of speech, gradually expanding the parts of speech that were masked based on theories of language acquisition. They found significant improvement in LM performance for an ordering that first masked nouns and verbs, and then moved on to determiners, conjunctions, adjectives and so on—a benefit that held true for several languages.

Why does this word-level curriculum work when ordering data based on difficulty at the sentence or document level does not? Though there is a focus on baby-directed speech in the BabyLM Challenge—the data contain CHILDES, a database of transcribed child-directed speech—babies actually hear a lot of adult-directed conversation, and it’s likely that they learn from speech that is not specifically directed at them. Evidence supports the idea that ambient language in the environment likely supports language acquisition. For example, despite large cultural differences in the amount of child-directed speech a child experiences, there are only minor or no differences in acquisition rates. Salhan and his collaborators’ focus on specific word categories could simulate the sort of learning possible when a baby understands only a small number of words.

How else might a learner bootstrap their learning when they have access to only a small amount of data? One of the winning approaches in 2023 remixed sentences from the BabyLM training data to create additional training examples. On its face, this feels a bit like cheating: Their model works better because they trained it on more data. But, given that the new data were derived from the smaller dataset, it is an interesting approach that fell within the bounds of the challenge. Perhaps this remixing of words could be akin to using our own internal monologues to improve our literacy, a kind of “talking to yourself” for LMs. (A data remixing effort in the 2024 challenge had more limited success.)

In 2024, the best-performing model was a mixture of the training objectives from two well-known LMs: GPT and BERT. GPT uses causal language modeling; it is trained to predict the next word in a sequence. BERT is trained using masked language modeling; it is trained to predict randomly masked words that can appear anywhere in a sentence. Each performs well on different tasks, implying that combining them might give the best of both worlds. The aptly named GPT-BERT model dominated all submissions in 2024 in both the 100-million- and the 10-million-word track.

How might masked and causal language modeling fit into human language acquisition? Psycholinguistics has long leveraged the fact that people predict the next word in a sequence based on what they’ve encountered so far. Psycholinguists often manipulate violations of that expectation, which led to the discovery of characteristic activity patterns tied to semantic and syntactic surprise, the N400 and P600 event-related potentials. So causal LMs are well motivated, but what about masked language models? Masking a word within a stimulus sentence is much less common in psycholinguistics, but you can probably remember a time that you predicted a word you missed mid-sentence when listening to someone. Combining these two predictive models into one makes sense in light of the predictions people (and babies) make when processing noisy language.

Of course, babies don’t learn language in isolation. They experience a rich world of sight, sound, touch and smell—and some hypotheses from developmental psychology suggest that this combination of sensory information helps drive language learning. The most recent BabyLM Challenge adopted that idea, expanding to include a multimodal track. The multimodal dataset included images paired with text to encourage teams to consider learning from two sensory streams at the same time. In this first multimodal iteration, none of the teams were able to outperform the baselines provided by the challenge organizers. But the approach has strong theoretical support, so the BabyLM organizers plan include the multimodal track in future challenges—perhaps one of the models in next year’s effort find the key combination.

Subscribe to get notified every time a new “NeuroAI” piece is published.

Read the latest insights on the intersection between neuroscience and AI.