Nested Learning: Google’s New Paradigm That Could Redefine the Future of LLMs


Nested Learning: Google’s New Paradigm That Could Redefine the Future of LLMs

Making room in the ML series: what Google just unveiled could be the future of LLMs

Google introduces Nested Learning, a new paradigm that could be the conceptual successor to Transformers.

Today we’re pausing our series on Machine Learning, Deep Learning, and LLMs.

Not to go off track, but because something has happened that fits perfectly with where we are: a breakthrough that could define the next decade of AI.

The context: from «Attention Is All You Need» to «Nested Learning»

In 2017, Google Research published Attention Is All You Need.

That paper changed history: it gave birth to Transformers, the architecture that underpins absolutely all modern generative AI. ChatGPT, Claude, Gemini, Llama… they’re all Transformers.

Eight years later, in November 2025, the same team makes its move again.

And what they’re proposing isn’t an incremental improvement, but a new paradigm: Nested Learning.

An approach designed to tackle one of the biggest problems with current LLMs: catastrophic forgetting.

Infographic explaining Google’s Nested Learning paradigm and the Hope architecture with continuous memory.

The problem Nested Learning aims to solve

Current LLMs are impressive, but they have a critical structural limitation:

They can’t learn new things without forgetting part of what came before.

They’re static models, frozen after training. If you want GPT-4 to know something that happened yesterday, you can’t simply «teach it.» You have to retrain it from scratch or use workarounds like RAG (Retrieval-Augmented Generation).

This isn’t a bug. It’s a direct consequence of how Transformers are designed.

And it’s exactly what Nested Learning proposes to change.

The mental «click»: different learning speeds within the same model

In a traditional Transformer:

  • All layers learn at the same speed
  • All updates happen at every training step
  • There’s no notion of memory with different rhythms

Nested Learning breaks that paradigm.

Imagine a model where:

  • Some parts learn very fast → short-term memory, adapted to current context
  • Others learn very slowly → stable knowledge, long-term memory
  • And others operate at intermediate frequencies → semantic knowledge, recurring patterns

This looks much more like how human neuroplasticity works: waves, rhythms, layers of memory that update at different speeds.

The key lies in the conceptual proposal:

Architecture and optimization aren’t separate things. They’re the same process operating at different scales.

Instead of viewing a model as a single block learning at a fixed speed, Nested Learning conceives it as a system of nested optimizations, each with its own update frequency.

«Hope»: the experimental architecture that proves the concept

To test the idea, Google didn’t stop at theory. They built an experimental architecture called Hope.

Hope implements a Continuous Memory System (CMS):

  • There’s not just «short memory» vs «long memory»
  • There’s a full spectrum of memory modules
  • Each updates at its own frequency

The results are promising

Hope outperforms standard Transformers in:

  1. General language modeling → better perplexity
  2. Long-context tasks → handles extended conversations and documents with more coherence
  3. Needle-In-A-Haystack (NIAH) tests → finds specific information in massive contexts
  4. Memory efficiency → retains relevant information without exploding in size

In other words: Hope remembers better, for longer, and at lower cost.

My take: Google hits the accelerator in the AI race

I’ll be honest: in recent months it seemed like China was leading the race, especially with advances like:

  • DeepSeek and its ultra-efficient MoE (Mixture of Experts)
  • DSA (Dynamic Sparse Attention) architectures
  • Open-source models rivaling proprietary ones

But this Google paper changes the tone.

Nested Learning is not:

  • A patch
  • An optimization
  • An engineering trick
  • A bigger model

It’s a foundational proposal.

And that matters for three reasons:

1. It’s a path beyond brute force

Until now, the recipe was simple:

More data → more parameters → more GPUs

Nested Learning proposes something different: smarter models, not just bigger ones.

2. It opens the door to models that learn in real-time

Imagine an AI that:

  • Reads today’s news
  • Integrates that knowledge into its memory
  • Adapts without massive retraining

This is a game changer.

Today, if you want an LLM to know something new, you have to:

  • Retrain it (expensive, slow)
  • Use RAG (limited, fragile)
  • Wait for the next version

With Nested Learning, the model could learn continuously, like we do.

3. It’s a step toward closing the gap with the human brain

Not in size, but in learning dynamics.

The brain doesn’t learn everything at the same speed. Some memories are fixed quickly (where you left your keys), others take years to consolidate (your native language).

Nested Learning is the first serious attempt to replicate that structure in AI.

What does this mean for the future of LLMs?

It’s too early to say that Nested Learning will be the successor to Transformers.

But it’s undoubtedly the most serious and ambitious proposal since 2017.

If it scales, we could see:

  • LLMs that update in real-time without retraining
  • Smaller but more capable models thanks to efficient memory
  • AI that learns from interaction with users, not just from the initial corpus

And that’s why we’re making room in the series today: because understanding the future of LLMs is also part of understanding the present.


«If 2017 was the year of Transformers, 2025 could be remembered as the year their successor began.»


Reference links

For those who want to dive deeper, here are the direct links to the original material:

🔗 Google Research Blog — research.google

🔗 Paper NeurIPS 2025 — arxiv.org

🔗 Analysis article — thedavestack.com


TL;DR

  • Google proposes Nested Learning, a new training paradigm
  • Tackles catastrophic forgetting with continuous multi-frequency memory
  • The experimental architecture Hope outperforms Transformers on long-context tasks
  • It’s the most serious proposal toward adaptive LLMs since «Attention Is All You Need»
  • Could change how we train and deploy AI models