Titans: How Google Is Teaching AI to Actually Remember Things
TL;DR
Google Research introduced Titans, a new architecture family that gives transformers a learnable long-term memory module. Unlike standard transformers that "forget" everything outside their context window, Titans maintain a neural memory that updates its own weights at test time using gradient descent. The architecture has three pieces: short-term attention (like working memory), a neural long-term memory (persistent, gradient-updated), and persistent memory (task-level knowledge). Titans scale to 2M+ token contexts, outperform both standard transformers and modern recurrent models, and represent a fundamentally different way of thinking about memory in neural networks.
1. Introduction: The Amnesia Problem
Here's a thought experiment. Imagine you're having a conversation with a colleague about a complex software project. You discuss the architecture on Monday, the database schema on Tuesday, the API design on Wednesday, and deployment strategy on Thursday. By Friday, when you're making final decisions, you remember all of it. The key constraints, the tradeoffs you discussed, the decisions you made.
Now imagine your colleague forgets everything from Monday through Wednesday. Every morning, they wake up with a fresh slate, only able to recall the last few hours of conversation. That's pretty much how most large language models work today.
Modern transformers (the architecture behind GPT-4, Claude, Gemini, and every other frontier LLM) have a fundamental limitation: they operate within a fixed context window. Whether it's 8K, 128K, or even 1M tokens, there's always a hard boundary. Everything outside that window is gone. Not compressed, not summarized, just gone.
This isn't just a minor inconvenience. It's a fundamental architectural constraint that limits what AI systems can do. Long-running agents that need to remember weeks of interactions, personalized assistants that should learn your preferences over months, systems that need to process entire codebases... all of these bump up against the memory wall.
In late 2024, Google Research published a paper called "Titans: Learning to Memorize at Test Time" that proposes a genuinely novel approach to this problem. Rather than just making the context window bigger (which has quadratic cost), Titans introduce a neural long-term memory, a separate module that learns to remember important information using gradient descent, at inference time.
It's one of the most interesting architecture papers I've read recently (pun intended), and it has significant implications for the future of AI systems. Let's dig in.
2. Background: How Transformers Handle Memory Today
Before we understand what Titans does differently, we need to understand the current state of affairs.
The Context Window: A Fixed-Size Buffer
A transformer processes a sequence of tokens by computing attention, where every token looks at every other token to determine what's relevant. This is powerful because it captures direct dependencies between any two positions in the sequence. It's also expensive: the computation is O(N²) where N is the sequence length.
Think of it like a database query that does a full cross-join. Great for accuracy, terrible for scale.
┌─────────────────────────────────────────┐
│ Context Window (e.g., 128K) │
│ │
│ [token₁] ←→ [token₂] ←→ ... ←→ [tokenₙ]│
│ ↕ ↕ ↕ │
│ All tokens attend to all other tokens │
│ (Quadratic cost: O(N²)) │
└─────────────────────────────────────────┘
│ │
│ Everything outside this window? │
│ Gone. │
└──────────────────────────────────────────┘The KV Cache: Memory as Append-Only Storage
During inference, transformers maintain a key-value (KV) cache, basically a growing table of key-value pairs from previous tokens. Each new token queries this cache to find relevant context. It's like an append-only log: you keep adding entries, and each new query searches the entire log.
The problem? This cache grows linearly with sequence length, and the attention computation over it grows quadratically. For a 1M token context, you're looking at enormous memory requirements and compute costs.
# Pseudocode: How standard transformer attention works
class StandardAttention:
def __init__(self):
self.kv_cache = [] # Grows forever (within context window)
def forward(self, new_token):
q = project_query(new_token)
k, v = project_key_value(new_token)
self.kv_cache.append((k, v)) # Append, never compress
# Attend to ENTIRE cache: O(N) per token, O(N²) total
output = softmax(q @ all_keys.T) @ all_values
return outputRAG: The External Hard Drive Approach
Retrieval-Augmented Generation (RAG) is the most common workaround for limited context. The idea is simple: store documents in a vector database, retrieve relevant chunks at query time, and inject them into the context window. It's like having an external hard drive that you search when needed.
RAG works, but it has real limitations:
- Retrieval quality is a bottleneck. You need to know what to look for before you find it.
- No learning. The retrieval system doesn't get better at remembering what's important for you.
- Context window still limits. Retrieved chunks still compete for limited context space.
- Latency. Retrieval adds an extra step to every query.
Linear Attention & Recurrent Models: Compression with Costs
Models like Mamba (state-space models), RWKV, and linear attention variants try a different approach: compress the sequence into a fixed-size hidden state, like an RNN. This gives you O(N) complexity, linear in sequence length. Great for efficiency.
The catch? You're trying to compress arbitrarily long sequences into a fixed-size vector or matrix. It's like trying to summarize an entire book into a single paragraph. You inevitably lose information. As the paper puts it:
"On one hand, we use these linear models to enhance scalability and efficiency... whose advantages appear for very long context; On the other hand, a very long context cannot be properly compressed in a small vector-valued or matrix-valued state."
This is the fundamental tension. Transformers are accurate but don't scale. Recurrent models scale but lose information. What if you could have both?
3. The Human Memory Analogy
The key insight behind Titans comes from neuroscience. Human memory isn't a single system. It's a confederation of systems, each serving a different function.
Working Memory (Short-Term)
Your working memory holds what you're actively processing right now. It has limited capacity (the famous "7 ± 2 items") but high precision. When you're reading this sentence, your working memory holds the beginning of the sentence while you process the end.
Analogy in transformers: This is what attention does. Precise modeling of dependencies within a limited window.
Long-Term Memory (Episodic & Semantic)
Your long-term memory stores vast amounts of information (experiences, facts, skills) for extended periods. Crucially, it's not a passive tape recorder. Your brain actively decides what to store based on importance and surprise. You remember your wedding day, not your 847th commute to work.
Analogy in current AI: This doesn't really exist. The KV cache is too short-lived, and model weights are frozen after training.
Meta-Memory
This is your ability to know what you know. To assess whether a piece of information is in your memory and how to retrieve it. It's the system that lets you say "I know I read something about this, let me think..."
Analogy in current AI: Barely exists. Models can't introspect on their own memory state.
Human Memory System:
┌──────────────────────────────────────────────────┐
│ │
│ ┌─────────────┐ ┌───────────────┐ ┌────────┐ │
│ │ Working │ │ Long-Term │ │ Meta- │ │
│ │ Memory │ │ Memory │ │ Memory │ │
│ │ │ │ │ │ │ │
│ │ • Limited │ │ • Vast │ │ • Self │ │
│ │ • Precise │ │ • Selective │ │ aware │ │
│ │ • Active │ │ • Surprise- │ │ • Know │ │
│ │ processing│ │ driven │ │ what │ │
│ │ │ │ • Persistent │ │ you │ │
│ │ │ │ │ │ know │ │
│ └──────┬──────┘ └───────┬───────┘ └───┬────┘ │
│ │ │ │ │
│ └────────┬────────┴──────────────┘ │
│ │ │
│ Interconnected but independent │
└──────────────────────────────────────────────────┘The Titans paper argues that existing architectures only implement one of these systems, and that's why they struggle:
| Architecture | Memory Type | Limitation |
|---|---|---|
| Transformers | Working memory (attention) | Limited context, quadratic cost |
| RNNs/LSTMs | Compressed long-term | Lossy compression, forgets |
| Linear Transformers | Matrix-valued compressed | Same compression problem |
| State-space (Mamba) | Vector-valued compressed | Even more compression |
What if we built an architecture that has all three?
4. Enter Titans: Three Memory Systems Working Together
Titans introduces an architecture with three distinct but interconnected modules, what the paper calls "hyper-heads":
4.1 Core Module: Short-Term Memory (Attention)
This is standard sliding-window attention, but with a deliberately small window. Instead of trying to attend to millions of tokens, the core module only looks at recent context, maybe a few thousand tokens.
Think of it as your RAM: fast, precise, limited capacity. Its job is to handle the immediate dependencies in the current segment of text.
# Core module: standard attention with small window
class CoreAttention:
def forward(self, segment):
# Only attend within a small local window
# Fast, precise, but limited scope
Q, K, V = self.project(segment)
return sliding_window_attention(Q, K, V, window_size=4096)# Long-term neural memory (simplified)
class NeuralLongTermMemory:
def __init__(self):
self.memory_network = MLP(layers=2) # The weights ARE the memory
def write(self, token):
"""Store information by updating weights (at test time!)"""
k, v = project_kv(token)
# Compute "surprise": how unexpected is this token?
prediction = self.memory_network(k)
surprise = loss(prediction, v)
# Update memory proportional to surprise
gradient = compute_gradient(surprise, self.memory_network.params)
self.memory_network.params -= learning_rate * gradient
def read(self, query):
"""Retrieve from memory via forward pass"""
q = project_query(query)
return self.memory_network(q)# Persistent memory: learned during training, fixed at inference
class PersistentMemory:
def __init__(self):
# Learnable parameters, like extra "virtual tokens"
self.memory_params = nn.Parameter(torch.randn(num_slots, dim))
def read(self, query):
# Always available, independent of input data
return attention(query, self.memory_params, self.memory_params)4.2 Long-Term Neural Memory: The Big Innovation
This is where things get interesting. The long-term memory module is a separate neural network, a small MLP, whose weights are the memory. Here's the key idea:
The memory module's parameters are themselves the stored information. The module learns to memorize by updating its own weights at test time using gradient descent.
Let that sink in. During inference (not training), the memory module is actively learning. It adjusts its weights to store new information as it processes the input sequence. It's a neural network that learns to be a database.
The database analogy: Imagine a key-value store where, instead of explicit INSERT/SELECT operations, the database itself is a neural network. Writing data = adjusting weights. Reading data = forward pass. The "database" gets smarter over time because it learns what's worth storing.
4.3 Persistent Memory: Task-Level Knowledge
The third component is persistent memory, a set of learnable parameters that are not data-dependent. These are learned during training and stay fixed at inference time. They encode general knowledge about the task itself.
Think of it as your procedural memory. You don't need to re-learn how to ride a bike every time you get on one. Some knowledge is just baked in.
Putting It All Together
Input Sequence: [x₁, x₂, x₃, ..., x₁₀₀₀₀₀₀]
│
┌────────┴────────┐
│ Split into │
│ segments │
└────────┬────────┘
│
┌──────────────┼──────────────┐
│ │ │
▼ ▼ ▼
┌─────────────┐ ┌──────────┐ ┌──────────────┐
│ Core │ │ Neural │ │ Persistent │
│ Attention │ │ Long-Term│ │ Memory │
│ (window=4K) │ │ Memory │ │ (learned │
│ │ │ (gradient│ │ params) │
│ Short-term │ │ updated │ │ │
│ precise │ │ at test │ │ Task-level │
│ dependencies│ │ time) │ │ knowledge │
└──────┬──────┘ └────┬─────┘ └──────┬───────┘
│ │ │
└──────────────┼──────────────┘
│
┌────┴────┐
│ Combine │
└────┬────┘
│
Output5. Technical Deep Dive: How the Neural Memory Actually Works
5.1 Surprise-Based Learning
The most elegant part of Titans' memory is its surprise-based writing mechanism. Inspired by how human brains preferentially encode surprising events, the memory module pays more attention to unexpected inputs.
Here's the intuition: if you show the memory a token it already "knows" (can predict well), there's no need to update. But if a token is surprising (the memory's prediction is very wrong), it should be stored.
The "surprise" is measured as the loss of the memory network on a given input:
surprise(x) = || MemoryNetwork(k_x) - v_x ||²Where k_x and v_x are the key and value projections of the input. High loss = high surprise = write this to memory.
The memory update rule is then simply gradient descent on this surprise loss:
θ_new = θ_old - α · ∇_θ surprise(x)This is remarkable because it means the memory is being trained at test time. The outer model (trained during pre-training) has learned how to learn. It's a meta-learning system. The pre-training phase teaches the model what constitutes good memory; the test-time gradient descent does the actual memorizing.
5.2 Memory Decay: Forgetting on Purpose
Just like human memory, Titans implements forgetting. Not everything should be remembered forever. The decay mechanism works like weight decay in optimization:
# Memory update with surprise and decay
def update_memory(memory_params, token, decay_rate):
k, v = project(token)
# Forward pass: what does memory predict?
prediction = memory_forward(memory_params, k)
# Surprise = prediction error
surprise = mse_loss(prediction, v)
# Gradient of surprise w.r.t. memory parameters
grad = compute_gradient(surprise, memory_params)
# Update: learn from surprise, but also decay old memories
memory_params = (1 - decay_rate) * memory_params - lr * grad
return memory_paramsThe paper shows that this decay mechanism is actually a generalization of the forgetting gates used in modern recurrent models like Mamba and RWKV. Those models use a scalar or vector gate to control forgetting; Titans' approach covers all of those as a special case.
What's particularly clever is that the decay rate is adaptive. It depends on both the memory's current capacity utilization and the surprise level of incoming data. When memory is "full" and new data is highly surprising, more aggressive forgetting kicks in.
5.3 The Momentum Connection
The authors make a beautiful mathematical observation: the complete memory update rule (with surprise, decay, and a smoothing term) is equivalent to mini-batch SGD with momentum and weight decay.
Memory update: θ_t = (1-α)·θ_{t-1} - η·g_t + β·m_{t-1}
This is exactly: SGD with weight decay (α), learning rate (η),
and momentum (β) on the surprise lossThis equivalence isn't just elegant, it's practical. Because mini-batch gradient descent can be parallelized (process multiple tokens simultaneously in a batch), the memory module can be trained efficiently using tensor operations. The paper leverages this to create a fast, parallelizable training algorithm.
5.4 Deep Memory: More Than a Single Layer
A crucial question the paper addresses: does the memory need to be deep (multiple layers)? The answer is yes. A single-layer linear memory (like what linear attention uses) is limited in what it can store. By making the memory a multi-layer MLP, Titans can store more complex, nonlinear patterns.
Think of it this way: a single-layer memory is like a flat key-value store. A deep memory is like a relational database. It can capture complex relationships between stored items.
The experiments show consistent improvements from 1-layer to 2-layer memory modules, with diminishing returns beyond 2-3 layers.
6. Architecture Variants: Three Ways to Wire Memory
The paper presents three variants of Titans, differing in how the long-term memory is integrated with the core attention:
Input → [Neural Memory Layer] → [Attention Layer] → Output
↓ ↑
Read/Write to
persistent memoryInput ──────────────────┐
├──→ [Attention over both] → Output
Memory Output ──────────┘
(retrieved context)Input ─┬─→ [Attention] ──┐
│ ├──→ Gate ──→ Output
└─→ [Memory] ────┘MAL: Memory as a Layer
In MAL, the long-term memory operates as a pre-processing layer. The input first passes through the memory module (which reads from and writes to memory), and the enriched representation is then fed to the attention layer.
Software analogy: Think of it like a middleware layer in a web application. Requests pass through the memory layer first, which enriches them with historical context, before reaching the main processing logic.
MAL is the simplest variant. The memory sees every token and processes the sequence as a recurrent module, then hands off its output (enriched with historical context) to a standard attention layer.
MAC: Memory as a Context
In MAC, the memory module's output is concatenated with the input to the attention layer. The attention can then attend to both the current segment and the memory's retrieved context.
Software analogy: Like a database JOIN. The attention layer gets the current data plus relevant historical data retrieved from the memory, and can freely combine them.
MAC is the most powerful variant because the attention mechanism can explicitly decide how to weight current context vs. historical memory.
MAG: Memory as a Gate
In MAG, the memory module runs in parallel with the attention layer, and their outputs are combined via a learned gating mechanism.
Software analogy: Like a load balancer that dynamically routes between a cache (memory) and the database (attention) based on the query type.
The gating mechanism learns when to rely more on recent context (attention) vs. historical memory, making MAG particularly flexible.
Comparison
| Variant | How Memory Integrates | Strengths | Best For |
|---|---|---|---|
| MAL | Sequential (memory → attention) | Simple, efficient | General-purpose |
| MAC | Concatenated context | Most flexible attention | Tasks needing explicit memory access |
| MAG | Parallel with gating | Dynamic weighting | Tasks with varying memory needs |
7. Results & Benchmarks
The experimental results are comprehensive and impressive. Here are the highlights:
Language Modeling
On standard language modeling benchmarks, Titans consistently outperform:
- All modern linear recurrent models (Mamba, RWKV, GLA, DeltaNet)
- Hybrid models (recurrent + sliding window attention combinations)
- Standard transformers with the same context window
The MAC variant typically performs best, followed by MAG and MAL.
Needle-in-a-Haystack
This is where Titans really shine. The needle-in-a-haystack test hides a specific piece of information ("the needle") inside a very long context ("the haystack") and tests whether the model can retrieve it.
Titans scale to over 2 million tokens while maintaining high retrieval accuracy. Standard transformers either can't handle this length or show significant degradation. Linear recurrent models struggle because they compress everything into a fixed-size state, making it hard to retrieve specific details.
Common-Sense Reasoning
On benchmarks like HellaSwag, PIQA, ARC, and WinoGrande, Titans show improvements over baselines, particularly on tasks that benefit from broader context understanding.
Time Series & Genomics
The architecture isn't limited to NLP. Titans also show strong results on:
- Time series forecasting, where long-term patterns matter
- DNA modeling, where sequences are extremely long and patterns span thousands of base pairs
Key Takeaway
The results demonstrate that Titans achieve a remarkable balance: the accuracy of full attention with the scalability of recurrent models. They don't just match transformers. They often exceed them while being able to handle sequences that are orders of magnitude longer.
8. Comparison with Other Approaches
Titans aren't the first attempt to augment transformers with memory. Here's how they compare:
Memorizing Transformers (2022)
Google's earlier work added a non-differentiable external memory (basically a k-nearest-neighbor lookup over past key-value pairs). The memory is a static cache, and no learning happens at retrieval time.
vs. Titans: Memorizing Transformers treat memory as passive storage. Titans' memory actively learns, adapting what and how it stores based on surprise signals.
Infini-attention (2024)
Also from Google, Infini-attention compresses past context into a fixed-size memory using linear attention updates, combining it with standard attention for the current segment.
vs. Titans: Infini-attention uses a linear compression scheme. Titans use a deep neural network as memory, which can capture more complex patterns. Titans' surprise-based writing is also more selective.
State-Space Models (Mamba, S4)
SSMs process sequences with O(N) complexity using clever recurrence structures. Mamba adds selectivity, the ability to filter information based on input.
vs. Titans: SSMs compress into a fixed-size state with no mechanism for selective, surprise-based storage. Titans keep the precision of attention for recent context while offloading long-term storage to a separate, deeper module.
RWKV
RWKV combines RNN-style recurrence with attention-like mechanisms, achieving linear complexity.
vs. Titans: Similar to the SSM comparison. RWKV compresses all history into a fixed-size state. Titans maintain a richer, learnable memory.
Summary
Precision Scalability Learning at Memory
(recall) (context) test time? depth
────────────────────────────────────────────────────────────────
Transformer ★★★★★ ★★☆☆☆ No N/A
Mamba/SSM ★★★☆☆ ★★★★★ No Single
Memorizing Trans. ★★★★☆ ★★★★☆ No N/A
Infini-attention ★★★★☆ ★★★★☆ Partially Single
RWKV ★★★☆☆ ★★★★★ No Single
Titans ★★★★★ ★★★★★ Yes Deep9. Practical Implications
For AI Agent Developers
Titans' architecture has obvious implications for long-running AI agents. Today's agents typically rely on external memory systems (databases, vector stores) to maintain context across sessions. With Titans-style memory, an agent could natively maintain a persistent understanding of its task, environment, and user, without any external plumbing.
Imagine a coding agent that remembers your entire codebase architecture, your coding style, the bugs it helped fix last month, and the design decisions you explained. All natively, without a vector database.
For Personalized AI
Current personalization approaches involve fine-tuning (expensive, static) or RAG over user history (fragile, requires retrieval engineering). A Titans-style model could naturally accumulate user preferences and patterns in its long-term memory during interaction. Truly adaptive personalization.
For Enterprise Applications
Processing entire document repositories, maintaining context across long business workflows, understanding multi-day email threads... all of these become more natural with architectures that can actually remember.
For Continuous Learning
Perhaps the most profound implication: Titans represent a step toward models that learn after deployment. The test-time gradient descent on the memory module is a form of online learning. While limited to the memory module's parameters, it's a meaningful departure from the "train once, deploy frozen" paradigm.
10. Limitations & Open Questions
Compute Cost of Test-Time Learning
Running gradient descent at inference time is not free. Each token requires a forward pass through the memory, a loss computation, a backward pass, and a parameter update. While the memory network is small (a few layers), this adds overhead compared to pure attention or pure recurrence.
The paper addresses this with parallelizable mini-batch training, but in practice, the tradeoff between memory quality and inference speed needs careful tuning.
Memory Consistency
When the memory network's weights are being updated continuously, there's a risk of catastrophic forgetting, where new information overwrites old but still-relevant information. The decay mechanism helps, but doesn't fully solve this for very long sequences with complex, interdependent facts.
Scalability of Deep Memory
The experiments use relatively shallow memory networks (1-2 layers). How well does this approach scale to much deeper memory modules, or to storing truly massive amounts of information (millions of facts)? This remains an open question.
Interpretability
What is the memory actually storing? Unlike an explicit key-value cache, the knowledge in a neural memory is distributed across weights, making it difficult to inspect, debug, or audit. For applications requiring transparency (healthcare, finance, legal), this opacity is a real problem.
Training Stability
Training a model that contains a meta-learning inner loop (the test-time memory updates) within an outer training loop adds complexity. The paper demonstrates that it works, but scaling this to frontier-model sizes (hundreds of billions of parameters) may present stability challenges.
Implementation Status
When the paper was published in late 2024, no production-ready implementation existed. Since then, the community has made progress. There are several open-source research implementations on GitHub, and a feature request on Hugging Face Transformers to integrate Titans natively. However, as of early 2026, no major model provider has shipped a production model using the full Titans architecture. The gap between research paper and production deployment remains significant, requiring optimized CUDA kernels, distributed training support, and integration with existing serving infrastructure.
11. The Road Ahead
When Might We See This in Production?
Given that Titans comes from Google Research, it's reasonable to expect that some form of this technology could appear in Google's products (Gemini, etc.) within the next year or two. The key bottlenecks are:
- Efficient implementation. The test-time gradient descent needs highly optimized kernels to be practical at scale.
- Scaling validation. The paper demonstrates results at relatively modest scale; proving it works for 100B+ parameter models is essential.
- Integration with existing infrastructure. Production LLM serving stacks (vLLM, TensorRT-LLM, etc.) would need fundamental changes to support dynamic weight updates.
What Needs to Happen
- Hardware support. Test-time gradient descent may benefit from specialized hardware or memory architectures.
- Hybrid approaches. We'll likely see Titans-style memory combined with other techniques (RAG, longer context windows) rather than replacing them entirely.
- Standardization. The community needs to converge on benchmarks specifically for long-term memory evaluation.
- Safety research. A model that learns at test time raises new safety considerations. What if it memorizes harmful patterns?
The Bigger Picture
Titans represent a philosophical shift in how we think about neural network memory. Instead of the binary choice between "attend to everything" (transformers) and "compress everything" (recurrent models), they introduce a third option: learn what to remember.
This is closer to how biological intelligence works. Your brain doesn't keep a verbatim transcript of everything you've ever experienced, and it doesn't throw away everything older than a few minutes. It selectively encodes, organizes, and retrieves based on relevance, surprise, and importance.
If the Titans approach, or something inspired by it, becomes mainstream, we may look back at this period as the time when AI started to truly remember.
References
-
Behrouz, A., et al. "Titans: Learning to Memorize at Test Time." arXiv:2501.00663, December 2024. https://arxiv.org/abs/2501.00663
-
Vaswani, A., et al. "Attention Is All You Need." NeurIPS, 2017.
-
Gu, A. & Dao, T. "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." arXiv:2312.00752, 2023.
-
Katharopoulos, A., et al. "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention." ICML, 2020.
-
Wu, Y., et al. "Memorizing Transformers." ICLR, 2022.
-
Munkhdalai, T., et al. "Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention." arXiv:2404.07143, 2024.
-
Peng, B., et al. "RWKV: Reinventing RNNs for the Transformer Era." EMNLP, 2023.
-
Squire, L.R. "Memory systems of the brain: A brief history and current perspective." Neurobiology of Learning and Memory, 2004.
-
Hochreiter, S. & Schmidhuber, J. "Long Short-Term Memory." Neural Computation, 1997.
-
Hopfield, J.J. "Neural networks and physical systems with emergent collective computational abilities." PNAS, 1982.