Beyond Bigger Context Windows: How RLM and Phase Handoff Solve Different Halves of the Same Problem

MIT’s RLM & Phase Handoff solve two-sides of the same problem.

Context windows have grown 250x in three years, from 4K tokens to over 1M. The assumption behind that expansion is straightforward: if the model can see more, it can do more. Past a token count  threshold, giving a model more input context doesn’t make it smarter. It makes it worse.

The MIT RLM paper showed that models can reason over inputs two orders of magnitude beyond their native context window, not by expanding the window, but by teaching the model to strategically sample it. At Blueshift, we built Phase Handoff to prevent context from accumulating across multi step agent workflows, not by summarizing, but by giving the agent control over what it carries forward.

When we first read the RLM paper, the reaction was “they’re solving the other half of the problem.” Recursive Language Model tackles inputs too large to reason over in one pass. Phase Handoff tackles outputs that pile up across many passes. Together, they cover the full surface area of context degradation. This post explains what that means in practice.

The Shared Diagnosis: Context Rot Is Real

The numbers are worth stating plainly. At 200K tokens, with a 2K token current task, roughly 1% of the model’s attention budget goes to the thing you actually need it to do. We call this context rot. The RLM authors describe the same phenomenon: “linear memory costs, attention degradation at extreme lengths, and the inability to revisit or reorganize information once consumed”.

Both approaches reject the standard workarounds. Bigger windows delay the problem without solving it. Summarization is lossy. RAG struggles with reasoning continuity. The critical insight is: the model needs to control its own cognitive load.

RLM: Teaching Models to Read

Recursive Language Models address what we’d call the reading problem: how does a model reason effectively over an input that exceeds its ability to process in a single pass?

The mechanism is elegant. Instead of stuffing a massive document into the prompt, RLM stores it as a Python variable in an external REPL environment. The model receives only its query and instructions, not the 500K token dataset. It then writes code to strategically explore the context: peeking at slices, searching with regex, filtering for relevance. When it finds a relevant section, it can call itself recursively on that smaller chunk.

The performance improvements are substantial. RLM handles inputs up to two orders of magnitude beyond a model’s native context window. Even on shorter inputs, it outperforms vanilla frontier models on long context tasks. The post trained RLM-Qwen3-8B outperforms its base model by 28.3% on average. In benchmarks, RLM uses ~2-3K tokens per query versus 95K+ for traditional approaches, while maintaining or improving answer quality.

This is genuinely impressive for tasks where the bottleneck is ingestion: multi hop Question Answering over large document sets, searching through hundreds of pages of financial filings, finding patterns in massive product catalogs, extracting structured data from enormous corpora. The model can’t hold enough input to reason over it effectively in one pass, and RLM gives it a way to be strategic about what it looks at.

Phase Handoff: Teaching Agents to Do

Phase Handoff addresses a different problem: how does an agent sustain coherent reasoning across a multi step execution workflow that generates context as a byproduct?

Consider what our marketing agents do: analyze campaign performance, build an audience segment, design an email template, configure triggers, and launch a campaign. Each phase calls multiple tools, each tool returns 50 to 200KB of JSON, and the agent carries all of it forward into the next phase. By phase four, the agent is sitting on 200K+ tokens of accumulated context, most of it irrelevant to the current task.

The bottleneck here isn’t ingestion. No single tool output exceeds the model’s ability to reason in one pass. The problem is accumulation: five phases of tool outputs, intermediate reasoning, and superseded data piling up and drowning the current task in noise.

Phase Handoff gives the agent a structured mechanism to manage this. When the agent finishes a phase, it calls a phase_handoff tool that triggers a context fold:

  1. Tool outputs are cleared. 150K of raw JSON from the analysis phase, gone.
  2. New tools are loaded. Segment building tools replace analysis tools.
  3. Artifacts are preserved. The agent selects key findings worth carrying forward (~2KB of compressed semantic content).
  4. A journal entry is created. The transition is logged for audit but hidden from active context.

This sits within a broader three tier memory architecture: Transient Memory (tool outputs, cleared per API call), Message Memory (operational state, scoped to a single user turn), and Root Memory (curated artifacts, survives the entire conversation). Resource pointers keep references to created objects at ~50 bytes instead of embedding 50KB payloads in context.

The result: context oscillates between 60 to 150K tokens instead of growing monotonically. Task completion went from 34% to 89% on 5+ phase workflows. Cost dropped 65%.

Reading vs. Doing: Why the Distinction Matters

The difference between RLM and Phase Handoff isn’t just architectural. It reflects two fundamentally different failure modes of language models under pressure.

The reading failure happens when a single input overwhelms the model’s attention. A 500K token document fed into a 200K context window, or even a 1M window, suffers from attention degradation. The model can’t effectively reason over all of it in one pass. Important details get lost in the noise. RLM fixes this by letting the model strategically sample and recurse, never processing more than it can handle in any single call.

The doing failure happens when many individually manageable inputs accumulate across steps. Each tool call returns a reasonable 100KB of data. The model handles it fine. But by the fifth phase, the model is carrying 500KB of prior tool outputs that are no longer relevant, and the current task is competing with all of them for attention. Phase Handoff fixes this by clearing completed phase data and preserving only the semantic distillation the agent selects.

Here’s a concrete example that illustrates both failures in one workflow:

Phase 1 (reading problem): Agent needs to analyze 500K tokens of campaign performance data across 50 campaigns to identify the highest opportunity segment.

Phases 2 through 5 (doing problem): Agent takes the insight from Phase 1 and executes: builds a segment, designs an email template, configures triggers, and launches the campaign, each phase generating 50 to 150K of tool outputs.

RLM would excel at Phase 1. The campaign data is too large to reason over in one pass, so the model writes code to search for conversion rates, filter by segment, and recursively drill into the top performers.

But RLM can’t solve what happens next. Its core mechanism is a read loop: store data as a variable, write code to explore it, recurse on relevant chunks, converge on an answer. Execution phases don’t work that way. Building a segment requires calling external APIs with side effects, not querying a stored dataset.

The REPL pattern assumes the work product is information extraction, not system mutation. RLM also has no concept of lifecycle management: no tool swapping between phases, no artifact selection, no mechanism for deciding what to carry forward versus discard. And its recursive calls are stateless by design, which is ideal for independent analysis sub-tasks but breaks down when Phase 3 needs to reference the segment created in Phase 2.

You could extend RLM to store tool output histories as REPL variables and extract relevant findings before each step, but at that point you’re reinventing Phase Handoff’s artifact compression inside the REPL pattern.

Phase Handoff would excel at Phases 2 through 5, folding context between each phase and keeping the agent in its cognitive sweet spot. But it doesn’t help within Phase 1 if the single analytics payload is too large for effective reasoning in one pass.

The two approaches aren’t interchangeable. You need both.

Where the Two Approaches Converge

Despite different mechanisms, the philosophical overlap is striking. We’ve identified four principles that both approaches share:

Agent autonomy over context. Both RLM and Phase Handoff give the model control over what it processes. RLM lets the model write code to select which parts of an input to examine. Phase Handoff lets the agent decide which findings to preserve as artifacts. Neither relies on framework imposed summarization or mechanical pruning. The model’s semantic understanding drives the decisions.

Smaller effective context at each step. RLM ensures the model only sees ~2-3K tokens of relevant context per recursive call, even when the total input is 500K+. Phase Handoff ensures the agent never exceeds ~150K tokens of total context, even across a 5+ phase workflow. Both achieve better results by processing less at any given moment, the opposite of the “bigger window” approach.

Lossless intent, lossy representation. Neither approach claims to be lossless in the information theoretic sense. RLM might miss a relevant section during its coded exploration. Phase Handoff’s artifacts are a compression of the full analysis. But both preserve the intent and semantic meaning far better than mechanical summarization, because the model is making informed decisions about what matters.

Composability with existing infrastructure. RLM works with any LLM via a REPL wrapper. Phase Handoff works within a standard tool calling agent loop. Neither requires changes to the underlying model. Both are orchestration layer innovations that make existing models more effective.

A Combined Architecture

Here is what a combined architecture looks like.

Analysis phases use RLM style decomposition. When the agent needs to reason over a large dataset (campaign analytics, customer event histories, product catalog exploration), the context lives as a variable in a code sandbox. The model writes targeted queries, filters, and recursive calls to extract exactly the insights it needs, without ever loading the full dataset into its prompt.

Execution phases use Phase Handoff. Once the agent has its insights and moves into creation mode (building segments, designing templates, configuring campaigns), Phase Handoff manages the lifecycle. Tool outputs are cleared between phases. Artifacts carry forward the semantic thread. New tools are loaded as needed.

Transient Memory is the natural integration point. In our three tier architecture, Transient Memory (Tier 1) is where a large tool outputs land before being processed and cleared. An RLM style mechanism would live within Transient Memory, giving the agent a way to handle oversized individual tool outputs that exceed effective single pass reasoning.

Concretely: when a tool returns 500K tokens of campaign analytics, instead of embedding that payload in the prompt, the REPL variable becomes the Transient Memory holder. The model uses RLM style recursive exploration to extract insights within that tier, those insights get promoted to Root Memory as artifacts, and the raw data is cleared, just like any other Transient Memory lifecycle, but with a smarter extraction step in between.

Resource pointers bridge both worlds. Whether the agent discovers a key resource via RLM style analysis or creates one during a Phase Handoff execution phase, the pointer pattern keeps references lightweight. If RLM driven analysis identifies an existing segment worth targeting, the framework stores a pointer to that segment (resource(seg-456)) the same way it would for agent created resources. The full object lives in the database; context carries only a ~50-byte symbolic reference. The mechanism is the same regardless of whether the resource was found or built.

The combined flow for a marketing workflow might look like this:

Total context never exceeds 150K at any phase. The 500K analytics dataset is handled effectively through decomposition. No information is lost that the agent deemed important. The full audit trail is preserved in journal entries.

What This Means for Agentic Software

We see context management following the trajectory that compute management followed in cloud infrastructure. First, everyone tried to solve it with more resources (bigger context windows, like bigger servers). Then frameworks emerged to manage resources more intelligently (orchestration layers, like Kubernetes). Now we’re entering the era of self managing systems, where the agent itself decides how to allocate its cognitive resources.

RLM and Phase Handoff represent two facets of this shift. RLM tackles input side context management: how to reason over data that’s too large to ingest at once. Phase Handoff tackles lifecycle context management: how to sustain coherence across workflows that generate context as a byproduct.

The RLM paper demonstrates a genuine paradigm shift in how models interact with large inputs. And we’ve seen firsthand that lifecycle context management is the difference between a 34% and 89% task completion rate on real workflows. Production agents will need both. The “reading” problem doesn’t go away just because you solve the “doing” problem, and vice versa. Teams treating context as a first-class architectural concern now will have a structural advantage over those who bolt it on later.

We’re building toward the combined architecture in our Compass and Launchpad agents, where we can understand your brand’s marketing program, and then be a partner in executing your program.

This is Part 2 of our series on agent architecture. Part 1, “Why We Built Our Own Agent Framework,”covers the context rot problem, Phase Handoff, three tier memory, and how we compare to LangGraph, AutoGen, and CrewAI.

Blueshift’s agent framework powers Compass and Launchpad, AI agents for enterprise marketing automation. Learn more.

 

Written by:

Mehul Shah

Mehul Shah

Co-Founder and CTO

Mehul Shah is the co-founder and CTO of Blueshift, specializing in real-time data, AI, and scalable marketing systems. He focuses on building technology that enables personalized customer engagement at scale.