
A Lens, Not a Bucket
LLMs are remarkably capable. The real opportunity in front of product builders now isn’t making them answer better questions—it’s unlocking long‑horizon work.
Not single turns, but sustained execution across hours and domains. Hand an agent a goal, let it plan, act, adapt, and return with results.
At Blueshift, that meant an agent that could operate across our entire marketing automation surface: Segments, Templates, Campaigns, Analytics, Recommendations, Ideation. Six domains. ~90 tools. Tens of billions of emails a year.
If that’s the goal, you hit a wall: the context limit.
The way through is a reframing you have to internalize. A context window isn’t a bucket you fill—it’s a lens you point.
Today’s models don’t hold 10 million tokens. They move through them—like scrolling through this page. But even that analogy flatters the model.
What actually happens is closer to blinking: isolated flashes of attention, each one stateless unless you deliberately carry something forward.
When you scroll, you persist. Models don’t. Without explicit traces, each phase begins as a new actor that happens to share the same name.
At roughly 200k tokens per phase, 10 million tokens means ~50 phases—fifty separate model calls that need to behave like one coherent execution. At that scale, the problem isn’t capacity. It’s continuity. You need a way to carry forward what was done, what still matters, and why—without dragging everything along.
We call that mechanism PhaseHandoff. This post describes how it works—and why it makes long‑horizon agentic work possible.

The Ground We’re Standing On
Those ~90 tools have rich specs—decision trees, examples, parameter mappings—summing to ~350,000 tokens compressed. That alone doesn’t fit in context. Tool paging isn’t an optimization; it’s mandatory.
But tool specs are the easy part.
The real challenge is long-horizon execution—what the agent does. A question like “Analyze our entire email program” isn’t a prompt. It’s an unbounded traversal: campaign history, segment logic, template variants, performance data, user attributes. You don’t load that. You move through it.
Even simple tasks compound fast. Verifying one email template means rendering it across user profiles—exercising conditional logic, recommendation paths, layout variations. Each render cycle burns ~25K tokens. Twenty profiles, three iterations each: 60 cycles, 1.5 million tokens. For a single template.
Scale to a full program audit—dozens of templates, hundreds of segments, performance across campaigns—and you’re in eight-figure territory.
At this scale, the user isn’t waiting for a chat reply. They’ve kicked off a job. The main question isn’t speed. It’s whether coherence can be maintained as execution unfolds.

When Intent Evaporates
Given the scale and scope we’ve just described, our first response was mechanical: build a fleet of specialists—a Segment Agent, a Campaign Agent, a Template Agent—each owning a clean slice of the domain.
On paper, it looked sensible. In practice, something strange happened.
Nothing broke all at once. The system kept moving, outputs kept flowing, and every local decision looked reasonable. But as work passed from agent to agent, intent quietly evaporated.
The problem wasn’t capability. Each specialist could do its job. The problem was the handoffs.
Consider a concrete example. A Segment Agent identifies users at high churn risk—a nuanced determination based on purchase recency, support tickets, and engagement decay. It hands off a summary to the Campaign Agent:
segment_id: abc-123, intent: churn_prevention
From a systems perspective, this looks fine. The artifact exists. The label is correct. The pipeline continues.
But the reasoning is gone.
The Campaign Agent knows what the segment is called, but not why those users are at risk. It doesn’t know whether churn is driven by unmet expectations, recent negative interactions, or gradual disengagement. So it does the reasonable thing: it generates a generic “We miss you” email.
Nothing crashes. The campaign launches. The system works.
But the intent has evaporated at the boundary.
At first, we treated this as a coordination problem. Maybe handoffs needed to be richer. Maybe summaries needed to be more structured. Maybe agents needed better schemas for passing context. None of it helped.
The deeper issue was the domain itself.
Marketing automation is a low-observability domain. You don’t get immediate, definitive feedback that a decision was correct. Outcomes are delayed, noisy, and ambiguous. Multiple explanations fit the data, and failure is often silent.
In domains like this, correctness can’t guide you. Judgment has to accumulate over time.
That judgment is tacit. It lives in subtle tradeoffs: tone versus urgency, conversion versus trust, reaction versus shaping behavior. Does “We miss you” sound warm or desperate? Is three follow-ups persistence or harassment? You can’t compute these decisions from rules—they’re calibrated through experience with a specific audience.
Fragment identity, and that tacit judgment disappears. It’s not lost—it was never encoded. Each new agent starts from zero, re-deriving context from whatever traces survived the handoff. The system keeps moving, but strategy quietly flattens into generic action.
This is why coordination mechanisms didn’t help. We eventually realized we were solving the wrong problem. We kept trying to improve coordination—richer handoffs, better schemas, more context passed between agents. But coordination wasn’t what was failing.
What was failing was continuity. In a low-observability domain, judgment has to accumulate over time—and it can’t accumulate if you keep swapping out the bearer of judgment.
The problem wasn’t routing or orchestration. It was identity.

The Airlock
In long-horizon work, handoffs are unavoidable. You’ll change tools, change focus, compact context, branch into verification loops, and return. Whether you hand off to a specialist or to your future self, you’re crossing a boundary.
And boundaries are where systems leak.
We learned this the hard way. Our earliest multi-agent designs didn’t fail because specialists were incapable. They failed because handoffs are the moment of maximum pressure loss. Intent, stance, and tacit judgment—especially in low-observability domains—don’t survive boundary crossings by default. Everything still “works,” but the system quietly depressurizes.
So the design goal stopped being “better coordination.” It became pressure integrity: a general-purpose way to cross seams with bounded pressure loss—so intent and judgment don’t decay at every transition.
That requirement has a name: an airlock.
An airlock doesn’t eliminate discontinuities. It makes them survivable. It forces a ritual at the boundary: capture what matters, seal it, then transition.
Practically, that meant one simple rule: don’t hand work to a different identity. Hand it back to yourself. Instead of coordinating many agents, preserve one identity and move it through phases. The handoff stops being a summary for a stranger. It becomes a journal—written by you, for you—that keeps intent contained while everything else changes.

Context Paging
Between airlocks, the system is in flight. It’s making tool calls, producing outputs, iterating, verifying, backtracking. That work generates exhaust: stale traces, half-relevant outputs, forgotten assumptions. If you don’t manage the internal environment, it doesn’t gracefully degrade—it rots.
That’s what zombie context is: old outputs the agent treats as current, hallucinated calls to tools that aren’t loaded anymore, confusion about what’s already been done. Nothing breaks loudly. The cabin just gets contaminated.
At scale, tool loading makes this unavoidable. Swapping from segment tools to campaign tools can be a 100,000+ token delta in what the model “knows.” This is context paging—same principle as an OS paging data through RAM. The context window is working memory. The tool catalog is disk. You can’t keep everything resident, so you page deliberately.
But here’s what we didn’t expect: tool swaps and compaction create the same failure mode.
- Compaction discards context you might still be implicitly relying on.
- Tool swaps orphan the traces you are relying on—parameter names lose meaning, old outputs become misleading, “what happened” becomes uninterpretable.
Either way, you’ve crossed a seam. Either way, you risk depressurizing.
That’s why we stopped treating these as separate concerns. We unified them into a single operational procedure: when you cross a boundary, you seal the handoff and refresh the internal environment at the same time. The airlock needs machinery.
This coupling extends to memory. We didn’t design “memory” as a feature. Memory emerged as a byproduct of sealed transitions: if you’re forced to articulate what matters before crossing a boundary, what you articulate becomes the memory. The truly durable layer is the platform itself—segments, templates, campaigns. The handoff notes are what keep the agent oriented to them.
With that framing in place, we can name the primitive.

PhaseHandoff: The Atomic Transition
PhaseHandoff is the engineered airlock cycle. It’s the minimal sealed procedure we run whenever we cross a boundary—whether that’s a tool pack swap, a compaction event, or a mid-phase reset to prevent rot.
An airlock only works if it’s atomic. You don’t crack the hatch, wander off, and hope pressure holds. You close, seal, equalize, then open.
PhaseHandoff does the same in software:
- Checkpoint — write what you accomplished and what must survive
- Clear — discard tool outputs and stale traces
- Load — bring in the next tool suite for the next phase
A small set of core utilities stays resident across all phases. Everything else is paged.
{
"next_phase_tools": ["create_campaign", "update_campaign", "get_campaign_full"],
"handoff_note": "Created segment 'High-LTV Q4' (uuid: abc-123). 847 users matched. Ready to build campaign journey.",
"current_phase_artifacts": ["Segment 'High-LTV Q4': abc-123"]
}
The forcing function is the note. Before the transition, compress what matters: what was accomplished, what comes next, why it matters. Each note accumulates into <your_handoff_notes> for the next phase. Artifacts are the hard anchors—IDs, URLs, counts—so judgment doesn’t float free and drift.
Handoff notes and artifacts serve different purposes: notes are narrative (“what I accomplished, what’s next”), artifacts are structured identifiers (UUIDs, URLs). Both accumulate across phases—notes in <your_handoff_notes>, artifacts in <accumulated_artifacts>—but artifacts are never wiped within the run. The truly durable memory is the platform itself: the segments, templates, and campaigns that persist long after the conversation ends.
A Concrete Example
User request: “Create a VIP segment, build a welcome campaign, and verify the email renders correctly for 20 different user profiles.”
Phase 1: Segment discovery and creation
- Load segment tools → query schema → create VIP segment
- Handoff note: “Created VIP segment (uuid: abc-123). Ready for campaign.”
Phase 2: Campaign build
- Load campaign + template tools → create email template → wire up journey
- Handoff note: “Campaign created with welcome template. Need to verify renders.”
Phase 3: Verification (mid-domain compaction)
- Render template for 20 profiles—each render adds tokens
- After 10 renders, context is noisy; PhaseHandoff with same tools
- Handoff note: “10/20 renders complete. All passed. Continuing.”
Phase 4: Completion
- Finish remaining renders, final response with links
Total execution: millions of tokens. Context at any moment: bounded by deliberate compaction.
When Traces Break
PhaseHandoff doesn’t eliminate failure—it changes the failure modes.
The most common: semantic drift. The agent writes “Created segment for high-value users” without the UUID. Three phases later, it references “the VIP segment”—but the platform has two segments with similar names. It picks the wrong one. We mitigate this with conventions around unique artifact identifiers in handoff notes and a deterministic resource manifest that the agent can trust to auto-track any resources it has touched.
Bad: “Created segment for high-value users” Good: “Created segment ‘High-LTV Q4’ (uuid: abc-123). 847 users matched. Ready for campaign.”
The second: premature compaction. The agent calls PhaseHandoff before work is actually complete, discarding context it still needs. We detect this through completion-rate monitoring and tune the prompts that govern handoff timing.
The system is robust, not foolproof. But the failures are legible—every phase transition is logged, every handoff note persisted. When something breaks, we can trace it.

Mission Dynamics
Processing a long document is loading cargo—you have it, you reference it. Long-horizon execution is an active mission. The terrain changes. Your equipment changes. The execution itself generates new terrain: each tool call, each decision point, each verification loop adds to what you’re moving through. You’re not reading context—you’re producing it, and you need to stay coherent as it accumulates.
Long-horizon execution needs stronger trace mechanisms than long-input processing. You’re sustaining coherent work across a journey that might span dozens of phases, and you must converge before the budget runs out.
Tool Ecosystem Management
The “long context” here is the agent’s own capability surface: hundreds of tools that can’t all be loaded simultaneously.
PhaseHandoff addresses this directly. Tools are organized into packs; a lightweight catalog of what’s available stays in context even when the full specs don’t—the agent has the map, just not the territory. When it needs campaign tools, it requests them by pack. The system gates visibility—server-enforced, not model-enforced. The agent can’t call tools it hasn’t been granted. Unloaded tools don’t exist. Write operations require typed approvals. Resource payloads are stored server-side and referenced by pointer.
And because the core system prompt stays stable while we swap tool definitions at the tail, we maximize prompt cache hits. It’s a nice bonus: the architecture that preserves coherence also happens to be faster and cheaper
Load-Bearing Framing
Early versions of our phase context used objective language: “The system recorded 3 phase transitions.” The model treated it as external documentation—something to reference or ignore.
We changed it to ownership language: “You called phase_handoff and wrote these notes.”
In our testing, self-authored framing (“You wrote this”) was treated by the model as reasoning history it needed to honor. Objective framing (“The system recorded this”) was treated as external reference it could ignore. We haven’t run formal ablations, but the behavioral difference was consistent enough to become load-bearing.
The same information can function as “state you must honor” or “background you may reference,” depending entirely on how the prompt frames it.
For tool-using agents without a persistent execution environment, the framing is load-bearing infrastructure.
Deliberate Forgetting
Forgetting is not the enemy. Incidental forgetting is.
Context windows are finite. Compaction will happen whether you design for it or not. The question is whether it happens chaotically (overflow, truncation, “lost in the middle”) or deliberately.
PhaseHandoff makes forgetting deliberate. The agent writes what to remember, declares artifacts, then the system clears tool outputs. The agent chooses what survives.
Handoff notes accumulate. Artifacts accumulate. These anchors cannot be forgotten within the run—they’re what prevent drift. PhaseHandoff agents can’t run infinitely (the accumulated notes eventually fill the window), but they run far longer than approaches without explicit compaction while staying grounded.
The conventional framing treats large context windows as the goal and forgetting as the failure mode. We flip it:
Deliberate forgetting is the mechanism. Grounded continuity is the goal.
Context limits are a design constraint, not a problem to overcome.

Stigmergy, Across Time
Stigmergy is typically about space: ants distributed through a colony, each leaving pheromones for others. But it also applies across time. And for agents, it has to.
Humans carry implicit continuity across conscious moments—you don’t need a trace to remember who you are. Agents have no such luxury. Phase N and Phase N+1 share an identity frame but not actual state. Without explicit traces, nothing connects them. They’re different actors who happen to share a name.
The traces ARE the continuity. Handoff notes, artifacts, platform objects—these aren’t metadata about the agent’s work. They’re what makes the agent cohere across phases at all. When we say the framing is “load-bearing,” this is what we mean. It’s not infrastructure around the agent. It’s what constructs the agent’s persistence.
This is why self-authored framing matters:
<your_handoff_notes>
You called phase_handoff #1 and wrote: "Got schema, identified VIP segment"
You called phase_handoff #2 and wrote: "Created welcome email template"
</your_handoff_notes>
“You wrote this” isn’t a prompt trick. It’s literally true—and that’s the point. The model is both writer and reader. The trace is a journal, not a memo from a stranger.
This reframes what the prompt’s job actually is. The prime directive isn’t “do marketing automation.” It’s to maximize trace strength between handoffs so the agent remains coherent across phases. Domain expertise lives in the tools. Coherence lives in the prompt. The marketing work is what gets done. The stigmergy loop is what makes any long horizon task work at all.
Progressive disclosure enables this architecture: domain knowledge arrives with the tools, so the core system prompt stays light—focused almost entirely on engineering the coherence loop.

The Long Horizon
The horizon extends as far as intent survives.
Stigmergy is the principle. Reduce is the shape. If you’re building agents for long-horizon work, you’ll need both in some form.
With PhaseHandoff, we’ve boiled it down to a single reducing function that embodies both. Each call accumulates the trail: intent, attempts, lessons—pushing in-context learning far beyond what’s possible out of the box.
MIT’s work on Recursive Language Models arrived at the same primitive—and the same 10-million-token scale—from different constraints. Their input is static; ours is dynamic—execution that generates its own context. Same pattern, different terrain. When that happens, the pattern is probably fundamental.
Yours will look different. The question is the same: how well do your traces preserve intent across the boundaries you’re creating?
Get that right, and the horizon extends. Get it wrong, and context rots at every seam.
If you’ve built something comparable—particularly around tool-loading strategies or framing sensitivity—we want to hear what you’ve learned. Blueshift ships agentic AI for marketing automation, and we’re hiring engineers who want to push these patterns further.