AI Agent Framework to Prevent Context Rot

Earlier this year, we launched Compass and Launchpad, a pair of AI agents that let marketers describe what they want in plain language and get back fully built, cross-channel campaigns.

Compass continuously scans customer behavior to surface revenue opportunities: under-engaged audiences, high-potential segments, optimizations with projected impact. Launchpad takes those insights and turns them into action: assembling audiences, messaging, dynamic content, and journey logic across email, SMS, push, and in-app channels. Early users launched nearly 10x more experiments and lifted goal metrics by an average of 36%.

Behind the scenes, these agents execute complex, multi-phase marketing workflows: analyzing campaign performance, building audience segments, designing personalized templates, configuring triggers, and launching campaigns: often spanning 5 or more phases and 20+ tool calls in a single conversation. Making that work reliably required us to solve a problem that none of the existing agent frameworks adequately address.

Before we started building Compass & Launchpad agents, we looked at existing agentic frameworks. We found that most agents get worse the longer they work, not because the models aren’t capable or not because the context windows are too small.

But because somewhere around step four of a five-step workflow, the agent quietly loses its edge. Responses get vaguer. It forgets details from earlier steps. It hallucinates campaign names. It contradicts decisions it made three phases ago.

We call this context rot, and it’s the reason we built our own agent orchestration framework instead of reaching for LangChain, AutoGen, or CrewAI. This post explains why, what we built, and how it compares to the alternatives.

Blueshift’s Marketing Agentic Framework implements phase handoff to fold context between agent phases, implements a three-tier memory architecture, and supports domain-specific dynamic tool loading that are loaded appropriately in each phase.

The Problem: Context Rot Is Not a Context Window Problem

Here’s a typical task our marketing agents handle:

“Analyze Q4 campaign performance, identify the highest-value customer segment, build a re-engagement campaign targeting lapsed VIPs, design and code personalized email template with profile data and predictions, and launch it.”

A marketing professional does this in a day. An AI agent should be able to do it in less than an hour. But most agents fail: not at the beginning, but partway through, when accumulated context starts degrading reasoning quality.

We measured this systematically across our production workloads:

Context Size	Task Success Rate
0-20K tokens	94%
20-50K tokens	81%
50-100K tokens	64%
100-200K tokens	47%
200K+ tokens	28%

The model isn’t running out of room. It’s running out of focus. Every token of accumulated history, tool outputs, intermediate reasoning, superseded data, competes with the current task for the model’s attention. At 200K tokens of context with a 2K-token current task, roughly 1% of the model’s attention budget is devoted to the thing you actually need it to do.

We call this the attention tax, and it compounds with every phase.

This isn’t a theoretical concern. It’s the difference between a 34% task completion rate and an 89% one. We know, because that’s the before-and-after of our framework.

What We Tried First (And Why It Didn’t Work)

Before building our own framework, we evaluated every standard approach. Each one fails for a specific reason:

Bigger context windows don’t help because context rot is independent of window size. A 1M-token context window doesn’t improve attention quality at 200K tokens. You’ve moved the failure point, not eliminated it, and you’re paying 25x more for the privilege.

Summarization is lossy by design. When you compress “customers with LTV > $500, last purchase > 30 days, in regions US/CA/UK, excluding churned status” into “high-value lapsed customers,” you’ve destroyed the specifics that later phases need. Worse, the model can’t predict what will matter downstream. Summarization makes irreversible discard decisions.

RAG solves a different problem. It’s excellent for knowledge retrieval (“What’s our refund policy?”) but struggles with reasoning continuity (“What did we decide about segment criteria and why?”). Agent context is dynamic and generated at runtime, there’s no pre-indexed corpus to retrieve from.

Agent delegation helps partially: child agents start with fresh context, but it moves the problem rather than solving it. The parent agent still accumulates state from all child outputs. And state transfer between parent and child is itself a lossy operation.

None of these give the agent what it actually needs: control over its own cognitive load.

Phase Handoff: Context Folding for Agent Workflows

Phase Handoff is the core mechanism of our framework. The idea is simple but powerful: let the agent manage its own context the way a human expert manages a complex project.

A marketing professional doesn’t carry every data point from campaign analysis into email design. They extract the insights they need: “our VIP segment is 45K users, highest engagement on mobile, prefer minimal design”, and then work from those insights, not the raw data.

Phase Handoff gives agents this same capability. When the agent finishes a phase of work, it calls a special phase_handoff tool that triggers a context fold:

Tool outputs are cleared. The 150K tokens of raw JSON from campaign analytics? Gone.
New tools are loaded. Segment-building tools replace analysis tools, so the agent has exactly what it needs for the next phase, nothing more.
Artifacts are preserved. The agent selects the key findings worth carrying forward: a few hundred tokens of compressed semantic content, not kilobytes of raw data.
A journal entry is created. The transition is logged for audit but hidden from the model’s active context.

The result: the agent emerges into the next phase with fresh context capacity, the right tools, compressed artifacts from prior work, and full reasoning capability restored. Context oscillates between 60–150K tokens instead of growing monotonically toward degradation.

The agent never sees the mechanism. It just experiences a clean workspace with its notes intact.

Three-Tier Memory: Matching Lifetime to Storage

Phase Handoff is part of a broader memory architecture. We found that treating all information uniformly, i.e. keeping it all in context or discarding it all, causes problems. Different information has fundamentally different lifetimes.

Blueshift’s Marketing Agentic Framework has a three-tier memory architecture that match to the natural lifetime of information density.

Tier 1: Transient Memory holds tool outputs and intermediate calculations. A single campaign analytics call might return 50–200KB of JSON. That data is needed for the current reasoning step and then becomes dead weight. Transient memory is cleared after each API call. Typical size: 50–200KB per step, zero after clearing.

Tier 2: Message Memory holds an operational state that needs to persist within a single user turn but not beyond it. Resource pointers, phase gates, the current active tool set: this is the scaffolding of the workflow, not the semantic content. It’s managed automatically by the framework and filtered at read-time.

Tier 3: Root Memory holds the genuinely permanent information: accumulated artifacts, brand guidelines, decided segment criteria, the historical mission log. This is the agent’s long-term memory: roughly 2KB of curated, agent-selected content that survives all phase transitions and represents the distilled intelligence of the entire conversation.

The philosophy is straightforward: the closer information is to the current task, the larger and more transient it should be. The more permanent the information, the smaller and more curated it must be. This mirrors how human experts work: you don’t memorize every spreadsheet you’ve ever opened, but you remember the conclusions you drew from them.

Resource Pointers: 50 Bytes Instead of 200KB

Long-horizon agent tasks create things: segments, campaigns, email templates. These resources can be large, often 200KB or more, and they’re referenced across multiple phases. The naive approach of embedding the full resource in context causes rapid bloat: one resource referenced four times costs 200KB of context.

Resource pointers store marketing assets efficiently without polluting the context space.

Our framework uses a pointer pattern instead. When a resource is created, we store it in the database and keep only a ~50-byte symbolic reference in context: resource(seg-123). When a later phase needs the resource, it dereferences the pointer on demand, fetching the latest state from the database.

The lifecycle works like this:

Agent creates a resource (e.g., a segment) via an API tool.
Framework stores the result in the database and generates a pointer.
Context carries only the pointer, not the full payload.
Later phases dereference on demand, getting the current state (not a stale copy).
The pointer survives phase transitions, maintaining cross-phase continuity.

This gives us minimal context overhead (50 bytes, not 50KB), always-current data (no stale copies), and clean cross-phase continuity. A conversation-scoped manifest tracks what’s been created, capped at 100 resources, so even massive workflows stay bounded.

How We Compare to Open-Source Frameworks

We didn’t build this in a vacuum. We evaluated the major open-source agent frameworks and found they each solve different aspects of the problem while leaving critical gaps in context management for long-horizon tasks.

LangGraph

LangGraph provides excellent graph-based workflow orchestration and fine-grained control over agent execution flow. Its state machine model is well-suited for complex branching logic.

However, context management is largely DIY. The framework provides a checkpointing mechanism with a ~1GB limit, but it’s on the developer to implement context pruning, memory tiers, and dynamic tool loading. For teams with deep infrastructure expertise and the bandwidth to build these layers, LangGraph is a strong foundation.

For us, that would have meant building most of our framework on top of theirs anyway.

AutoGen

AutoGen excels at multi-agent conversation patterns, it’s arguably the best framework for scenarios where multiple specialized agents need to debate, negotiate, or collaborate. Its ListMemory implementation is straightforward to understand.

The challenge is that memory is append-only by default: there’s no built-in mechanism for selective pruning or tiered retention. In our testing, long-horizon tasks could overwhelm memory, and there’s no equivalent to phase-based context folding. AutoGen’s strength is multi-agent orchestration, not single-agent long-horizon coherence.

CrewAI

CrewAI offers the fastest path from idea to working prototype. Its role-based agent definitions and built-in task decomposition make it easy to get started. For context management, it provides automatic summarization, which is better than nothing, but inherently lossy. You can’t control what gets preserved and what gets discarded.

CrewAI is optimized for productivity and accessibility rather than precise context control, which makes it a great choice for simpler workflows but a challenging foundation for the kind of multi-phase, multi-tool tasks we run in production.

Where Blueshift’s Framework Fits

Our framework is purpose-built for a specific challenge: sustaining coherence across long-horizon, tool-heavy agent workflows. Phase Handoff gives us agent-controlled context folding (not framework-imposed summarization). Three-tier memory matches information lifetime to storage tier. Resource pointers keep context lean across phases. Dynamic tool loading prevents upfront context bloat.

The trade-off is domain specific: we don’t have the multi-agent orchestration depth of AutoGen, and we’re not as general-purpose as LangGraph. We built for our use case: marketing automation workflows that routinely span 5+ phases / domains and 20+ tool calls, and are optimized aggressively for that domain.

The Results

The proof is in the numbers. After deploying Phase Handoff to production:

Metric	Before	After	Change
Task completion (5+ phases)	34%	89%	+162%
Average context at completion	287K tokens	95K tokens	-67%
Cost per workflow	$2.40	$0.85	-65%

These numbers reflect our production marketing automation workflows: multi-phase tasks with heavy tool use across campaign analysis, segmentation, template design, and launch. Results will vary by domain and task complexity, but the directional improvement has been consistent across every workflow type we’ve tested.

The cost reduction comes from two sources: smaller contexts (costs scale linearly with token count) and fewer retries (failures are expensive). The completion rate improvement comes from keeping the agent in its cognitive sweet spot: always under 150K tokens of context, always with exactly the tools it needs.

What This Means for the Industry

The agent ecosystem is maturing rapidly, and we believe context management will emerge as the critical differentiator between agents that demo well and agents that work in production.

Bigger context windows are not the answer. They’re necessary but not sufficient: like giving someone a bigger desk when the real problem is that they can’t find anything on it. The models need structured memory, not just more memory.

The agents that win will be the ones that give agentic applications genuine cognitive load management: the ability to decide what’s relevant, preserve what matters, discard what doesn’t, and maintain coherent reasoning across arbitrarily long workflows.

We think we’re early to this insight, but we won’t be alone for long. The open-source frameworks are evolving quickly, and we expect to see tiered memory architectures and context folding patterns become standard practice within the next year.

The future of AI agents isn’t about how much they can remember: it’s about how intelligently they can tuck things away.

Blueshift’s agent framework powers marketing for enterprise brands. If you’re interested in how AI agents can transform customer engagement for your business, reach out to learn more.

Written by:

Mehul Shah

Co-Founder and CTO

Mehul Shah is the co-founder and CTO of Blueshift, specializing in real-time data, AI, and scalable marketing systems. He focuses on building technology that enables personalized customer engagement at scale.

Solving Context Rot in Long-Horizon AI Agents: Why We Built Our Own Framework

The Problem: Context Rot Is Not a Context Window Problem

What We Tried First (And Why It Didn’t Work)

Phase Handoff: Context Folding for Agent Workflows

Three-Tier Memory: Matching Lifetime to Storage

Resource Pointers: 50 Bytes Instead of 200KB

How We Compare to Open-Source Frameworks

Where Blueshift’s Framework Fits

What This Means for the Industry

Written by:

Inside the Support Queue: What I’ve Learned About How Marketers Actually Work

Enterprise Email Filtering: Why “Delivered” Doesn’t Mean “Visible”

Agentic AI in Marketing: 4 Real Problems Solved by an AI Agent

Product

Solutions

Resources

Success

Company