Why Your AI Agent Can't Remember Yesterday
Context windows have limits, and most production agents hit them faster than you'd expect. Here's what actually happens when memory runs out — and what to do about it.
Every AI agent ships with an implicit promise: it'll keep track of where things stand. It'll remember what you asked for in step one when it's executing step seven. It won't lose the thread.
In controlled demos, this is true. In production, at scale, across long tasks — it falls apart in ways that are simultaneously predictable and hard to debug.
The Context Window Is Not Magic
Modern large language models operate inside a fixed context window — a sliding frame of tokens that represents "what the model can see right now." For most production agents running on GPT-4o or Claude Sonnet, that window is somewhere between 128k and 200k tokens.
That sounds enormous. It isn't, once you account for:
- System prompts (often 3–8k tokens for well-instrumented agents)
- Tool schemas (each tool definition eats tokens; a dozen tools = 2–5k tokens gone)
- Prior conversation turns (accumulates fast)
- Retrieved context from RAG pipelines (often 10–20k per retrieval)
- Intermediate reasoning traces if the model is using chain-of-thought
A sophisticated agent can burn through 50–60k tokens of overhead before it does a single piece of real work. For a long-running task — customer support escalation, multi-step code review, document drafting — you're regularly hitting the ceiling.
What Failure Looks Like
The first symptom isn't an error. It's drift. The agent starts giving answers that are technically coherent but missing context from earlier in the conversation. It re-asks questions you've already answered. It forgets that it already tried a tool call that failed and runs it again.
Then comes hallucinated continuity: the model generates plausible-sounding references to things that happened earlier in the conversation — except they didn't happen in this context window. It's confabulating a past it can't actually see.
Finally, most modern inference APIs will hard-truncate or throw an error when context is exceeded. Your agent dies mid-task with a cryptic context_length_exceeded error, no state saved, nothing to resume from.
The Architectural Solutions
1. Explicit Memory Stores
Don't rely on the context window as your only memory. Route key facts — user goals, decisions made, artifacts produced — into an external store (Redis, Postgres, a vector DB) and retrieve them selectively. Your agent becomes a stateless processor; state lives outside it.
2. Summarization Hooks
Before your context fills up, trigger a summarization step: have the model distill "what we've learned so far" into a compact representation that re-enters the next context. This is lossy — details get dropped — but it maintains narrative coherence.
3. Task Decomposition at the Boundary
Long tasks fail at scale because they're designed as single sessions. Break them into discrete, independently completable subtasks with explicit handoff contracts. Each subtask gets a fresh context. State is passed via structured data, not prose conversation history.
4. Context Monitoring
Instrument your agents. Emit token-count metrics at each step. Set alerts at 70% and 85% utilization. When you're approaching the limit, trigger graceful degradation — not crashes.
The Deeper Issue
Memory failures aren't just an engineering problem. They're a design problem. Most agent frameworks are built around the assumption that a conversation is a good unit of work. For short tasks, it is. For long-horizon, multi-step work — the kind of work agents are increasingly being asked to do — it's the wrong abstraction.
The agents that hold up in production are the ones whose designers took memory constraints as a first-class design constraint, not an afterthought.
Start there.