Skip to content
Gary Wu
Go back

Context Is a Harness Artifact

Edit page

Thesis. LLMs are stateless. A “context window” is a transformer input-size cap, not memory. A “conversation” is a UX fiction the harness produces by resubmitting prior turns on every call. Every memory mechanism — every chat history, every retrieval hook, every CLAUDE.md, every skill registry, every “dreaming” pass — is a harness decision, not a model capability. So long-horizon autonomy is a question of harness design, not model capability.

This is the flat, technical claim most commentary about “agents” elides. Once you accept it, a lot of the contemporary debate — bigger windows vs. smaller, memory plugins, RAG vs. fine-tuning, “will GPT-5 remember you?” — collapses into a single engineering question: what does the harness do between complete() calls?


1. The stateless fact

Every inference API call to Claude, OpenAI, or Gemini is a pure function: complete(tokens) -> tokens. No hidden server-side state. No persistent identity across calls. The second call knows nothing about the first. This is not something you have to reverse-engineer — the providers document it explicitly.

Anthropic’s Messages API docs:

The Messages API can be used for either single queries or stateless multi-turn conversations. […] When creating a new Message, you specify the prior conversational turns with the messages parameter, and the model then generates the next Message in the conversation. (Messages API)

The word “stateless” is not buried. The client sends the complete messages array — every user turn, every assistant turn — on every request. Send half the array, the model sees half a conversation. Send a re-ordered array, the model sees a re-ordered conversation. Send an invented array, the model sees an invented conversation and answers as if it had happened.

OpenAI’s Chat Completions behave the same way; Google’s Gemini generateContent is structurally identical. The newer OpenAI Responses API adds an optional server-side previous_response_id and a Conversations object, but that is a server-side rewrite of the same pattern — OpenAI’s servers now hold the transcript on your behalf and resubmit it for you. The model call underneath is still stateless. (Responses conversation-state guide.)

The point generalizes: the model does not remember the last call because there is no “last call” from the model’s perspective. There is only the current tensor it is asked to complete. Everything we call “memory” — short-term, long-term, episodic, semantic, procedural — is built by something outside the model. That something is the harness.


2. The context window is an input cap, not a memory abstraction

“200K context window” or “1M context window” is an input-size constraint on the transformer. Attention is $O(n^2)$ in sequence length; the provider has chosen a maximum $n$. That is the entire technical content of the phrase. It is not a memory span — it is the maximum number of tokens you are allowed to send in one forward pass.

Anthropic says this plainly:

The “context window” refers to all the text a language model can reference when generating a response, including the response itself. […] This represents a “working memory” for the model. […] More context isn’t automatically better. As token count grows, accuracy and recall degrade, a phenomenon known as context rot. This makes curating what’s in context just as important as how much space is available. (context-window docs)

Two things are load-bearing. The phrase “working memory” is in scare quotes in the original — the docs frame it as a metaphor for the buffer the model attends to on this call, not a durable store. And capacity is not quality: documentation, research, and operator experience converge on context rot. The 1M-token model is not a 5× better rememberer than the 200K model. It is a model that will accept 5× more input and, past some task-dependent point, synthesize it less reliably.

The context window is an architectural parameter of the transformer — comparable to batch size in a training run. Nothing lives between requests. Treating “bigger window” as the path to “more persistent agents” confuses the input cap with the memory architecture. Those are different layers of the stack.


3. The conversation is a UX fiction

If every call is stateless, what is a “conversation”?

It is the harness doing two things, per turn: (1) append the new user message to a list it is maintaining; (2) submit the entire list as messages, get back the assistant reply, append that too. That is the whole mechanism. Anthropic’s context-window docs describe it verbatim:

Progressive token accumulation: As the conversation advances through turns, each user message and assistant response accumulates within the context window. Previous turns are preserved completely.

This is not a property of the model. It is a property of what the harness chooses to resubmit. The model is handed the full transcript on every call and has no way to know it is the twentieth call rather than the first. It is not “continuing” anything; it is freshly conditioning on a prompt the harness has assembled to look like a continuation.

Two harnesses can produce radically different “conversations” from the same model and the same user inputs by varying what they put in messages: append-everything produces linear growth to the window cap; summarize-then-keep-last-N produces a chat that forgets turn 2’s wording but keeps the gist; inject-different-system-prompt-per-call produces a persona that appears to change mid-stream; query-a-vector-store-and-prepend produces a chat that seems to remember things the model was never told. None of these is a model behavior. All of them are harness behaviors presented through the conversational surface. There is no canonical shape for “a conversation.” There is only whatever the harness decides to submit.


4. Every existing LLM system is a harness choice

Once you see the harness, every agent, chatbot, copilot, and “AI framework” becomes an arrangement of answers to roughly six questions:

  1. What gets kept verbatim in messages? (raw transcript, sliding window, compacted, or replaced by summary)
  2. What gets summarized, and when? (event-driven, scheduled, or never)
  3. What gets pulled in from outside the transcript? (retrieval, pinned files, skill injection, world state)
  4. What gets written back out, and who decides? (the model via tool calls, the harness after every turn, a separate consolidation pass)
  5. What survives the process dying? (filesystem, DB, external issue tracker, nothing)
  6. How does the next run pick up? (replay the transcript, load a brief, reconstruct state from the world)

Plot real systems against these axes and the landscape stops looking like a zoo and starts looking like a design space. The table below draws on a verified 10-framework survey (../../wiki/business/autonomous-agents-context-continuity.md), with star counts and file paths verified via gh api as of 2026-04-19.

SystemTranscript policySummarizationExternal retrievalWrite-backDurabilityRollover
Claude Code (docs)Linear accumulation; /compact only when forcedStructured-summary compaction, lossyCLAUDE.md re-injected at start and after compact; MCP tool names deferredAuto MEMORY.md written by model when “worth remembering”Session JSONL + CLAUDE.md + MEMORY.md on disk--continue/--resume replays JSONL
ChatGPT (consumer)Linear within a conversation; new chats start freshImplicit truncation at capOptional “Memories” feature injects user facts into system promptModel proposes memory writes, user can vetoServer-side conversation rows + Memories tableNew chat ≠ resume; the harness decides scope
Claude Agent SDK (repo)Full transcript replayed on resumeRelies on Claude Code compaction when hosted under itNone at SDK layerNone automaticJSONL transcripts on diskresume=<id>; fork_session=True
OpenClaw (repo)Compacted on threshold after a pre-compact save promptYes, LLM-driven; different model can be configuredActive-memory plugin injects retrieved notes before every reply as hidden prefixDreaming: scheduled nightly pass scores candidates and promotes to MEMORY.mdMarkdown workspace + SQLite + optional gitFresh window; re-inject curated brief, not transcript
Hermes Agent (repo)Middle turns compressed while preserving cache breakpointscontext_compressor.pyMemoryManager pre-turn / post-turn / async-prefetch hooks — first-class seamPost-turn sync_all; skill capture after complex tasksFTS5 SQLite state.db + ~/.hermes/memories/ filesSession DB with parent/child lineage across compressions
LangGraph (docs)User-composed: nodes read/write a typed state objectUser-composed (a summarize node, if any)Store API (semantic search, namespaced)Whatever your nodes writeCheckpointer (SQLite / Postgres) + Store APIthread_id → load latest checkpoint; time-travel to any prior checkpoint
Agent Zero (repo)Batch-orientedHierarchical spawn: subordinate in fresh context, returns summary_50_recall_memories.py runs FAISS KNN after every message_50_memorize_fragments.py extracts facts post-turn into FAISSFAISS index + chat filesNo explicit resume; next run’s auto-recall finds relevant vectors
Cloudflare Agents SDK (docs)No built-in transcript notion — agent owns its stateWhatever the agent implementsDeveloper-defined SQL queriesthis.state / this.sql / this.schedule()Durable Object SQLite + alarms; hibernates when idleStable DO ID: there is no rollover — the agent never ended
Prime (article)No user-visible transcript — one LLM call per wake, bounded by designNot needed; wake-level reasoning stays smallStructured state read from DO SQL + D1 + GitHub Issues before each wakeDecisions row in DO SQL; tracking-issue comment on GitHub; skill crystallization on escalationDO SQLite (working) + GitHub Issues (episodic) + CLAUDE.md (semantic) + D1 (shared)Hibernation — the DO reconstructs from its four memory stores on next wake

None of the differences in this table are differences in the model. They are differences in what the harness does between calls.

The dominant harness — Claude Code, ChatGPT, Cursor, most chat copilots — is roughly: keep appending the transcript; compact only when forced; retrieve only what the user explicitly pinned. Call it linear accumulation. It is the laziest harness that produces a coherent user experience, and it is the paved road.

The systems that break from it all do it differently:

Each is a different shape of memory the harness is imposing on a stateless model. The claim “the model remembers” is incoherent in all of them. The correct claim is “the harness remembers on the model’s behalf, and chooses what to show the model each time it is called.”


5. Why linear accumulation persists despite breaking

If linear accumulation is the worst harness, why is it everywhere? Three reasons, in descending order of honesty.

It’s the trivial implementation. If your first job is show the user something that looks like a conversation, the shortest path is: store the turns, send them all back, display the reply. The harness falls out of the UI requirement. You don’t notice you made a design decision; you notice you made a product.

It works until it doesn’t. Up to maybe 50K–150K tokens of transcript (task-dependent), linear accumulation produces decent-seeming results — the model “remembers earlier in the chat” because earlier in the chat is in the prompt. The failures come later: the window fills, compaction strips a detail, “context rot” silently degrades answers on long chats. By then the architecture is a load-bearing assumption.

The platform vendors amplified it. Anthropic’s Messages API and OpenAI’s Responses API (with previous_response_id and server-side Conversations) make linear accumulation the paved road. If the fastest way to ship a chat product is to hand the vendor’s SDK your transcript and let it deal with replay, that is what most teams will do. The vendors are now layering server-side compaction on top (Anthropic ships it as a beta feature) — an elegant admission that linear accumulation breaks, and a tacit acknowledgment that the harness work has to happen somewhere. They will sell it to you rather than leave you to build it.

Linear accumulation is not a principled answer to “how should a long-horizon agent remember?” It is the answer to “how do we ship a chat product this quarter?” The industry has been answering the second while claiming to answer the first.


6. The bigger-window race is a dead end

The capacity race — 8K → 32K → 128K → 200K → 1M — has been sold as progress toward longer-memory, more-autonomous agents. It is not that.

Capacity is not synthesis quality. Anthropic’s own docs are explicit: “more context isn’t automatically better. As token count grows, accuracy and recall degrade.” Research on long-context retrieval (MRCR, GraphWalks, the various “needle in a haystack” evaluations) shows models handle a few needles in a million tokens well but degrade on synthesis across many scattered signals — the thing you actually need for multi-day agent work.

Capacity does not solve durability. A 1M-token window is still a per-call input cap. It does not help you across two calls, two processes, two machines, two weeks. The problems linear accumulation hits — context rot on long chats, state lost when a session dies, no cross-thread memory — are orthogonal to window size. Making the window bigger delays the failure and makes it more expensive when it comes.

Capacity does not substitute for curation. A good harness answers “what are the 20K tokens this call should see?” The answer is almost never “all of them.” A harness that curates — the right CLAUDE.md, the right skills, the right prior-attempts rows, the right retrieved notes — beats one that dumps history into a larger window. The penalty for indiscriminate inclusion (context rot, cost, latency) scales with the window you fill.

Anthropic’s own engineering guidance has started to shift: the docs explicitly tell you to “design your state artifacts so that context recovery is fast when a new session starts,” and link out to a piece titled “Effective harnesses for long-running agents.” The vendor selling you the context window is telling you, in writing, that the context window is not the answer — the harness is.

The big-window race is a manufacturing race (can we build transformers that accept longer inputs?), not a cognition race (can agents sustain coherent long-horizon work?). Mistaking the first for the second is the single most expensive category error in the current agent-infrastructure conversation.


7. Implications: autonomy is harness work

If the model is stateless, the window is an input cap, and the conversation is a UX fiction, then every long-horizon capability you care about lives in the harness. That reframes what “building an AI agent” even means.

Memory is a subsystem, not a feature. You pick a persistence substrate (files, SQLite, Postgres, Durable Objects, GitHub Issues, a vector DB, a knowledge graph). You pick what gets written, when, and by whom — the model via tool calls, the harness after every turn, or a separate consolidation worker. You pick what gets read back, on what trigger. The model is a fixed dependency; the memory is your architecture. And only a handful of systems include a consolidation pass at all — OpenClaw’s three-phase dreaming, NanoBot’s Consolidator-plus-Dream, Honcho’s named dreamer module, Graphiti’s bi-temporal edge invalidation, A-MEM’s neighbor-updating writes (active-memory survey). Everything else accumulates without distilling. Without consolidation, a system stores trajectories but never learns the lesson.

Retrieval is harness-level. The model does not “remember X.” The harness shows the model X, at the moment it decides X is relevant. Agent Zero’s _50_recall_memories.py, OpenClaw’s active-memory plugin, Hermes’s MemoryManager.prefetch_all — all fire outside the model and inject into the next prompt. The model is none the wiser. That is the point.

Autonomy is goal-carrying. A “Claude Code session working on a task” is really a human carrying the goal across turns and prodding the model into each next step. When the human leaves, the goal evaporates, because the goal was never in the model. For the goal to persist, it has to live somewhere outside the model — a run sheet, a GitHub Issue, a DO’s working memory, a plan written to MEMORY.md. That “somewhere” is the harness. Prime and the autonomous-entity pattern treat this as foundational: the agent’s identity, current goal, and escalation history live in persistent stores, and the LLM call is a disposable reasoning step that reads them, adds to them, and returns.

Stated fully: if memory, retrieval, consolidation, rollover, and goal-carrying are all harness work, then the harness is doing the thing we’ve been calling “agency.” The model is a mutation worker the harness dispatches. The “agent” is the harness. (Developed further in ../thinking-is-substrate-self-modification/README.md and ../goal-generation-is-agency/README.md.)

The closest ML analogue: DSPy

The ML research community has been pointing at this for years. DSPy — Stanford NLP’s framework, started at Stanford in February 2022 — is explicit that what it does to an LLM program is compile it. From dspy.ai:

DSPy is a declarative framework for building modular AI software. […] Compile AI programs into effective prompts and weights.

You write AI programs in a module-plus-signature style, and DSPy’s optimizers (MIPROv2, BootstrapFewShot, etc.) produce the prompts that make those programs work on a given model. The prompt is the compiler output; the AI program is the source; the model is a target. That is exactly the relationship between a harness and an LLM: a harness is a program that, on every turn, compiles the current world — user turn, memory, retrieved facts, system prompt, skills, plan — into the prompt the model executes.

Stanford-DSPy names this openly and builds tooling around it. Most agent frameworks have been doing a degenerate version of the same thing — string-concatenating a transcript — without acknowledging it. Once you see the harness as a compiler (see ../the-harness-is-a-prompt-compiler/README.md), “what’s the best harness?” becomes “what’s the best compiler?” — and compilers are engineered, tested, optimized. They are not accidental.


8. Closing

The industry has spent two years arguing about model capability — benchmarks, evals, “GPT-5 will unlock this, Opus 4.7 will unlock that” — while the mechanisms that decide whether an agent can do multi-day work sat un-examined one layer up.

The stateless fact is not a bug; it is the correct factoring. The model should be a pure function; state and identity should live in a substrate designed for them. What has been missing is the recognition that the substrate is the architecture — every agent framework is implicitly answering a handful of harness-design questions, and the answers matter enormously.

Once you see that, three things follow:

  1. Stop racing context windows. You are racing the wrong axis.
  2. Stop asking “does the model remember?” It doesn’t. Ask “what does my harness show it, and when, and why?”
  3. Treat the harness as a first-class engineering artifact — with a schema for memory, a retrieval policy, a consolidation pass, a rollover protocol, and a curation function that chooses what goes into each complete() call.

The short name for that last step is: treat the harness as a prompt compiler. That is the subject of the companion articles in this series.


Sources

Vendor docs (primary). Anthropic Messages API (“stateless multi-turn conversations”); Anthropic Context windows (window as “working memory,” context rot); Anthropic Effective harnesses for long-running agents; OpenAI Conversation state (stateless Chat Completions + server-side Responses conversation object); DSPy / Stanford NLP dspy.ai (“compile AI programs into effective prompts and weights,” started Feb 2022).

Framework mechanisms cited. OpenClaw — compaction.md, dreaming.md, active-memory.md. Hermes Agent — memory_manager.py. Claude Code — context-window docs, agent-sdk sessions. LangGraph — persistence docs. Agent Zero — _50_recall_memories.py. Cloudflare Agents SDK — docs, DO hibernation.

Internal surveys (verified 2026-04-19, all GitHub URLs and star counts checked via gh api):


Edit page
Share this post on:

Previous Post
Two Classes of Agents: Codebase-Native vs Workers-Native
Next Post
Dreaming and the Effect Gate