Thesis. LLMs are stateless. A “context window” is a transformer input-size cap, not memory. A “conversation” is a UX fiction the harness produces by resubmitting prior turns on every call. Every memory mechanism — every chat history, every retrieval hook, every
CLAUDE.md, every skill registry, every “dreaming” pass — is a harness decision, not a model capability. So long-horizon autonomy is a question of harness design, not model capability.
This is the flat, technical claim most commentary about “agents” elides. Once you accept it, a lot of the contemporary debate — bigger windows vs. smaller, memory plugins, RAG vs. fine-tuning, “will GPT-5 remember you?” — collapses into a single engineering question: what does the harness do between complete() calls?
1. The stateless fact
Every inference API call to Claude, OpenAI, or Gemini is a pure function: complete(tokens) -> tokens. No hidden server-side state. No persistent identity across calls. The second call knows nothing about the first. This is not something you have to reverse-engineer — the providers document it explicitly.
Anthropic’s Messages API docs:
The Messages API can be used for either single queries or stateless multi-turn conversations. […] When creating a new
Message, you specify the prior conversational turns with themessagesparameter, and the model then generates the nextMessagein the conversation. (Messages API)
The word “stateless” is not buried. The client sends the complete messages array — every user turn, every assistant turn — on every request. Send half the array, the model sees half a conversation. Send a re-ordered array, the model sees a re-ordered conversation. Send an invented array, the model sees an invented conversation and answers as if it had happened.
OpenAI’s Chat Completions behave the same way; Google’s Gemini generateContent is structurally identical. The newer OpenAI Responses API adds an optional server-side previous_response_id and a Conversations object, but that is a server-side rewrite of the same pattern — OpenAI’s servers now hold the transcript on your behalf and resubmit it for you. The model call underneath is still stateless. (Responses conversation-state guide.)
The point generalizes: the model does not remember the last call because there is no “last call” from the model’s perspective. There is only the current tensor it is asked to complete. Everything we call “memory” — short-term, long-term, episodic, semantic, procedural — is built by something outside the model. That something is the harness.
2. The context window is an input cap, not a memory abstraction
“200K context window” or “1M context window” is an input-size constraint on the transformer. Attention is $O(n^2)$ in sequence length; the provider has chosen a maximum $n$. That is the entire technical content of the phrase. It is not a memory span — it is the maximum number of tokens you are allowed to send in one forward pass.
Anthropic says this plainly:
The “context window” refers to all the text a language model can reference when generating a response, including the response itself. […] This represents a “working memory” for the model. […] More context isn’t automatically better. As token count grows, accuracy and recall degrade, a phenomenon known as context rot. This makes curating what’s in context just as important as how much space is available. (context-window docs)
Two things are load-bearing. The phrase “working memory” is in scare quotes in the original — the docs frame it as a metaphor for the buffer the model attends to on this call, not a durable store. And capacity is not quality: documentation, research, and operator experience converge on context rot. The 1M-token model is not a 5× better rememberer than the 200K model. It is a model that will accept 5× more input and, past some task-dependent point, synthesize it less reliably.
The context window is an architectural parameter of the transformer — comparable to batch size in a training run. Nothing lives between requests. Treating “bigger window” as the path to “more persistent agents” confuses the input cap with the memory architecture. Those are different layers of the stack.
3. The conversation is a UX fiction
If every call is stateless, what is a “conversation”?
It is the harness doing two things, per turn: (1) append the new user message to a list it is maintaining; (2) submit the entire list as messages, get back the assistant reply, append that too. That is the whole mechanism. Anthropic’s context-window docs describe it verbatim:
Progressive token accumulation: As the conversation advances through turns, each user message and assistant response accumulates within the context window. Previous turns are preserved completely.
This is not a property of the model. It is a property of what the harness chooses to resubmit. The model is handed the full transcript on every call and has no way to know it is the twentieth call rather than the first. It is not “continuing” anything; it is freshly conditioning on a prompt the harness has assembled to look like a continuation.
Two harnesses can produce radically different “conversations” from the same model and the same user inputs by varying what they put in messages: append-everything produces linear growth to the window cap; summarize-then-keep-last-N produces a chat that forgets turn 2’s wording but keeps the gist; inject-different-system-prompt-per-call produces a persona that appears to change mid-stream; query-a-vector-store-and-prepend produces a chat that seems to remember things the model was never told. None of these is a model behavior. All of them are harness behaviors presented through the conversational surface. There is no canonical shape for “a conversation.” There is only whatever the harness decides to submit.
4. Every existing LLM system is a harness choice
Once you see the harness, every agent, chatbot, copilot, and “AI framework” becomes an arrangement of answers to roughly six questions:
- What gets kept verbatim in
messages? (raw transcript, sliding window, compacted, or replaced by summary) - What gets summarized, and when? (event-driven, scheduled, or never)
- What gets pulled in from outside the transcript? (retrieval, pinned files, skill injection, world state)
- What gets written back out, and who decides? (the model via tool calls, the harness after every turn, a separate consolidation pass)
- What survives the process dying? (filesystem, DB, external issue tracker, nothing)
- How does the next run pick up? (replay the transcript, load a brief, reconstruct state from the world)
Plot real systems against these axes and the landscape stops looking like a zoo and starts looking like a design space. The table below draws on a verified 10-framework survey (../../wiki/business/autonomous-agents-context-continuity.md), with star counts and file paths verified via gh api as of 2026-04-19.
| System | Transcript policy | Summarization | External retrieval | Write-back | Durability | Rollover |
|---|---|---|---|---|---|---|
| Claude Code (docs) | Linear accumulation; /compact only when forced | Structured-summary compaction, lossy | CLAUDE.md re-injected at start and after compact; MCP tool names deferred | Auto MEMORY.md written by model when “worth remembering” | Session JSONL + CLAUDE.md + MEMORY.md on disk | --continue/--resume replays JSONL |
| ChatGPT (consumer) | Linear within a conversation; new chats start fresh | Implicit truncation at cap | Optional “Memories” feature injects user facts into system prompt | Model proposes memory writes, user can veto | Server-side conversation rows + Memories table | New chat ≠ resume; the harness decides scope |
| Claude Agent SDK (repo) | Full transcript replayed on resume | Relies on Claude Code compaction when hosted under it | None at SDK layer | None automatic | JSONL transcripts on disk | resume=<id>; fork_session=True |
| OpenClaw (repo) | Compacted on threshold after a pre-compact save prompt | Yes, LLM-driven; different model can be configured | Active-memory plugin injects retrieved notes before every reply as hidden prefix | Dreaming: scheduled nightly pass scores candidates and promotes to MEMORY.md | Markdown workspace + SQLite + optional git | Fresh window; re-inject curated brief, not transcript |
| Hermes Agent (repo) | Middle turns compressed while preserving cache breakpoints | context_compressor.py | MemoryManager pre-turn / post-turn / async-prefetch hooks — first-class seam | Post-turn sync_all; skill capture after complex tasks | FTS5 SQLite state.db + ~/.hermes/memories/ files | Session DB with parent/child lineage across compressions |
| LangGraph (docs) | User-composed: nodes read/write a typed state object | User-composed (a summarize node, if any) | Store API (semantic search, namespaced) | Whatever your nodes write | Checkpointer (SQLite / Postgres) + Store API | thread_id → load latest checkpoint; time-travel to any prior checkpoint |
| Agent Zero (repo) | Batch-oriented | Hierarchical spawn: subordinate in fresh context, returns summary | _50_recall_memories.py runs FAISS KNN after every message | _50_memorize_fragments.py extracts facts post-turn into FAISS | FAISS index + chat files | No explicit resume; next run’s auto-recall finds relevant vectors |
| Cloudflare Agents SDK (docs) | No built-in transcript notion — agent owns its state | Whatever the agent implements | Developer-defined SQL queries | this.state / this.sql / this.schedule() | Durable Object SQLite + alarms; hibernates when idle | Stable DO ID: there is no rollover — the agent never ended |
| Prime (article) | No user-visible transcript — one LLM call per wake, bounded by design | Not needed; wake-level reasoning stays small | Structured state read from DO SQL + D1 + GitHub Issues before each wake | Decisions row in DO SQL; tracking-issue comment on GitHub; skill crystallization on escalation | DO SQLite (working) + GitHub Issues (episodic) + CLAUDE.md (semantic) + D1 (shared) | Hibernation — the DO reconstructs from its four memory stores on next wake |
None of the differences in this table are differences in the model. They are differences in what the harness does between calls.
The dominant harness — Claude Code, ChatGPT, Cursor, most chat copilots — is roughly: keep appending the transcript; compact only when forced; retrieve only what the user explicitly pinned. Call it linear accumulation. It is the laziest harness that produces a coherent user experience, and it is the paved road.
The systems that break from it all do it differently:
- OpenClaw runs a nightly “dreaming” pass that scores recent signals and promotes winners into a curated
MEMORY.md, then re-injects that brief (not the raw transcript) at session start. Context fill is routine, not an emergency — it runs an agent-initiated pre-compact save prompt first. (compaction.md,dreaming.md) - Hermes Agent formalizes retrieval as a first-class seam:
MemoryManager.prefetch_allandsync_allhooks fire before and after every LLM call, with backends pluggable behind the hook. (memory_manager.py) - Prime sidesteps “session” entirely. Each repo is a Cloudflare Durable Object that hibernates, wakes on alarm or webhook, reconstructs its picture from GitHub Issues +
CLAUDE.md+ D1, makes one bounded LLM call, writes back, and sleeps. No transcript to resume — only structured state the agent re-reads on every wake. (article) - LangGraph makes the checkpoint the primary object: each node write is a
StateSnapshotkeyed bythread_id. Any process with the Postgres backing the checkpointer can resume the same “conversation” weeks later, from any prior snapshot. The “conversation” is a typed state object, not a transcript.
Each is a different shape of memory the harness is imposing on a stateless model. The claim “the model remembers” is incoherent in all of them. The correct claim is “the harness remembers on the model’s behalf, and chooses what to show the model each time it is called.”
5. Why linear accumulation persists despite breaking
If linear accumulation is the worst harness, why is it everywhere? Three reasons, in descending order of honesty.
It’s the trivial implementation. If your first job is show the user something that looks like a conversation, the shortest path is: store the turns, send them all back, display the reply. The harness falls out of the UI requirement. You don’t notice you made a design decision; you notice you made a product.
It works until it doesn’t. Up to maybe 50K–150K tokens of transcript (task-dependent), linear accumulation produces decent-seeming results — the model “remembers earlier in the chat” because earlier in the chat is in the prompt. The failures come later: the window fills, compaction strips a detail, “context rot” silently degrades answers on long chats. By then the architecture is a load-bearing assumption.
The platform vendors amplified it. Anthropic’s Messages API and OpenAI’s Responses API (with previous_response_id and server-side Conversations) make linear accumulation the paved road. If the fastest way to ship a chat product is to hand the vendor’s SDK your transcript and let it deal with replay, that is what most teams will do. The vendors are now layering server-side compaction on top (Anthropic ships it as a beta feature) — an elegant admission that linear accumulation breaks, and a tacit acknowledgment that the harness work has to happen somewhere. They will sell it to you rather than leave you to build it.
Linear accumulation is not a principled answer to “how should a long-horizon agent remember?” It is the answer to “how do we ship a chat product this quarter?” The industry has been answering the second while claiming to answer the first.
6. The bigger-window race is a dead end
The capacity race — 8K → 32K → 128K → 200K → 1M — has been sold as progress toward longer-memory, more-autonomous agents. It is not that.
Capacity is not synthesis quality. Anthropic’s own docs are explicit: “more context isn’t automatically better. As token count grows, accuracy and recall degrade.” Research on long-context retrieval (MRCR, GraphWalks, the various “needle in a haystack” evaluations) shows models handle a few needles in a million tokens well but degrade on synthesis across many scattered signals — the thing you actually need for multi-day agent work.
Capacity does not solve durability. A 1M-token window is still a per-call input cap. It does not help you across two calls, two processes, two machines, two weeks. The problems linear accumulation hits — context rot on long chats, state lost when a session dies, no cross-thread memory — are orthogonal to window size. Making the window bigger delays the failure and makes it more expensive when it comes.
Capacity does not substitute for curation. A good harness answers “what are the 20K tokens this call should see?” The answer is almost never “all of them.” A harness that curates — the right CLAUDE.md, the right skills, the right prior-attempts rows, the right retrieved notes — beats one that dumps history into a larger window. The penalty for indiscriminate inclusion (context rot, cost, latency) scales with the window you fill.
Anthropic’s own engineering guidance has started to shift: the docs explicitly tell you to “design your state artifacts so that context recovery is fast when a new session starts,” and link out to a piece titled “Effective harnesses for long-running agents.” The vendor selling you the context window is telling you, in writing, that the context window is not the answer — the harness is.
The big-window race is a manufacturing race (can we build transformers that accept longer inputs?), not a cognition race (can agents sustain coherent long-horizon work?). Mistaking the first for the second is the single most expensive category error in the current agent-infrastructure conversation.
7. Implications: autonomy is harness work
If the model is stateless, the window is an input cap, and the conversation is a UX fiction, then every long-horizon capability you care about lives in the harness. That reframes what “building an AI agent” even means.
Memory is a subsystem, not a feature. You pick a persistence substrate (files, SQLite, Postgres, Durable Objects, GitHub Issues, a vector DB, a knowledge graph). You pick what gets written, when, and by whom — the model via tool calls, the harness after every turn, or a separate consolidation worker. You pick what gets read back, on what trigger. The model is a fixed dependency; the memory is your architecture. And only a handful of systems include a consolidation pass at all — OpenClaw’s three-phase dreaming, NanoBot’s Consolidator-plus-Dream, Honcho’s named dreamer module, Graphiti’s bi-temporal edge invalidation, A-MEM’s neighbor-updating writes (active-memory survey). Everything else accumulates without distilling. Without consolidation, a system stores trajectories but never learns the lesson.
Retrieval is harness-level. The model does not “remember X.” The harness shows the model X, at the moment it decides X is relevant. Agent Zero’s _50_recall_memories.py, OpenClaw’s active-memory plugin, Hermes’s MemoryManager.prefetch_all — all fire outside the model and inject into the next prompt. The model is none the wiser. That is the point.
Autonomy is goal-carrying. A “Claude Code session working on a task” is really a human carrying the goal across turns and prodding the model into each next step. When the human leaves, the goal evaporates, because the goal was never in the model. For the goal to persist, it has to live somewhere outside the model — a run sheet, a GitHub Issue, a DO’s working memory, a plan written to MEMORY.md. That “somewhere” is the harness. Prime and the autonomous-entity pattern treat this as foundational: the agent’s identity, current goal, and escalation history live in persistent stores, and the LLM call is a disposable reasoning step that reads them, adds to them, and returns.
Stated fully: if memory, retrieval, consolidation, rollover, and goal-carrying are all harness work, then the harness is doing the thing we’ve been calling “agency.” The model is a mutation worker the harness dispatches. The “agent” is the harness. (Developed further in ../thinking-is-substrate-self-modification/README.md and ../goal-generation-is-agency/README.md.)
The closest ML analogue: DSPy
The ML research community has been pointing at this for years. DSPy — Stanford NLP’s framework, started at Stanford in February 2022 — is explicit that what it does to an LLM program is compile it. From dspy.ai:
DSPy is a declarative framework for building modular AI software. […] Compile AI programs into effective prompts and weights.
You write AI programs in a module-plus-signature style, and DSPy’s optimizers (MIPROv2, BootstrapFewShot, etc.) produce the prompts that make those programs work on a given model. The prompt is the compiler output; the AI program is the source; the model is a target. That is exactly the relationship between a harness and an LLM: a harness is a program that, on every turn, compiles the current world — user turn, memory, retrieved facts, system prompt, skills, plan — into the prompt the model executes.
Stanford-DSPy names this openly and builds tooling around it. Most agent frameworks have been doing a degenerate version of the same thing — string-concatenating a transcript — without acknowledging it. Once you see the harness as a compiler (see ../the-harness-is-a-prompt-compiler/README.md), “what’s the best harness?” becomes “what’s the best compiler?” — and compilers are engineered, tested, optimized. They are not accidental.
8. Closing
The industry has spent two years arguing about model capability — benchmarks, evals, “GPT-5 will unlock this, Opus 4.7 will unlock that” — while the mechanisms that decide whether an agent can do multi-day work sat un-examined one layer up.
The stateless fact is not a bug; it is the correct factoring. The model should be a pure function; state and identity should live in a substrate designed for them. What has been missing is the recognition that the substrate is the architecture — every agent framework is implicitly answering a handful of harness-design questions, and the answers matter enormously.
Once you see that, three things follow:
- Stop racing context windows. You are racing the wrong axis.
- Stop asking “does the model remember?” It doesn’t. Ask “what does my harness show it, and when, and why?”
- Treat the harness as a first-class engineering artifact — with a schema for memory, a retrieval policy, a consolidation pass, a rollover protocol, and a curation function that chooses what goes into each
complete()call.
The short name for that last step is: treat the harness as a prompt compiler. That is the subject of the companion articles in this series.
Related articles
- The Harness Is a Prompt Compiler — the companion piece. Formalizes the harness as
f(H, G) -> P: a function from harness state and goal to the prompt submitted to the model. - Thinking Is Substrate Self-Modification — the deeper claim. If memory is where cognition lives, then “thinking” is the harness mutating its own substrate, and the LLM call is a regulated rewrite step.
- Goal Generation Is Agency — the capstone. Executors (models) are a commodity; the scarce, load-bearing capability in any autonomous system is whatever generates and carries the goal across time. That lives in the harness.
- Prime: Persistent Org-Level AI Agents on Cloudflare — a concrete harness design that takes these claims seriously. No session, no transcript, four-layer memory (DO SQLite / GitHub Issues /
CLAUDE.md/ D1), one LLM call per wake. - The Autonomous Entity Pattern — harness architecture generalized across domains (software orgs, brands, books, hospitals).
Sources
Vendor docs (primary). Anthropic Messages API (“stateless multi-turn conversations”); Anthropic Context windows (window as “working memory,” context rot); Anthropic Effective harnesses for long-running agents; OpenAI Conversation state (stateless Chat Completions + server-side Responses conversation object); DSPy / Stanford NLP dspy.ai (“compile AI programs into effective prompts and weights,” started Feb 2022).
Framework mechanisms cited. OpenClaw — compaction.md, dreaming.md, active-memory.md. Hermes Agent — memory_manager.py. Claude Code — context-window docs, agent-sdk sessions. LangGraph — persistence docs. Agent Zero — _50_recall_memories.py. Cloudflare Agents SDK — docs, DO hibernation.
Internal surveys (verified 2026-04-19, all GitHub URLs and star counts checked via gh api):
- Autonomous Agents: Context Continuity survey — 10 frameworks, five-question rubric per framework (compaction, persistence, retrieval, rollover, what compounds), with file-path citations for every mechanism.
- Active Memory SOTA Survey — 11 memory systems (Letta, Mem0, Graphiti, Cognee, A-MEM, MemoryOS, Honcho, LangMem, etc.), focused on write-propagation, cascade, and consolidation beyond linear accumulation.