The Harness Is a Prompt Compiler

Every LLM system — ChatGPT, Claude Code, your own agent — is implicitly a compiler. Given accumulated history and a goal, it produces a prompt. The quality of that compiler is the entire game. Models and harnesses improve on independent axes: models become more tolerant of messy inputs; harnesses become more precise in synthesizing them. The bigger-context-window race lets lazy compilers get away with more. It does not make the compilers better.

Open Table of Contents

1. Every LLM System Is Secretly a Compiler
2. The Compilation Model: P = f(H, G)
3. Plotting Real Systems on the Quality-of-f Axis
4. Prior Art: DSPy, MIPRO, Self-Refine, Constitutional AI
5. Three Implications
6. What This Means for Builders: Invest in f
7. Related Articles and Sources

1. Every LLM System Is Secretly a Compiler

Open the network tab on ChatGPT or Claude Code during any “conversation.” Each time you hit enter, the client does not send your most recent message. It sends a blob — a fully reconstructed prompt containing the system instruction, every prior turn, every tool call, every tool result, sometimes pinned files, sometimes cached context segments — packaged into one request. The model never sees a “conversation.” It sees one prompt at a time.

That blob is not your message. It is a compiled artifact.

We talk about “context management,” “memory,” “sessions,” “multi-turn chat” as if they were first-class primitives. They are not. They are implementation details of a hidden compilation step: given some accumulating state and a goal, produce a prompt. The model runs the prompt. The harness compiles it.

Every LLM system ships a compiler, whether its authors think of it that way or not. ChatGPT’s compiler is trivial — concatenate recent turns, drop old ones when they no longer fit. Claude Code’s compiler is slightly richer — inject CLAUDE.md, include working-directory context, resume via session JSONL when asked. Advanced frameworks — Graphiti, OpenClaw, Honcho, Prime — run elaborate pipelines: retrieve from semantic stores, consolidate background memories, re-rank episodes, synthesize briefs. All of these are compilers. They differ on sophistication, not on kind.

Once you see it this way, a pile of hand-waved concepts collapses into one well-defined problem. “Memory” is what the compiler persists. “Retrieval” is how the compiler selects inputs. “Context” is the compiled output. “Session” — as argued below — is a cache key.

This article makes the compiler framing explicit, plots existing systems along a quality-of-compilation axis, and argues that the most important work in LLM systems right now is work on the compiler — not on the model, and not on the size of its input window.

2. The Compilation Model: P = f(H, G)

Formally:

H = accumulated history. All state the system has from prior interaction: chat turns, tool calls, user-authored files, skill libraries, issue threads, graph edges, embedding indexes, episodic logs. Grows monotonically. Lives in storage the harness controls.
G = goal. What the user (or an upstream agent) wants now. Usually a user message; sometimes a scheduled wake; sometimes a webhook.
f = the compiler. f(H, G) → P. Retrieval queries, filters, ranking, compression, templating, token budgeting, cache-key computation.
P = the compiled prompt. The artifact shipped to the model.
LLM(P) = runtime execution. The model returns text (and sometimes tool calls) given P.

The whole system is a two-phase pipeline:

(H, G) ──f──► P ──LLM──► output
                        │
                        ▼
                     (update H)

Two observations follow.

The LLM step is vendor-fungible. There are ~six organizations in the world that produce frontier models. As a system builder, you rent one. You are not going to outcompete them on model quality. If your system beats another system on output quality, the difference lives in f — both systems can call the same LLM.

f is where the engineering is. Writing f well means choosing which H-slices to load, in what order, at what compression, with what framing, under what token budget, for which G. None of that is model-dependent. All of it is decidable at compile time. A better f today means better outputs tomorrow, regardless of which model you call.

The analog in classical compilers is exact: GCC and LLVM optimize the same C code for the same hardware, but differ in how they lower source to IR, which passes run, how register allocation works. The CPU vendor doesn’t ship the optimizer. Same pattern here: the LLM vendor ships the runtime; the harness author writes the optimizer.

3. Plotting Real Systems on the Quality-of-f Axis

Rank existing systems on how much work their compiler does. “Quality” here is the sophistication of the synthesis step between H and P. A high-quality compiler does selective retrieval, compression, and goal-conditioned framing. A low-quality compiler concatenates.

System	What H looks like	What f does	Quality of f
ChatGPT / Claude Code (chat mode)	Flat linear turns; optional pinned files	Concatenate recent turns until token limit, then drop or summarize	Low
Claude Code (with `CLAUDE.md`)	Linear turns + working directory + `CLAUDE.md` loaded at session start	Concatenate + prepend `CLAUDE.md`; opaque `/compact` on overflow	Low-to-modest
OpenClaw	Workspace files (`AGENTS.md`, `MEMORY.md`, daily logs), SQLite session transcripts, skills registry, active-memory plugin	Session-start injection of curated files + yesterday’s daily log + on-demand `memory_search` + pre-turn hidden active-memory pass	Modest-to-high
Prime (RepoPrime DO)	DO SQLite (working memory + decision log), GitHub issues (episodic), `CLAUDE.md` (semantic), D1 (shared signals/attempts)	Per-wake: load identity, query world state, fetch attempt history, synthesize into a Zod-typed `generateObject()` call with structured decision output	High — explicit, per-wake
Hermes Agent (with Honcho plugin)	Hermes provides session FTS5 DB + `MemoryManager` hooks + a plugin interface; Honcho plugs in to add peer representations, deriver/dreamer/dialectic decomposition, and a user-profile layer	`MemoryManager.prefetch_all()` / `sync_all()` fire pre- and post-turn and call whatever providers are registered; with Honcho, a background `dreamer` module consolidates durable state between turns	Highest — explicit synthesis and feedback loop

Systems at the top treat H as an undifferentiated stream — stuffing everything in and hoping the model attends to the right tokens. Systems at the bottom do the filtering in code, making the compiler do real work before the model sees anything.

Notes on each tier, with receipts.

ChatGPT / Claude Code chat. Default strategy: every turn, send the entire conversation back plus a system prompt. When context overflows, older turns drop or get silently summarized. H is append-only, f is identity-plus-truncation. Works only because modern frontier models tolerate enormous irrelevant context without collapsing.

Claude Code with CLAUDE.md. A small step up. The harness injects markdown files at session start and supports /compact for LLM-summarized older turns. The Agent SDK sessions docs describe persistence as JSONL transcripts under ~/.claude/projects/<encoded-cwd>/. Resume replays the entire transcript; the SDK has no first-class retrieval or ranking — the “smartness” in f is whatever the user wrote into CLAUDE.md.

OpenClaw. Substantial compiler. Workspace is a structured file set: AGENTS.md, SOUL.md, USER.md, MEMORY.md, daily logs, skills. The active-memory plugin gets “one bounded chance to surface relevant memory before the main reply is generated,” injected as a hidden system-context prefix. A nightly dreaming process consolidates daily logs into MEMORY.md. Explicit f work outside the model.

Prime. An explicit, Zod-typed compiler. Each wake cycle the RepoPrime Durable Object runs a deterministic buildContext pass — load CLAUDE.md, query D1 for signal states, fetch recent GitHub issues labeled automated, read attempt history — and synthesizes all of it into a structured generateObject() call with a DecisionSchema. One LLM call per wake. All selection, ranking, and framing happens in f. The schema forces the model’s output to be structured data, not prose.

Hermes Agent and similar “memory-native” frameworks. The most elaborate compilers. Honcho decomposes into deriver (extract from writes), dreamer (background consolidation), dialectic (represent peers for retrieval), and webhooks (broadcast state changes). Each turn pulls from the peer-representation layer, not raw history. H is not a log; it is a living structure.

Chart these from “lazy compiler” on the left to “explicit synthesis compiler” on the right — ChatGPT at one end, Honcho-powered Hermes at the other. Interesting engineering happens across every point on that line. None of it has anything to do with the model.

4. Prior Art: DSPy, MIPRO, Self-Refine, Constitutional AI

The prompt-compiler framing is not new. It has a direct ancestor in academic ML, and the specific word compiler is used — not metaphorically, but literally — in at least one widely-adopted framework.

DSPy: the literal prompt compiler

Stanford NLP’s DSPy project began in early 2022 and has become the canonical “programmatic” framework for LLM systems. Its tagline is “Programming—not prompting—LMs.” The compilation framing is literal, not metaphorical:

DSPy provides tools to compile high-level code with natural language annotations into the low-level computations, prompts, or weight updates that align your LM with your program’s structure and metrics, and if you change your code or your metrics, you can simply re-compile accordingly.

You use a DSPy optimizer to compile your code into high-quality instructions, automatic few-shot examples, or updated LM weights for your LM.

Both quotes from dspy.ai. The word “compile” is load-bearing. DSPy models your program as a graph of Signature and Module nodes — each declaring input/output behavior — and runs an optimizer that converts the graph into concrete prompts at compile time.

The optimizers are the interesting part. BootstrapFewShot uses a teacher module to generate complete demonstrations for every stage, plus labeled examples from a training set. MIPROv2 jointly optimizes instructions and few-shot examples: bootstrap candidate examples, propose instructions grounded in task dynamics, Bayesian-optimize over the combinatorial space. Other optimizers — COPRO, KNNFewShot, Optuna-backed variants — explore related tradeoffs.

DSPy treats prompts as compiled artifacts, not human-authored assets. You write the program; the compiler writes the prompts. Change the program or the metric, re-compile. Swap the model, re-compile. Prompts are build outputs, not inputs.

Most of the systems in Section 3 have a DSPy-ish compiler hidden inside them — they just don’t say so, and they don’t use optimizers to search prompt space. Prime’s buildContext + buildPrompt is conceptually a DSPy program with a hand-tuned compiler. OpenClaw’s active-memory plugin could be expressed as a DSPy Signature with an optimized retrieval step. Honcho’s deriver/dreamer decomposition maps onto DSPy’s multi-stage pipeline model. DSPy makes explicit what the rest of the field does implicitly.

Self-Refine: the iterative loop

Self-Refine (Madaan et al., 2023) shows that iterative compilation beats one-shot compilation. The same LLM generates, critiques, refines — repeatedly. No training, no RL, just looping. Across seven tasks, outputs are preferred by both humans and automatic metrics over one-step generation. For the compiler framing: even when f is trivial concatenation, wrapping LLM(f(H, G)) in LLM(f(H, G, critique)) produces a better artifact. The compiler can use the LLM as a sub-step in its own compilation pass.

Constitutional AI: compilation with a policy layer

Constitutional AI (Bai et al., 2022) trains a model to be harmless using only a list of principles — a “constitution” — plus self-critique. RLAIF-based, but the artifact is a model whose behavior has been shaped by a compiler-like pass: principles applied as critiques, critiques used to rewrite outputs, rewritten outputs as training data. Policies are declarative at train time; implicit in weights at inference.

The through-line

DSPy says prompts are compiled. Self-Refine says compilation can be iterative and self-correcting. Constitutional AI says compilation can bake in a policy. Together they establish that the transformation (H, G) → P is a first-class engineering problem with a rich design space — not a hack around a chatbox.

5. Three Implications

If the compiler framing is right, three practical consequences follow. Each contradicts a widely-held assumption in the current LLM discourse.

5.1. The context-window race is a dead end

The dominant vendor narrative for the last two years has been context window size. GPT-4 moved from 8k to 128k to (effectively) 1M. Claude shipped 200k, then 1M. Gemini shipped 2M. Each jump was framed as capability improvement: “now the model can read an entire codebase.”

From the compiler perspective, this race optimizes the wrong axis. A bigger window does not make f better. It makes f less necessary. If the entire history fits, you do not need to retrieve, rank, or compress. The laziest compiler now works at scales where before it would have failed.

That is a usability gain, not a synthesis gain. Stuffing a 1M-token window with everything the user has ever said produces a worse prompt than a 50k-token prompt assembled by an intelligent compiler from the same H and G. More tokens is more distractor mass — attention degrades as context grows (lost-in-the-middle).

The context-window race is a tolerance improvement, not a precision improvement. Tolerance widens the design envelope for f, lets compilers be less aggressive about compression, makes recoveries from bad retrieval less catastrophic. It does not reduce the importance of f. A better compiler on a small-context model will beat a worse compiler on a large-context model at almost any non-trivial task. The tolerance race benefits the lazy compiler most. The precision race — where f lives — is still wide open.

5.2. Models and harnesses improve on orthogonal axes

The most important implication, and the one most often missed.

Model improvements are almost entirely improvements on tolerance: handling longer inputs, noisier inputs, less-structured inputs, more ambiguous goals, more adversarial context. GPT-4 tolerates worse prompts than GPT-3.5. Sonnet tolerates more irrelevant context than Haiku. When vendors ship a new model, most of the delta is how much lazy-compiler behavior it compensates for.

Harness improvements are almost entirely improvements on precision: what goes into P, in what form, at what granularity, for what G. A better retrieval step. A better consolidation pass. A better way of pinning skills. None of this requires a new model. All of it can be layered on top of any sufficiently-capable model.

These are orthogonal axes. They do not trade off. A perfectly tolerant model still benefits from a precise compiler — because precise compilation is how you make output useful, not how you make it legal. A perfectly precise compiler still benefits from a tolerant model — because no compiler is infallible and some mess always leaks through.

Vendor progress and harness progress are additive, and they are claimed by different people. The vendor’s roadmap does not block yours. Teams currently waiting for “GPT-5 to fix our agent problems” are waiting for the wrong thing. GPT-5 will make your compiler less punished for being lazy. It will not improve what your compiler produces.

5.3. “Session” is a cache key, not a primitive

The word session is overloaded. It’s used for at least four distinct things:

A UI container (the browser tab or terminal window).
A unit of state persistence (JSONL transcript on disk).
A unit of behavior scoping (what “memory” the agent has access to).
A cache key for KV-cache reuse (the identifier that lets the model server skip re-computing prefix embeddings).

Only (4) is a model-level primitive. The others are UX conveniences. From the compiler’s point of view, a “session” is nothing more than a prefix that recurs across calls — an optimization target for the server, not a semantic unit of the system.

Treating session-as-primitive leads to poor architectural choices. If the “end of a session” causes an agent to lose state, that is a property of the compiler, not of the agent. The compiler could just as easily load the prior session’s state from H and continue; nothing in the LLM runtime cares.

The Prime architecture makes this explicit. There is no session in Prime. There is only H (DO SQLite + GitHub Issues + CLAUDE.md + D1) and G (an alarm, a webhook, a human message). f reads H and G. There is no “end” to a session because there was never a session — just a sequence of wake cycles that compile prompts from continuously-updating state. Removing the session concept removes a category of bugs (state loss at “session boundaries”) and opens up architectural options (cross-wake planning, continuous agency, horizon-agnostic operation).

If you find yourself reasoning about “what happens when the session ends,” you have smuggled a cache key into your semantics. Refactor it out.

6. What This Means for Builders: Invest in f

If you are building an LLM system in 2026, the above reframes where to spend engineering time.

Stop treating prompts as artisanal. Hand-authoring giant prompts and shipping them as static strings is equivalent to hand-authoring assembly when a compiler exists. Model f as an explicit function from (H, G) to P. Even a trivial compiler — some retrieval, some templating, some token budgeting — will beat a hand-tuned 12k-token string on maintenance cost, model portability, and output quality. The moment you have more than one prompt, you have a compiler. Make it explicit.

Treat H as the substrate, not the log. The rich view of H is “everything the system can draw on to answer this goal”: databases, file trees, issue threads, embeddings, entity graphs, attempt histories, cached LLM outputs, skills, consolidated memories, user profiles, project conventions. Most of what you need is already in your system — it just isn’t in your prompt. The compiler’s job is to get it there.

Measure the compiler, not the model. When output quality degrades, the first question should be “did the compiler retrieve the right things?” — not “did the model change?” Log P. Diff it across runs. Bugs in output almost always come from f: wrong context, missing context, stale context, redundant context. If you’re not inspecting P, you’re debugging the wrong component.

Borrow the DSPy framing even if you don’t adopt DSPy. Most teams won’t build a full optimizer stack. They should still adopt the framing: prompts as compiled output, programs as graphs of typed modules, metrics as the thing you optimize against. Teams that eventually search over compiler configurations will compound faster than teams that don’t — for the same reason teams with CI/CD compound faster than teams deploying by hand.

Build the consolidation pass. The single highest-leverage f feature is scheduled consolidation — OpenClaw’s dreaming, Honcho’s dreamer module, Graphiti’s build_communities, A-MEM’s consolidate_memories. Raw H grows monotonically and becomes low-signal over time. Consolidation rewrites H into a higher-density form — summaries, entity graphs, skill documents, cluster representatives. Run it on a schedule, not on every turn. This is the step that separates systems that improve with use from systems that degrade with use.

Treat model swaps as cheap. A well-factored compiler makes model swaps a config change. A giant hand-authored prompt tuned to one model’s quirks makes them a rewrite. Route each (H, G) pair to whichever model is cheapest for that step — Workers AI for classification, Haiku for mechanical synthesis, Sonnet for research, Opus for orchestration — without the compiler caring. This is how API Mom and similar routing layers work. Baseline.

Don’t wait for the next model. The next frontier model will be more tolerant. It will not close the gap between a good compiler and a bad one. Your system improves when your compiler improves.

Companion articles (same series)

Context Is a Harness Artifact — Foundation for this article. Argues that “context” is not a property of the LLM; it is produced by the system surrounding the LLM.
Thinking Is Substrate Self-Modification — Deeper move: the compiler’s most valuable output is not the prompt, it is the mutations it applies to H along the way.
Memory as Lazy Queries Over the World — One-primitive frame. Treats “memory” not as a store but as a pull-model query re-evaluated each turn.

Prime: Persistent Org-Level AI Agents on Cloudflare — A concrete reference compiler. Per-wake buildContext + generateObject() with Zod-typed decision output.
Never Fail Twice: The Escalation Ladder That Learns — How successful H-enrichments become durable skills.
The Autonomous Entity Pattern — The meta-framework that the Prime article is an instance of.

Research surveys (internal)

Autonomous Agent Frameworks: Context Continuity Survey — 10-framework review of how OpenClaw, Claude Agent SDK, Hermes, LangGraph, Cloudflare Agents SDK, and others implement f. Verified source citations for each system’s compaction, persistence, and consolidation mechanisms.
Active Memory SOTA Survey — 11 production/research LLM-memory systems (Letta, Mem0, Graphiti, Cognee, A-MEM, MemoryOS, Honcho, LangMem, HippoRAG, EM-LLM, Anthropic Memory Tool). Cross-cutting pattern table.

External sources

DSPy (Stanford NLP). Home page: https://dspy.ai/ — source of the “compile high-level code with natural language annotations…” phrasing in Section 4. Repo: https://github.com/stanfordnlp/dspy. MIPROv2 and BootstrapFewShot optimizer docs.
Self-Refine. Madaan, A., et al. (2023). Self-Refine: Iterative Refinement with Self-Feedback. arXiv:2303.17651. Implementation.
Constitutional AI. Bai, Y., et al. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073.
Lost in the Middle. Liu, N. F., et al. (2023). arXiv:2307.03172 — attention degrades as context grows, even within the advertised window.
Frameworks referenced in Section 3. OpenClaw docs: agent-workspace, active-memory, dreaming, compaction. Claude Agent SDK sessions. Graphiti. Honcho. Cloudflare Agents SDK.

The harness is the compiler. The compiler is the work. Everything else is vendor-supplied.