Skip to content
Gary Wu
Go back

Memory as Lazy Queries Over the World

Edit page

Thesis: In a stateful LLM system, memory is not a mirror of the world — it is a graph of lazy queries that resolve into the world on demand. Storage and computation are the same primitive: the content-addressed pipeline. Pointers are executable intents, not static references. A memory hierarchy (hot → warm → cold → internet) is just varying resolver cost. Caller-visible performance is honest engineering. This gives a single-user system effectively infinite memory at bounded storage cost.


1. The Staleness Problem

The first thing anyone notices about a memory system is that it lies.

You save a URL. A week later the page has moved, the server has 404’d, the CDN has rotated a query parameter, or the company was acquired and the whole domain redirects to a marketing page about “synergies.” You saved a pointer to a thing in the world, and the world moved. This is not a bug in the pointer format. It is a category error in what the pointer is for.

Classical memory models treat memory as a mirror: the system stores a faithful copy of some portion of reality, and retrieval is lookup. This is how filesystems work, how textbook caching layers work, and how most current LLM-agent memory frameworks work — Mem0, for instance, is an ADD-only store where “memories accumulate; nothing is overwritten” ([Mem0 docs][mem0]). The invariant is: what we wrote is what we read.

The mirror model has three structural problems. Staleness: a mirror is valid only for the instant it was taken. Capacity: mirrors scale with the thing being mirrored, so mirroring the web is not a personal-infrastructure project. Opacity: a mirrored blob has no provenance — you cannot re-derive it, verify it, or update it cheaply.

The fix is to stop mirroring. Don’t save the page; save the question that the page answered. When you want the answer, ask the question again. If nothing has changed, the cache returns instantly. If things have changed, you get the current answer. The storage cost is the question, not the answer. This is the entire thesis in a sentence: memory is a lazy query, not a stored result.


2. Pointers as Executable Intents

If memory is queries, then every “pointer” in the system is actually a small program. A URL is a degenerate case — a program whose only instruction is “HTTP GET this address.” A filesystem path is another degenerate case — “open and read this inode.” Both are executable. We just stopped thinking of them as executable because the execution was so cheap.

Treat every pointer as a deep-search instruction:

The common structure is: name + arguments + resolver. The resolver is a piece of code. Given the name and the arguments, it produces bytes. Two pointers with the same (name, arguments) tuple resolve to the same logical value, even if the resolver’s internal machinery differs.

This is the exact shape of a tool call in the Model Context Protocol ([MCP spec][mcp]). MCP tools are named capabilities invoked with a JSON argument blob; the server decides how to resolve them. The MCP design document frames tools as “functions the model can call,” but from the memory system’s perspective they are pointers whose dereference happens to run code. A pointer is a tool invocation; a tool invocation is a pointer. They are the same object.

Once pointers are programs, the staleness problem dissolves. A pointer is never stale because it does not pretend to be a snapshot. It is a recipe for producing a current snapshot. You grade pointers on two axes now:

Both are operational questions, not ontological ones. We can measure them. We can budget against them. We cannot sweep them under the rug the way the mirror model does.


3. One Primitive: The Content-Addressed Pipeline

Now collapse the last distinction.

In [The Media Store Pattern][media-store] I argued that content-addressable storage — bytes in, SHA-256 key out — is the right primitive for a pipeline because it makes deduplication free and inter-stage coupling cheap. Every capability in a Scram Jet pipeline talks to the same store. Keys flow; bytes stay put.

Extend the idea one step. If a pipeline stage is a pure function of its inputs — same inputs, same outputs — then its output is also content-addressable. And the identity of the output is SHA-256(stage-name + canonical-inputs). This is not a new observation. It is the foundational invariant of:

All three systems converged on the same primitive for the same reason: the hash of a recipe is a perfect handle. It is an identity, a cache key, a deduplication key, and a proof of derivation all at once.

So let us stop drawing a line between “storage” and “computation.” The primitive is:

Pipeline := { name, inputs, resolver }
Key(Pipeline) := SHA-256(name, canonical(inputs))

A pipeline produces bytes. Its key is computed from its definition. The store maps key → bytes. That is the whole model.

Storage is degenerate computation

A plain blob of bytes is just a pipeline with zero cost, zero latency, and a constant body. The resolver is the identity function on a stored byte array. If you dereference the key, you get the bytes, and the “computation” is the read from R2.

This is not a metaphor. A file and a pipeline stage are the same object at different cost settings. When Nix serves a cached output, it is behaving as pure storage. When Nix has to re-execute the derivation, it is behaving as pure computation. The interface is identical in both cases: you hand it a derivation, it hands you back the store path. The user does not need to know or care whether the path existed before the call.

This is the first unification: storage and computation are the same primitive at different points on a cost curve.


4. The Memory Hierarchy as Resolver-Cost Spectrum

Once every pointer is a pipeline and every pipeline has a resolver with a cost, the classical memory hierarchy (CPU registers → L1 → L2 → L3 → DRAM → disk → network) reappears — but with a different carving of tiers, shaped for agent cognition instead of for CPU microarchitecture.

TierNameTypical latencyTypical costExample resolver
L1Hot graph1 – 10 ms~0¢In-process knowledge graph (Mnemosi); key-value lookup; vector-nearest-neighbor over a loaded index
L2Warm cache50 – 500 ms<0.01¢KV + R2 in the same region; previously-computed pipeline output
L3Cold archive1 – 10 s0.01 – 1¢Cross-region R2; large object reads; re-running a pipeline stage from inputs that are themselves in L2
L4Internet / deep search5 s – 5 min1¢ – $1+HTTP fetch; search engine query; LLM-reranked deep research; tool invocation that itself calls another tool chain

Each tier is defined by resolver cost, not storage medium. A piece of data can live in multiple tiers simultaneously: if the hot graph has an embedding for a document, R2 has the bytes, and the web has the live page, all three are resolvers for “what does this document say right now?” They differ in freshness, latency, and cost. The caller picks.

Letta’s MemGPT paper explicitly uses the virtual-memory / OS-paging metaphor for exactly this reason ([Packer et al., MemGPT, arXiv:2310.08560][memgpt]): main context is “RAM,” external memory is “disk,” the agent swaps pages. The framing here generalizes: once the “disk” can itself be a computation — the page is a pipeline output — you get a memory system spanning microseconds to the open web, with one API. The eight-orders-of-magnitude latency range is not a bug; it is why the model is useful.


5. Caller API: Latency, Freshness, Budget

The only honest way to expose a resolver-cost hierarchy is to let the caller negotiate the tradeoff explicitly.

A CPU’s memory hierarchy hides its tiers from the programmer because the latency ratios (L1 : DRAM ≈ 1 : 100) are small enough that speculative hardware can paper over them. Our ratios are 1 : 10,000,000 or worse. No amount of speculation hides that. The API has to surface it.

Here is the primitive:

interface ResolveOptions {
  max_latency_ms?: number;    // deadline for this call
  max_stale_s?: number;       // accept a cached answer this old; -1 = infinite
  force?: boolean;            // bypass all caches; re-run the pipeline
  budget_cents?: number;      // spend cap for this call (sub-resolvers must respect)
  on_downgrade?: 'error' | 'stale' | 'partial';
}

interface ResolveResult<T> {
  value: T;
  tier: 'L1' | 'L2' | 'L3' | 'L4';
  age_s: number;              // how old is this answer?
  cost_cents: number;         // what did it cost to produce?
  provenance: string[];       // which resolver(s) were invoked
}

async function get<T>(
  pipeline_hash: string,
  opts: ResolveOptions = {},
): Promise<ResolveResult<T>>;

A few things fall out of this signature.

Latency-bounded resolution. The caller says “give me the best answer in 250 ms.” The resolver walks the tier list, picks the cheapest hit that fits the deadline, and returns. If nothing fits and on_downgrade = 'error', it throws; 'stale' returns the freshest cached value and flags age_s; 'partial' returns a best-effort subset (useful for aggregate queries).

Freshness as a first-class parameter. max_stale_s fixes the staleness problem at the system level. “The current Bitcoin price” wants max_stale_s: 60. “My grandmother’s birthday” wants -1 — any cached answer is fine forever. Most pointers in a personal memory system are closer to the grandmother case than the Bitcoin case, which is why caches work so well in practice.

Budget propagation. Deep-search resolvers call other resolvers. A naive design lets a single query spiral into a $20 LLM call. budget_cents is passed down the call graph; each sub-resolver checks its slice before committing — the same pattern used by every mature cost-aware orchestrator ([Cost-Aware Capability Orchestration][cost-aware]).

Provenance as a return value. Every answer came from somewhere; the system can always tell the caller where. Free if you built on content-addressed pipelines (every output has a key, every key has a derivation); impossible to retrofit onto a mirror.

Three knobs: how long you’ll wait, how fresh you need it, how much you’ll pay. Every other memory-system design decision follows from these.


6. Pure vs Effectful Pipelines

A load-bearing distinction: not all pipelines are safe to re-run.

A pure pipeline is a function of its inputs. Re-running it produces the same bytes; a cache hit is semantically identical to a cache miss. Transcribing a specific audio file with a pinned model version, hashing a blob, rendering Markdown with a pinned renderer, and temperature-0 LLM summarization of a fixed document are all (approximately) pure.

An effectful pipeline has external interactions. Re-running it can produce different results (live web fetch), move money (a purchase), send messages (email, Telegram, webhook), or mutate remote state (post to social media). Both kinds are legitimate pipeline stages, but a memory system has to treat them differently. Pure stages can be aggressively cached, replayed, and parallelized. Effectful stages need idempotency keys, human-in-the-loop gates, or explicit consent to re-run.

The cleanest discipline I know is Nix’s: pure by default, impure declared explicitly (__impure = true in Nix flakes). Every pipeline in a content-addressed system should declare its effect type. The resolver respects it: pure gets free caching; effectful does not cache past a TTL, is not speculatively executed, is not retried without consent.

A richer treatment (dreaming, counterfactual simulation, the effect gate) lives in [Dreaming and the Effect Gate][dreaming]. The essential point here: a memory hierarchy that doesn’t distinguish pure from effectful is unsafe at the top tiers, because “cache the answer” is correct for pure reads and catastrophic for effectful ones.


7. MCP Tools Are Memory Primitives

Put the pieces together and a clarifying thing happens: the distinction between “memory” and “tool use” dissolves.

Consider the kinds of things an agent “remembers”:

  1. Facts about a user, stored in a knowledge graph.
  2. Files the user has uploaded.
  3. Past conversation turns.
  4. The contents of a web page the agent saw last week.
  5. The current state of the user’s calendar.
  6. The user’s GitHub repos.
  7. Current stock prices.
  8. The weather.

A mirror-model memory system draws a hard line between 1–4 (“memory”) and 5–8 (“tool use / live data”). The line is almost always wrong. Calendar contents are memory — the agent remembers them by asking the calendar. GitHub repos are memory — the agent remembers them by asking GitHub. The only difference between a knowledge-graph fact and a Google Calendar event is the cost and freshness of the resolver.

In an MCP-native agent, every piece of external state is addressable via a tool invocation; every tool invocation is a pointer; every pointer is a pipeline; every pipeline participates in the memory hierarchy. The knowledge graph is the hot tier because it is local. The calendar tool is the warm tier because it is one RPC away. Deep-research tools are the cold tier because they take seconds and cost cents.

MCP was the right abstraction for agents even though it was marketed as a tool-use protocol. Tool use is remembering. Memory is tool use. The unified primitive is the named, typed, cacheable capability invocation ([MCP spec][mcp]). The practical consequence: you do not design “how the agent remembers” and “how the agent calls tools” separately. You design one thing — the capability invocation — and layer a cost-aware cache on top. Letta’s core_memory tools, Anthropic’s Memory Tool (memory_20250818, a filesystem-style tool embedded in the Messages API — [docs][anthropic-memory]), and Mem0’s memory.add / memory.search are the same underlying object: a named capability the model calls. The only interesting engineering question is where the cache lives and what its eviction policy is.


8. Prior Art: Nix, Bazel, Unison, Haskell, Lisp

None of this is new. The pieces have sat around for decades. The agent-memory context is the only novelty.

Nix (2003) first shipped the content-addressed build artifact. A derivation is a pure function; its output path is determined by its inputs; the store caches outputs under hashes; substituters fetch remote outputs when a local build would be redundant ([Nix manual][nix]). Gentoo’s binpkgs and Guix’s store are in the same lineage.

Bazel and Buck2 (mid-2010s) applied the same idea to polyglot monorepo builds at Google/Meta scale. Action digests are computed from the command, environment, and input tree; remote caches serve outputs by digest; a fresh check-out builds instantly if the cache is warm ([Bazel remote caching][bazel]; [Buck2 docs][buck2]). “Don’t rebuild what someone else already built” generalizes to “don’t re-derive what someone else already derived.”

Unison (2018 – ongoing) takes the boldest step: functions themselves are identified by the hash of their abstract syntax tree. A function is not a name; it is a content-addressed object. Refactoring, renaming, code distribution, and caching all collapse into the same primitive ([Unison: Why we built it][unison]) — the closest any production system has come to “code is data, and data is content-addressed” as a unified principle.

Haskell contributes lazy evaluation. A value is a thunk; reading it may trigger arbitrary computation; if already forced, reading is free. Exactly the semantics we want for memory resolution: I don’t care whether this pointer has been materialized; dereferencing produces the answer, and the system arranges the cost ([Haskell wiki][haskell-lazy]).

Lisp (1958 – ongoing) contributes the oldest and most radical idea: code is data. An S-expression is a list; a list is an S-expression. McCarthy’s original paper built the whole language around this equivalence. A memory system on content-addressed pipelines is Lisp’s “code is data” applied to the agent stack — a pointer is data, the same pointer is a program, and the resolver is just eval.

The agent-memory version follows mechanically: outputs are content-addressed (Nix); re-retrieval is cached by query digest (Bazel); queries themselves are content-addressed and portable (Unison); memory reads are lazy pipeline dereferences (Haskell); pointers are programs are pointers (Lisp). When five independent traditions converge on the same primitive, the primitive is probably real. Agent memory is the next place to apply it.


9. The Matrix Analogy, Applied Correctly

“I know Kung Fu” is the canonical pop-culture reference for instant knowledge acquisition. Applied to data, it’s incoherent — a brain has no slots shaped like helicopter-flying-skill. Applied to capabilities, it’s exact.

Kung Fu is not a static representation. It’s a pipeline: given a situation (an attack, a challenge), it produces an output (a response move). Neo asking for Kung Fu is asking the system to make a pipeline available, not to copy bytes.

neo.capabilities.add({
  name: "fly-helicopter",
  resolver: mcp.call("pipelines.fly_helicopter", { model: "bell-206" }),
  freshness: { max_stale_s: 86400 },
});

The first time Neo needs to fly a helicopter, the resolver runs — possibly calling a flight-dynamics simulator, a physics tool, and an LLM for decision-making. The cost is paid when the capability is used, not when it is “loaded.” The “download” is capability registration; the resolver does the real work on demand. Because resolvers are content-addressed, the system can pre-warm common sub-queries (“how do helicopters yaw?”) without ever producing a complete “knowledge of helicopters” blob — because no such blob exists or needs to.

Asking “what does Neo know?” is a category error. The right question is “what capabilities can Neo invoke, at what cost, with what freshness guarantee?“


10. Effectively Infinite Memory at Bounded Storage Cost

A single-user agent produces some bounded amount of novel state per day: conversations, notes, generated artifacts. Call it N bytes. Over a decade, the novel state is ~3,650 × N — gigabytes, not petabytes. This is the irreducible core the system must actually store.

Everything else the agent appears to “know” is a pipeline: “what’s on my calendar?” invokes the calendar tool; “what did I commit last Tuesday?” invokes GitHub; “what does this article say?” invokes web fetch plus cache. None of these consume the user’s storage in the mirror sense. They consume resolver cost at call time. The memory is effectively infinite because the resolvers reach the entire internet, all installed tools, all reachable services. The storage cost is bounded because only the novel core, plus a utility-weighted LRU of pipeline outputs, ever touches durable storage.

The caches do real work. A-MEM’s consolidate_memories() pass, Graphiti’s community detection, Honcho’s dreamer module ([discussed in the Active Memory SOTA survey][sota-survey]) all reorganize the cache so that frequently-resolved pipelines become progressively cheaper — the same optimization a CPU’s branch predictor performs, scaled up seven orders of magnitude in latency and applied to knowledge instead of instructions.

The mirror model forces a choice between “remember everything” (expensive, stale) and “remember little” (cheap, forgetful). The lazy-query model refuses the choice: remember the questions, resolve them when asked. The world is your storage backend. The cache is your speed boost. The budget parameter is your governor. This is how a single-user system ends up with a memory that feels unbounded while costing a few dollars a month.



12. Sources

McCarthy, J. (1960). Recursive Functions of Symbolic Expressions and Their Computation by Machine, Part I. Communications of the ACM 3(4). The original Lisp paper; source of “code is data.”


Written 2026-04-19. Part of the README articles series on content-addressed computation, agent architecture, and capability-oriented systems. No code was shipped for this piece; it is the conceptual scaffolding that several concrete pieces — the Mnemosi active-substrate RFC, the dreaming pipeline, and the MCP-memory unification — will hang on.


Edit page
Share this post on:

Previous Post
Goal Generation Is Agency
Next Post
Thinking Is Substrate Self-Modification