Skip to content
Gary Wu
Go back

Goal Generation Is Agency

Edit page

Agency is the capacity to generate goals from ambient state — not the capacity to execute given goals. Every system in the “AI agent” category today is a very capable executor; almost none does open-ended goal generation from rich ambient state. That missing capability is the distinguishing primitive. Once it exists, executors become a commodity layer you swap as the market evolves. An active-memory substrate is a plausible-but-not-yet-proven path to that primitive; a spike against our own wiki substrate produced roughly 65% validation — real signal where the substrate was mature, silence where it wasn’t.


1. “Agent” Is Doing Too Much Work

The word “agent” has been overloaded to the point where it carries almost no architectural signal. A queue-polling background worker is called an agent. A chat-completion call in a for-loop is called an agent. A Durable Object with an alarm, a Claude Code session, a cron-invoked shell script — all called agents.

Cluster these systems by what they actually do and a cleaner taxonomy falls out. Two roles, not one:

These are different computations with different inputs, success conditions, and failure modes.

An executor takes goal → plan → actions → outcome. Judged on whether outcome matches goal. The goal is exogenous: a human, a ticket, or a parent agent supplies it.

A goal generator takes ambient state → candidate goals → ranking → broadcast. Judged on whether the goals it surfaces are real, well-scoped, and worth doing when nobody asked. The substrate is exogenous; the goal is endogenous.

Dennett’s intentional stance draws roughly this line from the outside: a system is “agent-like” to the degree it’s usefully predictable as pursuing goals over time. Korsgaard, in Self-Constitution, pushes further: practical agency requires choosing what to pursue. Friston’s active-inference framing arrives at a related place — an agent reduces its own surprise by acting on the world, which implies it is the thing deciding which prediction to reduce next. None of these frames reduce agency to “executes a given instruction.” All locate agency at the point where the goal is generated.

We can adopt that as a working definition without claiming to have solved philosophy:

Agency is the capacity to generate goals from ambient state, and to keep generating them, coherently, without being prompted.

Everything downstream of the generated goal — decomposition, tool selection, retries, escalation — is execution. Valuable, hard to do well, increasingly commoditized. But not the thing that makes a system agentic.

This article argues that almost every system currently marketed as an “agent” is, under this definition, an executor. That categorization is not a dismissal: executors are critically useful, and the best ones are extremely sophisticated. The categorization clarifies what the hard open problem actually is, and where the durable moat lives if you solve it.


2. Audit the Field: What Do Existing Systems Actually Do?

Before claiming a gap, we should demonstrate it. Below is a capability audit of ten systems that get called “agents” in 2025–2026 discourse. Each column is a concrete, verifiable capability. I am deliberately ranking capabilities from easy to hard:

  1. Reactive response. Receives an event and responds.
  2. Scheduled execution. Wakes on a timer and runs a fixed routine.
  3. Memory-augmented context. Retains state across invocations and retrieves it on future ones.
  4. Drift detection. Compares desired state to actual state and flags the delta.
  5. Subgoal decomposition. Given a goal, produces and executes subgoals.
  6. Open-ended goal generation. Given ambient state only, proposes novel goals worth pursuing.

The sources for each row are cited — either the framework’s own docs or our own internal survey of ten context-continuity systems (/Users/admin/Work/_readme/wiki/business/autonomous-agents-context-continuity.md) and eleven active-memory systems (/Users/admin/Work/_readme/wiki/business/active-memory-sota-survey.md).

System1. Reactive2. Scheduled3. Memory4. Drift detect5. Subgoal decomp6. Open-ended goal gen
Claude CodeYes (user turn)NoCLAUDE.md + MEMORY.mdNoYes (TODO list)No — waits for prompt
Claude Agent SDKYesNo (caller)JSONL session transcriptsNoYes (query() loop)No
OpenClawYes (message)Yes (dreaming cron)Workspace files + SQLite + active-memoryNoYesNo — conversational
Letta / MemGPTYesYes (sleep-time compute)Core blocks + archival vectorPressure warning onlyYes (agent loop)No — self-edits, doesn’t author goals
Mem0Yes (add/search)NoVector + BM25 + entity linksNoN/A (memory layer)No — not a goal system
Zep / GraphitiYes (episode)Yes (community rebuild)Bi-temporal knowledge graphEdge invalidation = fact-level driftN/ANo — substrate, not planner
MoltWorkerYes (sandbox events)Yes (wake-cron)R2-as-filesystem + SDK sessionsNoYes (nested agents)No
AutoGPT / BabyAGI (2023)N/ALoopVector storeNoYes (task list)Attempted — failed (see §3)
“Prime” (our org-level pattern)Yes (webhook)Yes (DO alarm)4-layer (DO SQLite + GitHub issues + CLAUDE.md + D1)Yes (signals vs desired)Yes (dispatcher jobs)Yes, in a bounded domain (repo standards)
Honcho (AGPL)YesYes (dreamer)Relational + peer representationsDerivation deltasN/APartial — derives, broadcasts, but doesn’t rank goals

A few observations from the table:

Two rows approach column 6, both with qualifications. Prime generates goals from repo-standards drift — ambient state is a set of signals it knows how to score, against a desired state it has opinions about. That is open-ended goal generation in a bounded domain. Honcho’s deriver + dreamer + webhooks stack emits derivation events — arguably proto-goals — but does not rank or prioritize across them in a way an executor could consume without further framing.

No system in this table does what a reasonable person would mean by “an agent that, given my whole operational state, figures out what I should work on today and sets it up to be worked on.” That is the gap.


3. AutoGPT / BabyAGI: The Failure That Taught Us What the Gap Looks Like

It is worth naming the one category that did claim column 6 and whose results are by now a well-documented cautionary tale. In 2023, AutoGPT (github.com/Significant-Gravitas/AutoGPT) and BabyAGI (github.com/yoheinakajima/babyagi) went viral on a simple loop:

  1. Given a high-level objective, ask the LLM: “what tasks should I do to achieve this?”
  2. Pick the top task.
  3. Execute it.
  4. Ask the LLM: “given what just happened, what new tasks should I add?”
  5. Re-prioritize.
  6. Goto 2.

The community built this in a weekend. It looked like open-ended goal generation. It was not. Within weeks, retrospectives from users, researchers, and the authors themselves documented the same failure mode: tasks were syntactically plausible but unmoored from reality. The system would spend hours creating sub-tasks that referred to files it had not read, facts it had not verified, URLs it had hallucinated. Tasks spawned subtasks, the tree grew, nothing shipped. Users would come back after an hour to an LLM bill, a log of a hundred “tasks,” and zero concrete progress.

The postmortem is not “the prompt was bad.” It is structural, and it is the lesson that informs everything else in this article:

Open-ended goal generation without rich, grounded, continuously-updated ambient state produces hallucination dressed as autonomy.

AutoGPT’s loop asked the LLM to generate goals from the LLM’s own priors and its own running narrative. The substrate was the conversation plus whatever scratch files the LLM wrote. No facts ledger, no typed entities, no drift signals, no graph of what the user actually cares about. The LLM was forced to pretend it had ambient state, and it hallucinated that state turn by turn.

Contrast a human “generating goals from ambient state”: walks into a kitchen, notices the trash is full, sees the counter is sticky, remembers a package was supposed to arrive, hears a notification ping. The ambient state is enormously rich and continuously verifiable against reality. Goal generation is cheap precisely because the substrate is high-resolution and real-time.

The 2023 experiment revealed, by failing, that the substrate is the rate-limiting reagent. You cannot brute-force column 6 with an for-loop. The field internalized the lesson and largely retreated from column 6 entirely, focusing on columns 1–5 for the next two years. That is what the 2026 agent-framework landscape mostly looks like.


4. Prime as a Bounded-Domain Existence Proof

If you restrict the domain enough, open-ended goal generation works today. Our own Prime pattern (/Users/admin/Work/_readme/articles/org-prime-agent-architecture/README.md) is one such example, and it is useful to walk through why it works in order to see what a general-purpose version would need.

Prime is a hierarchy of persistent Durable Object agents (Org Prime, Repo Prime × N) monitoring a multi-repo GitHub org. Each Repo Prime has:

This system does generate goals. A new repo joins the org; Repo Prime wakes, scans, notices biome is missing, creates a tracking issue, dispatches a job to add biome.json, comments the outcome. Nobody asked. The goal was generated from ambient state.

Why does this work where AutoGPT didn’t? Four reasons:

  1. The substrate is real. Repo state comes from GitHub’s API, CI from GitHub Actions, file presence from GET /contents. Nothing is narrated by the LLM.
  2. The desired state is explicit. CLAUDE.md says what “good” looks like. The LLM never has to invent what the user wants.
  3. The goal vocabulary is bounded. {fix-ci, add-biome, add-commitlint, add-dependabot, merge-mulan-pr, ...} — finite catalog. The LLM is choosing from a menu.
  4. The loop is interrupted. Each wake produces at most five actions, then the DO hibernates. No unbounded recursion; re-grounds from the real world every wake.

This is open-ended goal generation inside a domain where the substrate is dense and the goal vocabulary is discrete. It is structurally identical to how MCTS and AlphaZero generate goals (moves) inside a bounded game-tree: the board state is dense and verifiable, the move vocabulary is finite, the loop re-grounds after every move. In-domain goal generation is tractable when the substrate is high-resolution and the action space is enumerable.

Prime is an existence proof. What it does not show is that the same pattern scales to open-ended domains — where the desired state isn’t a YAML file of standards, the goal vocabulary isn’t a fixed catalog, and the substrate isn’t a REST API. That’s the harder problem, and it’s the one the active-memory hypothesis tries to address.


5. The Active-Memory Hypothesis: Goal Generation as a Substrate Function

If Prime’s recipe is “dense verifiable substrate + desired-state doc + bounded goal vocabulary + interrupted loop,” the question for general-purpose agency becomes: can we build that recipe where the desired state is not a YAML file?

The active-memory survey (/Users/admin/Work/_readme/wiki/business/active-memory-sota-survey.md) outlines one plausible path, composed from three systems that each hold a piece:

Compose these and you get a substrate where writes propagate through pipelines (potentially mutating related nodes), contradictions get flagged rather than silently overwritten, scheduled consolidation passes produce candidate summaries and priorities, and threshold-crossing events fire webhooks that an executor can pick up.

On top of that substrate, goal generation becomes — plausibly — a read-query over the substrate. “What should we work on?” becomes: find the highest-heat unresolved thread, rank by staleness × importance, check against desired-state pages for the relevant product, return a typed goal object. The LLM is no longer hallucinating ambient state; it’s phrasing an SQL-plus-semantic-search query against a graph that has been actively curating itself.

This is not a solved architecture. It is a hypothesis with a clear shape:

Goal-generation quality is proportional to substrate quality. Build the substrate well enough and goal generation becomes a querying problem. Build it poorly and you reproduce AutoGPT.

The remainder of this article treats that hypothesis as something to test, not something to assume.


6. The Spike: Honest 65% Validation

On 2026-04-19 we ran a small spike (/Users/admin/Work/jane/wiki/6-research/goal-generation-spike-s1.md) against this hypothesis. The setup was as clean as we could make it:

The result: roughly 65% validation. Specifically, the self-rating from the spike itself was:

AxisScoreNotes
Coherence7/10Two spine events (ship book 1, deploy intake chatbot) with weeks 2–4 compounding around them. Nothing for UberMesh or Scalable Media.
Grounding8/10Every goal cited a specific doc and line range; most claims validated against two different files.
Specificity7/10~60% of goals could be dispatched to Haiku directly from the goal + citation; another ~25% to Sonnet.
Sequencing realism6/10Aggressive: pulled month-2 work into month-1 for a single-operator pipeline.

Two things are worth saying clearly about this result.

Where it worked. Where the substrate was mature — Book Telic has an explicit revenue model, twelve-month plan, economics calculations, status.json showing books_published: 0 against a month-1 target of 1; First 300 has a published thesis, scored portfolio, ICP definition, a transfer document naming the next spike — the generated goals were concrete, well-sequenced, and cited sources the system could re-verify. They read like goals a human operator would produce. Anyone reading the spike’s “Week 1” section could dispatch the work without further briefing.

Where it didn’t. In UberMesh, there was no monetization path stated in the substrate itself. The spike correctly refused to invent one. In Scalable Media, the only monetization framing came from a three-day-old memory file, not from the repo’s own docs. The spike correctly flagged this as insufficient grounding. Under the AutoGPT-failure-mode test, this is the right behavior — silence beats hallucination. But it also means the plan was “one plan” for two products rather than four. The other two didn’t fail; they returned null.

The honest takeaway is not “goal generation works!” and not “goal generation fails!” It is structural:

The rate-limiting factor was substrate maturity, not architecture. Where the wiki had real revenue, strategy, status, and dependency docs, goal generation produced genuine signal. Where the wiki had aspiration without ground-level detail, goal generation silently skipped or produced LOW-confidence goals correctly marked as such.

This tells us what to fix. Not “write a smarter planner.” Rather: “make the substrate denser on the 30% of surface area where it’s thin.” Different kind of work, and it compounds — every page written improves the next goal-generation cycle.

Extrapolating: a substrate at 100% maturity (every product has revenue, status, dependency, and desired-state docs with cross-links) might produce goal generation at 85–90% validation. Getting from 65% to 90% is a matter of filling in substrate, not rewriting the planner. This first spike is consistent with the active-memory hypothesis without proving it.

One spike is not a proof. We have not tested a hardened substrate, substrate-evolution over time, or the same prompt against competitors. But we now have a result shaped like the hypothesis.


7. Agent-Agnosticism: Executors as a Commodity Layer

Grant the argument so far. Suppose agency — open-ended goal generation from ambient state — turns out to be solvable as a function of substrate quality plus a goal-ranking layer on top. What follows for system design?

The consequence is architectural:

Once you own goal generation and the substrate it runs on, the executor becomes a commodity.

This is a concrete claim about which interfaces matter and which don’t.

Today, a lot of engineering effort goes into picking “the right agent framework.” OpenClaw vs Claude Agent SDK vs MoltWorker vs LangGraph vs CrewAI vs AutoGen. Committing to one creates lock-in, usually at the shape-of-work level (conversational vs loop vs graph) rather than at the goal level.

If goal generation is separated cleanly — if a goal is an output of the substrate, shaped as a structured record with type, reason, citation, success criterion, and confidence — the executor on the other side can be anything that can consume that record. Claude Code is a great executor for goals wanting human-in-the-loop. Claude Agent SDK for scripted determinism. MoltWorker for sandboxed nested runs. OpenClaw for conversational. Prime for repo-shaped.

The question stops being “which agent framework do we commit to?” and becomes “which executor is best for this goal right now?” The answer changes week-to-week as the market evolves. You can run executors concurrently against the same goal stream, A/B them, route goals by capability match. This is the same move the API-Mom router makes at the model layer — model choice as routing, not commitment. One layer up, for agent frameworks.

The durable asset is not the executor. Executors are replaceable. The durable asset is:

  1. The substrate — the typed, content-addressed, actively-curated store of what’s true about the world.
  2. The goal generator — the continuous process that queries the substrate and emits goal records.
  3. The interface between them — the goal-record format, which is the only thing any executor needs to understand.

Everything below the goal record is commodity. Everything above it is frontend and human-in-the-loop UX.

This is the Kubernetes split: control plane owns desired state and drift; data plane is dumb and swappable. Kubelet, CRI-O, containerd are interchangeable. The substrate-plus-goal-generator is the control plane for autonomous operation; the executor zoo is the data plane. Prime’s article made this argument for the narrow case of repo standards; the claim here is that it generalizes.

One consequence worth flagging: if you bet your company on being a “best-in-class agent executor,” you are betting on a commoditizing layer. Your moat, if you have one, has to be upstream of the executor.


8. The Clean Definition

The definition the article has been working toward:

Agency is the capacity, given ambient state, to generate goals worth pursuing, and to continue doing so coherently over time.

Philosophically clean: it separates agency from execution rather than bundling them. Operationally clean: it gives you a test — can the system produce a ranked stream of goals without being prompted, such that the goals hold up under inspection? Practically clean: it tells you where to invest — upstream of the executor, in the substrate and the ranking layer.

Testable corollaries:

And a consequence:

Most of what the industry calls “agents” in 2026 are execution surfaces. They are useful. They are sophisticated. They are not agents in the sense this article means.

Categorization, not insult. The best execution surfaces — Claude Code, OpenClaw, Claude Agent SDK, MoltWorker — are valuable precisely because they are great at executing. The problem isn’t that they’re executors; it’s that “agent” has been applied to them in a way that obscures what’s actually missing.


9. What to Build If You Want Agency

The practical implications for builders separate cleanly by altitude:

If you are building an executor (an agent framework, a Claude-based coding tool, a sandbox runner): optimize for the handoff from a goal record. Accept structured goal inputs with success criteria and citations. Report outcomes back. Don’t try to also be a goal generator — that’s a different product. Make it easy for someone else’s goal generator to drive you.

If you are building a substrate (a memory system, a knowledge graph, a content-addressed store): the active-memory survey names the primitive you need — write propagation. Writes must trigger pipelines. Pipelines must mutate related entities. Scheduled consolidation passes must re-rank and re-embed. Threshold-crossing events must emit webhooks. Without these, you have a database, and goal generation over a database fails the AutoGPT test.

If you are building a goal generator: the substrate-quality problem is your problem. You cannot generate good goals from a thin substrate. Roadmap: adopt a bi-temporal edge schema, adopt the deriver/dreamer/webhooks shape, adopt a per-domain desired-state doc convention, adopt a bounded goal vocabulary per domain, run the spike, measure, fix the substrate where it produced null results, re-run.

If you are betting your company on any single agent framework: separate the layers. Keep goal generation and the substrate in your own hands. Use whoever’s executor is best this quarter; be ready to swap.

Most of all: stop asking “is this system autonomous?” and start asking “does this system generate its own goals from a substrate it continuously re-grounds in?” The first is a branding question. The second is an engineering question with a test attached.


10. Honest Caveats

Three things this article does not claim.

It does not claim goal generation is solved. Our own spike was 65% validation on a partially-mature substrate. That is a promising first data point, not a victory. The test that matters is running the same shape of spike on a substrate we have deliberately hardened, against a goal-generation stack we have deliberately built, and seeing whether 65% becomes 85%. We have not run that test.

It does not claim existing systems are “wrong.” OpenClaw, Letta, Claude Code, MoltWorker, Agent SDK — these are excellent at what they do. Re-categorizing them as executors rather than agents changes their position on the taxonomy; it does not reduce their value. Naming the layer clearly makes them more useful, because it’s then clear what they should integrate with.

It does not claim active memory is the only path. Active memory is a plausible substrate shape, informed by prior art and tested in a partial spike. Other shapes exist — RL over long horizons, model-internal memory (EM-LLM’s direction), human-in-the-loop-with-agent-assist, formal planning over typed domains (PDDL descendants). The active-memory bet is that goal generation becomes a substrate-query problem once the substrate has the shape described in §5. We could be wrong. The test is empirical.

These caveats are the precondition for the article’s central claim being useful: we draw a line between two distinct things (executors and goal generators), name the hard open problem (open-ended goal generation), point at one specific path through it (active memory), and report our own partial result honestly. If the path fails, the distinction still holds, and whatever solves goal generation will still be the thing that matters.



Sources

Internal

External


Edit page
Share this post on:

Previous Post
Dreaming and the Effect Gate
Next Post
Memory as Lazy Queries Over the World