Most AI agents treat failure as terminal. The right architecture treats failure as fuel — each failure enriches the next attempt, successful fixes become permanent skills, and the system gets progressively more capable without retraining a single model.
When an AI agent fails at a task, the standard response is: retry with the same model (which usually fails again), or escalate directly to a human (expensive, doesn’t scale). Neither approach learns anything. Neither gets cheaper over time.
This article presents a different model — one that mirrors how high-performing human organizations actually handle problems. It has three interlocking parts:
- The escalation ladder — a structured hierarchy of models, each more capable and expensive than the last, that a job climbs only on genuine failure
- The advisor/executor split — higher models diagnose and advise; cheaper models execute; expensive intelligence is never wasted on mechanical work
- Skill crystallization — when escalation succeeds, the advice that fixed it gets written back as a permanent skill, making lower models more capable for the next similar job
Together these form a self-improving capability loop: the more the system runs, the fewer escalations it needs.
Table of Contents
Open Table of Contents
- The Problem with Retry Logic
- The Escalation Ladder
- The Advisor/Executor Split
- Accumulated Failure Context
- Skill Crystallization
- The Compound Effect
- How This Mirrors Human Organizations
- Connection to Claude Skills
- Implementation in API MOM
- Job Payload Schema
- The Escalation Loop in Code
- Skill Storage and Injection
- Metrics That Matter
- Anti-Patterns
- The Bigger Picture
- References
The Problem with Retry Logic
Most agent frameworks handle failure like this:
attempt() → fail → retry() → fail → retry() → fail → give up
Sometimes with backoff, sometimes with a slightly different prompt. But the fundamental problem is that each retry starts with the same information as the last attempt. There is no learning. The fifth attempt is as blind as the first.
The more sophisticated version escalates to a more capable model on failure:
haiku → fail → sonnet → fail → opus → fail → human
This is better — the job at least gets progressively more intelligence applied to it. But it still has two critical flaws:
Flaw 1: Expensive models do cheap work. When opus finally succeeds at writing a biome.json configuration, it spent its reasoning budget on a task that haiku could have done with better instructions. The expensive model’s intelligence was used on execution, not diagnosis.
Flaw 2: Nothing is learned. The next repo that needs the same biome.json fix will climb the exact same ladder. Opus will do the exact same work again. The system has zero memory of what worked.
The Escalation Ladder
The escalation ladder is not about retrying with more intelligence. It is about routing to the right level of capability for the specific type of problem.
Level 0: Static template (no AI — pure pattern matching)
Level 1: Fast LLM (e.g. Llama 8B via Cloudflare Workers AI)
Level 2: Capable LLM (e.g. Llama 70B or equivalent)
Level 3: Reasoning model (e.g. Claude Sonnet with thinking)
Level 4: Top-tier model (e.g. Claude Opus with adaptive thinking)
Level 5: Human (terminal — Telegram alert, manual review)
A job starts at Level 0. It only climbs if the current level fails. The vast majority of jobs should succeed at Level 0 or 1 — if your Level 3+ is handling routine work, something is wrong with your lower levels.
The ladder is also not symmetric. Climbing costs money and latency. The goal of the entire architecture is to keep jobs as low on the ladder as possible while ensuring they succeed.
The Advisor/Executor Split
Here is the key insight that most escalation architectures miss:
Higher models should advise, not execute.
When a Level 0 executor fails, the natural response is to send the same task to Level 1 and have it try. But this is wasteful. Level 1 is more expensive. Level 1 is slower. And if the task is a mechanical operation (write this file, apply this config), Level 0 is perfectly capable of doing it — it just needed better instructions.
The correct escalation is:
Level 0 (executor): tries, fails
→ escalate to Level 1 (advisor):
"Level 0 failed with this error. What should it do differently?"
→ Level 1 returns: new instructions, context, template
→ re-queue Level 0 with Level 1's instructions
Level 0 (executor): retries with enriched instructions → succeeds
Level 1 never touches the actual execution. It only produces better instructions. Level 0 does the work, as it always should.
This distinction — advisor vs executor — is the architectural heart of the system:
| Role | Responsibility | Cost profile |
|---|---|---|
| Executor | Implements instructions; creates files, runs commands, calls APIs | Low cost, high throughput |
| Advisor | Diagnoses failures; produces new instructions for the executor | High cost, low frequency |
Higher models are almost always advisors. Lower models are almost always executors. The only time a higher model executes directly is when the task genuinely requires its capabilities (complex reasoning, novel situations with no prior art).
Why this mirrors how organizations work
A staff engineer does not rewrite the junior’s code on every bug. They read the failure, write a detailed comment explaining what went wrong and what to do instead, and send it back. The junior implements the fix. This is faster, teaches the junior, and uses the staff engineer’s time on the work only they can do — diagnosis, architecture decisions, edge-case reasoning.
The same principle applies here. The “senior engineer” (Level 3-4 model) is expensive. Every token it spends on mechanical execution is a token not spent on genuine reasoning.
Accumulated Failure Context
When an advisor produces instructions for an executor, and the executor still fails, something important has happened: the advisor’s hypothesis was wrong, or incomplete. The next advisor up the chain needs to know this.
This is why failure context must accumulate, not reset.
// Job payload after two failed escalation cycles:
{
escalation_history: [
{
level: 0,
label: "Static template",
attempted_at: 1742000000,
error: "biome.json conflicts with existing tsconfig extends path"
},
{
level: 0,
label: "Static template (retry with L1 advice)",
attempted_at: 1742000300,
error: "monorepo root biome.json not picked up by packages/*/tsconfig"
}
],
escalation_advice: [
{
from_level: 1,
label: "Llama 8B",
instructions: "Repo uses monorepo structure. Add packages/* to biome.json extends.",
reasoning: "package.json has workspaces field"
}
]
}
When Level 2 receives this job, its advisor prompt is not “fix biome.json” — it is:
“A static template executor tried twice to add biome.json. First attempt failed because of tsconfig conflicts. A Level-1 advisor suggested adding packages/* to the extends field. The retry still failed — the monorepo root config wasn’t picked up. What should the executor do differently?”
The accumulated history transforms a vague “this failed” into a precise diagnosis tree. The Level 2 advisor can see not just what failed, but what was tried, what was advised, and why the advice was insufficient. It has far more signal to reason from.
The accumulation rule: Every escalation event appends to escalation_history. Every advisor response appends to escalation_advice. Neither list is ever truncated during the lifecycle of a single job.
Skill Crystallization
Now comes the piece that makes the whole architecture self-improving.
When an advisor’s instructions lead to a successful fix, the system has discovered something valuable: a pattern + solution pair that worked for a real case. This should not be discarded.
The successful advice gets written to a skill registry:
// After Level-1 advice leads to a successful Level-0 fix:
await writeSkill({
pattern: {
job_type: "add-biome",
repo_signals: ["monorepo", "typescript", "packages/*"],
},
instructions: "Use root biome.json with packages/* overrides. Template: ...",
source: "escalation-l1-advice",
success_count: 1,
last_used: now,
})
The next time a Level-0 executor picks up an add-biome job on a monorepo TypeScript repo, it loads matching skills first:
const skills = await loadMatchingSkills(db, job.type, repoSignals)
// Returns: ["Use root biome.json with packages/* overrides..."]
const prompt = buildExecutorPrompt(job, skills)
// Prompt now includes: "[Known fix] Use root biome.json with packages/* overrides..."
The Level-0 executor succeeds on the first attempt, without escalation. The skill registry has permanently elevated its capability for this pattern.
What skills look like
A skill is a compact unit of crystallized expertise:
interface Skill {
id: string
pattern: {
job_type: string
repo_signals: string[] // signals that must be present
error_patterns?: string[] // regex patterns that match failure messages
}
instructions: string // what the executor should do
template?: string // exact file content or template to use
source: string // "escalation-l1" | "escalation-l2" | "human-fix"
success_count: number
failure_count: number // if skill starts failing, surface for review
confidence: number // success_count / (success_count + failure_count)
last_used: number
created_at: number
}
Skills are matched by job type and repo signals. When multiple skills match, they are ranked by confidence and recency. High-confidence skills (applied successfully many times) are trusted; low-confidence skills (one success, no follow-up) are presented as hints rather than instructions.
Skills degrade gracefully
A skill that was right for TypeScript 4.x repos may be wrong for TypeScript 5.x repos. If a skill causes new failures, its failure_count rises and its confidence falls. Skills below a confidence threshold are automatically flagged for review. A human (or higher model) can update or retire the skill.
This is the same lifecycle as documentation: it starts accurate, drifts as the world changes, needs maintenance. The difference is that the system surfaces drift automatically — you don’t discover it when a junior engineer gets confused.
The Compound Effect
The three mechanisms — escalation ladder, advisor/executor split, skill crystallization — interact to produce a compounding capability improvement:
Week 1: 100 jobs, 40% succeed at L0, 30% need L1 advice, 20% need L2, 10% need L3+
→ 70 skills written to registry
Week 4: 100 jobs, 65% succeed at L0 (skills), 20% need L1, 10% need L2, 5% need L3+
→ 90 more skills written (new patterns encountered)
Week 12: 100 jobs, 85% succeed at L0, 10% need L1, 4% need L2, 1% need L3+
→ escalation cost down 75%, throughput up 3x, capability up
The system gets cheaper and more capable simultaneously — not because the models improved, but because accumulated expertise is being reused.
The key insight: Each failure is not a cost — it is an investment. The job fails, advice is generated, advice costs money. But that money buys a skill that eliminates the escalation cost for every future similar job. The ROI on an escalation event is (cost of all future similar jobs × escalation probability before skill).
For a system running hundreds of repos across dozens of job types, the ROI on early escalations is high. The 50th add-biome job on a monorepo should essentially be free.
How This Mirrors Human Organizations
This is not a novel framework invented for AI. It is a faithful mapping of how high-performing engineering organizations handle problems.
| Human org | This architecture |
|---|---|
| Junior engineer | Level 0 executor |
| Senior engineer | Level 1-2 advisor |
| Staff / principal | Level 3-4 advisor |
| Incident post-mortem | Escalation history |
| Runbook / playbook | Skill registry |
| ”Write this up in the wiki” | Skill crystallization |
When a junior fails at a task, a senior doesn’t redo the work — they diagnose, explain, and send it back. The junior learns. The next time this pattern appears, the junior handles it without escalation. The senior’s time is preserved for novel problems.
The runbook is the organizational equivalent of the skill registry. When an on-call engineer hits a novel incident, diagnoses it, and fixes it, the right response is to write a runbook entry. The next engineer who hits the same incident reads the runbook and resolves it in minutes, not hours.
The failure of most engineering organizations — and most AI agent systems — is not doing this. Fixes happen, knowledge is not crystallized, the next engineer (or agent) starts from scratch.
Connection to Claude Skills
Claude Code’s /skills system is a direct implementation of this same idea at the prompt level. A skill is a reusable prompt that fires when a pattern matches. The skill registry contains expert-authored instructions for specific tasks. When you invoke /commit, you are not prompting from scratch — you are loading a skill that encodes commit-message best practices, staged file handling, and co-author formatting.
The difference in this architecture is that skills are not hand-authored — they are extracted from successful escalation cycles. The system authors its own skills. Over time, the skill registry becomes a body of operational knowledge that represents every novel problem the system has encountered and solved.
OpenClaw, one of the leading open-source agent frameworks, has 5,400+ skills in its registry — not because the developers wrote 5,400 prompt templates, but because the framework accumulated expertise from thousands of real agent runs. A system with this architecture can reach that scale automatically.
Implementation in API MOM
The escalation ladder, advisor/executor split, and skill crystallization are all concerns that belong in API MOM — the centralized AI routing layer — not in individual agents.
Here is why:
- Agents should not know about models. An agent declares an escalation level and a task. API MOM decides which model handles it.
- Skills are cross-agent. A skill for “add-biome to monorepo TypeScript repo” is useful to any agent that does
add-biomejobs. Centralizing skills in API MOM makes them available to all agents. - Billing and escalation cost are coupled. API MOM already tracks token spend per request. It is the right place to track escalation cost and ROI per skill.
API MOM endpoints for escalation
POST /execute
Body: { level, job_type, prompt, context, escalation_history, escalation_advice }
Response: { result, new_level?, advice?, skill_candidate? }
POST /advise
Body: { from_level, job_type, error, escalation_history, repo_context }
Response: { instructions, reasoning, suggested_executor_level }
POST /skills
Body: Skill
Response: { id }
GET /skills?job_type=add-biome&signals=monorepo,typescript
Response: Skill[]
The agent (e.g. mulan) calls /execute at level 0. If it fails, API MOM automatically calls /advise at level 1 and returns the advice to the agent. The agent re-queues at level 0 with the advice. On success, API MOM calls /skills to store the pattern. The agent never needs to know which model was used at any point.
Job Payload Schema
The full payload schema for a job that supports escalation:
interface EscalationAttempt {
level: number
label: string // "Static template", "Llama 8B", "Sonnet+think", etc.
attempted_at: number
error: string
}
interface EscalationAdvice {
from_level: number
label: string
instructions: string
reasoning?: string
suggested_executor_level?: number // advisor can suggest going back further
}
interface JobPayload {
// Core task data
[key: string]: unknown
// Escalation state (append-only during job lifecycle)
escalation_history?: EscalationAttempt[]
escalation_advice?: EscalationAdvice[]
// Current execution context (updated on re-queue)
model_spec?: ModelSpec
active_skills?: string[] // skill IDs injected for this attempt
}
The key constraint: escalation_history and escalation_advice are append-only. No attempt is ever removed. The complete failure narrative is always available to any advisor in the chain.
The Escalation Loop in Code
async function executeWithEscalation(
job: Job,
db: D1Database,
apiMomUrl: string,
apiMomKey: string,
): Promise<JobResult> {
const maxLevel = 4 // beyond this: human escalation
let level = job.escalation_level ?? 0
const history: EscalationAttempt[] = job.payload.escalation_history ?? []
const advice: EscalationAdvice[] = job.payload.escalation_advice ?? []
while (level <= maxLevel) {
// 1. Load matching skills for this level + job type + repo context
const skills = await fetchSkills(apiMomUrl, apiMomKey, job.type, job.payload)
// 2. Execute at current level (via API MOM)
const result = await apiMomExecute(apiMomUrl, apiMomKey, {
level,
job_type: job.type,
prompt: buildExecutorPrompt(job, skills),
escalation_history: history,
escalation_advice: advice,
})
if (result.success) {
// 3a. Success: extract skill if this was an escalated attempt
if (history.length > 0 && advice.length > 0) {
await crystallizeSkill(apiMomUrl, apiMomKey, job, advice[advice.length - 1], result)
}
return result
}
// 3b. Failure: record attempt, get advice from next level
history.push({ level, label: ESCALATION_LABELS[level], attempted_at: Date.now() / 1000, error: result.error! })
if (level >= maxLevel) break
const advisorLevel = level + 1
const advisorResponse = await apiMomAdvise(apiMomUrl, apiMomKey, {
from_level: advisorLevel,
job_type: job.type,
error: result.error!,
escalation_history: history,
repo_context: job.payload.repo_context as string,
})
advice.push({
from_level: advisorLevel,
label: ESCALATION_LABELS[advisorLevel],
instructions: advisorResponse.instructions,
reasoning: advisorResponse.reasoning,
suggested_executor_level: advisorResponse.suggested_executor_level,
})
// 4. Re-queue at the lowest level the advisor thinks can handle this
// (may go back further than level 0, e.g. advisor says "try level 1 not level 0")
level = advisorResponse.suggested_executor_level ?? 0
// Persist updated escalation state to D1
await updateJobPayload(db, job.id, { escalation_history: history, escalation_advice: advice })
}
// All levels exhausted — needs human
await escalateToHuman(db, job, history, advice)
return { success: false, error: 'Escalation ladder exhausted — human review required' }
}
Skill Storage and Injection
async function crystallizeSkill(
apiMomUrl: string,
apiMomKey: string,
job: Job,
lastAdvice: EscalationAdvice,
result: JobResult,
): Promise<void> {
// Extract repo signals: what was true about this repo when the fix worked
const repoSignals = extractRepoSignals(job.payload)
await fetch(`${apiMomUrl}/skills`, {
method: 'POST',
headers: { Authorization: `Bearer ${apiMomKey}`, 'Content-Type': 'application/json' },
body: JSON.stringify({
pattern: {
job_type: job.type,
repo_signals: repoSignals,
error_patterns: extractErrorPatterns(job.payload.escalation_history),
},
instructions: lastAdvice.instructions,
source: `escalation-l${lastAdvice.from_level}`,
success_count: 1,
failure_count: 0,
}),
})
}
async function buildExecutorPrompt(job: Job, skills: Skill[]): Promise<string> {
const skillContext = skills.length > 0
? `\n\n## Known fixes for this pattern\n${skills.map(s =>
`[Confidence: ${Math.round(s.confidence * 100)}%] ${s.instructions}`
).join('\n\n')}`
: ''
return `${getBasePrompt(job.type)}${skillContext}\n\nRepo: ${job.repo}\n${JSON.stringify(job.payload, null, 2)}`
}
Metrics That Matter
A system running this architecture should track:
| Metric | What it tells you |
|---|---|
| Escalation rate by level | What fraction of jobs require L1, L2, L3+ |
| Skill hit rate | What fraction of jobs had a matching skill |
| Skill effectiveness | Jobs with skill injected: success rate vs baseline |
| Cost per job by type | Where you’re spending money; where skills would pay off most |
| Advisor accuracy | When Level-N advice was followed, what was the subsequent success rate |
| Skill confidence decay | Skills whose confidence is falling — potential drift |
The escalation rate by level is the primary health indicator. A healthy system has a steep drop-off: most jobs succeed at L0, a small fraction need L1, a tiny fraction need L2+. If L2 is handling 20% of jobs, your L0/L1 skill registry is underdeveloped.
Anti-Patterns
The “just use GPT-4 for everything” anti-pattern
Using the most capable model for all jobs eliminates escalation cost but eliminates learning entirely. You pay top-tier prices for work that a static template could handle. The skill registry never builds. Costs scale linearly with volume.
The “retry with same model” anti-pattern
Re-running the same prompt on the same model after failure is almost never correct. If the model failed once, it will fail again with high probability — the same reasoning applied to the same inputs produces the same outputs. Retry is only appropriate for transient errors (rate limits, network failures), not reasoning failures.
The “executor at every level” anti-pattern
Having each level of the escalation ladder independently execute the task (not advise) wastes expensive model capacity on execution and produces conflicting side effects. Level 3 should not be committing files to GitHub; it should be telling Level 0 exactly which file to commit.
The “discard history on escalation” anti-pattern
Wiping the failure history when re-queuing at a lower level with new instructions defeats the purpose of accumulated context. An advisor at Level 3 who cannot see what Level 1 already tried will suggest the same things Level 1 already suggested.
The “skills are permanent” anti-pattern
Skills must have confidence tracking and decay detection. A skill based on three successes from six months ago, applied to a different tech stack version, will generate incorrect advice with high confidence. The system should surface skill drift before it causes production failures.
The Bigger Picture
This architecture is a claim about how AI systems should grow in capability.
The dominant paradigm is model scaling: capability improves by training bigger models on more data. This is correct but external — you cannot control it, you pay for it at inference time, and it does not retain anything specific to your domain.
The alternative is operational learning: capability improves by accumulating expertise from real runs in your specific domain. The model weights do not change. But the effective capability of the system grows because the prompts get better, the skills get richer, and less is left to the model’s general reasoning.
This is how human expertise actually works. A doctor does not get better at diagnosing because their brain retrains. They get better because they accumulate cases, build pattern recognition, and develop heuristics that map symptoms to diagnoses faster than reasoning from first principles.
The skill registry is the operational memory of the system. The escalation ladder is the mechanism for generating new skills. The advisor/executor split is the allocation policy that ensures expensive reasoning is used on diagnosis, not execution.
Built correctly, this architecture produces a system where:
- Common cases are nearly free — handled by Level-0 executors with high-confidence skills, no AI cost at all
- Novel cases are expensive but educational — advisor models engage, generate instructions, those instructions become skills
- Truly hard cases escalate to humans — but with full context, so humans spend minutes not hours
- The system gets cheaper as it scales — more volume means more skills, which means lower average escalation level
This is the compounding return on operational AI investment. Not from better models — from better use of the models you already have.
References
- API MOM as Intelligent Router — routing by capability hint, free tier / subscription quota management
- Three-Layer AI Agent Architecture — Container / Brain / Wallet separation
- Recurring Automation Governance — the governance layer that keeps automation from running unchecked
- Mulan Dispatcher — reference implementation: escalation ladder in
jobs.ts, skill learning inlearn.ts - OpenClaw Framework — 5,400+ skill registry demonstrating the scale this architecture can reach