Skip to content
Gary Wu
Go back

Never Fail Twice: The Escalation Ladder That Learns

Edit page

Most AI agents treat failure as terminal. The right architecture treats failure as fuel — each failure enriches the next attempt, successful fixes become permanent skills, and the system gets progressively more capable without retraining a single model.

When an AI agent fails at a task, the standard response is: retry with the same model (which usually fails again), or escalate directly to a human (expensive, doesn’t scale). Neither approach learns anything. Neither gets cheaper over time.

This article presents a different model — one that mirrors how high-performing human organizations actually handle problems. It has three interlocking parts:

  1. The escalation ladder — a structured hierarchy of models, each more capable and expensive than the last, that a job climbs only on genuine failure
  2. The advisor/executor split — higher models diagnose and advise; cheaper models execute; expensive intelligence is never wasted on mechanical work
  3. Skill crystallization — when escalation succeeds, the advice that fixed it gets written back as a permanent skill, making lower models more capable for the next similar job

Together these form a self-improving capability loop: the more the system runs, the fewer escalations it needs.


Table of Contents

Open Table of Contents

The Problem with Retry Logic

Most agent frameworks handle failure like this:

attempt() → fail → retry() → fail → retry() → fail → give up

Sometimes with backoff, sometimes with a slightly different prompt. But the fundamental problem is that each retry starts with the same information as the last attempt. There is no learning. The fifth attempt is as blind as the first.

The more sophisticated version escalates to a more capable model on failure:

haiku → fail → sonnet → fail → opus → fail → human

This is better — the job at least gets progressively more intelligence applied to it. But it still has two critical flaws:

Flaw 1: Expensive models do cheap work. When opus finally succeeds at writing a biome.json configuration, it spent its reasoning budget on a task that haiku could have done with better instructions. The expensive model’s intelligence was used on execution, not diagnosis.

Flaw 2: Nothing is learned. The next repo that needs the same biome.json fix will climb the exact same ladder. Opus will do the exact same work again. The system has zero memory of what worked.


The Escalation Ladder

The escalation ladder is not about retrying with more intelligence. It is about routing to the right level of capability for the specific type of problem.

Level 0: Static template      (no AI — pure pattern matching)
Level 1: Fast LLM             (e.g. Llama 8B via Cloudflare Workers AI)
Level 2: Capable LLM          (e.g. Llama 70B or equivalent)
Level 3: Reasoning model      (e.g. Claude Sonnet with thinking)
Level 4: Top-tier model       (e.g. Claude Opus with adaptive thinking)
Level 5: Human                (terminal — Telegram alert, manual review)

A job starts at Level 0. It only climbs if the current level fails. The vast majority of jobs should succeed at Level 0 or 1 — if your Level 3+ is handling routine work, something is wrong with your lower levels.

The ladder is also not symmetric. Climbing costs money and latency. The goal of the entire architecture is to keep jobs as low on the ladder as possible while ensuring they succeed.


The Advisor/Executor Split

Here is the key insight that most escalation architectures miss:

Higher models should advise, not execute.

When a Level 0 executor fails, the natural response is to send the same task to Level 1 and have it try. But this is wasteful. Level 1 is more expensive. Level 1 is slower. And if the task is a mechanical operation (write this file, apply this config), Level 0 is perfectly capable of doing it — it just needed better instructions.

The correct escalation is:

Level 0 (executor): tries, fails
  → escalate to Level 1 (advisor):
      "Level 0 failed with this error. What should it do differently?"
  → Level 1 returns: new instructions, context, template
  → re-queue Level 0 with Level 1's instructions
Level 0 (executor): retries with enriched instructions → succeeds

Level 1 never touches the actual execution. It only produces better instructions. Level 0 does the work, as it always should.

This distinction — advisor vs executor — is the architectural heart of the system:

RoleResponsibilityCost profile
ExecutorImplements instructions; creates files, runs commands, calls APIsLow cost, high throughput
AdvisorDiagnoses failures; produces new instructions for the executorHigh cost, low frequency

Higher models are almost always advisors. Lower models are almost always executors. The only time a higher model executes directly is when the task genuinely requires its capabilities (complex reasoning, novel situations with no prior art).

Why this mirrors how organizations work

A staff engineer does not rewrite the junior’s code on every bug. They read the failure, write a detailed comment explaining what went wrong and what to do instead, and send it back. The junior implements the fix. This is faster, teaches the junior, and uses the staff engineer’s time on the work only they can do — diagnosis, architecture decisions, edge-case reasoning.

The same principle applies here. The “senior engineer” (Level 3-4 model) is expensive. Every token it spends on mechanical execution is a token not spent on genuine reasoning.


Accumulated Failure Context

When an advisor produces instructions for an executor, and the executor still fails, something important has happened: the advisor’s hypothesis was wrong, or incomplete. The next advisor up the chain needs to know this.

This is why failure context must accumulate, not reset.

// Job payload after two failed escalation cycles:
{
  escalation_history: [
    {
      level: 0,
      label: "Static template",
      attempted_at: 1742000000,
      error: "biome.json conflicts with existing tsconfig extends path"
    },
    {
      level: 0,
      label: "Static template (retry with L1 advice)",
      attempted_at: 1742000300,
      error: "monorepo root biome.json not picked up by packages/*/tsconfig"
    }
  ],
  escalation_advice: [
    {
      from_level: 1,
      label: "Llama 8B",
      instructions: "Repo uses monorepo structure. Add packages/* to biome.json extends.",
      reasoning: "package.json has workspaces field"
    }
  ]
}

When Level 2 receives this job, its advisor prompt is not “fix biome.json” — it is:

“A static template executor tried twice to add biome.json. First attempt failed because of tsconfig conflicts. A Level-1 advisor suggested adding packages/* to the extends field. The retry still failed — the monorepo root config wasn’t picked up. What should the executor do differently?”

The accumulated history transforms a vague “this failed” into a precise diagnosis tree. The Level 2 advisor can see not just what failed, but what was tried, what was advised, and why the advice was insufficient. It has far more signal to reason from.

The accumulation rule: Every escalation event appends to escalation_history. Every advisor response appends to escalation_advice. Neither list is ever truncated during the lifecycle of a single job.


Skill Crystallization

Now comes the piece that makes the whole architecture self-improving.

When an advisor’s instructions lead to a successful fix, the system has discovered something valuable: a pattern + solution pair that worked for a real case. This should not be discarded.

The successful advice gets written to a skill registry:

// After Level-1 advice leads to a successful Level-0 fix:
await writeSkill({
  pattern: {
    job_type: "add-biome",
    repo_signals: ["monorepo", "typescript", "packages/*"],
  },
  instructions: "Use root biome.json with packages/* overrides. Template: ...",
  source: "escalation-l1-advice",
  success_count: 1,
  last_used: now,
})

The next time a Level-0 executor picks up an add-biome job on a monorepo TypeScript repo, it loads matching skills first:

const skills = await loadMatchingSkills(db, job.type, repoSignals)
// Returns: ["Use root biome.json with packages/* overrides..."]

const prompt = buildExecutorPrompt(job, skills)
// Prompt now includes: "[Known fix] Use root biome.json with packages/* overrides..."

The Level-0 executor succeeds on the first attempt, without escalation. The skill registry has permanently elevated its capability for this pattern.

What skills look like

A skill is a compact unit of crystallized expertise:

interface Skill {
  id: string
  pattern: {
    job_type: string
    repo_signals: string[]         // signals that must be present
    error_patterns?: string[]      // regex patterns that match failure messages
  }
  instructions: string             // what the executor should do
  template?: string                // exact file content or template to use
  source: string                   // "escalation-l1" | "escalation-l2" | "human-fix"
  success_count: number
  failure_count: number            // if skill starts failing, surface for review
  confidence: number               // success_count / (success_count + failure_count)
  last_used: number
  created_at: number
}

Skills are matched by job type and repo signals. When multiple skills match, they are ranked by confidence and recency. High-confidence skills (applied successfully many times) are trusted; low-confidence skills (one success, no follow-up) are presented as hints rather than instructions.

Skills degrade gracefully

A skill that was right for TypeScript 4.x repos may be wrong for TypeScript 5.x repos. If a skill causes new failures, its failure_count rises and its confidence falls. Skills below a confidence threshold are automatically flagged for review. A human (or higher model) can update or retire the skill.

This is the same lifecycle as documentation: it starts accurate, drifts as the world changes, needs maintenance. The difference is that the system surfaces drift automatically — you don’t discover it when a junior engineer gets confused.


The Compound Effect

The three mechanisms — escalation ladder, advisor/executor split, skill crystallization — interact to produce a compounding capability improvement:

Week 1:  100 jobs, 40% succeed at L0, 30% need L1 advice, 20% need L2, 10% need L3+
         → 70 skills written to registry

Week 4:  100 jobs, 65% succeed at L0 (skills), 20% need L1, 10% need L2, 5% need L3+
         → 90 more skills written (new patterns encountered)

Week 12: 100 jobs, 85% succeed at L0, 10% need L1, 4% need L2, 1% need L3+
         → escalation cost down 75%, throughput up 3x, capability up

The system gets cheaper and more capable simultaneously — not because the models improved, but because accumulated expertise is being reused.

The key insight: Each failure is not a cost — it is an investment. The job fails, advice is generated, advice costs money. But that money buys a skill that eliminates the escalation cost for every future similar job. The ROI on an escalation event is (cost of all future similar jobs × escalation probability before skill).

For a system running hundreds of repos across dozens of job types, the ROI on early escalations is high. The 50th add-biome job on a monorepo should essentially be free.


How This Mirrors Human Organizations

This is not a novel framework invented for AI. It is a faithful mapping of how high-performing engineering organizations handle problems.

Human orgThis architecture
Junior engineerLevel 0 executor
Senior engineerLevel 1-2 advisor
Staff / principalLevel 3-4 advisor
Incident post-mortemEscalation history
Runbook / playbookSkill registry
”Write this up in the wiki”Skill crystallization

When a junior fails at a task, a senior doesn’t redo the work — they diagnose, explain, and send it back. The junior learns. The next time this pattern appears, the junior handles it without escalation. The senior’s time is preserved for novel problems.

The runbook is the organizational equivalent of the skill registry. When an on-call engineer hits a novel incident, diagnoses it, and fixes it, the right response is to write a runbook entry. The next engineer who hits the same incident reads the runbook and resolves it in minutes, not hours.

The failure of most engineering organizations — and most AI agent systems — is not doing this. Fixes happen, knowledge is not crystallized, the next engineer (or agent) starts from scratch.


Connection to Claude Skills

Claude Code’s /skills system is a direct implementation of this same idea at the prompt level. A skill is a reusable prompt that fires when a pattern matches. The skill registry contains expert-authored instructions for specific tasks. When you invoke /commit, you are not prompting from scratch — you are loading a skill that encodes commit-message best practices, staged file handling, and co-author formatting.

The difference in this architecture is that skills are not hand-authored — they are extracted from successful escalation cycles. The system authors its own skills. Over time, the skill registry becomes a body of operational knowledge that represents every novel problem the system has encountered and solved.

OpenClaw, one of the leading open-source agent frameworks, has 5,400+ skills in its registry — not because the developers wrote 5,400 prompt templates, but because the framework accumulated expertise from thousands of real agent runs. A system with this architecture can reach that scale automatically.


Implementation in API MOM

The escalation ladder, advisor/executor split, and skill crystallization are all concerns that belong in API MOM — the centralized AI routing layer — not in individual agents.

Here is why:

API MOM endpoints for escalation

POST /execute
  Body: { level, job_type, prompt, context, escalation_history, escalation_advice }
  Response: { result, new_level?, advice?, skill_candidate? }

POST /advise
  Body: { from_level, job_type, error, escalation_history, repo_context }
  Response: { instructions, reasoning, suggested_executor_level }

POST /skills
  Body: Skill
  Response: { id }

GET  /skills?job_type=add-biome&signals=monorepo,typescript
  Response: Skill[]

The agent (e.g. mulan) calls /execute at level 0. If it fails, API MOM automatically calls /advise at level 1 and returns the advice to the agent. The agent re-queues at level 0 with the advice. On success, API MOM calls /skills to store the pattern. The agent never needs to know which model was used at any point.


Job Payload Schema

The full payload schema for a job that supports escalation:

interface EscalationAttempt {
  level: number
  label: string            // "Static template", "Llama 8B", "Sonnet+think", etc.
  attempted_at: number
  error: string
}

interface EscalationAdvice {
  from_level: number
  label: string
  instructions: string
  reasoning?: string
  suggested_executor_level?: number  // advisor can suggest going back further
}

interface JobPayload {
  // Core task data
  [key: string]: unknown

  // Escalation state (append-only during job lifecycle)
  escalation_history?: EscalationAttempt[]
  escalation_advice?: EscalationAdvice[]

  // Current execution context (updated on re-queue)
  model_spec?: ModelSpec
  active_skills?: string[]    // skill IDs injected for this attempt
}

The key constraint: escalation_history and escalation_advice are append-only. No attempt is ever removed. The complete failure narrative is always available to any advisor in the chain.


The Escalation Loop in Code

async function executeWithEscalation(
  job: Job,
  db: D1Database,
  apiMomUrl: string,
  apiMomKey: string,
): Promise<JobResult> {
  const maxLevel = 4  // beyond this: human escalation
  let level = job.escalation_level ?? 0
  const history: EscalationAttempt[] = job.payload.escalation_history ?? []
  const advice: EscalationAdvice[] = job.payload.escalation_advice ?? []

  while (level <= maxLevel) {
    // 1. Load matching skills for this level + job type + repo context
    const skills = await fetchSkills(apiMomUrl, apiMomKey, job.type, job.payload)

    // 2. Execute at current level (via API MOM)
    const result = await apiMomExecute(apiMomUrl, apiMomKey, {
      level,
      job_type: job.type,
      prompt: buildExecutorPrompt(job, skills),
      escalation_history: history,
      escalation_advice: advice,
    })

    if (result.success) {
      // 3a. Success: extract skill if this was an escalated attempt
      if (history.length > 0 && advice.length > 0) {
        await crystallizeSkill(apiMomUrl, apiMomKey, job, advice[advice.length - 1], result)
      }
      return result
    }

    // 3b. Failure: record attempt, get advice from next level
    history.push({ level, label: ESCALATION_LABELS[level], attempted_at: Date.now() / 1000, error: result.error! })

    if (level >= maxLevel) break

    const advisorLevel = level + 1
    const advisorResponse = await apiMomAdvise(apiMomUrl, apiMomKey, {
      from_level: advisorLevel,
      job_type: job.type,
      error: result.error!,
      escalation_history: history,
      repo_context: job.payload.repo_context as string,
    })

    advice.push({
      from_level: advisorLevel,
      label: ESCALATION_LABELS[advisorLevel],
      instructions: advisorResponse.instructions,
      reasoning: advisorResponse.reasoning,
      suggested_executor_level: advisorResponse.suggested_executor_level,
    })

    // 4. Re-queue at the lowest level the advisor thinks can handle this
    //    (may go back further than level 0, e.g. advisor says "try level 1 not level 0")
    level = advisorResponse.suggested_executor_level ?? 0

    // Persist updated escalation state to D1
    await updateJobPayload(db, job.id, { escalation_history: history, escalation_advice: advice })
  }

  // All levels exhausted — needs human
  await escalateToHuman(db, job, history, advice)
  return { success: false, error: 'Escalation ladder exhausted — human review required' }
}

Skill Storage and Injection

async function crystallizeSkill(
  apiMomUrl: string,
  apiMomKey: string,
  job: Job,
  lastAdvice: EscalationAdvice,
  result: JobResult,
): Promise<void> {
  // Extract repo signals: what was true about this repo when the fix worked
  const repoSignals = extractRepoSignals(job.payload)

  await fetch(`${apiMomUrl}/skills`, {
    method: 'POST',
    headers: { Authorization: `Bearer ${apiMomKey}`, 'Content-Type': 'application/json' },
    body: JSON.stringify({
      pattern: {
        job_type: job.type,
        repo_signals: repoSignals,
        error_patterns: extractErrorPatterns(job.payload.escalation_history),
      },
      instructions: lastAdvice.instructions,
      source: `escalation-l${lastAdvice.from_level}`,
      success_count: 1,
      failure_count: 0,
    }),
  })
}

async function buildExecutorPrompt(job: Job, skills: Skill[]): Promise<string> {
  const skillContext = skills.length > 0
    ? `\n\n## Known fixes for this pattern\n${skills.map(s =>
        `[Confidence: ${Math.round(s.confidence * 100)}%] ${s.instructions}`
      ).join('\n\n')}`
    : ''

  return `${getBasePrompt(job.type)}${skillContext}\n\nRepo: ${job.repo}\n${JSON.stringify(job.payload, null, 2)}`
}

Metrics That Matter

A system running this architecture should track:

MetricWhat it tells you
Escalation rate by levelWhat fraction of jobs require L1, L2, L3+
Skill hit rateWhat fraction of jobs had a matching skill
Skill effectivenessJobs with skill injected: success rate vs baseline
Cost per job by typeWhere you’re spending money; where skills would pay off most
Advisor accuracyWhen Level-N advice was followed, what was the subsequent success rate
Skill confidence decaySkills whose confidence is falling — potential drift

The escalation rate by level is the primary health indicator. A healthy system has a steep drop-off: most jobs succeed at L0, a small fraction need L1, a tiny fraction need L2+. If L2 is handling 20% of jobs, your L0/L1 skill registry is underdeveloped.


Anti-Patterns

The “just use GPT-4 for everything” anti-pattern

Using the most capable model for all jobs eliminates escalation cost but eliminates learning entirely. You pay top-tier prices for work that a static template could handle. The skill registry never builds. Costs scale linearly with volume.

The “retry with same model” anti-pattern

Re-running the same prompt on the same model after failure is almost never correct. If the model failed once, it will fail again with high probability — the same reasoning applied to the same inputs produces the same outputs. Retry is only appropriate for transient errors (rate limits, network failures), not reasoning failures.

The “executor at every level” anti-pattern

Having each level of the escalation ladder independently execute the task (not advise) wastes expensive model capacity on execution and produces conflicting side effects. Level 3 should not be committing files to GitHub; it should be telling Level 0 exactly which file to commit.

The “discard history on escalation” anti-pattern

Wiping the failure history when re-queuing at a lower level with new instructions defeats the purpose of accumulated context. An advisor at Level 3 who cannot see what Level 1 already tried will suggest the same things Level 1 already suggested.

The “skills are permanent” anti-pattern

Skills must have confidence tracking and decay detection. A skill based on three successes from six months ago, applied to a different tech stack version, will generate incorrect advice with high confidence. The system should surface skill drift before it causes production failures.


The Bigger Picture

This architecture is a claim about how AI systems should grow in capability.

The dominant paradigm is model scaling: capability improves by training bigger models on more data. This is correct but external — you cannot control it, you pay for it at inference time, and it does not retain anything specific to your domain.

The alternative is operational learning: capability improves by accumulating expertise from real runs in your specific domain. The model weights do not change. But the effective capability of the system grows because the prompts get better, the skills get richer, and less is left to the model’s general reasoning.

This is how human expertise actually works. A doctor does not get better at diagnosing because their brain retrains. They get better because they accumulate cases, build pattern recognition, and develop heuristics that map symptoms to diagnoses faster than reasoning from first principles.

The skill registry is the operational memory of the system. The escalation ladder is the mechanism for generating new skills. The advisor/executor split is the allocation policy that ensures expensive reasoning is used on diagnosis, not execution.

Built correctly, this architecture produces a system where:

This is the compounding return on operational AI investment. Not from better models — from better use of the models you already have.


References


Edit page
Share this post on:

Previous Post
Prime: Persistent Org-Level AI Agents on Cloudflare
Next Post
The Tight Loop: Observability and Action