API Mom as Intelligent Router

Org Status: 🟡 Dormant Cloudflare: N/A Last Audited: 2026-04-28

The layer between your agent and the LLM market — routes by capability, optimizes for cost, abstracts away every provider decision.

When you build a multi-agent system, you quickly discover that “which model should I use?” is the wrong question for agents to ask. Agents should describe what they need. A separate layer — the router — decides how to satisfy that need at the lowest cost, using whatever capacity is currently available.

This article describes API Mom as that layer: not just a proxy or cost meter, but an intelligent router that knows about free tiers, subscription quotas, provider health, and capability requirements — and makes the right call so your agents don’t have to.

The Problem
The Abstraction: Capability Hints, Not Model Names
Four Access Tiers
The Routing Algorithm
Tier 1: Cloudflare Workers AI
Tier 2: OpenRouter Free Models
Tier 3: Paid API (Claude, Gemini, GPT)
Tier 4: Subscription Quota via Runners
The Routing Table
Cost-Aware Degradation
Internal Architecture: Two Durable Objects
API Mom as Router: Implementation
Agent Integration: Calling the Router
OpenRouter Integration
Provider Health and Fallback Chains
What Goes Into API Mom vs What Stays in the Agent
References

Most agent implementations hardcode model selection:

// Wrong — agent knows too much about the model market
const { object } = await generateObject({
  model: anthropic('claude-sonnet-4-6'),
  schema: DecisionSchema,
  prompt: buildPrompt(context),
})

This creates several problems:

No free tier usage — Cloudflare Workers AI and OpenRouter have free models capable of handling triage and classification. Hardcoding Claude Sonnet pays $3/M tokens for work a free model could do.
No subscription leverage — Pro/Max Claude subscriptions have 5–20× the rate limits of the API. Code-fix work executed through a local runner under subscription costs zero API tokens. Hardcoding to the API never uses this.
No cost-aware fallback — when the daily budget runs low, the agent has no mechanism to downgrade to cheaper capacity.
Provider lock-in — if Anthropic has an outage or raises prices, every agent needs updating.
Surprise bills — no central enforcement of spend limits per agent, per project, per day.

The fix is not smarter agents. It is removing model selection from agents entirely.

Agents describe what they need. The router decides how to satisfy it.

// Right — agent declares capability requirements
const result = await router.complete({
  capability: 'structured-reasoning',  // what I need
  complexity: 'standard',              // how hard
  context_tokens: 2000,               // how much context
  function: 'generate-run-sheet',     // for cost attribution
  schema: RunSheetSchema,
  prompt: buildPrompt(context),
})

The agent has no knowledge of which model ran. It receives the result. API Mom handled everything else.

This means:

New free models appear → API Mom uses them automatically
Subscription runs low → API Mom downgrades gracefully
Provider outage → API Mom fails over without agent changes
Pricing changes → update the routing table in one place

LLM access has fundamentally different cost models. The router must understand all four.

Tier 1: Free (Cloudflare Workers AI)

Cost: Free within Workers usage limits
Location: On-edge, same datacenter as your Worker
Models: @cf/meta/llama-3.3-70b-instruct, @cf/google/gemma-3-12b-it, others
Latency: Lowest possible — no egress
Capabilities: Classification, boolean checks, simple extraction, formatting
Not suited for: Complex reasoning, code generation, structured output with strict schemas

Tier 2: Free (OpenRouter Free Tier)

Cost: Free (rate-limited, provider-funded)
Available models (as of 2026): 50+ models including Llama 3.3 70B, Gemma 3, Mistral, DeepSeek, Qwen — see openrouter.ai/models?q=free
Capabilities: Wide range — some free models match paid mid-tier performance
Caveat: Rate limits vary. API Mom must track health and rotate between free models when one is throttled.

Tier 3: Paid API (per-token)

Claude (Anthropic): Haiku ($0.80/M) → Sonnet ($3/M) → Opus ($15/M)
Gemini (Google): Flash ($0.15/M) → Flash Thinking → Pro ($1.25/M)
GPT (OpenAI): 4o-mini ($0.15/M) → 4o ($2.5/M)
When to use: Tasks requiring genuine reasoning, code understanding, or reliable structured output where free models are insufficient

Tier 4: Subscription Quota (via Runners)

Cost: Zero API tokens — consumed from Pro/Max subscription
Pro: 5× the API rate limits
Max: 20× the API rate limits
Mechanism: The local runner executes a Claude Code / Claude Agent SDK session under the subscription. This is not an API call — it is a separate execution channel entirely.
When to use: All actual code-fix work. CI debugging. Complex multi-file changes. Anything that would cost significant API tokens.

The counterintuitive result: The most capable work (code changes via Claude Code) is the cheapest, because it runs through the subscription, not the API.

Given: capability_hint, complexity, context_size, current_budget, daily_budget_remaining

1. Can Workers AI handle this capability?
   AND context_size < Workers AI limit?
   → Route to Workers AI (free, fastest)

2. Is there a healthy free OpenRouter model for this capability?
   AND free model quality sufficient for complexity?
   → Route to OpenRouter free tier

3. Is this a code-fix execution task (not a reasoning call)?
   → Delegate to runner (subscription quota, zero API cost)

4. What is the cheapest paid model that satisfies the capability?
   AND daily_budget_remaining > estimated_cost?
   → Route to cheapest sufficient paid model

5. Budget exhausted for today?
   → Force-downgrade to free tier even if quality is reduced
   → Log degradation event

Priority: free → subscription → cheap paid → expensive paid. Budget exhaustion forces back to free.

Workers AI runs inside Cloudflare’s network. No external API call, no egress, lowest latency.

// In API Mom Worker
async function routeToWorkersAI(
  request: RouterRequest,
  env: Env,
): Promise<RouterResponse> {
  const model = selectWorkersAIModel(request.capability)

  const response = await env.AI.run(model, {
    messages: [
      { role: 'system', content: request.system ?? '' },
      { role: 'user', content: request.prompt },
    ],
    max_tokens: 512,
  })

  return {
    content: response.response ?? '',
    model: model,
    tier: 'workers-ai',
    cost_usd: 0,
  }
}

function selectWorkersAIModel(capability: string): string {
  // Larger model for reasoning, smaller for classification
  if (capability === 'classification' || capability === 'triage') {
    return '@cf/google/gemma-3-12b-it'
  }
  return '@cf/meta/llama-3.3-70b-instruct'
}

Capabilities best handled by Workers AI:

Signal triage: “is this repo missing biome.json?” → boolean
Classification: “what kind of failure is this CI log?”
Simple extraction: “list the failing test names from this output”
Formatting: “summarize this error in one sentence”

OpenRouter aggregates hundreds of models under a single OpenAI-compatible API. Many are free — provider-subsidized, rate-limited, but capable.

// OpenRouter is OpenAI-compatible — use the OpenAI provider with a custom base URL
import { createOpenAI } from '@ai-sdk/openai'

function createOpenRouterProvider(env: Env, model: string) {
  return createOpenAI({
    apiKey: env.OPENROUTER_API_KEY,
    baseURL: 'https://openrouter.ai/api/v1',
    defaultHeaders: {
      'HTTP-Referer': 'https://api-mom.workers.dev',
      'X-Title': 'API Mom Router',
    },
  })(model)
}

// Free models to rotate through (check openrouter.ai/models?q=free for current list)
const FREE_MODELS = [
  'meta-llama/llama-3.3-70b-instruct:free',
  'google/gemma-3-12b-it:free',
  'deepseek/deepseek-r1:free',
  'mistralai/mistral-7b-instruct:free',
  'qwen/qwen-2.5-72b-instruct:free',
]

API Mom must track which free models are currently healthy. When one hits its rate limit, rotate to the next. When all are throttled, escalate to paid tier.

This is issue garywu/api-mom#44 (free model routing — auto-discovery, fallback chain, tier tracking).

When free tiers are insufficient, route to the cheapest model that satisfies capability requirements.

const PAID_ROUTES: Record<string, PaidRoute[]> = {
  // Ordered cheapest-first within each capability tier
  'structured-reasoning': [
    { provider: 'anthropic', model: 'claude-haiku-4-5-20251001', cost_per_m_input: 0.80 },
    { provider: 'google',    model: 'gemini-2.0-flash',          cost_per_m_input: 0.10 },
    { provider: 'anthropic', model: 'claude-sonnet-4-6',         cost_per_m_input: 3.00 },
  ],
  'code-understanding': [
    { provider: 'anthropic', model: 'claude-sonnet-4-6',         cost_per_m_input: 3.00 },
    { provider: 'anthropic', model: 'claude-opus-4-6',           cost_per_m_input: 15.0 },
  ],
  'long-context': [
    { provider: 'google',    model: 'gemini-2.5-pro',            cost_per_m_input: 1.25 },
    { provider: 'anthropic', model: 'claude-sonnet-4-6',         cost_per_m_input: 3.00 },
  ],
  'critical-decision': [
    { provider: 'anthropic', model: 'claude-opus-4-6',           cost_per_m_input: 15.0 },
    { provider: 'google',    model: 'gemini-2.5-pro',            cost_per_m_input: 1.25 },
  ],
}

This tier is architecturally different from the others. It is not an LLM API call. It is a job delegation.

When an agent needs code-level work done — fix a TypeScript error, debug a failing CI, create a PR — it does not call an LLM directly. It submits a job to the dispatcher, which routes it to the local runner, which executes a full Claude Agent SDK session on a machine running under a Pro or Max subscription.

Agent (Prime DO) → Dispatcher → Local Runner → Claude Agent SDK session
                                                ↑
                                     Pro: 5× rate limits
                                     Max: 20× rate limits
                                     Cost: $0 API tokens

The key properties:

The agent does not wait for this. It submits a job and hibernates.
The runner reports outcome back to the dispatcher, which notifies the agent.
The Claude session runs to completion under the subscription, handling retries internally.
No API key needed on the runner — the subscription is the authentication.

What this means for routing: API Mom should never be involved in runner execution. The routing decision is made upstream — if the task requires code execution, delegate to dispatcher, not to an LLM API call.

// In Prime's reasoning cycle
async function executeDecision(action: Action, env: Env) {
  if (action.type === 'fix-code' || action.type === 'debug-ci') {
    // Does NOT go through API Mom — goes to dispatcher
    await submitDispatcherJob(action, env)
    return
  }
  // Everything else goes through API Mom
  await routerCall(action, env)
}

Putting it all together — what goes where:

Task	Capability	Tier	Model/Channel	Cost
”Does this repo have biome.json?”	`triage`	Workers AI	Llama 3.3 70B	Free
”What’s the CI failure type?”	`classification`	Workers AI	Gemma 3 12B	Free
”Summarize these 3 failures”	`summarization`	OpenRouter free	Llama/Mistral	Free
”Generate run sheet for 40 repos”	`structured-reasoning`	Paid API	Haiku / Gemini Flash	~$0.01
”Re-plan after 3 failures”	`structured-reasoning`	Paid API	Sonnet	~$0.05
”Fix TypeScript error in this file”	`code-execution`	Runner (subscription)	Claude Agent SDK	$0 API
”Debug failing CI across 5 files”	`code-execution`	Runner (subscription)	Claude Agent SDK	$0 API
”Critical architecture decision”	`critical-decision`	Paid API	Opus / Gemini Pro	~$0.20

API Mom tracks daily spend per project. When budget is running low, it automatically degrades:

async function selectTier(
  request: RouterRequest,
  db: D1Database,
): Promise<Tier> {
  const { spent_today, daily_limit } = await getDailyBudget(request.project_id, db)
  const remaining_fraction = 1 - spent_today / daily_limit

  // Plenty of budget — use best fit
  if (remaining_fraction > 0.5) return selectBestFit(request)

  // Getting low — force to free tiers
  if (remaining_fraction > 0.1) return tryFreeTiers(request)

  // Almost out — Workers AI or nothing
  if (remaining_fraction > 0) return 'workers-ai'

  // Exhausted — reject with 429, log event
  throw new BudgetExhaustedException(request.project_id)
}

Agents receive a X-Budget-Remaining-Fraction header on every response. They can check this to decide whether to schedule non-urgent work for tomorrow.

This is the hierarchical budget system in garywu/api-mom#18.

API Mom applies the same control plane / data plane split it serves — recursively, to itself.

Router DO (data plane — hot path)
  Every request → SQLite lookup → proxy → record
  No LLM calls. Ever.
  Latency: sub-millisecond routing decision

Optimizer DO (control plane — cold path)
  Wakes on alarm (hourly/daily)
  Reads Router DO metrics → one LLM call → updates routing table
  Hibernates. Zero cost between wakes.

Router DO (hot path)

The Router DO handles every incoming request. Its only job: read the routing table, pick a tier, forward the request, record the outcome. No reasoning, no LLM, no external calls beyond the forwarded request itself.

export class RouterDO extends Agent<Env, RouterState> {
  async onStart() {
    // Routing table lives in DO SQLite — in-memory fast on warm DO
    this.sql`CREATE TABLE IF NOT EXISTS routing_table (
      capability    TEXT NOT NULL,
      complexity    TEXT NOT NULL,
      tier          TEXT NOT NULL,   -- 'workers-ai' | 'openrouter-free' | 'paid'
      provider      TEXT NOT NULL,
      model         TEXT NOT NULL,
      priority      INTEGER NOT NULL DEFAULT 0,
      updated_at    INTEGER NOT NULL
    )`
    this.sql`CREATE TABLE IF NOT EXISTS provider_health (
      model         TEXT PRIMARY KEY,
      status        TEXT NOT NULL,   -- 'healthy' | 'throttled' | 'down'
      throttled_until INTEGER,
      error_count   INTEGER DEFAULT 0,
      last_checked  INTEGER NOT NULL
    )`
    this.sql`CREATE TABLE IF NOT EXISTS call_ledger (
      id            TEXT PRIMARY KEY,
      project_id    TEXT NOT NULL,
      capability    TEXT NOT NULL,
      tier          TEXT NOT NULL,
      model         TEXT NOT NULL,
      cost_usd      REAL NOT NULL DEFAULT 0,
      latency_ms    INTEGER,
      status        TEXT NOT NULL,   -- 'success' | 'error' | 'throttled'
      called_at     INTEGER NOT NULL
    )`
  }

  async onRequest(request: Request): Promise<Response> {
    if (request.method !== 'POST') return new Response('not found', { status: 404 })

    const capability = request.headers.get('X-Capability') ?? 'general'
    const complexity = request.headers.get('X-Complexity') ?? 'standard'
    const projectId  = request.headers.get('X-Project-Id') ?? 'default'

    // Pure SQLite lookup — no LLM, no external call
    const route = this.selectRoute(capability, complexity)
    const start = Date.now()

    try {
      const response = await this.forwardRequest(request, route)
      const latency = Date.now() - start

      this.recordOutcome({ projectId, capability, route, latency, status: 'success' })

      return new Response(response.body, {
        status: response.status,
        headers: {
          ...Object.fromEntries(response.headers),
          'X-Tier-Used':  route.tier,
          'X-Model-Used': route.model,
          'X-Latency-Ms': String(latency),
        },
      })
    } catch (err) {
      this.markProviderUnhealthy(route.model)
      throw err
    }
  }

  private selectRoute(capability: string, complexity: string): Route {
    // 1. Try routing table — smart path
    try {
      const routes = [...this.sql`
        SELECT * FROM routing_table
        WHERE capability = ${capability} AND complexity = ${complexity}
        AND model NOT IN (
          SELECT model FROM provider_health
          WHERE status = 'throttled' AND throttled_until > ${Date.now()}
        )
        ORDER BY priority DESC
      `]
      if (routes.length > 0) return routes[0] as Route
    } catch {
      // Routing table corrupted or missing — fall through to dumb pass-through
      this.recordEvent('routing-table-error')
    }

    // 2. Dumb pass-through — always present, never depends on routing table
    // Configured via env vars, not routing table. Always works.
    return this.passthroughRoute()
  }

  private passthroughRoute(): Route {
    // Hard-coded fallback. Reads from env, never from SQLite.
    // This path must work even if the entire DO SQLite is corrupted.
    //
    // Primary fallback: Workers AI — on-edge, no external API call, no key needed.
    // If Workers AI fails, Cloudflare itself is down → Worker is also down → unreachable.
    // Secondary fallback (last resort): Anthropic API key from env.
    return {
      tier: 'workers-ai',
      provider: 'cloudflare',
      model: '@cf/meta/llama-3.3-70b-instruct',
      priority: -1,
      is_fallback: true,
    }
  }

  private lastResortRoute(): Route {
    // Only reached if Workers AI itself fails — extremely rare.
    // Anthropic API key configured in wrangler.jsonc secrets.
    return {
      tier: 'paid',
      provider: 'anthropic',
      model: this.env.LAST_RESORT_MODEL ?? 'claude-haiku-4-5-20251001',
      priority: -2,
      is_fallback: true,
      is_last_resort: true,
    }
  }
}

Optimizer DO (cold path)

The Optimizer wakes on alarm, reads the Router DO’s performance data, and updates the routing table. It makes exactly one LLM call per cycle — routed through the Router DO itself, which sends it to the free tier.

export class OptimizerDO extends Agent<Env, OptimizerState> {
  async onStart() {
    // Wake every hour to review performance
    await this.schedule(3600, 'optimize', {})
  }

  async optimize(_: unknown) {
    // Read Router DO's performance data
    const routerDO = this.env.ROUTER_DO.get(
      this.env.ROUTER_DO.idFromName('singleton')
    )
    const metrics = await routerDO.fetch('https://router/metrics').then(r => r.json())

    // Discover new free models from OpenRouter
    const freeModels = await this.discoverFreeModels()

    // ONE LLM call — routed through Router DO (hits free tier)
    // The Optimizer uses the Router it's optimizing
    const { object: updates } = await generateObject({
      model: this.createRouterModel('triage', 'trivial', 'optimizer'),
      schema: RoutingUpdateSchema,
      prompt: buildOptimizerPrompt(metrics, freeModels),
    })

    // Apply updates to Router DO's routing table
    for (const update of updates.changes) {
      await routerDO.fetch('https://router/routing-table', {
        method: 'PATCH',
        body: JSON.stringify(update),
      })
    }

    // Schedule next optimization
    await this.schedule(3600, 'optimize', {})
  }
}

Fallback Is Not Optional

The routing table is an optimization layer. The dumb pass-through is the guaranteed baseline. They are not the same thing and one must never depend on the other.

Smart path (routing table):     present when Optimizer has run
Dumb pass-through (env vars):   always present, always works

If routing table empty:         dumb pass-through
If routing table corrupted:     dumb pass-through
If all providers throttled:     dumb pass-through
If Optimizer DO is down:        dumb pass-through
If DO SQLite fails:             dumb pass-through

The fallback is configured in wrangler.jsonc (env vars), never in the routing table. This is load-bearing: if the routing table is the problem, it cannot also be the solution.

// wrangler.jsonc — fallback hierarchy, none depend on routing table
{
  "ai": { "binding": "AI" },   // Workers AI binding — primary fallback, always present

  "vars": {
    "LAST_RESORT_MODEL": "claude-haiku-4-5-20251001"  // only if Workers AI fails
  },
  "secrets": ["ANTHROPIC_API_KEY"]  // last-resort key only — not the primary path
}

Primary fallback: Workers AI. On-edge, no external call, no API key, co-located with the Router DO. If Workers AI fails, Cloudflare itself is down — the Worker hosting the Router is also down — so callers can’t reach the fallback anyway. Workers AI failing and the Router being reachable is not a real failure mode.

Last resort: Anthropic API. Only exists for belt-and-suspenders. In practice, never used.

Every fallback use is logged. The Optimizer reads fallback frequency as a signal: high fallback rate means the routing table has a problem. It investigates and fixes. But during the investigation, callers never notice — they kept getting responses.

This is the same principle as a circuit breaker in distributed systems: the smart path is tried first; on failure, the system degrades gracefully to a known-good baseline rather than failing entirely.

The Recursive Property

The Optimizer uses the Router to make its own LLM calls. The Router routes those calls to Workers AI or an OpenRouter free model. The system that improves the routing table uses the routing table to decide how to improve the routing table.

This means:

The Optimizer is always cheap — it routes to the cheapest available tier
As the routing table improves, the Optimizer’s own costs decrease
The system self-bootstraps toward lower cost over time

Optimizer wakes → calls Router → Router selects Workers AI (free)
→ LLM analyzes metrics → suggests routing table changes
→ Router now routes more calls to Workers AI
→ Next Optimizer wake: Router is better → Optimizer costs even less

The routing endpoint is OpenAI-compatible (issue garywu/api-mom#53) so any AI SDK provider can point at it.

// POST /v1/chat/completions — unified routing endpoint
app.post('/v1/chat/completions', async (c) => {
  const body = await c.req.json()
  const capabilityHint = c.req.header('X-Capability') ?? 'general'
  const complexity = c.req.header('X-Complexity') ?? 'standard'
  const projectId = c.req.header('X-Project-Id') ?? 'default'
  const functionName = c.req.header('X-Function') ?? 'unknown'

  // Select tier
  const tier = await selectTier({ capability: capabilityHint, complexity, project_id: projectId }, c.env.DB)

  // Route to tier
  let response: LLMResponse
  switch (tier) {
    case 'workers-ai':
      response = await routeToWorkersAI(body, c.env)
      break
    case 'openrouter-free':
      response = await routeToOpenRouterFree(body, c.env)
      break
    case 'paid-api':
      response = await routeToPaidAPI(body, capabilityHint, c.env)
      break
  }

  // Record cost
  await recordCost({ project_id: projectId, function: functionName, tier, cost_usd: response.cost_usd }, c.env.DB)

  return c.json(formatAsOpenAIResponse(response), 200, {
    'X-Tier-Used': tier,
    'X-Model-Used': response.model,
    'X-Cost-Usd': response.cost_usd.toFixed(6),
    'X-Budget-Remaining-Fraction': await getBudgetFraction(projectId, c.env.DB),
  })
})

The fallback pass-through is the first thing to implement and the first thing to test. Every other feature is built on top of it. If the fallback breaks, the entire system fails silently — callers get errors instead of degraded-but-working responses.

// src/router.test.ts
import { describe, it, expect, beforeEach } from 'vitest'
import { RouterDO } from './router-do'
import { createMiniflareEnv } from './test-helpers'

describe('RouterDO fallback pass-through', () => {
  let env: Env

  beforeEach(async () => {
    env = await createMiniflareEnv({
      FALLBACK_MODEL: 'claude-haiku-4-5-20251001',
      FALLBACK_PROVIDER: 'anthropic',
      ANTHROPIC_API_KEY: 'test-key',
    })
  })

  it('uses Workers AI fallback when routing table is empty', async () => {
    const router = new RouterDO(env)
    // No routing table entries — DO SQLite is fresh/empty

    const route = router.selectRoute('structured-reasoning', 'standard')

    expect(route.is_fallback).toBe(true)
    expect(route.tier).toBe('workers-ai')           // Cloudflare-native, no egress
    expect(route.model).toBe('@cf/meta/llama-3.3-70b-instruct')
    expect(route.provider).toBe('cloudflare')
  })

  it('uses fallback when all providers are throttled', async () => {
    const router = new RouterDO(env)
    // Seed routing table with one route, but mark it throttled
    await router.seedRoutingTable([
      { capability: 'structured-reasoning', complexity: 'standard',
        tier: 'openrouter-free', model: 'meta-llama/llama-3.3-70b-instruct:free' }
    ])
    await router.markProviderThrottled('meta-llama/llama-3.3-70b-instruct:free', Date.now() + 3600_000)

    const route = router.selectRoute('structured-reasoning', 'standard')

    expect(route.is_fallback).toBe(true)
  })

  it('uses fallback when routing table is corrupted', async () => {
    const router = new RouterDO(env)
    // Corrupt the routing table
    await router.corruptRoutingTableForTest()

    // Must not throw — must return fallback
    const route = router.selectRoute('structured-reasoning', 'standard')

    expect(route.is_fallback).toBe(true)
  })

  it('uses fallback when Optimizer DO has never run', async () => {
    // Fresh deploy — Optimizer hasn't touched routing table yet
    const router = new RouterDO(env)

    const route = router.selectRoute('triage', 'trivial')

    // Even triage — which should go to Workers AI eventually — falls back on first run
    expect(route.is_fallback).toBe(true)
  })

  it('records every fallback use in call_ledger', async () => {
    const router = new RouterDO(env)

    router.selectRoute('structured-reasoning', 'standard')

    const events = [...router.sql`SELECT * FROM call_ledger WHERE is_fallback = 1`]
    expect(events).toHaveLength(1)
    expect(events[0].capability).toBe('structured-reasoning')
  })

  it('smart route takes over once routing table is populated', async () => {
    const router = new RouterDO(env)
    await router.seedRoutingTable([
      { capability: 'triage', complexity: 'trivial',
        tier: 'workers-ai', model: '@cf/google/gemma-3-12b-it', priority: 10 }
    ])

    const route = router.selectRoute('triage', 'trivial')

    expect(route.is_fallback).toBe(false)
    expect(route.tier).toBe('workers-ai')
    expect(route.model).toBe('@cf/google/gemma-3-12b-it')
  })
})

These six tests define the contract: the fallback works in every failure mode, it logs every use, and smart routes take over correctly once the Optimizer populates the table. All six must pass before any routing feature ships.

See garywu/api-mom#106 — implement and test fallback pass-through first.

From an agent (Prime DO or any Cloudflare Worker), routing through API Mom is one line:

// In wrangler.jsonc: point the Anthropic provider at API Mom
// The agent never knows which model ran

import { createAnthropic } from '@ai-sdk/anthropic'
import { generateObject } from 'ai'

function createRouterModel(env: Env, capability: string, complexity: string, functionName: string) {
  const proxiedFetch = async (url: RequestInfo | URL, init?: RequestInit) => {
    const headers = new Headers(init?.headers)
    headers.set('X-Api-Key', env.API_MOM_KEY)
    headers.set('X-Capability', capability)
    headers.set('X-Complexity', complexity)
    headers.set('X-Function', functionName)
    headers.set('X-Project-Id', env.PROJECT_ID)
    return fetch(url, { ...init, headers })
  }

  // Points at API Mom — which model actually runs is API Mom's decision
  return createAnthropic({
    apiKey: 'routed',
    baseURL: `${env.API_MOM_URL}/v1`,
    fetch: proxiedFetch,
  })('claude-sonnet-4-6')  // This model ID is ignored — API Mom overrides it
}

// Usage in Prime's wake cycle
const model = createRouterModel(env, 'structured-reasoning', 'standard', 'generate-run-sheet')
const { object: runSheet } = await generateObject({
  model,
  schema: RunSheetSchema,
  prompt: buildRunSheetPrompt(context),
})

The agent passes capability and complexity. The model name in createAnthropic() is a placeholder — API Mom overrides the routing decision.

OpenRouter is the primary free model source. It provides an OpenAI-compatible API with 50+ free models.

// API Mom's OpenRouter free tier handler
async function routeToOpenRouterFree(
  body: ChatCompletionRequest,
  env: Env,
): Promise<LLMResponse> {
  const healthyModels = await getHealthyFreeModels(env.DB)

  for (const model of healthyModels) {
    try {
      const res = await fetch('https://openrouter.ai/api/v1/chat/completions', {
        method: 'POST',
        headers: {
          'Authorization': `Bearer ${env.OPENROUTER_API_KEY}`,
          'HTTP-Referer': 'https://api-mom.workers.dev',
          'X-Title': 'API Mom',
          'Content-Type': 'application/json',
        },
        body: JSON.stringify({ ...body, model }),
      })

      if (res.status === 429) {
        await markModelThrottled(model, env.DB)
        continue  // try next free model
      }

      if (!res.ok) {
        await markModelUnhealthy(model, env.DB)
        continue
      }

      const data = await res.json() as OpenAIResponse
      await markModelHealthy(model, env.DB)
      return { content: data.choices[0].message.content, model, tier: 'openrouter-free', cost_usd: 0 }

    } catch {
      await markModelUnhealthy(model, env.DB)
    }
  }

  // All free models exhausted — escalate
  throw new FreeTierExhaustedException()
}

API Mom periodically rediscovers healthy free models (issue garywu/api-mom#48).

-- API Mom D1 schema
CREATE TABLE provider_health (
  model TEXT PRIMARY KEY,
  tier TEXT NOT NULL,           -- 'workers-ai' | 'openrouter-free' | 'paid'
  status TEXT NOT NULL,         -- 'healthy' | 'throttled' | 'degraded' | 'down'
  throttled_until INTEGER,      -- epoch ms — when to retry
  last_checked INTEGER NOT NULL,
  error_count INTEGER DEFAULT 0
);

CREATE TABLE api_calls_ledger (
  id TEXT PRIMARY KEY,
  project_id TEXT NOT NULL,
  function_name TEXT NOT NULL,
  tier TEXT NOT NULL,
  model TEXT NOT NULL,
  input_tokens INTEGER,
  output_tokens INTEGER,
  cost_usd REAL NOT NULL DEFAULT 0,
  called_at INTEGER NOT NULL
);

CREATE TABLE daily_budgets (
  project_id TEXT PRIMARY KEY,
  daily_limit_usd REAL NOT NULL,
  spent_today_usd REAL NOT NULL DEFAULT 0,
  reset_at INTEGER NOT NULL     -- next midnight UTC
);

Decision	Where it lives	Why
Which model to use	API Mom	Provider-agnostic, changes with market
Which free models are healthy	API Mom	Shared across all agents, needs central tracking
Whether budget allows a call	API Mom	Enforced centrally, not per-agent
What capability is needed	Agent	Agent knows its own task
Whether to use runner vs API	Agent	Architectural decision, not routing
How to interpret the response	Agent	Domain-specific
Retry on structured output failure	Vercel AI SDK	Handled by `generateObject()` internally

The agent describes the what and the how hard. API Mom decides the which model and how much it costs.

The routing hierarchy produces a remarkable property: the system stays intelligent at essentially zero cost during idle periods.

When Prime wakes on a 6-hour alarm and needs to triage 40 repos:

Signal checks (boolean: “does this file exist?”) → Workers AI → $0
Failure classification (“what kind of CI error is this?”) → OpenRouter free → $0
Run sheet generation (complex structured reasoning) → Claude Haiku → ~$0.01
Actual code fix → local runner, subscription → $0 API tokens

A full org maintenance cycle — 40 repos checked, 5 jobs dispatched — might cost $0.01–0.05 in API tokens, with the expensive work (code execution) consuming zero because it runs under subscription.

This is not “cheap AI.” It is intelligence that scales to zero. The system is always present, always reasoning, always acting — but consumes nothing when idle, and almost nothing when active. The cost curve looks like serverless compute, not a running LLM instance.

Cost vs Activity:
  Idle (no alarms firing):              $0.00 / hour
  Active triage (Workers AI):           $0.00 / cycle
  Active planning (free + Haiku):       $0.01 / cycle
  Active execution (runner):            $0.00 API / session
  Heavy re-planning (Sonnet):           $0.05 / cycle  ← rare
  Critical decision (Opus):            $0.20 / decision ← very rare

Compare to a naive implementation that calls Claude Sonnet for every decision: $3/M tokens × continuous operation = hundreds of dollars per month for a 40-repo org. The routing tier reduces this by 95%+ while delivering the same outcomes.

The routing logic described in this article is, in itself, a product.

Every team building AI agents faces the same problem: LLM costs are unpredictable, provider choices are fragmented, and the free capacity that exists (Workers AI, OpenRouter free tier, subscription quota) is invisible unless you build infrastructure to use it.

API Mom’s router is that infrastructure. As a service:

You bring: your agent code, your use case
API Mom brings: routing intelligence, free tier access, provider health tracking, cost attribution, budget enforcement
You pay: a fraction of what you’d pay calling providers directly
API Mom keeps: the margin between what you pay and what the optimal routing costs

The value proposition: pay API Mom less than you’d pay Anthropic directly, and get smarter routing as a bonus.

This works because:

Most tasks don’t need Claude Sonnet — they just default to it because that’s what developers hardcode
Free capacity exists at scale that individual teams can’t effectively aggregate
Subscription quota is systematically under-utilized — routing into it turns idle capacity into value
Provider health and fallback chains require operational infrastructure most teams won’t build

The product positioning: “Pay less for the same intelligence. Free tiers and subscription quota you already have, used automatically.”

This is issue garywu/atlas#369 (Atlas commercial thesis) — the same intelligence optimization logic that makes Prime cheap is the commercial moat for API Mom as a product.

Cloudflare

Cloudflare Workers AI — On-edge model inference. Free within Workers limits. env.AI.run(model, messages).
Workers AI: Available models — Full list. Look for text-generation type models.
Cloudflare D1 — Provider health table, cost ledger, daily budgets.
Cloudflare Agents SDK — agents package. The container that calls API Mom.

OpenRouter

OpenRouter — Aggregated model API. OpenAI-compatible endpoint.
OpenRouter: Free models — Current list of zero-cost models with rate limits.
OpenRouter: API reference — Authentication, rate limits, error codes.

AI SDKs

Vercel AI SDK — generateObject(), generateText(), provider abstraction. Custom fetch option enables transparent routing through API Mom.
@ai-sdk/openai — Used to talk to OpenRouter (OpenAI-compatible endpoint).
@ai-sdk/anthropic — Claude provider. Pointed at API Mom’s /v1/anthropic proxy.
Claude Agent SDK — @anthropic-ai/claude-agent-sdk. Subscription-based execution. Used by local runner, not by Prime directly.
Anthropic API: Rate limits — Pro: 5× limits. Max: 20× limits vs base API.

garywu/three-layer-ai-agent-architecture — Container / Brain / Wallet. API Mom as cost proxy. Origin of the three-layer pattern this article extends.
garywu/org-prime-agent-architecture — Prime agents that call the router. The Tier 4 runner delegation pattern in full context.
garywu/cloudflare-durable-objects-patterns — DO hibernation, SQLite, alarms. API Mom’s own infrastructure.
garywu/cloudflare-autonomous-pipeline — Dispatcher scheduling, D1 at scale.

#44 free model routing — Auto-discovery, fallback chain, tier tracking
#48 periodic free provider discovery — Quarterly sweep for new free tiers
#53 unified /v1/chat/completions endpoint — OpenAI-compatible routing endpoint
#18 hierarchical budget system — Project → provider → call-type spend caps