Skip to content
Gary Wu
Go back

API Mom as Intelligent Router

Edit page

Org Status: 🟑 Dormant Cloudflare: N/A Last Audited: 2026-04-28


The layer between your agent and the LLM market β€” routes by capability, optimizes for cost, abstracts away every provider decision.

When you build a multi-agent system, you quickly discover that β€œwhich model should I use?” is the wrong question for agents to ask. Agents should describe what they need. A separate layer β€” the router β€” decides how to satisfy that need at the lowest cost, using whatever capacity is currently available.

This article describes API Mom as that layer: not just a proxy or cost meter, but an intelligent router that knows about free tiers, subscription quotas, provider health, and capability requirements β€” and makes the right call so your agents don’t have to.



Most agent implementations hardcode model selection:

// Wrong β€” agent knows too much about the model market
const { object } = await generateObject({
  model: anthropic('claude-sonnet-4-6'),
  schema: DecisionSchema,
  prompt: buildPrompt(context),
})

This creates several problems:

  1. No free tier usage β€” Cloudflare Workers AI and OpenRouter have free models capable of handling triage and classification. Hardcoding Claude Sonnet pays $3/M tokens for work a free model could do.
  2. No subscription leverage β€” Pro/Max Claude subscriptions have 5–20Γ— the rate limits of the API. Code-fix work executed through a local runner under subscription costs zero API tokens. Hardcoding to the API never uses this.
  3. No cost-aware fallback β€” when the daily budget runs low, the agent has no mechanism to downgrade to cheaper capacity.
  4. Provider lock-in β€” if Anthropic has an outage or raises prices, every agent needs updating.
  5. Surprise bills β€” no central enforcement of spend limits per agent, per project, per day.

The fix is not smarter agents. It is removing model selection from agents entirely.


Agents describe what they need. The router decides how to satisfy it.

// Right β€” agent declares capability requirements
const result = await router.complete({
  capability: 'structured-reasoning',  // what I need
  complexity: 'standard',              // how hard
  context_tokens: 2000,               // how much context
  function: 'generate-run-sheet',     // for cost attribution
  schema: RunSheetSchema,
  prompt: buildPrompt(context),
})

The agent has no knowledge of which model ran. It receives the result. API Mom handled everything else.

This means:


LLM access has fundamentally different cost models. The router must understand all four.

Tier 1: Free (Cloudflare Workers AI)

Tier 2: Free (OpenRouter Free Tier)

Tier 3: Paid API (per-token)

Tier 4: Subscription Quota (via Runners)

The counterintuitive result: The most capable work (code changes via Claude Code) is the cheapest, because it runs through the subscription, not the API.


Given: capability_hint, complexity, context_size, current_budget, daily_budget_remaining

1. Can Workers AI handle this capability?
   AND context_size < Workers AI limit?
   β†’ Route to Workers AI (free, fastest)

2. Is there a healthy free OpenRouter model for this capability?
   AND free model quality sufficient for complexity?
   β†’ Route to OpenRouter free tier

3. Is this a code-fix execution task (not a reasoning call)?
   β†’ Delegate to runner (subscription quota, zero API cost)

4. What is the cheapest paid model that satisfies the capability?
   AND daily_budget_remaining > estimated_cost?
   β†’ Route to cheapest sufficient paid model

5. Budget exhausted for today?
   β†’ Force-downgrade to free tier even if quality is reduced
   β†’ Log degradation event

Priority: free β†’ subscription β†’ cheap paid β†’ expensive paid. Budget exhaustion forces back to free.


Workers AI runs inside Cloudflare’s network. No external API call, no egress, lowest latency.

// In API Mom Worker
async function routeToWorkersAI(
  request: RouterRequest,
  env: Env,
): Promise<RouterResponse> {
  const model = selectWorkersAIModel(request.capability)

  const response = await env.AI.run(model, {
    messages: [
      { role: 'system', content: request.system ?? '' },
      { role: 'user', content: request.prompt },
    ],
    max_tokens: 512,
  })

  return {
    content: response.response ?? '',
    model: model,
    tier: 'workers-ai',
    cost_usd: 0,
  }
}

function selectWorkersAIModel(capability: string): string {
  // Larger model for reasoning, smaller for classification
  if (capability === 'classification' || capability === 'triage') {
    return '@cf/google/gemma-3-12b-it'
  }
  return '@cf/meta/llama-3.3-70b-instruct'
}

Capabilities best handled by Workers AI:


OpenRouter aggregates hundreds of models under a single OpenAI-compatible API. Many are free β€” provider-subsidized, rate-limited, but capable.

// OpenRouter is OpenAI-compatible β€” use the OpenAI provider with a custom base URL
import { createOpenAI } from '@ai-sdk/openai'

function createOpenRouterProvider(env: Env, model: string) {
  return createOpenAI({
    apiKey: env.OPENROUTER_API_KEY,
    baseURL: 'https://openrouter.ai/api/v1',
    defaultHeaders: {
      'HTTP-Referer': 'https://api-mom.workers.dev',
      'X-Title': 'API Mom Router',
    },
  })(model)
}

// Free models to rotate through (check openrouter.ai/models?q=free for current list)
const FREE_MODELS = [
  'meta-llama/llama-3.3-70b-instruct:free',
  'google/gemma-3-12b-it:free',
  'deepseek/deepseek-r1:free',
  'mistralai/mistral-7b-instruct:free',
  'qwen/qwen-2.5-72b-instruct:free',
]

API Mom must track which free models are currently healthy. When one hits its rate limit, rotate to the next. When all are throttled, escalate to paid tier.

This is issue garywu/api-mom#44 (free model routing β€” auto-discovery, fallback chain, tier tracking).


When free tiers are insufficient, route to the cheapest model that satisfies capability requirements.

const PAID_ROUTES: Record<string, PaidRoute[]> = {
  // Ordered cheapest-first within each capability tier
  'structured-reasoning': [
    { provider: 'anthropic', model: 'claude-haiku-4-5-20251001', cost_per_m_input: 0.80 },
    { provider: 'google',    model: 'gemini-2.0-flash',          cost_per_m_input: 0.10 },
    { provider: 'anthropic', model: 'claude-sonnet-4-6',         cost_per_m_input: 3.00 },
  ],
  'code-understanding': [
    { provider: 'anthropic', model: 'claude-sonnet-4-6',         cost_per_m_input: 3.00 },
    { provider: 'anthropic', model: 'claude-opus-4-6',           cost_per_m_input: 15.0 },
  ],
  'long-context': [
    { provider: 'google',    model: 'gemini-2.5-pro',            cost_per_m_input: 1.25 },
    { provider: 'anthropic', model: 'claude-sonnet-4-6',         cost_per_m_input: 3.00 },
  ],
  'critical-decision': [
    { provider: 'anthropic', model: 'claude-opus-4-6',           cost_per_m_input: 15.0 },
    { provider: 'google',    model: 'gemini-2.5-pro',            cost_per_m_input: 1.25 },
  ],
}

This tier is architecturally different from the others. It is not an LLM API call. It is a job delegation.

When an agent needs code-level work done β€” fix a TypeScript error, debug a failing CI, create a PR β€” it does not call an LLM directly. It submits a job to the dispatcher, which routes it to the local runner, which executes a full Claude Agent SDK session on a machine running under a Pro or Max subscription.

Agent (Prime DO) β†’ Dispatcher β†’ Local Runner β†’ Claude Agent SDK session
                                                ↑
                                     Pro: 5Γ— rate limits
                                     Max: 20Γ— rate limits
                                     Cost: $0 API tokens

The key properties:

What this means for routing: API Mom should never be involved in runner execution. The routing decision is made upstream β€” if the task requires code execution, delegate to dispatcher, not to an LLM API call.

// In Prime's reasoning cycle
async function executeDecision(action: Action, env: Env) {
  if (action.type === 'fix-code' || action.type === 'debug-ci') {
    // Does NOT go through API Mom β€” goes to dispatcher
    await submitDispatcherJob(action, env)
    return
  }
  // Everything else goes through API Mom
  await routerCall(action, env)
}

Putting it all together β€” what goes where:

TaskCapabilityTierModel/ChannelCost
”Does this repo have biome.json?”triageWorkers AILlama 3.3 70BFree
”What’s the CI failure type?”classificationWorkers AIGemma 3 12BFree
”Summarize these 3 failures”summarizationOpenRouter freeLlama/MistralFree
”Generate run sheet for 40 repos”structured-reasoningPaid APIHaiku / Gemini Flash~$0.01
”Re-plan after 3 failures”structured-reasoningPaid APISonnet~$0.05
”Fix TypeScript error in this file”code-executionRunner (subscription)Claude Agent SDK$0 API
”Debug failing CI across 5 files”code-executionRunner (subscription)Claude Agent SDK$0 API
”Critical architecture decision”critical-decisionPaid APIOpus / Gemini Pro~$0.20

API Mom tracks daily spend per project. When budget is running low, it automatically degrades:

async function selectTier(
  request: RouterRequest,
  db: D1Database,
): Promise<Tier> {
  const { spent_today, daily_limit } = await getDailyBudget(request.project_id, db)
  const remaining_fraction = 1 - spent_today / daily_limit

  // Plenty of budget β€” use best fit
  if (remaining_fraction > 0.5) return selectBestFit(request)

  // Getting low β€” force to free tiers
  if (remaining_fraction > 0.1) return tryFreeTiers(request)

  // Almost out β€” Workers AI or nothing
  if (remaining_fraction > 0) return 'workers-ai'

  // Exhausted β€” reject with 429, log event
  throw new BudgetExhaustedException(request.project_id)
}

Agents receive a X-Budget-Remaining-Fraction header on every response. They can check this to decide whether to schedule non-urgent work for tomorrow.

This is the hierarchical budget system in garywu/api-mom#18.


API Mom applies the same control plane / data plane split it serves β€” recursively, to itself.

Router DO (data plane β€” hot path)
  Every request β†’ SQLite lookup β†’ proxy β†’ record
  No LLM calls. Ever.
  Latency: sub-millisecond routing decision

Optimizer DO (control plane β€” cold path)
  Wakes on alarm (hourly/daily)
  Reads Router DO metrics β†’ one LLM call β†’ updates routing table
  Hibernates. Zero cost between wakes.

Router DO (hot path)

The Router DO handles every incoming request. Its only job: read the routing table, pick a tier, forward the request, record the outcome. No reasoning, no LLM, no external calls beyond the forwarded request itself.

export class RouterDO extends Agent<Env, RouterState> {
  async onStart() {
    // Routing table lives in DO SQLite β€” in-memory fast on warm DO
    this.sql`CREATE TABLE IF NOT EXISTS routing_table (
      capability    TEXT NOT NULL,
      complexity    TEXT NOT NULL,
      tier          TEXT NOT NULL,   -- 'workers-ai' | 'openrouter-free' | 'paid'
      provider      TEXT NOT NULL,
      model         TEXT NOT NULL,
      priority      INTEGER NOT NULL DEFAULT 0,
      updated_at    INTEGER NOT NULL
    )`
    this.sql`CREATE TABLE IF NOT EXISTS provider_health (
      model         TEXT PRIMARY KEY,
      status        TEXT NOT NULL,   -- 'healthy' | 'throttled' | 'down'
      throttled_until INTEGER,
      error_count   INTEGER DEFAULT 0,
      last_checked  INTEGER NOT NULL
    )`
    this.sql`CREATE TABLE IF NOT EXISTS call_ledger (
      id            TEXT PRIMARY KEY,
      project_id    TEXT NOT NULL,
      capability    TEXT NOT NULL,
      tier          TEXT NOT NULL,
      model         TEXT NOT NULL,
      cost_usd      REAL NOT NULL DEFAULT 0,
      latency_ms    INTEGER,
      status        TEXT NOT NULL,   -- 'success' | 'error' | 'throttled'
      called_at     INTEGER NOT NULL
    )`
  }

  async onRequest(request: Request): Promise<Response> {
    if (request.method !== 'POST') return new Response('not found', { status: 404 })

    const capability = request.headers.get('X-Capability') ?? 'general'
    const complexity = request.headers.get('X-Complexity') ?? 'standard'
    const projectId  = request.headers.get('X-Project-Id') ?? 'default'

    // Pure SQLite lookup β€” no LLM, no external call
    const route = this.selectRoute(capability, complexity)
    const start = Date.now()

    try {
      const response = await this.forwardRequest(request, route)
      const latency = Date.now() - start

      this.recordOutcome({ projectId, capability, route, latency, status: 'success' })

      return new Response(response.body, {
        status: response.status,
        headers: {
          ...Object.fromEntries(response.headers),
          'X-Tier-Used':  route.tier,
          'X-Model-Used': route.model,
          'X-Latency-Ms': String(latency),
        },
      })
    } catch (err) {
      this.markProviderUnhealthy(route.model)
      throw err
    }
  }

  private selectRoute(capability: string, complexity: string): Route {
    // 1. Try routing table β€” smart path
    try {
      const routes = [...this.sql`
        SELECT * FROM routing_table
        WHERE capability = ${capability} AND complexity = ${complexity}
        AND model NOT IN (
          SELECT model FROM provider_health
          WHERE status = 'throttled' AND throttled_until > ${Date.now()}
        )
        ORDER BY priority DESC
      `]
      if (routes.length > 0) return routes[0] as Route
    } catch {
      // Routing table corrupted or missing β€” fall through to dumb pass-through
      this.recordEvent('routing-table-error')
    }

    // 2. Dumb pass-through β€” always present, never depends on routing table
    // Configured via env vars, not routing table. Always works.
    return this.passthroughRoute()
  }

  private passthroughRoute(): Route {
    // Hard-coded fallback. Reads from env, never from SQLite.
    // This path must work even if the entire DO SQLite is corrupted.
    //
    // Primary fallback: Workers AI β€” on-edge, no external API call, no key needed.
    // If Workers AI fails, Cloudflare itself is down β†’ Worker is also down β†’ unreachable.
    // Secondary fallback (last resort): Anthropic API key from env.
    return {
      tier: 'workers-ai',
      provider: 'cloudflare',
      model: '@cf/meta/llama-3.3-70b-instruct',
      priority: -1,
      is_fallback: true,
    }
  }

  private lastResortRoute(): Route {
    // Only reached if Workers AI itself fails β€” extremely rare.
    // Anthropic API key configured in wrangler.jsonc secrets.
    return {
      tier: 'paid',
      provider: 'anthropic',
      model: this.env.LAST_RESORT_MODEL ?? 'claude-haiku-4-5-20251001',
      priority: -2,
      is_fallback: true,
      is_last_resort: true,
    }
  }
}

Optimizer DO (cold path)

The Optimizer wakes on alarm, reads the Router DO’s performance data, and updates the routing table. It makes exactly one LLM call per cycle β€” routed through the Router DO itself, which sends it to the free tier.

export class OptimizerDO extends Agent<Env, OptimizerState> {
  async onStart() {
    // Wake every hour to review performance
    await this.schedule(3600, 'optimize', {})
  }

  async optimize(_: unknown) {
    // Read Router DO's performance data
    const routerDO = this.env.ROUTER_DO.get(
      this.env.ROUTER_DO.idFromName('singleton')
    )
    const metrics = await routerDO.fetch('https://router/metrics').then(r => r.json())

    // Discover new free models from OpenRouter
    const freeModels = await this.discoverFreeModels()

    // ONE LLM call β€” routed through Router DO (hits free tier)
    // The Optimizer uses the Router it's optimizing
    const { object: updates } = await generateObject({
      model: this.createRouterModel('triage', 'trivial', 'optimizer'),
      schema: RoutingUpdateSchema,
      prompt: buildOptimizerPrompt(metrics, freeModels),
    })

    // Apply updates to Router DO's routing table
    for (const update of updates.changes) {
      await routerDO.fetch('https://router/routing-table', {
        method: 'PATCH',
        body: JSON.stringify(update),
      })
    }

    // Schedule next optimization
    await this.schedule(3600, 'optimize', {})
  }
}

Fallback Is Not Optional

The routing table is an optimization layer. The dumb pass-through is the guaranteed baseline. They are not the same thing and one must never depend on the other.

Smart path (routing table):     present when Optimizer has run
Dumb pass-through (env vars):   always present, always works

If routing table empty:         dumb pass-through
If routing table corrupted:     dumb pass-through
If all providers throttled:     dumb pass-through
If Optimizer DO is down:        dumb pass-through
If DO SQLite fails:             dumb pass-through

The fallback is configured in wrangler.jsonc (env vars), never in the routing table. This is load-bearing: if the routing table is the problem, it cannot also be the solution.

// wrangler.jsonc β€” fallback hierarchy, none depend on routing table
{
  "ai": { "binding": "AI" },   // Workers AI binding β€” primary fallback, always present

  "vars": {
    "LAST_RESORT_MODEL": "claude-haiku-4-5-20251001"  // only if Workers AI fails
  },
  "secrets": ["ANTHROPIC_API_KEY"]  // last-resort key only β€” not the primary path
}

Primary fallback: Workers AI. On-edge, no external call, no API key, co-located with the Router DO. If Workers AI fails, Cloudflare itself is down β€” the Worker hosting the Router is also down β€” so callers can’t reach the fallback anyway. Workers AI failing and the Router being reachable is not a real failure mode.

Last resort: Anthropic API. Only exists for belt-and-suspenders. In practice, never used.

Every fallback use is logged. The Optimizer reads fallback frequency as a signal: high fallback rate means the routing table has a problem. It investigates and fixes. But during the investigation, callers never notice β€” they kept getting responses.

This is the same principle as a circuit breaker in distributed systems: the smart path is tried first; on failure, the system degrades gracefully to a known-good baseline rather than failing entirely.

The Recursive Property

The Optimizer uses the Router to make its own LLM calls. The Router routes those calls to Workers AI or an OpenRouter free model. The system that improves the routing table uses the routing table to decide how to improve the routing table.

This means:

Optimizer wakes β†’ calls Router β†’ Router selects Workers AI (free)
β†’ LLM analyzes metrics β†’ suggests routing table changes
β†’ Router now routes more calls to Workers AI
β†’ Next Optimizer wake: Router is better β†’ Optimizer costs even less

The routing endpoint is OpenAI-compatible (issue garywu/api-mom#53) so any AI SDK provider can point at it.

// POST /v1/chat/completions β€” unified routing endpoint
app.post('/v1/chat/completions', async (c) => {
  const body = await c.req.json()
  const capabilityHint = c.req.header('X-Capability') ?? 'general'
  const complexity = c.req.header('X-Complexity') ?? 'standard'
  const projectId = c.req.header('X-Project-Id') ?? 'default'
  const functionName = c.req.header('X-Function') ?? 'unknown'

  // Select tier
  const tier = await selectTier({ capability: capabilityHint, complexity, project_id: projectId }, c.env.DB)

  // Route to tier
  let response: LLMResponse
  switch (tier) {
    case 'workers-ai':
      response = await routeToWorkersAI(body, c.env)
      break
    case 'openrouter-free':
      response = await routeToOpenRouterFree(body, c.env)
      break
    case 'paid-api':
      response = await routeToPaidAPI(body, capabilityHint, c.env)
      break
  }

  // Record cost
  await recordCost({ project_id: projectId, function: functionName, tier, cost_usd: response.cost_usd }, c.env.DB)

  return c.json(formatAsOpenAIResponse(response), 200, {
    'X-Tier-Used': tier,
    'X-Model-Used': response.model,
    'X-Cost-Usd': response.cost_usd.toFixed(6),
    'X-Budget-Remaining-Fraction': await getBudgetFraction(projectId, c.env.DB),
  })
})

The fallback pass-through is the first thing to implement and the first thing to test. Every other feature is built on top of it. If the fallback breaks, the entire system fails silently β€” callers get errors instead of degraded-but-working responses.

// src/router.test.ts
import { describe, it, expect, beforeEach } from 'vitest'
import { RouterDO } from './router-do'
import { createMiniflareEnv } from './test-helpers'

describe('RouterDO fallback pass-through', () => {
  let env: Env

  beforeEach(async () => {
    env = await createMiniflareEnv({
      FALLBACK_MODEL: 'claude-haiku-4-5-20251001',
      FALLBACK_PROVIDER: 'anthropic',
      ANTHROPIC_API_KEY: 'test-key',
    })
  })

  it('uses Workers AI fallback when routing table is empty', async () => {
    const router = new RouterDO(env)
    // No routing table entries β€” DO SQLite is fresh/empty

    const route = router.selectRoute('structured-reasoning', 'standard')

    expect(route.is_fallback).toBe(true)
    expect(route.tier).toBe('workers-ai')           // Cloudflare-native, no egress
    expect(route.model).toBe('@cf/meta/llama-3.3-70b-instruct')
    expect(route.provider).toBe('cloudflare')
  })

  it('uses fallback when all providers are throttled', async () => {
    const router = new RouterDO(env)
    // Seed routing table with one route, but mark it throttled
    await router.seedRoutingTable([
      { capability: 'structured-reasoning', complexity: 'standard',
        tier: 'openrouter-free', model: 'meta-llama/llama-3.3-70b-instruct:free' }
    ])
    await router.markProviderThrottled('meta-llama/llama-3.3-70b-instruct:free', Date.now() + 3600_000)

    const route = router.selectRoute('structured-reasoning', 'standard')

    expect(route.is_fallback).toBe(true)
  })

  it('uses fallback when routing table is corrupted', async () => {
    const router = new RouterDO(env)
    // Corrupt the routing table
    await router.corruptRoutingTableForTest()

    // Must not throw β€” must return fallback
    const route = router.selectRoute('structured-reasoning', 'standard')

    expect(route.is_fallback).toBe(true)
  })

  it('uses fallback when Optimizer DO has never run', async () => {
    // Fresh deploy β€” Optimizer hasn't touched routing table yet
    const router = new RouterDO(env)

    const route = router.selectRoute('triage', 'trivial')

    // Even triage β€” which should go to Workers AI eventually β€” falls back on first run
    expect(route.is_fallback).toBe(true)
  })

  it('records every fallback use in call_ledger', async () => {
    const router = new RouterDO(env)

    router.selectRoute('structured-reasoning', 'standard')

    const events = [...router.sql`SELECT * FROM call_ledger WHERE is_fallback = 1`]
    expect(events).toHaveLength(1)
    expect(events[0].capability).toBe('structured-reasoning')
  })

  it('smart route takes over once routing table is populated', async () => {
    const router = new RouterDO(env)
    await router.seedRoutingTable([
      { capability: 'triage', complexity: 'trivial',
        tier: 'workers-ai', model: '@cf/google/gemma-3-12b-it', priority: 10 }
    ])

    const route = router.selectRoute('triage', 'trivial')

    expect(route.is_fallback).toBe(false)
    expect(route.tier).toBe('workers-ai')
    expect(route.model).toBe('@cf/google/gemma-3-12b-it')
  })
})

These six tests define the contract: the fallback works in every failure mode, it logs every use, and smart routes take over correctly once the Optimizer populates the table. All six must pass before any routing feature ships.

See garywu/api-mom#106 β€” implement and test fallback pass-through first.


From an agent (Prime DO or any Cloudflare Worker), routing through API Mom is one line:

// In wrangler.jsonc: point the Anthropic provider at API Mom
// The agent never knows which model ran

import { createAnthropic } from '@ai-sdk/anthropic'
import { generateObject } from 'ai'

function createRouterModel(env: Env, capability: string, complexity: string, functionName: string) {
  const proxiedFetch = async (url: RequestInfo | URL, init?: RequestInit) => {
    const headers = new Headers(init?.headers)
    headers.set('X-Api-Key', env.API_MOM_KEY)
    headers.set('X-Capability', capability)
    headers.set('X-Complexity', complexity)
    headers.set('X-Function', functionName)
    headers.set('X-Project-Id', env.PROJECT_ID)
    return fetch(url, { ...init, headers })
  }

  // Points at API Mom β€” which model actually runs is API Mom's decision
  return createAnthropic({
    apiKey: 'routed',
    baseURL: `${env.API_MOM_URL}/v1`,
    fetch: proxiedFetch,
  })('claude-sonnet-4-6')  // This model ID is ignored β€” API Mom overrides it
}

// Usage in Prime's wake cycle
const model = createRouterModel(env, 'structured-reasoning', 'standard', 'generate-run-sheet')
const { object: runSheet } = await generateObject({
  model,
  schema: RunSheetSchema,
  prompt: buildRunSheetPrompt(context),
})

The agent passes capability and complexity. The model name in createAnthropic() is a placeholder β€” API Mom overrides the routing decision.


OpenRouter is the primary free model source. It provides an OpenAI-compatible API with 50+ free models.

// API Mom's OpenRouter free tier handler
async function routeToOpenRouterFree(
  body: ChatCompletionRequest,
  env: Env,
): Promise<LLMResponse> {
  const healthyModels = await getHealthyFreeModels(env.DB)

  for (const model of healthyModels) {
    try {
      const res = await fetch('https://openrouter.ai/api/v1/chat/completions', {
        method: 'POST',
        headers: {
          'Authorization': `Bearer ${env.OPENROUTER_API_KEY}`,
          'HTTP-Referer': 'https://api-mom.workers.dev',
          'X-Title': 'API Mom',
          'Content-Type': 'application/json',
        },
        body: JSON.stringify({ ...body, model }),
      })

      if (res.status === 429) {
        await markModelThrottled(model, env.DB)
        continue  // try next free model
      }

      if (!res.ok) {
        await markModelUnhealthy(model, env.DB)
        continue
      }

      const data = await res.json() as OpenAIResponse
      await markModelHealthy(model, env.DB)
      return { content: data.choices[0].message.content, model, tier: 'openrouter-free', cost_usd: 0 }

    } catch {
      await markModelUnhealthy(model, env.DB)
    }
  }

  // All free models exhausted β€” escalate
  throw new FreeTierExhaustedException()
}

API Mom periodically rediscovers healthy free models (issue garywu/api-mom#48).


-- API Mom D1 schema
CREATE TABLE provider_health (
  model TEXT PRIMARY KEY,
  tier TEXT NOT NULL,           -- 'workers-ai' | 'openrouter-free' | 'paid'
  status TEXT NOT NULL,         -- 'healthy' | 'throttled' | 'degraded' | 'down'
  throttled_until INTEGER,      -- epoch ms β€” when to retry
  last_checked INTEGER NOT NULL,
  error_count INTEGER DEFAULT 0
);

CREATE TABLE api_calls_ledger (
  id TEXT PRIMARY KEY,
  project_id TEXT NOT NULL,
  function_name TEXT NOT NULL,
  tier TEXT NOT NULL,
  model TEXT NOT NULL,
  input_tokens INTEGER,
  output_tokens INTEGER,
  cost_usd REAL NOT NULL DEFAULT 0,
  called_at INTEGER NOT NULL
);

CREATE TABLE daily_budgets (
  project_id TEXT PRIMARY KEY,
  daily_limit_usd REAL NOT NULL,
  spent_today_usd REAL NOT NULL DEFAULT 0,
  reset_at INTEGER NOT NULL     -- next midnight UTC
);

DecisionWhere it livesWhy
Which model to useAPI MomProvider-agnostic, changes with market
Which free models are healthyAPI MomShared across all agents, needs central tracking
Whether budget allows a callAPI MomEnforced centrally, not per-agent
What capability is neededAgentAgent knows its own task
Whether to use runner vs APIAgentArchitectural decision, not routing
How to interpret the responseAgentDomain-specific
Retry on structured output failureVercel AI SDKHandled by generateObject() internally

The agent describes the what and the how hard. API Mom decides the which model and how much it costs.


The routing hierarchy produces a remarkable property: the system stays intelligent at essentially zero cost during idle periods.

When Prime wakes on a 6-hour alarm and needs to triage 40 repos:

A full org maintenance cycle β€” 40 repos checked, 5 jobs dispatched β€” might cost $0.01–0.05 in API tokens, with the expensive work (code execution) consuming zero because it runs under subscription.

This is not β€œcheap AI.” It is intelligence that scales to zero. The system is always present, always reasoning, always acting β€” but consumes nothing when idle, and almost nothing when active. The cost curve looks like serverless compute, not a running LLM instance.

Cost vs Activity:
  Idle (no alarms firing):              $0.00 / hour
  Active triage (Workers AI):           $0.00 / cycle
  Active planning (free + Haiku):       $0.01 / cycle
  Active execution (runner):            $0.00 API / session
  Heavy re-planning (Sonnet):           $0.05 / cycle  ← rare
  Critical decision (Opus):            $0.20 / decision ← very rare

Compare to a naive implementation that calls Claude Sonnet for every decision: $3/M tokens Γ— continuous operation = hundreds of dollars per month for a 40-repo org. The routing tier reduces this by 95%+ while delivering the same outcomes.


The routing logic described in this article is, in itself, a product.

Every team building AI agents faces the same problem: LLM costs are unpredictable, provider choices are fragmented, and the free capacity that exists (Workers AI, OpenRouter free tier, subscription quota) is invisible unless you build infrastructure to use it.

API Mom’s router is that infrastructure. As a service:

The value proposition: pay API Mom less than you’d pay Anthropic directly, and get smarter routing as a bonus.

This works because:

  1. Most tasks don’t need Claude Sonnet β€” they just default to it because that’s what developers hardcode
  2. Free capacity exists at scale that individual teams can’t effectively aggregate
  3. Subscription quota is systematically under-utilized β€” routing into it turns idle capacity into value
  4. Provider health and fallback chains require operational infrastructure most teams won’t build

The product positioning: β€œPay less for the same intelligence. Free tiers and subscription quota you already have, used automatically.”

This is issue garywu/atlas#369 (Atlas commercial thesis) β€” the same intelligence optimization logic that makes Prime cheap is the commercial moat for API Mom as a product.


Cloudflare

OpenRouter

AI SDKs


Edit page
Share this post on:

Previous Post
Ambient Spaced Repetition
Next Post
AI Context Efficiency