Org Status: π‘ Dormant Cloudflare: N/A Last Audited: 2026-04-28
The layer between your agent and the LLM market β routes by capability, optimizes for cost, abstracts away every provider decision.
When you build a multi-agent system, you quickly discover that βwhich model should I use?β is the wrong question for agents to ask. Agents should describe what they need. A separate layer β the router β decides how to satisfy that need at the lowest cost, using whatever capacity is currently available.
This article describes API Mom as that layer: not just a proxy or cost meter, but an intelligent router that knows about free tiers, subscription quotas, provider health, and capability requirements β and makes the right call so your agents donβt have to.
- The Problem
- The Abstraction: Capability Hints, Not Model Names
- Four Access Tiers
- The Routing Algorithm
- Tier 1: Cloudflare Workers AI
- Tier 2: OpenRouter Free Models
- Tier 3: Paid API (Claude, Gemini, GPT)
- Tier 4: Subscription Quota via Runners
- The Routing Table
- Cost-Aware Degradation
- Internal Architecture: Two Durable Objects
- API Mom as Router: Implementation
- Agent Integration: Calling the Router
- OpenRouter Integration
- Provider Health and Fallback Chains
- What Goes Into API Mom vs What Stays in the Agent
- References
Most agent implementations hardcode model selection:
// Wrong β agent knows too much about the model market
const { object } = await generateObject({
model: anthropic('claude-sonnet-4-6'),
schema: DecisionSchema,
prompt: buildPrompt(context),
})
This creates several problems:
- No free tier usage β Cloudflare Workers AI and OpenRouter have free models capable of handling triage and classification. Hardcoding Claude Sonnet pays $3/M tokens for work a free model could do.
- No subscription leverage β Pro/Max Claude subscriptions have 5β20Γ the rate limits of the API. Code-fix work executed through a local runner under subscription costs zero API tokens. Hardcoding to the API never uses this.
- No cost-aware fallback β when the daily budget runs low, the agent has no mechanism to downgrade to cheaper capacity.
- Provider lock-in β if Anthropic has an outage or raises prices, every agent needs updating.
- Surprise bills β no central enforcement of spend limits per agent, per project, per day.
The fix is not smarter agents. It is removing model selection from agents entirely.
Agents describe what they need. The router decides how to satisfy it.
// Right β agent declares capability requirements
const result = await router.complete({
capability: 'structured-reasoning', // what I need
complexity: 'standard', // how hard
context_tokens: 2000, // how much context
function: 'generate-run-sheet', // for cost attribution
schema: RunSheetSchema,
prompt: buildPrompt(context),
})
The agent has no knowledge of which model ran. It receives the result. API Mom handled everything else.
This means:
- New free models appear β API Mom uses them automatically
- Subscription runs low β API Mom downgrades gracefully
- Provider outage β API Mom fails over without agent changes
- Pricing changes β update the routing table in one place
LLM access has fundamentally different cost models. The router must understand all four.
Tier 1: Free (Cloudflare Workers AI)
- Cost: Free within Workers usage limits
- Location: On-edge, same datacenter as your Worker
- Models:
@cf/meta/llama-3.3-70b-instruct,@cf/google/gemma-3-12b-it, others - Latency: Lowest possible β no egress
- Capabilities: Classification, boolean checks, simple extraction, formatting
- Not suited for: Complex reasoning, code generation, structured output with strict schemas
Tier 2: Free (OpenRouter Free Tier)
- Cost: Free (rate-limited, provider-funded)
- Available models (as of 2026): 50+ models including Llama 3.3 70B, Gemma 3, Mistral, DeepSeek, Qwen β see openrouter.ai/models?q=free
- Capabilities: Wide range β some free models match paid mid-tier performance
- Caveat: Rate limits vary. API Mom must track health and rotate between free models when one is throttled.
Tier 3: Paid API (per-token)
- Claude (Anthropic): Haiku ($0.80/M) β Sonnet ($3/M) β Opus ($15/M)
- Gemini (Google): Flash ($0.15/M) β Flash Thinking β Pro ($1.25/M)
- GPT (OpenAI): 4o-mini ($0.15/M) β 4o ($2.5/M)
- When to use: Tasks requiring genuine reasoning, code understanding, or reliable structured output where free models are insufficient
Tier 4: Subscription Quota (via Runners)
- Cost: Zero API tokens β consumed from Pro/Max subscription
- Pro: 5Γ the API rate limits
- Max: 20Γ the API rate limits
- Mechanism: The local runner executes a Claude Code / Claude Agent SDK session under the subscription. This is not an API call β it is a separate execution channel entirely.
- When to use: All actual code-fix work. CI debugging. Complex multi-file changes. Anything that would cost significant API tokens.
The counterintuitive result: The most capable work (code changes via Claude Code) is the cheapest, because it runs through the subscription, not the API.
Given: capability_hint, complexity, context_size, current_budget, daily_budget_remaining
1. Can Workers AI handle this capability?
AND context_size < Workers AI limit?
β Route to Workers AI (free, fastest)
2. Is there a healthy free OpenRouter model for this capability?
AND free model quality sufficient for complexity?
β Route to OpenRouter free tier
3. Is this a code-fix execution task (not a reasoning call)?
β Delegate to runner (subscription quota, zero API cost)
4. What is the cheapest paid model that satisfies the capability?
AND daily_budget_remaining > estimated_cost?
β Route to cheapest sufficient paid model
5. Budget exhausted for today?
β Force-downgrade to free tier even if quality is reduced
β Log degradation event
Priority: free β subscription β cheap paid β expensive paid. Budget exhaustion forces back to free.
Workers AI runs inside Cloudflareβs network. No external API call, no egress, lowest latency.
// In API Mom Worker
async function routeToWorkersAI(
request: RouterRequest,
env: Env,
): Promise<RouterResponse> {
const model = selectWorkersAIModel(request.capability)
const response = await env.AI.run(model, {
messages: [
{ role: 'system', content: request.system ?? '' },
{ role: 'user', content: request.prompt },
],
max_tokens: 512,
})
return {
content: response.response ?? '',
model: model,
tier: 'workers-ai',
cost_usd: 0,
}
}
function selectWorkersAIModel(capability: string): string {
// Larger model for reasoning, smaller for classification
if (capability === 'classification' || capability === 'triage') {
return '@cf/google/gemma-3-12b-it'
}
return '@cf/meta/llama-3.3-70b-instruct'
}
Capabilities best handled by Workers AI:
- Signal triage: βis this repo missing biome.json?β β boolean
- Classification: βwhat kind of failure is this CI log?β
- Simple extraction: βlist the failing test names from this outputβ
- Formatting: βsummarize this error in one sentenceβ
OpenRouter aggregates hundreds of models under a single OpenAI-compatible API. Many are free β provider-subsidized, rate-limited, but capable.
// OpenRouter is OpenAI-compatible β use the OpenAI provider with a custom base URL
import { createOpenAI } from '@ai-sdk/openai'
function createOpenRouterProvider(env: Env, model: string) {
return createOpenAI({
apiKey: env.OPENROUTER_API_KEY,
baseURL: 'https://openrouter.ai/api/v1',
defaultHeaders: {
'HTTP-Referer': 'https://api-mom.workers.dev',
'X-Title': 'API Mom Router',
},
})(model)
}
// Free models to rotate through (check openrouter.ai/models?q=free for current list)
const FREE_MODELS = [
'meta-llama/llama-3.3-70b-instruct:free',
'google/gemma-3-12b-it:free',
'deepseek/deepseek-r1:free',
'mistralai/mistral-7b-instruct:free',
'qwen/qwen-2.5-72b-instruct:free',
]
API Mom must track which free models are currently healthy. When one hits its rate limit, rotate to the next. When all are throttled, escalate to paid tier.
This is issue garywu/api-mom#44 (free model routing β auto-discovery, fallback chain, tier tracking).
When free tiers are insufficient, route to the cheapest model that satisfies capability requirements.
const PAID_ROUTES: Record<string, PaidRoute[]> = {
// Ordered cheapest-first within each capability tier
'structured-reasoning': [
{ provider: 'anthropic', model: 'claude-haiku-4-5-20251001', cost_per_m_input: 0.80 },
{ provider: 'google', model: 'gemini-2.0-flash', cost_per_m_input: 0.10 },
{ provider: 'anthropic', model: 'claude-sonnet-4-6', cost_per_m_input: 3.00 },
],
'code-understanding': [
{ provider: 'anthropic', model: 'claude-sonnet-4-6', cost_per_m_input: 3.00 },
{ provider: 'anthropic', model: 'claude-opus-4-6', cost_per_m_input: 15.0 },
],
'long-context': [
{ provider: 'google', model: 'gemini-2.5-pro', cost_per_m_input: 1.25 },
{ provider: 'anthropic', model: 'claude-sonnet-4-6', cost_per_m_input: 3.00 },
],
'critical-decision': [
{ provider: 'anthropic', model: 'claude-opus-4-6', cost_per_m_input: 15.0 },
{ provider: 'google', model: 'gemini-2.5-pro', cost_per_m_input: 1.25 },
],
}
This tier is architecturally different from the others. It is not an LLM API call. It is a job delegation.
When an agent needs code-level work done β fix a TypeScript error, debug a failing CI, create a PR β it does not call an LLM directly. It submits a job to the dispatcher, which routes it to the local runner, which executes a full Claude Agent SDK session on a machine running under a Pro or Max subscription.
Agent (Prime DO) β Dispatcher β Local Runner β Claude Agent SDK session
β
Pro: 5Γ rate limits
Max: 20Γ rate limits
Cost: $0 API tokens
The key properties:
- The agent does not wait for this. It submits a job and hibernates.
- The runner reports outcome back to the dispatcher, which notifies the agent.
- The Claude session runs to completion under the subscription, handling retries internally.
- No API key needed on the runner β the subscription is the authentication.
What this means for routing: API Mom should never be involved in runner execution. The routing decision is made upstream β if the task requires code execution, delegate to dispatcher, not to an LLM API call.
// In Prime's reasoning cycle
async function executeDecision(action: Action, env: Env) {
if (action.type === 'fix-code' || action.type === 'debug-ci') {
// Does NOT go through API Mom β goes to dispatcher
await submitDispatcherJob(action, env)
return
}
// Everything else goes through API Mom
await routerCall(action, env)
}
Putting it all together β what goes where:
| Task | Capability | Tier | Model/Channel | Cost |
|---|---|---|---|---|
| βDoes this repo have biome.json?β | triage | Workers AI | Llama 3.3 70B | Free |
| βWhatβs the CI failure type?β | classification | Workers AI | Gemma 3 12B | Free |
| βSummarize these 3 failuresβ | summarization | OpenRouter free | Llama/Mistral | Free |
| βGenerate run sheet for 40 reposβ | structured-reasoning | Paid API | Haiku / Gemini Flash | ~$0.01 |
| βRe-plan after 3 failuresβ | structured-reasoning | Paid API | Sonnet | ~$0.05 |
| βFix TypeScript error in this fileβ | code-execution | Runner (subscription) | Claude Agent SDK | $0 API |
| βDebug failing CI across 5 filesβ | code-execution | Runner (subscription) | Claude Agent SDK | $0 API |
| βCritical architecture decisionβ | critical-decision | Paid API | Opus / Gemini Pro | ~$0.20 |
API Mom tracks daily spend per project. When budget is running low, it automatically degrades:
async function selectTier(
request: RouterRequest,
db: D1Database,
): Promise<Tier> {
const { spent_today, daily_limit } = await getDailyBudget(request.project_id, db)
const remaining_fraction = 1 - spent_today / daily_limit
// Plenty of budget β use best fit
if (remaining_fraction > 0.5) return selectBestFit(request)
// Getting low β force to free tiers
if (remaining_fraction > 0.1) return tryFreeTiers(request)
// Almost out β Workers AI or nothing
if (remaining_fraction > 0) return 'workers-ai'
// Exhausted β reject with 429, log event
throw new BudgetExhaustedException(request.project_id)
}
Agents receive a X-Budget-Remaining-Fraction header on every response. They can check this to decide whether to schedule non-urgent work for tomorrow.
This is the hierarchical budget system in garywu/api-mom#18.
API Mom applies the same control plane / data plane split it serves β recursively, to itself.
Router DO (data plane β hot path)
Every request β SQLite lookup β proxy β record
No LLM calls. Ever.
Latency: sub-millisecond routing decision
Optimizer DO (control plane β cold path)
Wakes on alarm (hourly/daily)
Reads Router DO metrics β one LLM call β updates routing table
Hibernates. Zero cost between wakes.
Router DO (hot path)
The Router DO handles every incoming request. Its only job: read the routing table, pick a tier, forward the request, record the outcome. No reasoning, no LLM, no external calls beyond the forwarded request itself.
export class RouterDO extends Agent<Env, RouterState> {
async onStart() {
// Routing table lives in DO SQLite β in-memory fast on warm DO
this.sql`CREATE TABLE IF NOT EXISTS routing_table (
capability TEXT NOT NULL,
complexity TEXT NOT NULL,
tier TEXT NOT NULL, -- 'workers-ai' | 'openrouter-free' | 'paid'
provider TEXT NOT NULL,
model TEXT NOT NULL,
priority INTEGER NOT NULL DEFAULT 0,
updated_at INTEGER NOT NULL
)`
this.sql`CREATE TABLE IF NOT EXISTS provider_health (
model TEXT PRIMARY KEY,
status TEXT NOT NULL, -- 'healthy' | 'throttled' | 'down'
throttled_until INTEGER,
error_count INTEGER DEFAULT 0,
last_checked INTEGER NOT NULL
)`
this.sql`CREATE TABLE IF NOT EXISTS call_ledger (
id TEXT PRIMARY KEY,
project_id TEXT NOT NULL,
capability TEXT NOT NULL,
tier TEXT NOT NULL,
model TEXT NOT NULL,
cost_usd REAL NOT NULL DEFAULT 0,
latency_ms INTEGER,
status TEXT NOT NULL, -- 'success' | 'error' | 'throttled'
called_at INTEGER NOT NULL
)`
}
async onRequest(request: Request): Promise<Response> {
if (request.method !== 'POST') return new Response('not found', { status: 404 })
const capability = request.headers.get('X-Capability') ?? 'general'
const complexity = request.headers.get('X-Complexity') ?? 'standard'
const projectId = request.headers.get('X-Project-Id') ?? 'default'
// Pure SQLite lookup β no LLM, no external call
const route = this.selectRoute(capability, complexity)
const start = Date.now()
try {
const response = await this.forwardRequest(request, route)
const latency = Date.now() - start
this.recordOutcome({ projectId, capability, route, latency, status: 'success' })
return new Response(response.body, {
status: response.status,
headers: {
...Object.fromEntries(response.headers),
'X-Tier-Used': route.tier,
'X-Model-Used': route.model,
'X-Latency-Ms': String(latency),
},
})
} catch (err) {
this.markProviderUnhealthy(route.model)
throw err
}
}
private selectRoute(capability: string, complexity: string): Route {
// 1. Try routing table β smart path
try {
const routes = [...this.sql`
SELECT * FROM routing_table
WHERE capability = ${capability} AND complexity = ${complexity}
AND model NOT IN (
SELECT model FROM provider_health
WHERE status = 'throttled' AND throttled_until > ${Date.now()}
)
ORDER BY priority DESC
`]
if (routes.length > 0) return routes[0] as Route
} catch {
// Routing table corrupted or missing β fall through to dumb pass-through
this.recordEvent('routing-table-error')
}
// 2. Dumb pass-through β always present, never depends on routing table
// Configured via env vars, not routing table. Always works.
return this.passthroughRoute()
}
private passthroughRoute(): Route {
// Hard-coded fallback. Reads from env, never from SQLite.
// This path must work even if the entire DO SQLite is corrupted.
//
// Primary fallback: Workers AI β on-edge, no external API call, no key needed.
// If Workers AI fails, Cloudflare itself is down β Worker is also down β unreachable.
// Secondary fallback (last resort): Anthropic API key from env.
return {
tier: 'workers-ai',
provider: 'cloudflare',
model: '@cf/meta/llama-3.3-70b-instruct',
priority: -1,
is_fallback: true,
}
}
private lastResortRoute(): Route {
// Only reached if Workers AI itself fails β extremely rare.
// Anthropic API key configured in wrangler.jsonc secrets.
return {
tier: 'paid',
provider: 'anthropic',
model: this.env.LAST_RESORT_MODEL ?? 'claude-haiku-4-5-20251001',
priority: -2,
is_fallback: true,
is_last_resort: true,
}
}
}
Optimizer DO (cold path)
The Optimizer wakes on alarm, reads the Router DOβs performance data, and updates the routing table. It makes exactly one LLM call per cycle β routed through the Router DO itself, which sends it to the free tier.
export class OptimizerDO extends Agent<Env, OptimizerState> {
async onStart() {
// Wake every hour to review performance
await this.schedule(3600, 'optimize', {})
}
async optimize(_: unknown) {
// Read Router DO's performance data
const routerDO = this.env.ROUTER_DO.get(
this.env.ROUTER_DO.idFromName('singleton')
)
const metrics = await routerDO.fetch('https://router/metrics').then(r => r.json())
// Discover new free models from OpenRouter
const freeModels = await this.discoverFreeModels()
// ONE LLM call β routed through Router DO (hits free tier)
// The Optimizer uses the Router it's optimizing
const { object: updates } = await generateObject({
model: this.createRouterModel('triage', 'trivial', 'optimizer'),
schema: RoutingUpdateSchema,
prompt: buildOptimizerPrompt(metrics, freeModels),
})
// Apply updates to Router DO's routing table
for (const update of updates.changes) {
await routerDO.fetch('https://router/routing-table', {
method: 'PATCH',
body: JSON.stringify(update),
})
}
// Schedule next optimization
await this.schedule(3600, 'optimize', {})
}
}
Fallback Is Not Optional
The routing table is an optimization layer. The dumb pass-through is the guaranteed baseline. They are not the same thing and one must never depend on the other.
Smart path (routing table): present when Optimizer has run
Dumb pass-through (env vars): always present, always works
If routing table empty: dumb pass-through
If routing table corrupted: dumb pass-through
If all providers throttled: dumb pass-through
If Optimizer DO is down: dumb pass-through
If DO SQLite fails: dumb pass-through
The fallback is configured in wrangler.jsonc (env vars), never in the routing table. This is load-bearing: if the routing table is the problem, it cannot also be the solution.
// wrangler.jsonc β fallback hierarchy, none depend on routing table
{
"ai": { "binding": "AI" }, // Workers AI binding β primary fallback, always present
"vars": {
"LAST_RESORT_MODEL": "claude-haiku-4-5-20251001" // only if Workers AI fails
},
"secrets": ["ANTHROPIC_API_KEY"] // last-resort key only β not the primary path
}
Primary fallback: Workers AI. On-edge, no external call, no API key, co-located with the Router DO. If Workers AI fails, Cloudflare itself is down β the Worker hosting the Router is also down β so callers canβt reach the fallback anyway. Workers AI failing and the Router being reachable is not a real failure mode.
Last resort: Anthropic API. Only exists for belt-and-suspenders. In practice, never used.
Every fallback use is logged. The Optimizer reads fallback frequency as a signal: high fallback rate means the routing table has a problem. It investigates and fixes. But during the investigation, callers never notice β they kept getting responses.
This is the same principle as a circuit breaker in distributed systems: the smart path is tried first; on failure, the system degrades gracefully to a known-good baseline rather than failing entirely.
The Recursive Property
The Optimizer uses the Router to make its own LLM calls. The Router routes those calls to Workers AI or an OpenRouter free model. The system that improves the routing table uses the routing table to decide how to improve the routing table.
This means:
- The Optimizer is always cheap β it routes to the cheapest available tier
- As the routing table improves, the Optimizerβs own costs decrease
- The system self-bootstraps toward lower cost over time
Optimizer wakes β calls Router β Router selects Workers AI (free)
β LLM analyzes metrics β suggests routing table changes
β Router now routes more calls to Workers AI
β Next Optimizer wake: Router is better β Optimizer costs even less
The routing endpoint is OpenAI-compatible (issue garywu/api-mom#53) so any AI SDK provider can point at it.
// POST /v1/chat/completions β unified routing endpoint
app.post('/v1/chat/completions', async (c) => {
const body = await c.req.json()
const capabilityHint = c.req.header('X-Capability') ?? 'general'
const complexity = c.req.header('X-Complexity') ?? 'standard'
const projectId = c.req.header('X-Project-Id') ?? 'default'
const functionName = c.req.header('X-Function') ?? 'unknown'
// Select tier
const tier = await selectTier({ capability: capabilityHint, complexity, project_id: projectId }, c.env.DB)
// Route to tier
let response: LLMResponse
switch (tier) {
case 'workers-ai':
response = await routeToWorkersAI(body, c.env)
break
case 'openrouter-free':
response = await routeToOpenRouterFree(body, c.env)
break
case 'paid-api':
response = await routeToPaidAPI(body, capabilityHint, c.env)
break
}
// Record cost
await recordCost({ project_id: projectId, function: functionName, tier, cost_usd: response.cost_usd }, c.env.DB)
return c.json(formatAsOpenAIResponse(response), 200, {
'X-Tier-Used': tier,
'X-Model-Used': response.model,
'X-Cost-Usd': response.cost_usd.toFixed(6),
'X-Budget-Remaining-Fraction': await getBudgetFraction(projectId, c.env.DB),
})
})
The fallback pass-through is the first thing to implement and the first thing to test. Every other feature is built on top of it. If the fallback breaks, the entire system fails silently β callers get errors instead of degraded-but-working responses.
// src/router.test.ts
import { describe, it, expect, beforeEach } from 'vitest'
import { RouterDO } from './router-do'
import { createMiniflareEnv } from './test-helpers'
describe('RouterDO fallback pass-through', () => {
let env: Env
beforeEach(async () => {
env = await createMiniflareEnv({
FALLBACK_MODEL: 'claude-haiku-4-5-20251001',
FALLBACK_PROVIDER: 'anthropic',
ANTHROPIC_API_KEY: 'test-key',
})
})
it('uses Workers AI fallback when routing table is empty', async () => {
const router = new RouterDO(env)
// No routing table entries β DO SQLite is fresh/empty
const route = router.selectRoute('structured-reasoning', 'standard')
expect(route.is_fallback).toBe(true)
expect(route.tier).toBe('workers-ai') // Cloudflare-native, no egress
expect(route.model).toBe('@cf/meta/llama-3.3-70b-instruct')
expect(route.provider).toBe('cloudflare')
})
it('uses fallback when all providers are throttled', async () => {
const router = new RouterDO(env)
// Seed routing table with one route, but mark it throttled
await router.seedRoutingTable([
{ capability: 'structured-reasoning', complexity: 'standard',
tier: 'openrouter-free', model: 'meta-llama/llama-3.3-70b-instruct:free' }
])
await router.markProviderThrottled('meta-llama/llama-3.3-70b-instruct:free', Date.now() + 3600_000)
const route = router.selectRoute('structured-reasoning', 'standard')
expect(route.is_fallback).toBe(true)
})
it('uses fallback when routing table is corrupted', async () => {
const router = new RouterDO(env)
// Corrupt the routing table
await router.corruptRoutingTableForTest()
// Must not throw β must return fallback
const route = router.selectRoute('structured-reasoning', 'standard')
expect(route.is_fallback).toBe(true)
})
it('uses fallback when Optimizer DO has never run', async () => {
// Fresh deploy β Optimizer hasn't touched routing table yet
const router = new RouterDO(env)
const route = router.selectRoute('triage', 'trivial')
// Even triage β which should go to Workers AI eventually β falls back on first run
expect(route.is_fallback).toBe(true)
})
it('records every fallback use in call_ledger', async () => {
const router = new RouterDO(env)
router.selectRoute('structured-reasoning', 'standard')
const events = [...router.sql`SELECT * FROM call_ledger WHERE is_fallback = 1`]
expect(events).toHaveLength(1)
expect(events[0].capability).toBe('structured-reasoning')
})
it('smart route takes over once routing table is populated', async () => {
const router = new RouterDO(env)
await router.seedRoutingTable([
{ capability: 'triage', complexity: 'trivial',
tier: 'workers-ai', model: '@cf/google/gemma-3-12b-it', priority: 10 }
])
const route = router.selectRoute('triage', 'trivial')
expect(route.is_fallback).toBe(false)
expect(route.tier).toBe('workers-ai')
expect(route.model).toBe('@cf/google/gemma-3-12b-it')
})
})
These six tests define the contract: the fallback works in every failure mode, it logs every use, and smart routes take over correctly once the Optimizer populates the table. All six must pass before any routing feature ships.
See garywu/api-mom#106 β implement and test fallback pass-through first.
From an agent (Prime DO or any Cloudflare Worker), routing through API Mom is one line:
// In wrangler.jsonc: point the Anthropic provider at API Mom
// The agent never knows which model ran
import { createAnthropic } from '@ai-sdk/anthropic'
import { generateObject } from 'ai'
function createRouterModel(env: Env, capability: string, complexity: string, functionName: string) {
const proxiedFetch = async (url: RequestInfo | URL, init?: RequestInit) => {
const headers = new Headers(init?.headers)
headers.set('X-Api-Key', env.API_MOM_KEY)
headers.set('X-Capability', capability)
headers.set('X-Complexity', complexity)
headers.set('X-Function', functionName)
headers.set('X-Project-Id', env.PROJECT_ID)
return fetch(url, { ...init, headers })
}
// Points at API Mom β which model actually runs is API Mom's decision
return createAnthropic({
apiKey: 'routed',
baseURL: `${env.API_MOM_URL}/v1`,
fetch: proxiedFetch,
})('claude-sonnet-4-6') // This model ID is ignored β API Mom overrides it
}
// Usage in Prime's wake cycle
const model = createRouterModel(env, 'structured-reasoning', 'standard', 'generate-run-sheet')
const { object: runSheet } = await generateObject({
model,
schema: RunSheetSchema,
prompt: buildRunSheetPrompt(context),
})
The agent passes capability and complexity. The model name in createAnthropic() is a placeholder β API Mom overrides the routing decision.
OpenRouter is the primary free model source. It provides an OpenAI-compatible API with 50+ free models.
// API Mom's OpenRouter free tier handler
async function routeToOpenRouterFree(
body: ChatCompletionRequest,
env: Env,
): Promise<LLMResponse> {
const healthyModels = await getHealthyFreeModels(env.DB)
for (const model of healthyModels) {
try {
const res = await fetch('https://openrouter.ai/api/v1/chat/completions', {
method: 'POST',
headers: {
'Authorization': `Bearer ${env.OPENROUTER_API_KEY}`,
'HTTP-Referer': 'https://api-mom.workers.dev',
'X-Title': 'API Mom',
'Content-Type': 'application/json',
},
body: JSON.stringify({ ...body, model }),
})
if (res.status === 429) {
await markModelThrottled(model, env.DB)
continue // try next free model
}
if (!res.ok) {
await markModelUnhealthy(model, env.DB)
continue
}
const data = await res.json() as OpenAIResponse
await markModelHealthy(model, env.DB)
return { content: data.choices[0].message.content, model, tier: 'openrouter-free', cost_usd: 0 }
} catch {
await markModelUnhealthy(model, env.DB)
}
}
// All free models exhausted β escalate
throw new FreeTierExhaustedException()
}
API Mom periodically rediscovers healthy free models (issue garywu/api-mom#48).
-- API Mom D1 schema
CREATE TABLE provider_health (
model TEXT PRIMARY KEY,
tier TEXT NOT NULL, -- 'workers-ai' | 'openrouter-free' | 'paid'
status TEXT NOT NULL, -- 'healthy' | 'throttled' | 'degraded' | 'down'
throttled_until INTEGER, -- epoch ms β when to retry
last_checked INTEGER NOT NULL,
error_count INTEGER DEFAULT 0
);
CREATE TABLE api_calls_ledger (
id TEXT PRIMARY KEY,
project_id TEXT NOT NULL,
function_name TEXT NOT NULL,
tier TEXT NOT NULL,
model TEXT NOT NULL,
input_tokens INTEGER,
output_tokens INTEGER,
cost_usd REAL NOT NULL DEFAULT 0,
called_at INTEGER NOT NULL
);
CREATE TABLE daily_budgets (
project_id TEXT PRIMARY KEY,
daily_limit_usd REAL NOT NULL,
spent_today_usd REAL NOT NULL DEFAULT 0,
reset_at INTEGER NOT NULL -- next midnight UTC
);
| Decision | Where it lives | Why |
|---|---|---|
| Which model to use | API Mom | Provider-agnostic, changes with market |
| Which free models are healthy | API Mom | Shared across all agents, needs central tracking |
| Whether budget allows a call | API Mom | Enforced centrally, not per-agent |
| What capability is needed | Agent | Agent knows its own task |
| Whether to use runner vs API | Agent | Architectural decision, not routing |
| How to interpret the response | Agent | Domain-specific |
| Retry on structured output failure | Vercel AI SDK | Handled by generateObject() internally |
The agent describes the what and the how hard. API Mom decides the which model and how much it costs.
The routing hierarchy produces a remarkable property: the system stays intelligent at essentially zero cost during idle periods.
When Prime wakes on a 6-hour alarm and needs to triage 40 repos:
- Signal checks (boolean: βdoes this file exist?β) β Workers AI β $0
- Failure classification (βwhat kind of CI error is this?β) β OpenRouter free β $0
- Run sheet generation (complex structured reasoning) β Claude Haiku β ~$0.01
- Actual code fix β local runner, subscription β $0 API tokens
A full org maintenance cycle β 40 repos checked, 5 jobs dispatched β might cost $0.01β0.05 in API tokens, with the expensive work (code execution) consuming zero because it runs under subscription.
This is not βcheap AI.β It is intelligence that scales to zero. The system is always present, always reasoning, always acting β but consumes nothing when idle, and almost nothing when active. The cost curve looks like serverless compute, not a running LLM instance.
Cost vs Activity:
Idle (no alarms firing): $0.00 / hour
Active triage (Workers AI): $0.00 / cycle
Active planning (free + Haiku): $0.01 / cycle
Active execution (runner): $0.00 API / session
Heavy re-planning (Sonnet): $0.05 / cycle β rare
Critical decision (Opus): $0.20 / decision β very rare
Compare to a naive implementation that calls Claude Sonnet for every decision: $3/M tokens Γ continuous operation = hundreds of dollars per month for a 40-repo org. The routing tier reduces this by 95%+ while delivering the same outcomes.
The routing logic described in this article is, in itself, a product.
Every team building AI agents faces the same problem: LLM costs are unpredictable, provider choices are fragmented, and the free capacity that exists (Workers AI, OpenRouter free tier, subscription quota) is invisible unless you build infrastructure to use it.
API Momβs router is that infrastructure. As a service:
- You bring: your agent code, your use case
- API Mom brings: routing intelligence, free tier access, provider health tracking, cost attribution, budget enforcement
- You pay: a fraction of what youβd pay calling providers directly
- API Mom keeps: the margin between what you pay and what the optimal routing costs
The value proposition: pay API Mom less than youβd pay Anthropic directly, and get smarter routing as a bonus.
This works because:
- Most tasks donβt need Claude Sonnet β they just default to it because thatβs what developers hardcode
- Free capacity exists at scale that individual teams canβt effectively aggregate
- Subscription quota is systematically under-utilized β routing into it turns idle capacity into value
- Provider health and fallback chains require operational infrastructure most teams wonβt build
The product positioning: βPay less for the same intelligence. Free tiers and subscription quota you already have, used automatically.β
This is issue garywu/atlas#369 (Atlas commercial thesis) β the same intelligence optimization logic that makes Prime cheap is the commercial moat for API Mom as a product.
Cloudflare
- Cloudflare Workers AI β On-edge model inference. Free within Workers limits.
env.AI.run(model, messages). - Workers AI: Available models β Full list. Look for
text-generationtype models. - Cloudflare D1 β Provider health table, cost ledger, daily budgets.
- Cloudflare Agents SDK β
agentspackage. The container that calls API Mom.
OpenRouter
- OpenRouter β Aggregated model API. OpenAI-compatible endpoint.
- OpenRouter: Free models β Current list of zero-cost models with rate limits.
- OpenRouter: API reference β Authentication, rate limits, error codes.
AI SDKs
- Vercel AI SDK β
generateObject(),generateText(), provider abstraction. Customfetchoption enables transparent routing through API Mom. - @ai-sdk/openai β Used to talk to OpenRouter (OpenAI-compatible endpoint).
- @ai-sdk/anthropic β Claude provider. Pointed at API Momβs
/v1/anthropicproxy. - Claude Agent SDK β
@anthropic-ai/claude-agent-sdk. Subscription-based execution. Used by local runner, not by Prime directly. - Anthropic API: Rate limits β Pro: 5Γ limits. Max: 20Γ limits vs base API.
Related Articles (garywu)
- garywu/three-layer-ai-agent-architecture β Container / Brain / Wallet. API Mom as cost proxy. Origin of the three-layer pattern this article extends.
- garywu/org-prime-agent-architecture β Prime agents that call the router. The Tier 4 runner delegation pattern in full context.
- garywu/cloudflare-durable-objects-patterns β DO hibernation, SQLite, alarms. API Momβs own infrastructure.
- garywu/cloudflare-autonomous-pipeline β Dispatcher scheduling, D1 at scale.
Related Issues (garywu/api-mom)
- #44 free model routing β Auto-discovery, fallback chain, tier tracking
- #48 periodic free provider discovery β Quarterly sweep for new free tiers
- #53 unified /v1/chat/completions endpoint β OpenAI-compatible routing endpoint
- #18 hierarchical budget system β Project β provider β call-type spend caps