Most production AI agents are a tangled mess — LLM calls mixed with state management, API keys scattered across services, cost tracking bolted on as an afterthought (if at all). The result: $47 in surprise Gemini bills, 9x cost undercounting, and zero visibility into what your agents are actually spending.
This article presents a clean separation: three composable layers that handle agent runtime, LLM interaction, and API cost control independently. Built on Cloudflare Workers, the Vercel AI SDK, and a centralized API proxy, this architecture lets you build agents that are stateful, intelligent, and financially observable — without coupling any of those concerns together.
What you’ll learn:
- How to separate agent state (Container) from LLM logic (Brain) from API metering (Wallet)
- The Cloudflare Agents SDK as a stateful runtime for millions of concurrent agents
- The Vercel AI SDK as a provider-agnostic LLM interface with structured output and tool calling
- The custom
fetchpattern that routes all LLM calls through a cost-tracking proxy - How a $47 surprise bill in 5 days exposed why the Wallet layer isn’t optional
- When you need all three layers vs. when you can skip one
- How this compares to LangChain, OpenAI Agents SDK, AWS Bedrock Agents, and LLM proxy solutions
Table of Contents
Open Table of Contents
- The Problem
- Architecture Overview
- Layer 1: Agent Runtime (Container)
- Layer 2: LLM Interface (Brain)
- Layer 3: API Proxy / Metering (Wallet)
- The Composition Pattern
- Patterns
- Small Examples
- Example 1: Minimal Agent with State Sync
- Example 2: Custom Fetch Logger
- Example 3: Schema-First Tool Definition
- Example 4: Thinking Budget Control
- Example 5: Agent with Scheduled Self-Improvement
- Example 6: Idempotent Queue Consumer with Cost Tracking
- Example 7: Provider Fallback Chain
- Example 8: Real-Time Cost Dashboard via WebSocket
- Comparisons
- Anti-Patterns
- When to Skip a Layer
- Production Checklist
- Advanced: The Cloudflare AI Gateway Bridge
- Cost Reference
- Summary
- References
The Problem
Here is the typical evolution of an AI agent project:
Week 1: You write a Worker that calls the Gemini API with fetch(). It works. You parse the JSON response manually. You hardcode the API key as an environment variable. Cost tracking? “We’ll add that later.”
Week 3: You need structured output, so you write a Zod schema and validate the response yourself. Sometimes the LLM returns malformed JSON. You add retry logic. The retry logic has bugs. You’re now maintaining 200 lines of LLM interaction code that has nothing to do with your agent’s actual purpose.
Week 5: You need the agent to remember things between requests. You bolt on KV storage. Then you need WebSocket connections for real-time updates. Then scheduling. Then you realize KV doesn’t support SQL queries. You start wishing for a database. Your “simple agent” is now 1,500 lines of infrastructure code.
Week 8: Your Google Cloud bill arrives. $47 in 5 days. Your internal tracking shows $5. The 9x discrepancy? You forgot that Gemini’s thinking tokens cost $3.50/M — not the $0.60/M you used for regular output tokens. Six services hold their own API keys. Nobody knows which service spent what. You disable all cron jobs as an emergency measure.
This isn’t hypothetical. This is what happened across two production services in a real Cloudflare Workers deployment. The fix wasn’t “better monitoring” — it was architectural. Each concern (state, LLM interaction, cost control) needed its own layer with clear boundaries.
What changes if you get this right
| Concern | Tangled | Three-Layer |
|---|---|---|
| Agent state | KV hacks, lost on redeploy | Durable Object with SQLite, survives everything |
| LLM calls | Raw fetch(), manual JSON parsing | generateObject() with Zod, automatic retries |
| Provider switching | Rewrite all call sites | Change one import |
| Cost tracking | ”Check the Google dashboard” | Per-call attribution with function-level tags |
| Cost control | Hope for the best | Daily spend limits, tier enforcement, automatic 429s |
| Concurrent agents | One singleton, maybe | Millions of independent instances, zero cost when idle |
| Real-time updates | Polling | WebSocket with automatic state sync |
Architecture Overview
The three-layer architecture separates concerns along natural boundaries:
┌─────────────────────────────────────────────────────────────┐
│ Your Application │
│ (Business logic, domain rules, what the agent actually does)│
└──────────────┬──────────────────────────────────────────────┘
│
┌──────────────▼──────────────────────────────────────────────┐
│ Layer 1: Agent Runtime (Container) │
│ Cloudflare Agents SDK · npm: agents │
│ │
│ Durable Object per agent instance │
│ Built-in SQLite · State sync · WebSocket │
│ Scheduling · Hibernation · Lifecycle hooks │
│ Cost when idle: $0 │
└──────────────┬──────────────────────────────────────────────┘
│
┌──────────────▼──────────────────────────────────────────────┐
│ Layer 2: LLM Interface (Brain) │
│ Vercel AI SDK · npm: ai + @ai-sdk/google │
│ │
│ streamText / generateText / generateObject │
│ Zod-validated structured output │
│ Tool calling with multi-step agent loops │
│ Provider abstraction (Google, Anthropic, OpenAI) │
│ Thinking token tracking │
└──────────────┬──────────────────────────────────────────────┘
│ custom fetch()
┌──────────────▼──────────────────────────────────────────────┐
│ Layer 3: API Proxy / Metering (Wallet) │
│ Centralized HTTP proxy · e.g., API Mom │
│ │
│ Cost tracking with per-call attribution │
│ API key management (single source of truth) │
│ Daily spend limits · Tier enforcement │
│ Caching · Rate limiting · Budget caps │
└──────────────┬──────────────────────────────────────────────┘
│
┌─────▼─────┐
│ Provider │ (Gemini, Claude, GPT, etc.)
└───────────┘
Key insight: Each layer is independently useful and independently replaceable. You can use the Agents SDK without the AI SDK (non-LLM agents). You can use the AI SDK without the Agents SDK (stateless LLM calls). You can use the proxy layer with any HTTP client. But when composed together, they create a production-grade agent platform where state, intelligence, and cost control are all first-class concerns.
The TypeScript Shape
// The three layers, typed
// Layer 1: Agent Runtime — what the container looks like
interface AgentRuntime {
// State
state: AgentState;
setState(newState: AgentState): void;
sql: SqlStorage; // Built-in SQLite
// Communication
broadcast(message: string): void;
onConnect(connection: Connection): void;
onMessage(connection: Connection, message: string): void;
// Scheduling
schedule(cron: string, callback: string): void;
scheduleEvery(intervalMs: number): void;
// Lifecycle
onStart(): void;
onStop(): void;
}
// Layer 2: LLM Interface — what the brain looks like
interface LLMInterface {
// Text generation
generateText(options: GenerateTextOptions): Promise<GenerateTextResult>;
streamText(options: StreamTextOptions): StreamTextResult;
// Structured output
generateObject<T>(options: {
model: LanguageModel;
schema: ZodType<T>;
prompt: string;
}): Promise<{ object: T; usage: TokenUsage }>;
// Tool calling
tool(options: {
description: string;
parameters: ZodType;
execute: (args: unknown) => Promise<string>;
}): Tool;
}
// Layer 3: API Proxy — what the wallet looks like
interface APIProxy {
// Proxied fetch
fetch(url: string, options: RequestInit): Promise<Response>;
// Cost tracking
recordCall(params: {
project: string;
function: string;
service: string;
inputTokens: number;
outputTokens: number;
thinkingTokens: number;
costUsd: number;
}): Promise<void>;
// Budget enforcement
checkDailyLimit(project: string): Promise<{ allowed: boolean; spent: number; limit: number }>;
checkTierPermission(project: string, estimatedCost: number): Promise<boolean>;
}
Layer 1: Agent Runtime (Container)
The agent runtime answers: where does the agent live, how does it persist, and how do clients talk to it?
The Cloudflare Agents SDK (npm install agents) gives you a TypeScript class that runs on a Durable Object — a stateful micro-server with its own SQLite database, WebSocket connections, and scheduling system. Each agent instance is isolated, horizontally scalable, and costs nothing when idle.
Core Concepts
The Agent Class
Every agent extends the Agent class, parameterized by your environment bindings and state shape:
import { Agent } from "agents";
interface Env {
AI: Ai;
RESEARCH_QUEUE: Queue;
DB: D1Database;
}
interface BrandAgentState {
brandSlug: string;
status: "idle" | "researching" | "generating" | "error";
lastResearchCycle: string | null;
pendingTasks: number;
totalContentGenerated: number;
costToday: number;
costBudget: number;
}
export class BrandAgent extends Agent<Env, BrandAgentState> {
initialState: BrandAgentState = {
brandSlug: "",
status: "idle",
lastResearchCycle: null,
pendingTasks: 0,
totalContentGenerated: 0,
costToday: 0,
costBudget: 5.0,
};
async onStart() {
// Runs when the agent starts or resumes from hibernation
this.sql`CREATE TABLE IF NOT EXISTS audit_log (
id INTEGER PRIMARY KEY AUTOINCREMENT,
action TEXT NOT NULL,
details TEXT,
cost_usd REAL DEFAULT 0,
created_at TEXT DEFAULT (datetime('now'))
)`;
}
async onMessage(connection: Connection, message: string) {
const { type, payload } = JSON.parse(message);
switch (type) {
case "audit":
await this.runAudit(payload.brandSlug);
break;
case "status":
connection.send(JSON.stringify({ type: "status", state: this.state }));
break;
}
}
private async runAudit(brandSlug: string) {
this.setState({
...this.state,
status: "researching",
brandSlug,
});
// Agent logic goes here — the runtime handles everything else
// State persists across restarts, deploys, and hibernation
// WebSocket clients get state updates automatically
// SQLite is always available for structured data
this.setState({ ...this.state, status: "idle" });
}
}
Routing
The Agents SDK routes HTTP and WebSocket requests to agent instances using a URL pattern:
import { routeAgentRequest } from "agents";
export default {
async fetch(request: Request, env: Env) {
// Routes to /:agent-name/:instance-name
// e.g., /brand-agent/niche-fi → BrandAgent instance "niche-fi"
return routeAgentRequest(request, env, { cors: true });
},
};
Each unique instance name gets its own Durable Object. The instance niche-fi is completely isolated from llc-tax. They have separate SQLite databases, separate state, separate WebSocket connections. You can have millions of them.
Built-in SQLite
Every agent instance has embedded SQLite accessed via this.sql. This is not D1 — it’s SQLite running directly inside the Durable Object with zero network latency:
// Create tables on startup
this.sql`CREATE TABLE IF NOT EXISTS memories (
id INTEGER PRIMARY KEY AUTOINCREMENT,
type TEXT NOT NULL,
content TEXT NOT NULL,
embedding BLOB,
created_at TEXT DEFAULT (datetime('now'))
)`;
// Query with tagged templates
const recent = [
...this.sql`SELECT * FROM memories
WHERE type = ${"observation"}
ORDER BY created_at DESC
LIMIT 10`
];
// Insert
this.sql`INSERT INTO memories (type, content)
VALUES (${"decision"}, ${JSON.stringify(decision)})`;
Key insight: The agent’s SQLite database is its long-term memory. State (
this.state) is for real-time data that syncs to connected clients. SQLite is for everything else — conversation history, audit logs, embeddings, task queues. The distinction matters: state changes trigger WebSocket broadcasts, SQL writes don’t.
Scheduling and Alarms
Agents can schedule their own work. The scheduling system uses Durable Object alarms under the hood — guaranteed at-least-once execution with automatic retries:
export class MonitoringAgent extends Agent<Env, MonitorState> {
async onStart() {
// Run every 4 hours
this.schedule("0 */4 * * *", "runHealthCheck");
}
async runHealthCheck() {
this.setState({ ...this.state, status: "checking" });
const sites = [...this.sql`SELECT * FROM monitored_sites WHERE active = 1`];
for (const site of sites) {
const response = await fetch(site.url);
this.sql`INSERT INTO health_checks (site_id, status, latency_ms)
VALUES (${site.id}, ${response.status}, ${Date.now() - start})`;
}
this.setState({
...this.state,
status: "idle",
lastCheck: new Date().toISOString(),
});
}
}
Hibernation and Cost
This is the killer feature for running agents at scale. When a Durable Object has no active connections and no pending timers, it hibernates. Hibernated agents cost $0. They wake up instantly on the next request.
Pricing (Workers Paid plan):
- Requests: $0.15 per million (first million free)
- Duration: $12.50 per million GB-seconds (only while actively executing)
- WebSocket messages: 20:1 ratio (100 incoming messages = 5 billed requests)
- Hibernated: $0
This means you can run 100,000 brand agents. If only 50 are active at any given time, you pay for 50. The other 99,950 cost nothing. They wake up in milliseconds when needed.
The AIChatAgent Subclass
For conversational agents, the SDK provides AIChatAgent with built-in message persistence and resumable streaming:
import { AIChatAgent } from "agents";
import { streamText, tool } from "ai";
import { createGoogleGenerativeAI } from "@ai-sdk/google";
export class ChatAgent extends AIChatAgent<Env, ChatState> {
async onChatMessage(
onFinish?: StreamTextOnFinishCallback,
options?: { abortSignal?: AbortSignal; body?: unknown }
) {
const google = createGoogleGenerativeAI({ apiKey: this.env.GEMINI_API_KEY });
const result = streamText({
model: google("gemini-2.5-flash"),
messages: this.messages, // Auto-loaded from SQLite
tools: {
searchKnowledge: tool({
description: "Search the agent's knowledge base",
parameters: z.object({ query: z.string() }),
execute: async ({ query }) => {
const results = [...this.sql`
SELECT content FROM memories
WHERE content LIKE ${'%' + query + '%'}
LIMIT 5
`];
return JSON.stringify(results);
},
}),
},
maxSteps: 5,
onFinish,
abortSignal: options?.abortSignal,
});
return result.toUIMessageStreamResponse();
}
}
AIChatAgent handles:
- Message persistence — conversations saved to SQLite automatically
- Resumable streaming — if a client disconnects mid-stream, it reconnects and picks up where it left off
- Chunk buffering — stores stream chunks so late-joining clients get the full response
- Automatic cleanup —
destroy()cancels pending requests
Key insight:
AIChatAgentis where Layer 1 (Container) and Layer 2 (Brain) naturally compose. TheonChatMessagemethod is the seam. The agent runtime manages the conversation lifecycle. The AI SDK handles the LLM call. Neither knows about the other’s internals.
Wrangler Configuration
{
"name": "brand-agents",
"main": "src/index.ts",
"compatibility_date": "2025-03-01",
"durable_objects": {
"bindings": [
{
"name": "BRAND_AGENT",
"class_name": "BrandAgent"
},
{
"name": "CHAT_AGENT",
"class_name": "ChatAgent"
}
]
},
"migrations": [
{
"tag": "v1",
"new_sqlite_classes": ["BrandAgent", "ChatAgent"]
}
]
}
Layer 2: LLM Interface (Brain)
The LLM interface answers: how do you talk to language models, validate their output, and let them call tools?
The Vercel AI SDK (npm install ai) is a TypeScript toolkit that abstracts LLM providers behind a common interface. It handles structured output with Zod schemas, multi-step tool calling, streaming, retries, and provider switching — all the things you’d otherwise build (badly) yourself.
Why Not Raw Fetch?
Here’s what calling Gemini looks like without the AI SDK:
// The "just use fetch" approach — DON'T DO THIS
async function generateContent(prompt: string, schema: object): Promise<unknown> {
const response = await fetch(
`https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash:generateContent`,
{
method: "POST",
headers: {
"Content-Type": "application/json",
"x-goog-api-key": env.GEMINI_API_KEY,
},
body: JSON.stringify({
contents: [{ parts: [{ text: prompt }] }],
generationConfig: {
responseMimeType: "application/json",
responseSchema: schema, // Not Zod — raw JSON Schema
},
}),
}
);
if (!response.ok) {
// What kind of error? Rate limit? Auth? Bad request? Model overloaded?
// You have to parse the error body to find out
const error = await response.json();
throw new Error(`Gemini error: ${error.error?.message}`);
}
const data = await response.json();
const text = data.candidates?.[0]?.content?.parts?.[0]?.text;
if (!text) {
throw new Error("No content in response");
}
// Parse JSON — but what if it's malformed?
let parsed;
try {
parsed = JSON.parse(text);
} catch {
// Retry? With what backoff? How many times?
throw new Error(`Invalid JSON from LLM: ${text.slice(0, 200)}`);
}
// Validate against schema — but you're using JSON Schema, not Zod
// No type inference, no runtime validation, no error messages
// You just... hope it's right
// Token counting? Thinking tokens?
const usage = data.usageMetadata;
// promptTokenCount, candidatesTokenCount, totalTokenCount
// But where are thoughtsTokenCount? Different field name in different API versions
// Also: are thinking tokens included in candidatesTokenCount or separate?
return parsed;
}
That’s 50 lines for a single call, no retries, no streaming, no tool calling, no type safety. Now multiply by every LLM call in your system.
Core Concepts
Provider Abstraction
The AI SDK defines a common LanguageModel interface. You create a model instance from any provider and pass it to the same functions:
import { generateText } from "ai";
import { createGoogleGenerativeAI } from "@ai-sdk/google";
import { createAnthropic } from "@ai-sdk/anthropic";
import { createOpenAI } from "@ai-sdk/openai";
// Same function, different providers
const google = createGoogleGenerativeAI({ apiKey: env.GEMINI_API_KEY });
const anthropic = createAnthropic({ apiKey: env.ANTHROPIC_API_KEY });
const openai = createOpenAI({ apiKey: env.OPENAI_API_KEY });
// Swap providers by changing one line
const { text } = await generateText({
model: google("gemini-2.5-flash"),
// model: anthropic("claude-sonnet-4-20250514"),
// model: openai("gpt-4o"),
prompt: "Analyze this brand's SEO performance...",
});
Key insight: Provider abstraction isn’t just about convenience — it’s about cost optimization. When Gemini Flash costs $0.15/M input tokens and Claude Sonnet costs $3/M, being able to route different tasks to different providers based on complexity is the difference between a $50/month and $500/month agent system.
Structured Output with Zod
This is the AI SDK’s strongest feature. Define a Zod schema, get a validated, typed object back:
import { generateObject } from "ai";
import { z } from "zod";
const BrandAuditSchema = z.object({
overallScore: z.number().min(0).max(100).describe("Overall brand health 0-100"),
categories: z.array(z.object({
name: z.enum(["seo", "content", "design", "performance"]),
score: z.number().min(0).max(100),
issues: z.array(z.object({
severity: z.enum(["critical", "warning", "info"]),
description: z.string(),
recommendation: z.string(),
})),
})),
summary: z.string().describe("2-3 sentence summary of findings"),
topPriority: z.string().describe("Single most important thing to fix"),
});
type BrandAudit = z.infer<typeof BrandAuditSchema>;
async function auditBrand(siteData: string): Promise<BrandAudit> {
const { object, usage } = await generateObject({
model: google("gemini-2.5-flash"),
schema: BrandAuditSchema,
prompt: `Audit this brand's website and provide a structured assessment:\n\n${siteData}`,
temperature: 0.3,
maxRetries: 3, // Retries with schema validation on each attempt
});
// object is fully typed as BrandAudit
// No JSON.parse, no manual validation, no hope-based programming
return object;
}
The AI SDK handles:
- Converting your Zod schema to the provider’s native format (JSON Schema for Gemini, tool_use for Claude)
- Parsing the LLM response
- Validating against the schema
- Retrying if validation fails (up to
maxRetries) - Throwing
NoObjectGeneratedErrorwith the raw text if all retries fail (so you can debug)
Tool Calling and Agent Loops
The AI SDK supports multi-step tool calling where the LLM can invoke tools, observe results, and decide what to do next:
import { generateText, tool } from "ai";
import { z } from "zod";
const result = await generateText({
model: google("gemini-2.5-flash"),
system: "You are a brand research agent. Use the available tools to gather data, then synthesize your findings.",
prompt: "Research the competitive landscape for 'bank statement to Excel converter' tools.",
tools: {
searchWeb: tool({
description: "Search the web for information",
parameters: z.object({
query: z.string().describe("Search query"),
}),
execute: async ({ query }) => {
const results = await fetch(`${env.API_MOM_URL}/v1/brave/search?q=${encodeURIComponent(query)}`, {
headers: { "X-Project-Id": "brand-agent", "X-Api-Key": env.API_MOM_KEY },
});
return await results.text();
},
}),
analyzeSerp: tool({
description: "Analyze a SERP to identify competitors and their positioning",
parameters: z.object({
serpData: z.string().describe("Raw SERP results to analyze"),
}),
execute: async ({ serpData }) => {
// Could be another LLM call, or custom logic
return `Analysis: Found 8 competitors. Top 3: ...`;
},
}),
saveFinding: tool({
description: "Save a research finding to the knowledge base",
parameters: z.object({
category: z.string(),
finding: z.string(),
confidence: z.number().min(0).max(1),
}),
execute: async ({ category, finding, confidence }) => {
// Save to agent's SQLite (via closure over the agent instance)
agent.sql`INSERT INTO findings (category, finding, confidence)
VALUES (${category}, ${finding}, ${confidence})`;
return "Saved.";
},
}),
},
maxSteps: 10, // LLM can call tools up to 10 times before final response
temperature: 0.3,
});
console.log(result.text); // Final synthesized response
console.log(result.steps.length); // How many tool-call rounds
console.log(result.usage); // Total tokens across all steps
The maxSteps parameter controls how many rounds the LLM can make. Each round: the LLM produces text and/or tool calls → tools execute → results go back to the LLM → repeat until the LLM produces text without tool calls (or hits the step limit).
Streaming
For real-time UIs, streamText sends tokens as they’re generated:
import { streamText } from "ai";
const result = streamText({
model: google("gemini-2.5-flash"),
messages: conversationHistory,
tools: { /* ... */ },
maxSteps: 5,
});
// In a Cloudflare Worker, return as a streaming response
return result.toTextStreamResponse();
// Or in an AIChatAgent, return UI message stream
return result.toUIMessageStreamResponse();
Thinking Token Tracking
Models like Gemini 2.5 Flash and Claude use “thinking” tokens that cost significantly more than regular output tokens. The AI SDK exposes this metadata:
const result = await generateText({
model: google("gemini-2.5-flash"),
prompt: "Complex reasoning task...",
providerOptions: {
google: {
thinkingConfig: { thinkingBudget: 2048 },
},
},
});
// Thinking tokens are in providerMetadata
const thinkingTokens =
result.providerMetadata?.google?.usageMetadata?.thoughtsTokenCount ?? 0;
// Or in AI SDK v6+
const thinkingTokensV6 = result.usage?.reasoningTokens ?? 0;
console.log(`Input: ${result.usage.inputTokens}`);
console.log(`Output: ${result.usage.outputTokens}`);
console.log(`Thinking: ${thinkingTokens}`); // These cost $3.50/M on Flash!
Key insight: Thinking tokens are the silent budget killer. On Gemini 2.5 Flash, thinking tokens cost $3.50/M — almost 6x the regular output token price of $0.60/M. If you price all output tokens at $0.60/M, your cost tracking will undercount by 3-9x. This is exactly what caused the $47 surprise bill. The AI SDK exposes thinking token counts, but you have to actually read them and price them correctly.
The LLM Harness Pattern
In production, you don’t call generateObject directly everywhere. You wrap it in a harness that handles cost calculation, logging, and ledger recording:
interface LLMConfig {
apiKey: string;
model?: string;
temperature?: number;
maxTokens?: number;
maxRetries?: number;
thinkingBudget?: number;
db?: D1Database; // For cost ledger recording
contextType?: string; // "pipeline" | "api" | "cron"
contextId?: string; // "brand-audit-niche-fi"
}
interface LLMResult<T> {
data: T;
usage: {
inputTokens: number;
outputTokens: number;
thinkingTokens: number;
totalTokens: number;
};
costUsd: number;
finishReason: string;
durationMs: number;
warnings: string[];
}
const PRICING: Record<string, { input: number; output: number; thinking: number }> = {
"gemini-2.5-flash": {
input: 0.15 / 1_000_000,
output: 0.6 / 1_000_000,
thinking: 3.5 / 1_000_000,
},
"gemini-2.5-pro": {
input: 1.25 / 1_000_000,
output: 10.0 / 1_000_000,
thinking: 10.0 / 1_000_000,
},
"gemini-2.0-flash": {
input: 0.1 / 1_000_000,
output: 0.4 / 1_000_000,
thinking: 0,
},
};
function calculateCost(
model: string,
inputTokens: number,
outputTokens: number,
thinkingTokens: number = 0,
): number {
const pricing = PRICING[model] ?? PRICING["gemini-2.5-flash"];
const textOutputTokens = Math.max(0, outputTokens - thinkingTokens);
return (
inputTokens * pricing.input +
textOutputTokens * pricing.output +
thinkingTokens * pricing.thinking
);
}
function extractThinkingTokens(result: any): number {
// Vercel AI SDK: providerMetadata.google.usageMetadata.thoughtsTokenCount
const fromProvider = result.providerMetadata?.google?.usageMetadata?.thoughtsTokenCount;
if (fromProvider != null) return fromProvider;
// AI SDK v6: usage.reasoningTokens
const fromUsage = result.usage?.reasoningTokens;
if (fromUsage != null) return fromUsage;
return 0;
}
async function llmObject<T>(
config: LLMConfig,
schema: z.ZodType<T>,
prompt: string,
system?: string,
label = "llmObject",
): Promise<LLMResult<T>> {
const model = createModel(config);
const start = Date.now();
const result = await generateObject({
model,
schema,
prompt,
system,
temperature: config.temperature ?? 0.3,
maxOutputTokens: config.maxTokens ?? 8192,
maxRetries: config.maxRetries ?? 3,
});
const inputTokens = result.usage?.inputTokens ?? 0;
const outputTokens = result.usage?.outputTokens ?? 0;
const thinkingTokens = extractThinkingTokens(result);
const costUsd = calculateCost(config.model ?? "gemini-2.5-flash", inputTokens, outputTokens, thinkingTokens);
const durationMs = Date.now() - start;
console.log(
`[llm:${label}] OK in ${durationMs}ms | tokens: ${inputTokens}→${outputTokens} (${thinkingTokens} thinking) | cost: $${costUsd.toFixed(4)}`
);
return {
data: result.object,
usage: { inputTokens, outputTokens, thinkingTokens, totalTokens: inputTokens + outputTokens },
costUsd,
finishReason: result.finishReason,
durationMs,
warnings: [],
};
}
This pattern creates a clean boundary. Your business logic calls llmObject(config, schema, prompt) and gets back typed data plus cost information. It never touches fetch, never parses JSON, never worries about retries.
Layer 3: API Proxy / Metering (Wallet)
The API proxy layer answers: who spent how much, on what, and should they be allowed to?
This is the layer most teams skip. They put API keys in environment variables, call providers directly, and discover the cost when the bill arrives. The proxy layer makes cost a first-class concern — tracked per call, attributed to a function, enforced with daily limits.
Why a Centralized Proxy?
Consider a typical multi-service architecture:
Without proxy:
Service A ──[GEMINI_API_KEY_A]──→ Google
Service B ──[GEMINI_API_KEY_B]──→ Google
Service C ──[GEMINI_API_KEY_C]──→ Google
Cron Job ──[GEMINI_API_KEY_D]──→ Google
Cost visibility: Check Google Dashboard → $47 total → ??? per service
With proxy:
Service A ──[X-Project-Id: svc-a]──→ API Proxy ──[GEMINI_API_KEY]──→ Google
Service B ──[X-Project-Id: svc-b]──→ API Proxy
Service C ──[X-Project-Id: svc-c]──→ API Proxy
Cron Job ──[X-Project-Id: cron ]──→ API Proxy
Cost visibility: GET /v1/costs?period=day
→ svc-a: $12.40 (article-write: $8, design-audit: $4.40)
→ svc-b: $3.20 (keyword-research: $3.20)
→ cron: $31.40 (batch-generate: $31.40) ← PROBLEM IDENTIFIED
The $47 Cost Disaster (A Real Story)
Here’s what happened with zero proxy layer:
-
pages-plus held its own
GEMINI_API_KEY. It ran 9 cron jobs generating blog posts. Each post made 3 Gemini calls (outline + draft + edit). In 5 days: 193 posts × 3 calls × ~$0.06/call = $37. -
aso-mrr also held its own
GEMINI_API_KEY. Its internal tracking showed $1.14 spent. Google billed $10.19. The 9x discrepancy: thinking tokens. The cost formula priced all output tokens at $0.60/M. But Gemini 2.5 Flash’s thinking tokens cost $3.50/M — almost 6x more. A call that “cost $0.02” actually cost $0.12. -
Combined: $47 in 5 days = $282/month run rate. Zero alerts. Zero dashboards. Discovered only by checking the Google Cloud billing console manually.
The fix: delete all API keys from individual services. Route everything through a centralized proxy. Track every call with function-level attribution. Enforce daily spend limits. Never enable a cron job until metering is verified.
Architecture of the Proxy Layer
// Simplified API proxy (e.g., "API Mom")
// In production, this is its own Cloudflare Worker with a D1 database
interface ProxyConfig {
projects: Map<string, {
name: string;
dailyLimitUsd: number;
tierPermission: 1 | 2 | 3; // Max cost tier allowed
apiKeys: Map<string, string>; // service → key
}>;
}
// Tier definitions
// Tier 1 (< $0.01/call): Flash models, cache reads, embeddings
// Tier 2 ($0.01 - $0.10/call): Thinking models, image gen
// Tier 3 (> $0.10/call): Pro models, long-context, multi-step
async function handleProxiedRequest(
request: Request,
config: ProxyConfig,
db: D1Database,
): Promise<Response> {
const projectId = request.headers.get("X-Project-Id");
const functionName = request.headers.get("X-Function") ?? "unknown";
const tags = request.headers.get("X-Tags");
// 1. Authenticate
const project = config.projects.get(projectId);
if (!project) return new Response("Unknown project", { status: 403 });
// 2. Check daily limit
const todaySpend = await getDailySpend(projectId, db);
if (todaySpend >= project.dailyLimitUsd) {
return Response.json(
{ error: "daily_limit_exceeded", spend: todaySpend, limit: project.dailyLimitUsd },
{ status: 429 }
);
}
// 3. Proxy the request to the real provider
const url = new URL(request.url);
const targetUrl = mapToProviderUrl(url.pathname);
const apiKey = project.apiKeys.get(getServiceFromPath(url.pathname));
const start = Date.now();
const response = await fetch(targetUrl, {
method: request.method,
headers: {
...Object.fromEntries(request.headers),
"x-goog-api-key": apiKey, // Inject the real API key
},
body: request.body,
});
// 4. Parse response for token usage
const responseBody = await response.json();
const usage = extractUsage(responseBody);
const costUsd = calculateCost(usage);
const durationMs = Date.now() - start;
// 5. Check tier permission
if (costUsd > tierThreshold(project.tierPermission)) {
return Response.json(
{ error: "cost_tier_exceeded", estimatedCost: costUsd, maxTier: project.tierPermission },
{ status: 403 }
);
}
// 6. Record to ledger
await db.prepare(`
INSERT INTO api_calls_ledger
(project_id, function, service, cost_usd,
input_tokens, output_tokens, thinking_tokens,
duration_ms, tags, status)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, 'ok')
`).bind(
projectId, functionName, "gemini", costUsd,
usage.inputTokens, usage.outputTokens, usage.thinkingTokens,
durationMs, tags
).run();
// 7. Return the response (re-create since we consumed the body)
return Response.json(responseBody, {
status: response.status,
headers: {
"X-Cost-Usd": costUsd.toFixed(6),
"X-Daily-Spend": (todaySpend + costUsd).toFixed(4),
"X-Daily-Limit": project.dailyLimitUsd.toString(),
},
});
}
Cost Attribution Levels
The proxy tracks costs at multiple levels of granularity:
// The ledger schema
interface ApiCallLedger {
id: number;
timestamp: string;
project_id: string; // "pages-plus", "scalable-media"
function: string; // "article-write", "brand-audit", "keyword-research"
service: string; // "gemini", "brave", "perplexity"
endpoint: string; // "gemini-2.5-flash", "gemini-2.5-pro"
cost_usd: number;
input_tokens: number;
output_tokens: number;
thinking_tokens: number; // Separate column — never lump with output
cache_read_tokens: number;
duration_ms: number;
tags: string; // JSON: ["brand:niche-fi", "trigger:cron", "batch:2026-03-15"]
status: "ok" | "error";
error_message: string | null;
}
The function field is critical. Without it, you know “pages-plus spent $37” but not “the article-write function spent $32 and the design-audit function spent $5.” Function-level attribution is what turns cost data into actionable intelligence.
Querying Costs
// GET /v1/costs?period=day&project=pages-plus
const costs = await db.prepare(`
SELECT
function,
service,
COUNT(*) as calls,
SUM(cost_usd) as total_cost,
SUM(input_tokens) as total_input,
SUM(output_tokens) as total_output,
SUM(thinking_tokens) as total_thinking,
AVG(duration_ms) as avg_duration
FROM api_calls_ledger
WHERE project_id = ?
AND timestamp >= datetime('now', '-1 day')
GROUP BY function, service
ORDER BY total_cost DESC
`).bind("pages-plus").all();
// Result:
// function | service | calls | total_cost | total_thinking
// article-write | gemini | 579 | $32.40 | 2,847,000
// design-audit | gemini | 38 | $4.60 | 412,000
// anchor-gen | gemini | 12 | $0.40 | 0
Daily Spend Limits
async function getDailySpend(projectId: string, db: D1Database): Promise<number> {
const result = await db.prepare(`
SELECT COALESCE(SUM(cost_usd), 0) as spend
FROM api_calls_ledger
WHERE project_id = ?
AND timestamp >= datetime('now', 'start of day')
AND status = 'ok'
`).bind(projectId).first<{ spend: number }>();
return result?.spend ?? 0;
}
// Client-side handling of 429
async function callLLMWithBudgetCheck(proxy: string, body: unknown): Promise<Response> {
const response = await fetch(`${proxy}/v1/gemini/generateContent`, {
method: "POST",
headers: {
"X-Project-Id": "brand-agents",
"X-Function": "research",
"X-Api-Key": env.PROXY_KEY,
},
body: JSON.stringify(body),
});
if (response.status === 429) {
const { spend, limit } = await response.json();
console.warn(`[budget] Daily limit hit: $${spend.toFixed(2)} / $${limit}`);
// Queue for tomorrow, or use a cheaper model, or skip
return handleBudgetExhausted(spend, limit);
}
return response;
}
Key insight: The proxy layer’s value isn’t just tracking — it’s enforcement. Any service can track its own costs. But only a centralized proxy can prevent overspend across all services simultaneously. When a cron job tries to burn $30 overnight, the proxy returns 429 after $5. The cron handles the 429 gracefully. Nobody wakes up to a surprise bill.
The Composition Pattern
This is where the architecture comes together. The three layers compose through two integration points:
- Container → Brain: The
AIChatAgent.onChatMessage()method callsstreamText()/generateText()/generateObject()from the AI SDK - Brain → Wallet: The AI SDK provider’s
fetchoption routes HTTP requests through the proxy
The Custom Fetch Bridge
Every AI SDK provider factory (createGoogleGenerativeAI, createAnthropic, createOpenAI) accepts a fetch option. This is the seam where Layer 2 connects to Layer 3:
import { createGoogleGenerativeAI } from "@ai-sdk/google";
// Direct call — bypasses cost tracking
const googleDirect = createGoogleGenerativeAI({
apiKey: env.GEMINI_API_KEY,
});
// Proxied call — all requests go through API Mom
const googleProxied = createGoogleGenerativeAI({
apiKey: "proxied", // API Mom injects the real key
baseURL: `${env.API_MOM_URL}/v1/google`,
fetch: async (url: RequestInfo | URL, init?: RequestInit) => {
const headers = new Headers(init?.headers);
headers.set("X-Project-Id", "brand-agents");
headers.set("X-Function", "content-generation");
headers.set("X-Api-Key", env.API_MOM_KEY);
headers.set("X-Tags", JSON.stringify(["brand:niche-fi", "trigger:agent"]));
return fetch(url, { ...init, headers });
},
});
The beauty of this pattern: the AI SDK doesn’t know it’s talking to a proxy. It thinks it’s calling Google’s API. The proxy handles authentication, cost tracking, and budget enforcement transparently.
Full Stack Composition
Here’s a complete agent that uses all three layers:
import { AIChatAgent } from "agents";
import { streamText, generateObject, tool } from "ai";
import { createGoogleGenerativeAI } from "@ai-sdk/google";
import { z } from "zod";
interface Env {
API_MOM_URL: string;
API_MOM_KEY: string;
RESEARCH_QUEUE: Queue;
}
interface ResearchAgentState {
brandSlug: string;
status: "idle" | "researching" | "analyzing" | "reporting";
findingsCount: number;
costToday: number;
costBudget: number;
lastError: string | null;
}
export class ResearchAgent extends AIChatAgent<Env, ResearchAgentState> {
initialState: ResearchAgentState = {
brandSlug: "",
status: "idle",
findingsCount: 0,
costToday: 0,
costBudget: 2.0,
lastError: null,
};
// Layer 2: Create a proxied model instance
private createModel(functionName: string) {
return createGoogleGenerativeAI({
apiKey: "proxied", // Real key lives in API Mom
baseURL: `${this.env.API_MOM_URL}/v1/google`,
fetch: async (url, init) => {
const headers = new Headers(init?.headers);
headers.set("X-Project-Id", "research-agents");
headers.set("X-Function", functionName);
headers.set("X-Api-Key", this.env.API_MOM_KEY);
headers.set("X-Tags", JSON.stringify([
`brand:${this.state.brandSlug}`,
"trigger:chat",
]));
return fetch(url, { ...init, headers });
},
})("gemini-2.5-flash");
}
// Layer 1: Lifecycle
async onStart() {
this.sql`CREATE TABLE IF NOT EXISTS findings (
id INTEGER PRIMARY KEY AUTOINCREMENT,
query TEXT NOT NULL,
category TEXT,
content TEXT NOT NULL,
confidence REAL DEFAULT 0.5,
source_url TEXT,
created_at TEXT DEFAULT (datetime('now'))
)`;
this.sql`CREATE TABLE IF NOT EXISTS cost_log (
id INTEGER PRIMARY KEY AUTOINCREMENT,
function TEXT NOT NULL,
cost_usd REAL NOT NULL,
tokens_used INTEGER,
created_at TEXT DEFAULT (datetime('now'))
)`;
}
// Layer 1 → Layer 2: Chat with tools
async onChatMessage(onFinish?: StreamTextOnFinishCallback) {
const model = this.createModel("chat");
const result = streamText({
model,
system: `You are a research agent for the brand "${this.state.brandSlug}".
Use the available tools to research topics and save findings.
Always search before answering. Save important findings.`,
messages: this.messages,
tools: {
search: tool({
description: "Search the web for information on a topic",
parameters: z.object({
query: z.string().describe("Search query"),
}),
execute: async ({ query }) => {
// This call also goes through API Mom for cost tracking
const res = await fetch(
`${this.env.API_MOM_URL}/v1/brave/search?q=${encodeURIComponent(query)}`,
{
headers: {
"X-Project-Id": "research-agents",
"X-Function": "web-search",
"X-Api-Key": this.env.API_MOM_KEY,
},
}
);
return await res.text();
},
}),
saveFinding: tool({
description: "Save a research finding to the knowledge base",
parameters: z.object({
query: z.string(),
category: z.string(),
content: z.string(),
confidence: z.number().min(0).max(1),
sourceUrl: z.string().optional(),
}),
execute: async ({ query, category, content, confidence, sourceUrl }) => {
this.sql`INSERT INTO findings (query, category, content, confidence, source_url)
VALUES (${query}, ${category}, ${content}, ${confidence}, ${sourceUrl ?? null})`;
this.setState({
...this.state,
findingsCount: this.state.findingsCount + 1,
});
return `Saved finding in category "${category}" with confidence ${confidence}`;
},
}),
getFindings: tool({
description: "Retrieve previous research findings",
parameters: z.object({
category: z.string().optional(),
limit: z.number().default(10),
}),
execute: async ({ category, limit }) => {
const results = category
? [...this.sql`SELECT * FROM findings WHERE category = ${category} ORDER BY created_at DESC LIMIT ${limit}`]
: [...this.sql`SELECT * FROM findings ORDER BY created_at DESC LIMIT ${limit}`];
return JSON.stringify(results);
},
}),
},
maxSteps: 8,
onFinish: async (result) => {
// Track cost in agent state
const thinkingTokens = result.providerMetadata?.google?.usageMetadata?.thoughtsTokenCount ?? 0;
const cost = calculateCost(
"gemini-2.5-flash",
result.usage?.inputTokens ?? 0,
result.usage?.outputTokens ?? 0,
thinkingTokens,
);
this.sql`INSERT INTO cost_log (function, cost_usd, tokens_used)
VALUES ('chat', ${cost}, ${result.usage?.totalTokens ?? 0})`;
this.setState({
...this.state,
costToday: this.state.costToday + cost,
});
onFinish?.(result);
},
});
return result.toUIMessageStreamResponse();
}
// Structured output example — Layer 2 with schema validation
async analyzeCompetitors(keyword: string) {
const model = this.createModel("competitor-analysis");
const CompetitorSchema = z.object({
competitors: z.array(z.object({
name: z.string(),
url: z.string(),
strengths: z.array(z.string()),
weaknesses: z.array(z.string()),
estimatedTraffic: z.enum(["low", "medium", "high", "very-high"]),
})),
marketGaps: z.array(z.string()),
recommendation: z.string(),
});
const { object, usage } = await generateObject({
model,
schema: CompetitorSchema,
prompt: `Analyze the competitive landscape for "${keyword}".
Identify top competitors, their strengths and weaknesses,
and market gaps we could exploit.`,
maxRetries: 3,
});
return object;
}
}
The Data Flow
When a user sends a chat message to this agent, here’s exactly what happens:
1. WebSocket message arrives at the Cloudflare Worker
2. routeAgentRequest() routes to the correct ResearchAgent instance (Durable Object)
3. AIChatAgent loads conversation history from SQLite (Layer 1)
4. onChatMessage() is called
5. streamText() sends the prompt + tools to the AI SDK (Layer 2)
6. AI SDK creates an HTTP request to Gemini's API
7. Custom fetch() intercepts the request, adds proxy headers
8. Request goes to API Mom (Layer 3)
9. API Mom checks daily spend limit
10. API Mom injects the real GEMINI_API_KEY
11. API Mom forwards to Google's API
12. Google returns tokens + usage metadata
13. API Mom records cost to api_calls_ledger
14. API Mom forwards response back (with X-Cost-Usd header)
15. AI SDK parses the response, validates schema if applicable
16. If tool calls: execute tools, send results back to LLM (repeat 6-15)
17. Stream tokens back through WebSocket to client (Layer 1)
18. onFinish: record cost in agent's SQLite, update state
19. State change broadcasts to all connected WebSocket clients
20. Agent goes idle → eventually hibernates → $0 cost
Each layer handles its concern. The agent code in step 4 doesn’t know about API keys. The AI SDK in step 6 doesn’t know about Durable Objects. The proxy in step 9 doesn’t know about Zod schemas. They compose through narrow interfaces: function calls and HTTP.
Patterns
Pattern 1: Budget-Aware Agent
An agent that adjusts its behavior based on remaining budget. Uses the Wallet layer’s cost response headers to make real-time decisions.
export class BudgetAwareAgent extends Agent<Env, BudgetAgentState> {
initialState: BudgetAgentState = {
dailyBudget: 5.0,
spentToday: 0,
model: "gemini-2.5-flash",
taskQueue: [],
completedTasks: 0,
skippedTasks: 0,
};
private getModel() {
// Downgrade model when budget is tight
const remaining = this.state.dailyBudget - this.state.spentToday;
if (remaining < 0.50) {
return "gemini-2.0-flash"; // Cheapest: $0.10/M input
}
if (remaining < 2.0) {
return "gemini-2.5-flash"; // Middle: $0.15/M input, has thinking
}
return "gemini-2.5-pro"; // Best: $1.25/M input, best quality
}
private createProxiedModel(functionName: string) {
const modelId = this.getModel();
return createGoogleGenerativeAI({
apiKey: "proxied",
baseURL: `${this.env.API_MOM_URL}/v1/google`,
fetch: async (url, init) => {
const headers = new Headers(init?.headers);
headers.set("X-Project-Id", "budget-agent");
headers.set("X-Function", functionName);
headers.set("X-Api-Key", this.env.API_MOM_KEY);
const response = await fetch(url, { ...init, headers });
// Read cost from proxy response headers
const costUsd = parseFloat(response.headers.get("X-Cost-Usd") ?? "0");
const dailySpend = parseFloat(response.headers.get("X-Daily-Spend") ?? "0");
// Update agent state with real cost data
this.setState({
...this.state,
spentToday: dailySpend,
});
// Log cost to agent's SQLite for analysis
this.sql`INSERT INTO cost_events (function, model, cost_usd, daily_total)
VALUES (${functionName}, ${modelId}, ${costUsd}, ${dailySpend})`;
return response;
},
})(modelId);
}
async processTask(task: AgentTask) {
const remaining = this.state.dailyBudget - this.state.spentToday;
// Skip expensive tasks when budget is low
if (task.estimatedCost > remaining) {
this.sql`INSERT INTO skipped_tasks (task_id, reason, remaining_budget)
VALUES (${task.id}, 'budget_insufficient', ${remaining})`;
this.setState({
...this.state,
skippedTasks: this.state.skippedTasks + 1,
});
// Re-queue for tomorrow
this.schedule(tomorrowMidnight(), "retryTask");
return;
}
const model = this.createProxiedModel(task.function);
try {
const result = await generateObject({
model,
schema: task.outputSchema,
prompt: task.prompt,
maxRetries: 2,
});
this.setState({
...this.state,
completedTasks: this.state.completedTasks + 1,
});
return result.object;
} catch (err) {
if (err.message?.includes("daily_limit_exceeded")) {
// Proxy rejected — we've hit the hard limit
console.warn("[budget] Hard limit hit, pausing until tomorrow");
this.setState({ ...this.state, status: "budget_exhausted" });
this.schedule(tomorrowMidnight(), "resetBudget");
}
throw err;
}
}
async resetBudget() {
this.setState({
...this.state,
spentToday: 0,
status: "idle",
});
// Re-process queued tasks
await this.processQueue();
}
}
Pattern 2: Multi-Provider Routing
Route different tasks to different LLM providers based on task complexity, leveraging the AI SDK’s provider abstraction.
type TaskComplexity = "trivial" | "standard" | "complex" | "critical";
interface ModelRoute {
provider: "google" | "anthropic" | "openai";
model: string;
costPerMillionInput: number;
costPerMillionOutput: number;
bestFor: string[];
}
const MODEL_ROUTES: Record<TaskComplexity, ModelRoute> = {
trivial: {
provider: "google",
model: "gemini-2.0-flash",
costPerMillionInput: 0.10,
costPerMillionOutput: 0.40,
bestFor: ["classification", "extraction", "formatting"],
},
standard: {
provider: "google",
model: "gemini-2.5-flash",
costPerMillionInput: 0.15,
costPerMillionOutput: 0.60,
bestFor: ["summarization", "content-generation", "tool-calling"],
},
complex: {
provider: "anthropic",
model: "claude-sonnet-4-20250514",
costPerMillionInput: 3.0,
costPerMillionOutput: 15.0,
bestFor: ["reasoning", "code-generation", "nuanced-analysis"],
},
critical: {
provider: "google",
model: "gemini-2.5-pro",
costPerMillionInput: 1.25,
costPerMillionOutput: 10.0,
bestFor: ["long-context", "multi-step-reasoning", "high-stakes-decisions"],
},
};
function createRoutedModel(
env: Env,
complexity: TaskComplexity,
functionName: string,
) {
const route = MODEL_ROUTES[complexity];
// Custom fetch that routes through the proxy
const proxiedFetch = async (url: RequestInfo | URL, init?: RequestInit) => {
const headers = new Headers(init?.headers);
headers.set("X-Project-Id", "routed-agents");
headers.set("X-Function", functionName);
headers.set("X-Api-Key", env.API_MOM_KEY);
headers.set("X-Tags", JSON.stringify([`complexity:${complexity}`, `provider:${route.provider}`]));
return fetch(url, { ...init, headers });
};
switch (route.provider) {
case "google":
return createGoogleGenerativeAI({
apiKey: "proxied",
baseURL: `${env.API_MOM_URL}/v1/google`,
fetch: proxiedFetch,
})(route.model);
case "anthropic":
return createAnthropic({
apiKey: "proxied",
baseURL: `${env.API_MOM_URL}/v1/anthropic`,
fetch: proxiedFetch,
})(route.model);
case "openai":
return createOpenAI({
apiKey: "proxied",
baseURL: `${env.API_MOM_URL}/v1/openai`,
fetch: proxiedFetch,
})(route.model);
}
}
// Usage in an agent
async function processWithRouting(task: AgentTask, env: Env) {
const complexity = classifyComplexity(task);
const model = createRoutedModel(env, complexity, task.function);
const { object } = await generateObject({
model,
schema: task.schema,
prompt: task.prompt,
maxRetries: 3,
});
return object;
}
// Complexity classifier — itself a trivial LLM call
async function classifyComplexity(task: AgentTask): Promise<TaskComplexity> {
const model = createRoutedModel(env, "trivial", "classify-complexity");
const { object } = await generateObject({
model,
schema: z.object({
complexity: z.enum(["trivial", "standard", "complex", "critical"]),
reasoning: z.string(),
}),
prompt: `Classify the complexity of this task:\n\n${task.prompt.slice(0, 500)}`,
});
return object.complexity;
}
Pattern 3: Agent-to-Agent Communication via Queues
Multiple agents that coordinate through Cloudflare Queues, with each agent independently managing its own state and cost budget.
// Research Agent — finds information
export class ResearchWorkerAgent extends Agent<Env, ResearchState> {
initialState: ResearchState = {
status: "idle",
assignedKeywords: [],
completedKeywords: [],
costToday: 0,
};
async onStart() {
this.sql`CREATE TABLE IF NOT EXISTS research_results (
keyword TEXT PRIMARY KEY,
serp_data TEXT,
competitor_data TEXT,
opportunity_score REAL,
researched_at TEXT DEFAULT (datetime('now'))
)`;
}
// Triggered by queue message from Coordinator
async handleResearchRequest(keyword: string, correlationId: string) {
this.setState({ ...this.state, status: "researching" });
const model = this.createProxiedModel("keyword-research");
// Step 1: Search
const searchResults = await this.searchWeb(keyword);
// Step 2: Analyze with LLM
const { object: analysis } = await generateObject({
model,
schema: KeywordAnalysisSchema,
prompt: `Analyze the SERP for "${keyword}":\n${searchResults}`,
});
// Step 3: Save results
this.sql`INSERT OR REPLACE INTO research_results
(keyword, serp_data, competitor_data, opportunity_score)
VALUES (${keyword}, ${JSON.stringify(searchResults)},
${JSON.stringify(analysis.competitors)},
${analysis.opportunityScore})`;
// Step 4: Emit completion event
await this.env.EVENTS_QUEUE.send({
event_id: crypto.randomUUID(),
type: "research.completed",
source: "research-agent",
timestamp: new Date().toISOString(),
correlation_id: correlationId,
payload: {
keyword,
opportunityScore: analysis.opportunityScore,
competitorCount: analysis.competitors.length,
},
});
this.setState({
...this.state,
status: "idle",
completedKeywords: [...this.state.completedKeywords, keyword],
});
}
}
// Content Agent — generates content based on research
export class ContentWorkerAgent extends Agent<Env, ContentState> {
initialState: ContentState = {
status: "idle",
articlesGenerated: 0,
costToday: 0,
};
// Triggered by research.completed event
async handleResearchCompleted(event: DomainMessage<ResearchCompletedPayload>) {
const { keyword, opportunityScore } = event.payload;
// Only generate content for high-opportunity keywords
if (opportunityScore < 0.6) {
console.log(`[content] Skipping "${keyword}" — score ${opportunityScore} below threshold`);
return;
}
this.setState({ ...this.state, status: "generating" });
// Use a more capable model for content generation
const model = this.createProxiedModel("article-write");
const { object: article } = await generateObject({
model,
schema: ArticleSchema,
prompt: `Write a comprehensive article about "${keyword}" targeting
users searching for this term. Include practical advice,
comparisons, and actionable steps.`,
maxRetries: 3,
});
// Emit publish command
await this.env.PUBLISH_QUEUE.send({
event_id: crypto.randomUUID(),
type: "content.publish",
source: "content-agent",
timestamp: new Date().toISOString(),
correlation_id: event.correlation_id,
payload: {
keyword,
title: article.title,
slug: article.slug,
content: article.content, // Normally stored in R2, reference in message
},
});
this.setState({
...this.state,
status: "idle",
articlesGenerated: this.state.articlesGenerated + 1,
});
}
}
Pattern 4: Cloudflare AI Gateway Integration
Instead of a custom proxy, use Cloudflare’s AI Gateway for caching, rate limiting, and observability. This works as a lightweight Layer 3 when you don’t need custom cost attribution.
import { createAiGateway } from "ai-gateway-provider";
import { createGoogleGenerativeAI } from "ai-gateway-provider/providers/google";
export class GatewayAgent extends Agent<Env, AgentState> {
private createModel() {
// Route through Cloudflare AI Gateway
const aigateway = createAiGateway({
binding: this.env.AI.gateway("my-gateway"),
options: {
cacheTtl: 300, // Cache responses for 5 minutes
},
});
const google = createGoogleGenerativeAI({
apiKey: this.env.GEMINI_API_KEY,
});
// Compose: AI Gateway wraps the Google provider
return aigateway(google("gemini-2.5-flash"));
}
async onChatMessage() {
const model = this.createModel();
const result = streamText({
model,
messages: this.messages,
maxSteps: 5,
});
return result.toUIMessageStreamResponse();
}
}
The AI Gateway provides:
- Caching: Identical prompts hit the cache instead of the API ($0)
- Rate limiting: Prevent abuse at the gateway level
- Logging: Every request logged with latency, tokens, and cost
- Analytics: Dashboard showing usage patterns per model
- Fallbacks: If one provider fails, automatically try another
But it doesn’t provide:
- Function-level cost attribution
- Daily spend limits per project
- Cross-service cost aggregation
- Custom cost formulas (like thinking token pricing)
This is why many production systems use both: AI Gateway for caching and fallbacks, a custom proxy for attribution and enforcement.
Pattern 5: Structured Output Pipeline
A pipeline that chains multiple LLM calls, each with its own schema, building on the output of the previous step.
const OutlineSchema = z.object({
title: z.string(),
sections: z.array(z.object({
heading: z.string(),
keyPoints: z.array(z.string()),
estimatedWordCount: z.number(),
})),
targetWordCount: z.number(),
targetAudience: z.string(),
});
const DraftSchema = z.object({
title: z.string(),
content: z.string().describe("Full article in markdown"),
wordCount: z.number(),
metaDescription: z.string().max(160),
tags: z.array(z.string()),
});
const EditSchema = z.object({
content: z.string().describe("Edited article in markdown"),
changes: z.array(z.object({
type: z.enum(["grammar", "clarity", "seo", "structure"]),
description: z.string(),
})),
readabilityScore: z.number().min(0).max(100),
});
async function articlePipeline(
keyword: string,
research: string,
env: Env,
): Promise<{ article: z.infer<typeof EditSchema>; totalCost: number }> {
let totalCost = 0;
const createModel = (fn: string) => createGoogleGenerativeAI({
apiKey: "proxied",
baseURL: `${env.API_MOM_URL}/v1/google`,
fetch: async (url, init) => {
const headers = new Headers(init?.headers);
headers.set("X-Project-Id", "content-pipeline");
headers.set("X-Function", fn);
headers.set("X-Api-Key", env.API_MOM_KEY);
const response = await fetch(url, { ...init, headers });
totalCost += parseFloat(response.headers.get("X-Cost-Usd") ?? "0");
return response;
},
})("gemini-2.5-flash");
// Step 1: Outline (cheap, fast)
const { object: outline } = await generateObject({
model: createModel("outline"),
schema: OutlineSchema,
prompt: `Create an article outline for "${keyword}" based on this research:\n${research}`,
});
// Step 2: Draft (main cost — longer output)
const { object: draft } = await generateObject({
model: createModel("draft"),
schema: DraftSchema,
prompt: `Write a full article following this outline:\n${JSON.stringify(outline)}`,
maxOutputTokens: 16384,
});
// Step 3: Edit (moderate cost — reviews full article)
const { object: edited } = await generateObject({
model: createModel("edit"),
schema: EditSchema,
prompt: `Edit this article for clarity, SEO, and readability:\n${draft.content}`,
});
console.log(`[pipeline] Article for "${keyword}" complete. Total cost: $${totalCost.toFixed(4)}`);
return { article: edited, totalCost };
}
Small Examples
Example 1: Minimal Agent with State Sync
The simplest possible agent — state syncs to all connected clients in real-time.
import { Agent } from "agents";
interface CounterState {
count: number;
lastUpdated: string | null;
}
export class CounterAgent extends Agent<Env, CounterState> {
initialState: CounterState = { count: 0, lastUpdated: null };
async onMessage(connection: Connection, message: string) {
const { action } = JSON.parse(message);
if (action === "increment") {
this.setState({
count: this.state.count + 1,
lastUpdated: new Date().toISOString(),
});
}
if (action === "reset") {
this.setState({ count: 0, lastUpdated: new Date().toISOString() });
}
// State automatically broadcasts to all connected WebSocket clients
}
}
Example 2: Custom Fetch Logger
Intercept all AI SDK requests to log request/response details — useful for debugging.
function createLoggingModel(apiKey: string, modelId: string) {
return createGoogleGenerativeAI({
apiKey,
fetch: async (url, init) => {
const startMs = Date.now();
const requestBody = init?.body ? JSON.parse(init.body as string) : null;
console.log(`[llm:request] ${url}`);
console.log(`[llm:request] Prompt tokens (est): ${estimateTokens(requestBody)}`);
const response = await fetch(url, init);
const durationMs = Date.now() - startMs;
// Clone to read body without consuming the stream
const cloned = response.clone();
const responseBody = await cloned.json();
const usage = responseBody.usageMetadata;
console.log(`[llm:response] ${durationMs}ms | ${response.status}`);
console.log(`[llm:response] Input: ${usage?.promptTokenCount ?? "?"} | Output: ${usage?.candidatesTokenCount ?? "?"} | Thinking: ${usage?.thoughtsTokenCount ?? 0}`);
return response;
},
})(modelId);
}
Example 3: Schema-First Tool Definition
Define tools using Zod schemas for full type safety — parameters and return types are both validated.
import { tool } from "ai";
import { z } from "zod";
const weatherTool = tool({
description: "Get current weather for a location",
parameters: z.object({
city: z.string().describe("City name"),
units: z.enum(["celsius", "fahrenheit"]).default("celsius"),
}),
execute: async ({ city, units }) => {
// city is typed as string, units as "celsius" | "fahrenheit"
const response = await fetch(
`https://api.weather.example.com/v1/current?city=${city}&units=${units}`
);
const data = await response.json();
return `${city}: ${data.temperature}°${units === "celsius" ? "C" : "F"}, ${data.condition}`;
},
});
// Use in generateText
const { text } = await generateText({
model: google("gemini-2.5-flash"),
tools: { weather: weatherTool },
prompt: "What's the weather in Tokyo?",
maxSteps: 3,
});
Example 4: Thinking Budget Control
Limit thinking tokens to control cost — useful when you want speed over depth.
async function quickClassification(text: string) {
const { object } = await generateObject({
model: google("gemini-2.5-flash"),
schema: z.object({
category: z.enum(["positive", "negative", "neutral"]),
confidence: z.number().min(0).max(1),
}),
prompt: `Classify the sentiment: "${text}"`,
providerOptions: {
google: {
thinkingConfig: { thinkingBudget: 0 }, // No thinking — fastest, cheapest
},
},
});
return object;
}
async function deepAnalysis(text: string) {
const { object } = await generateObject({
model: google("gemini-2.5-flash"),
schema: z.object({
sentiment: z.enum(["positive", "negative", "neutral", "mixed"]),
themes: z.array(z.string()),
reasoning: z.string(),
suggestedActions: z.array(z.string()),
}),
prompt: `Deeply analyze this text for sentiment, themes, and actionable insights:\n\n${text}`,
providerOptions: {
google: {
thinkingConfig: { thinkingBudget: 4096 }, // Allow thinking — better quality
},
},
});
return object;
}
Example 5: Agent with Scheduled Self-Improvement
An agent that periodically reviews its own performance and adjusts its strategy.
export class AdaptiveAgent extends Agent<Env, AdaptiveState> {
initialState: AdaptiveState = {
strategy: "balanced",
successRate: 0,
avgCostPerTask: 0,
tasksProcessed: 0,
};
async onStart() {
// Self-review every 6 hours
this.schedule("0 */6 * * *", "selfReview");
this.sql`CREATE TABLE IF NOT EXISTS task_outcomes (
id INTEGER PRIMARY KEY AUTOINCREMENT,
task_type TEXT,
model_used TEXT,
cost_usd REAL,
success INTEGER,
quality_score REAL,
created_at TEXT DEFAULT (datetime('now'))
)`;
}
async selfReview() {
const stats = [...this.sql`
SELECT
model_used,
COUNT(*) as total,
SUM(success) as successes,
AVG(cost_usd) as avg_cost,
AVG(quality_score) as avg_quality
FROM task_outcomes
WHERE created_at > datetime('now', '-6 hours')
GROUP BY model_used
`];
if (stats.length === 0) return;
// Use the cheapest model for self-reflection
const model = this.createProxiedModel("self-review");
const { object: review } = await generateObject({
model,
schema: z.object({
recommendation: z.enum(["use_cheaper_model", "use_better_model", "stay_current"]),
reasoning: z.string(),
suggestedThinkingBudget: z.number(),
}),
prompt: `Review agent performance stats and recommend adjustments:\n${JSON.stringify(stats)}`,
});
this.setState({
...this.state,
strategy: review.recommendation,
});
this.sql`INSERT INTO task_outcomes (task_type, model_used, cost_usd, success, quality_score)
VALUES ('self-review', 'gemini-2.0-flash', 0.001, 1, 1.0)`;
}
}
Example 6: Idempotent Queue Consumer with Cost Tracking
A queue consumer that deduplicates messages and tracks cost per processed item.
export default {
async queue(batch: MessageBatch<DomainMessage>, env: Env) {
for (const msg of batch.messages) {
const { event_id, type, payload } = msg.body;
// Idempotency check
const existing = await env.DB.prepare(
`SELECT 1 FROM processed_events WHERE event_id = ?`
).bind(event_id).first();
if (existing) {
msg.ack();
continue;
}
try {
let costUsd = 0;
if (type === "content.generate") {
const model = createGoogleGenerativeAI({
apiKey: "proxied",
baseURL: `${env.API_MOM_URL}/v1/google`,
fetch: async (url, init) => {
const headers = new Headers(init?.headers);
headers.set("X-Project-Id", "content-worker");
headers.set("X-Function", "content-generate");
headers.set("X-Api-Key", env.API_MOM_KEY);
headers.set("X-Tags", JSON.stringify([`event:${event_id}`]));
const res = await fetch(url, { ...init, headers });
costUsd += parseFloat(res.headers.get("X-Cost-Usd") ?? "0");
return res;
},
})("gemini-2.5-flash");
await generateObject({
model,
schema: ArticleSchema,
prompt: `Generate content for: ${payload.keyword}`,
});
}
// Mark as processed
await env.DB.prepare(
`INSERT INTO processed_events (event_id, type, cost_usd, processed_at)
VALUES (?, ?, ?, datetime('now'))`
).bind(event_id, type, costUsd).run();
msg.ack();
} catch (err) {
console.error(`[queue] Failed ${event_id}: ${err}`);
msg.retry({ delaySeconds: 30 });
}
}
},
};
Example 7: Provider Fallback Chain
Try providers in order — if one fails (rate limit, outage), fall back to the next.
async function generateWithFallback<T>(
schema: z.ZodType<T>,
prompt: string,
env: Env,
): Promise<{ data: T; provider: string; cost: number }> {
const providers = [
{
name: "google",
create: () => createGoogleGenerativeAI({
apiKey: "proxied",
baseURL: `${env.API_MOM_URL}/v1/google`,
fetch: proxyFetch(env, "google-primary"),
})("gemini-2.5-flash"),
},
{
name: "anthropic",
create: () => createAnthropic({
apiKey: "proxied",
baseURL: `${env.API_MOM_URL}/v1/anthropic`,
fetch: proxyFetch(env, "anthropic-fallback"),
})("claude-sonnet-4-20250514"),
},
{
name: "openai",
create: () => createOpenAI({
apiKey: "proxied",
baseURL: `${env.API_MOM_URL}/v1/openai`,
fetch: proxyFetch(env, "openai-fallback"),
})("gpt-4o"),
},
];
for (const provider of providers) {
try {
const model = provider.create();
const { object } = await generateObject({ model, schema, prompt, maxRetries: 2 });
return { data: object, provider: provider.name, cost: 0 };
} catch (err) {
console.warn(`[fallback] ${provider.name} failed: ${err.message}`);
if (provider === providers[providers.length - 1]) {
throw new Error(`All providers failed. Last error: ${err.message}`);
}
continue;
}
}
throw new Error("Unreachable");
}
Example 8: Real-Time Cost Dashboard via WebSocket
An agent that tracks cost across all agent instances and broadcasts to a monitoring dashboard.
export class CostMonitorAgent extends Agent<Env, CostMonitorState> {
initialState: CostMonitorState = {
totalToday: 0,
byProject: {},
alerts: [],
lastUpdated: null,
};
async onStart() {
// Poll cost data every 5 minutes
this.schedule("*/5 * * * *", "refreshCosts");
}
async refreshCosts() {
const response = await fetch(`${this.env.API_MOM_URL}/v1/costs?period=day`, {
headers: { "X-Api-Key": this.env.API_MOM_ADMIN_KEY },
});
const costs = await response.json();
const alerts: string[] = [];
// Check for projects approaching limits
for (const project of costs.projects) {
const usagePercent = (project.spent / project.dailyLimit) * 100;
if (usagePercent > 80) {
alerts.push(`${project.name}: ${usagePercent.toFixed(0)}% of daily budget ($${project.spent.toFixed(2)}/$${project.dailyLimit})`);
}
}
this.setState({
totalToday: costs.totalSpend,
byProject: Object.fromEntries(costs.projects.map(p => [p.name, p.spent])),
alerts,
lastUpdated: new Date().toISOString(),
});
// State update automatically broadcasts to all connected dashboard clients
// via WebSocket — no explicit push needed
}
}
Comparisons
Agent Runtime Frameworks
| Framework | Runtime Model | State Persistence | Scaling Model | Cost When Idle | Language | Edge/Serverless |
|---|---|---|---|---|---|---|
| Cloudflare Agents SDK | Durable Objects (micro-servers) | Built-in SQLite + key-value state | Millions of instances, auto-scale | $0 (hibernation) | TypeScript | Yes (global edge) |
| OpenAI Agents SDK | In-process (client manages runtime) | None (bring your own) | Manual (run more processes) | N/A (not a host) | Python/TypeScript | No |
| LangGraph | In-process or LangGraph Cloud | Checkpointing (pluggable store) | LangGraph Cloud or manual | Depends on host | Python/TypeScript | Cloud only |
| CrewAI | In-process | None built-in | Manual | Depends on host | Python | No |
| AutoGen / MS Agent Framework | In-process, conversation-based | Session-level | Manual | Depends on host | Python | No |
| AWS Bedrock Agents | Managed service | Session memory (managed) | Auto-scale | Per-second billing (no true $0) | API-based | No (AWS regions) |
Verdict: If you need agents that hibernate to $0, run on the edge, and scale to millions of instances without infrastructure management, the Cloudflare Agents SDK is unique. If you need a rich ecosystem of pre-built agent patterns and don’t care about runtime, LangGraph or OpenAI Agents SDK give you more out of the box.
LLM Interface Libraries
| Library | Structured Output | Tool Calling | Streaming | Provider Abstraction | Custom Fetch | Bundle Size | Edge Compatible |
|---|---|---|---|---|---|---|---|
| Vercel AI SDK | Zod schemas → generateObject | tool() + maxSteps loops | streamText | 25+ providers | Yes (per-provider) | ~67 KB gzipped | Yes |
| LangChain JS | Output parsers (less type-safe) | Agent executors, tool chains | Streaming callbacks | 50+ providers | Via custom LLM | ~101 KB gzipped | Limited |
| OpenAI SDK | JSON mode + function calling | Native function calling | Streaming helpers | OpenAI only | No | ~20 KB gzipped | Yes |
| Raw fetch | Manual JSON.parse + validation | Manual tool loop | Manual SSE parsing | One at a time | N/A | 0 KB | Yes |
| @cloudflare/ai-gateway SDK | Zod schemas | Tool calling | Streaming | Multi-provider via Gateway | N/A | Small | Yes |
Verdict: The Vercel AI SDK is the best fit for TypeScript-first, edge-compatible applications. It gives you type-safe structured output without the weight of LangChain, and provider abstraction without the lock-in of the OpenAI SDK. The custom fetch option is the key feature that enables the proxy layer integration.
LLM Proxy / Gateway Solutions
| Solution | Self-Hosted | Cost Tracking | Daily Limits | Function Attribution | Custom Cost Formulas | Thinking Token Support | Open Source |
|---|---|---|---|---|---|---|---|
| Custom Proxy (API Mom pattern) | Yes (your Worker) | Full control | Yes | Yes (X-Function header) | Yes | Yes | Your code |
| Cloudflare AI Gateway | Managed | Dashboard only | Rate limiting | No | No | Limited | No |
| LiteLLM | Yes | Per-model tracking | Yes | Limited | Via callbacks | Partial | Yes |
| Helicone | Yes (or hosted) | Per-request traces | Alerting | Via headers | Limited | Partial | Yes |
| Portkey | Hosted ($49/mo+) | Full traces | Budget caps | Via metadata | Via plugins | Yes | No |
| Direct API calls | N/A | None | None | None | None | None | N/A |
Verdict: For maximum control and Cloudflare-native integration, a custom proxy Worker is ideal — you own the cost formula, the attribution schema, and the enforcement logic. For teams that don’t want to build infrastructure, Helicone or Portkey provide good observability out of the box. Cloudflare AI Gateway is a good middle ground for caching and fallbacks but lacks the attribution depth needed for multi-service cost control.
Full Architecture Comparison
| Approach | State | LLM | Cost Control | Complexity | Monthly Cost (10K calls) |
|---|---|---|---|---|---|
| Three-Layer (this article) | Durable Objects | AI SDK | Custom proxy | Medium | ~$15 infra + LLM costs |
| Monolithic Worker | KV / D1 | Raw fetch | None | Low initially, high later | ~$5 infra + unknown LLM |
| LangChain + VPS | Redis / Postgres | LangChain | LiteLLM proxy | High | ~$50 server + LLM costs |
| AWS Bedrock Agents | Managed | Managed | CloudWatch | Medium | ~$100+ (multi-layer billing) |
| OpenAI Agents + Vercel | Vercel KV | OpenAI SDK | OpenAI dashboard | Low-Medium | ~$20 Vercel + LLM costs |
Anti-Patterns
| Don’t | Do Instead | Why |
|---|---|---|
Put API keys in every worker’s wrangler.jsonc | Single proxy holds all API keys | One place to rotate, audit, and limit keys |
Call fetch("https://generativelanguage.googleapis.com/...") directly | Use createGoogleGenerativeAI() from @ai-sdk/google | Type safety, retries, structured output, provider switching |
Parse LLM JSON with JSON.parse() and hope | Use generateObject() with a Zod schema | Automatic validation, typed results, retry on schema failure |
| Price all output tokens at the same rate | Separate thinking tokens ($3.50/M) from output tokens ($0.60/M) | 3-9x cost undercount if you don’t |
| Store conversation history in KV | Use AIChatAgent (auto-persists to SQLite) | KV can’t query, can’t paginate, can’t search |
| Create one giant “orchestrator” agent | Use multiple specialized agents communicating via Queues | Isolation, independent scaling, independent budgets |
| Run expensive LLM calls in crons without cost ceilings | Set daily spend limits in the proxy layer | A cron running every 15 minutes can burn $30 overnight |
| Trust internal cost tracking without reconciliation | Compare proxy totals against provider billing monthly | Internal tracking has bugs — the provider bill is the truth |
Use setInterval() in Durable Objects | Use this.schedule() or alarm-based scheduling | setInterval prevents hibernation, costs money 24/7 |
| Embed LLM calling logic in the agent class | Create a shared LLM harness module | DRY, consistent cost tracking, single place for pricing updates |
| Skip the proxy layer “because we only have one service” | Add the proxy from day one | You will have more services. Retrofitting cost tracking is 10x harder |
Use maxSteps: 100 for tool-calling agents | Start with maxSteps: 5-10, increase deliberately | Each step is a full LLM call. 100 steps = 100x cost. |
| Retry forever on LLM failures | Use maxRetries: 2-3 and handle NoObjectGeneratedError | Infinite retries = infinite cost |
| Let the AI SDK hit the provider directly from all environments | Route dev/staging through the same proxy with separate project IDs | Dev cost is invisible otherwise; it also shares rate limits |
When to Skip a Layer
You don’t always need all three layers. Here’s when each is optional:
Skip Layer 1 (Agent Runtime) When:
- Stateless LLM calls: Your Worker receives a request, calls an LLM, returns the response. No memory, no scheduling, no WebSockets. A plain Worker with the AI SDK is fine.
- Batch processing: You’re processing a queue of items through an LLM. The queue consumer doesn’t need to be a Durable Object — a regular Worker with
queue()handler works. - Simple API endpoints:
POST /api/summarizethat takes text and returns a summary. No state to manage.
Use Layer 1 when you need: persistence across requests, real-time WebSocket connections, scheduling, or millions of independent agent instances.
Skip Layer 2 (LLM Interface) When:
- No LLM calls: Your agent does deterministic work — monitors health, manages queues, aggregates metrics. Not every agent needs AI.
- Workers AI only: If you’re using only Cloudflare Workers AI models, you can call them directly via
env.AI.run(). The AI SDK adds value primarily for external providers. - Simple completions: If you just need
env.AI.run("@cf/meta/llama-3.1-8b-instruct", { messages })with no structured output, no tools, and no provider switching, the raw binding is simpler.
Use Layer 2 when you need: structured output (Zod schemas), tool calling, provider abstraction, or streaming to UI clients.
Skip Layer 3 (API Proxy) When:
- Development only: You’re prototyping and haven’t committed to a production architecture yet. Direct API calls with hardcoded keys are fine for exploration.
- Single service, single provider: You have one Worker making a few LLM calls per day with a hard-coded budget. The overhead of a proxy isn’t justified.
- Cloudflare AI Gateway is sufficient: You need caching and rate limiting but not function-level cost attribution.
Use Layer 3 when you need: multi-service cost attribution, daily spend limits, centralized API key management, or when you’ve been surprised by a bill even once.
The Progressive Adoption Path
Day 1: Layer 2 only (AI SDK in a Worker)
↓
Week 2: Layer 1 + Layer 2 (AI SDK in a Durable Object agent)
↓
Month 1: All three layers (agent + AI SDK + proxy)
You don’t have to build all three on day one. Start with the AI SDK for type-safe LLM calls. Add the agent runtime when you need state. Add the proxy when you need cost control. Each layer snaps into place without requiring changes to the others.
Production Checklist
Before deploying agents with all three layers:
Layer 1 (Container)
- Agent class extends
AgentorAIChatAgentfromagentspackage -
initialStatedefined with sensible defaults - SQLite tables created in
onStart()withIF NOT EXISTS -
wrangler.jsonchasdurable_objectsbindings andmigrationswithnew_sqlite_classes -
routeAgentRequest()configured in the Worker’sfetchhandler - No
setInterval()orsetTimeout()(usethis.schedule()instead) - State updates use
this.setState()for real-time sync - Long-term data stored in SQLite, not state
Layer 2 (Brain)
- AI SDK provider created with
create*()factory - All structured output uses
generateObject()+ Zod schema - Tool parameters and return types defined with Zod
-
maxStepsset deliberately (not unbounded) -
maxRetriesset (2-3 for structured output, 1 for streaming) - Thinking tokens extracted and priced separately
-
NoObjectGeneratedErrorcaught and handled
Layer 3 (Wallet)
- No API keys in any worker’s
wrangler.jsonc(only in the proxy) - All LLM calls route through the proxy via custom
fetch - Every proxied call includes
X-Project-IdandX-Function - Daily spend limit configured per project
- Workers handle 429
daily_limit_exceededgracefully - Cost formula prices thinking tokens separately
- Monthly reconciliation against provider billing dashboard
Advanced: The Cloudflare AI Gateway Bridge
For teams using Cloudflare’s AI Gateway, you can layer it between the AI SDK and your custom proxy for a hybrid approach:
App → Agents SDK → AI SDK → Custom Proxy → CF AI Gateway → Provider
(state) (LLM) (cost) (cache/rate) (API)
import { createAiGateway } from "ai-gateway-provider";
import { createGoogleGenerativeAI } from "ai-gateway-provider/providers/google";
function createFullStackModel(env: Env, functionName: string) {
// Layer 3a: Cloudflare AI Gateway (caching, rate limiting, fallbacks)
const aigateway = createAiGateway({
binding: env.AI.gateway("production-gateway"),
options: { cacheTtl: 300 },
});
// Layer 3b: Google provider (routes through AI Gateway)
const google = createGoogleGenerativeAI({
apiKey: env.GEMINI_API_KEY,
});
// Wrap with custom fetch for cost attribution
// This is the Layer 3c: your cost proxy
const gatewayModel = aigateway(google("gemini-2.5-flash"));
// The AI Gateway handles caching and rate limiting
// Your proxy handles cost attribution and budget enforcement
// The AI SDK handles structured output and tool calling
// The Agents SDK handles state and lifecycle
return gatewayModel;
}
This hybrid gives you:
- AI Gateway: Response caching ($0 for repeated prompts), rate limiting, provider fallbacks, usage analytics
- Custom proxy: Function-level cost attribution, daily spend limits, centralized key management
- AI SDK: Structured output, tool calling, streaming, provider abstraction
- Agents SDK: Durable state, WebSocket, scheduling, hibernation
Cost Reference
Quick reference for making model routing decisions:
| Model | Input ($/M tokens) | Output ($/M tokens) | Thinking ($/M tokens) | Best For |
|---|---|---|---|---|
| Gemini 2.0 Flash | $0.10 | $0.40 | N/A | Classification, extraction, simple tasks |
| Gemini 2.5 Flash | $0.15 | $0.60 | $3.50 | General purpose, tool calling, content |
| Gemini 2.5 Pro | $1.25 | $10.00 | $10.00 | Complex reasoning, long context |
| Claude Sonnet 4 | $3.00 | $15.00 | N/A | Code generation, nuanced analysis |
| Claude Haiku 3.5 | $0.80 | $4.00 | N/A | Fast classification, extraction |
| GPT-4o | $2.50 | $10.00 | N/A | Multi-modal, general purpose |
| GPT-4o Mini | $0.15 | $0.60 | N/A | Budget-friendly general tasks |
| Workers AI (Llama 3) | Free (Cloudflare) | Free | N/A | Prototyping, non-critical tasks |
Key insight: The cost difference between Gemini 2.0 Flash ($0.10/M input) and Claude Sonnet 4 ($3.00/M input) is 30x. If you’re routing every task through the same expensive model, you’re burning money on classification tasks that a cheap model handles just as well. Multi-provider routing isn’t a nice-to-have — it’s a cost optimization multiplier.
Summary
The three-layer architecture isn’t about adding complexity — it’s about preventing the complexity that inevitably emerges when state, LLM interaction, and cost control are tangled together.
-
Layer 1 (Container): The Cloudflare Agents SDK gives each agent a Durable Object with SQLite, WebSocket, and scheduling. Agents hibernate to $0 when idle. Millions can run concurrently. State survives everything.
-
Layer 2 (Brain): The Vercel AI SDK provides
generateObject()with Zod schemas,tool()with multi-step loops,streamText()for real-time UI, and provider abstraction to swap Google/Anthropic/OpenAI without changing your agent code. -
Layer 3 (Wallet): A centralized API proxy tracks every LLM call with function-level cost attribution, enforces daily spend limits, manages API keys in one place, and prevents the $47 surprise bills that happen when services hold their own keys.
The layers compose through two narrow interfaces:
AIChatAgent.onChatMessage()calls AI SDK functions (Container → Brain)- AI SDK’s
fetchoption routes through the proxy (Brain → Wallet)
Start with Layer 2 on day one. Add Layer 1 when you need state. Add Layer 3 before your first production deployment. Each layer is independently useful and independently replaceable. That’s the whole point.
References
Cloudflare Agents SDK
- Cloudflare Agents Documentation — Official docs covering getting started, API reference, and guides
- Cloudflare Agents SDK on GitHub — Source code and examples for the agents package
- agents on npm — The main SDK package (replaces deprecated agents-sdk)
- Agents API Reference — Agent class methods, state management, scheduling, WebSocket
- Chat Agents API Reference — AIChatAgent, message persistence, resumable streaming
- Using AI Models with Agents — streamText, generateText, provider configuration
- Agents Starter Kit — Full working example with tools and streaming chat
- Building agents with OpenAI and Cloudflare’s Agents SDK — Blog post showing OpenAI integration patterns
Cloudflare Durable Objects
- Durable Objects Documentation — Overview, concepts, and best practices
- Durable Objects Pricing — Per-request, duration, hibernation, WebSocket billing
- Durable Objects Alarms — At-least-once scheduling with exponential backoff
- Durable Object Lifecycle — Hibernation conditions, idle behavior
Vercel AI SDK
- AI SDK Documentation — Official docs for the ai npm package
- AI SDK on GitHub — Source code for the AI Toolkit for TypeScript
- Generating Structured Data — generateObject with Zod schemas
- generateObject API Reference — Full options and return types
- Intercepting Fetch Requests — Custom fetch for proxy routing (the Layer 2→3 bridge)
- Google Generative AI Provider — createGoogleGenerativeAI options including custom fetch
Cloudflare AI Gateway
- AI Gateway Overview — Caching, rate limiting, analytics for AI API calls
- Vercel AI SDK Integration — Using AI Gateway with the Vercel AI SDK
- ai-gateway-provider on GitHub — Community provider for AI SDK + AI Gateway
- ai-gateway-provider (Cloudflare AI Gateway Provider) — AI SDK community provider documentation
- @cloudflare/ai-gateway on npm — Official SDK for AI Gateway
LLM Proxy and Cost Tracking
- LiteLLM — Open-source LLM gateway with 100+ provider support
- Helicone — Open-source LLM observability platform with Rust-based performance
- Portkey — AI gateway with full traces, budget caps, and governance
- Top 5 LLM Gateways Comparison (Helicone) — Detailed comparison of gateway solutions
Alternative Agent Frameworks
- OpenAI Agents SDK — Lightweight agent framework with handoffs and guardrails
- LangChain — Comprehensive framework for LLM applications and agents
- LangGraph — Stateful, multi-actor agent orchestration
- CrewAI — Role-based multi-agent collaboration framework
- AWS Bedrock Agents — Managed agent service with multi-layer billing
- A Practical Guide to Building Agents (OpenAI) — OpenAI’s production agent guide
Framework Comparisons
- LangChain vs Vercel AI SDK vs OpenAI SDK: 2026 Guide (Strapi) — Comprehensive comparison of agent frameworks
- AI Framework Comparison (Komelin) — Vercel AI SDK, Mastra, LangChain, and Genkit
- CrewAI vs LangGraph vs AutoGen vs OpenAgents — Multi-agent framework comparison
- Building for Agentic AI — Agent SDKs & Design Patterns (GovTech) — Practical patterns for agent SDKs
Cloudflare Workers Platform
- Workers Pricing — CPU time billing, free tier, paid plan details
- Cloudflare Queues — At-least-once message delivery for event-driven architectures
- Workers AI — Built-in AI models, no API key needed