The Three-Layer AI Agent Architecture

Most production AI agents are a tangled mess — LLM calls mixed with state management, API keys scattered across services, cost tracking bolted on as an afterthought (if at all). The result: $47 in surprise Gemini bills, 9x cost undercounting, and zero visibility into what your agents are actually spending.

This article presents a clean separation: three composable layers that handle agent runtime, LLM interaction, and API cost control independently. Built on Cloudflare Workers, the Vercel AI SDK, and a centralized API proxy, this architecture lets you build agents that are stateful, intelligent, and financially observable — without coupling any of those concerns together.

What you’ll learn:

How to separate agent state (Container) from LLM logic (Brain) from API metering (Wallet)
The Cloudflare Agents SDK as a stateful runtime for millions of concurrent agents
The Vercel AI SDK as a provider-agnostic LLM interface with structured output and tool calling
The custom fetch pattern that routes all LLM calls through a cost-tracking proxy
How a $47 surprise bill in 5 days exposed why the Wallet layer isn’t optional
When you need all three layers vs. when you can skip one
How this compares to LangChain, OpenAI Agents SDK, AWS Bedrock Agents, and LLM proxy solutions

Open Table of Contents

The Problem
- What changes if you get this right
Architecture Overview
- The TypeScript Shape
Layer 1: Agent Runtime (Container)
- Core Concepts
- Wrangler Configuration
Layer 2: LLM Interface (Brain)
Layer 3: API Proxy / Metering (Wallet)
The Composition Pattern
Patterns
Small Examples
Comparisons
Anti-Patterns
When to Skip a Layer
Production Checklist
Advanced: The Cloudflare AI Gateway Bridge
Cost Reference
Summary
References

The Problem

Here is the typical evolution of an AI agent project:

Week 1: You write a Worker that calls the Gemini API with fetch(). It works. You parse the JSON response manually. You hardcode the API key as an environment variable. Cost tracking? “We’ll add that later.”

Week 3: You need structured output, so you write a Zod schema and validate the response yourself. Sometimes the LLM returns malformed JSON. You add retry logic. The retry logic has bugs. You’re now maintaining 200 lines of LLM interaction code that has nothing to do with your agent’s actual purpose.

Week 5: You need the agent to remember things between requests. You bolt on KV storage. Then you need WebSocket connections for real-time updates. Then scheduling. Then you realize KV doesn’t support SQL queries. You start wishing for a database. Your “simple agent” is now 1,500 lines of infrastructure code.

Week 8: Your Google Cloud bill arrives. $47 in 5 days. Your internal tracking shows $5. The 9x discrepancy? You forgot that Gemini’s thinking tokens cost $3.50/M — not the $0.60/M you used for regular output tokens. Six services hold their own API keys. Nobody knows which service spent what. You disable all cron jobs as an emergency measure.

This isn’t hypothetical. This is what happened across two production services in a real Cloudflare Workers deployment. The fix wasn’t “better monitoring” — it was architectural. Each concern (state, LLM interaction, cost control) needed its own layer with clear boundaries.

What changes if you get this right

Concern	Tangled	Three-Layer
Agent state	KV hacks, lost on redeploy	Durable Object with SQLite, survives everything
LLM calls	Raw `fetch()`, manual JSON parsing	`generateObject()` with Zod, automatic retries
Provider switching	Rewrite all call sites	Change one import
Cost tracking	”Check the Google dashboard”	Per-call attribution with function-level tags
Cost control	Hope for the best	Daily spend limits, tier enforcement, automatic 429s
Concurrent agents	One singleton, maybe	Millions of independent instances, zero cost when idle
Real-time updates	Polling	WebSocket with automatic state sync

Architecture Overview

The three-layer architecture separates concerns along natural boundaries:

┌─────────────────────────────────────────────────────────────┐
│                     Your Application                        │
│  (Business logic, domain rules, what the agent actually does)│
└──────────────┬──────────────────────────────────────────────┘
               │
┌──────────────▼──────────────────────────────────────────────┐
│  Layer 1: Agent Runtime (Container)                         │
│  Cloudflare Agents SDK  ·  npm: agents                      │
│                                                             │
│  Durable Object per agent instance                          │
│  Built-in SQLite  ·  State sync  ·  WebSocket               │
│  Scheduling  ·  Hibernation  ·  Lifecycle hooks              │
│  Cost when idle: $0                                          │
└──────────────┬──────────────────────────────────────────────┘
               │
┌──────────────▼──────────────────────────────────────────────┐
│  Layer 2: LLM Interface (Brain)                             │
│  Vercel AI SDK  ·  npm: ai + @ai-sdk/google                 │
│                                                             │
│  streamText / generateText / generateObject                  │
│  Zod-validated structured output                             │
│  Tool calling with multi-step agent loops                    │
│  Provider abstraction (Google, Anthropic, OpenAI)            │
│  Thinking token tracking                                     │
└──────────────┬──────────────────────────────────────────────┘
               │  custom fetch()
┌──────────────▼──────────────────────────────────────────────┐
│  Layer 3: API Proxy / Metering (Wallet)                     │
│  Centralized HTTP proxy  ·  e.g., API Mom                   │
│                                                             │
│  Cost tracking with per-call attribution                     │
│  API key management (single source of truth)                 │
│  Daily spend limits  ·  Tier enforcement                     │
│  Caching  ·  Rate limiting  ·  Budget caps                   │
└──────────────┬──────────────────────────────────────────────┘
               │
         ┌─────▼─────┐
         │  Provider  │  (Gemini, Claude, GPT, etc.)
         └───────────┘

Key insight: Each layer is independently useful and independently replaceable. You can use the Agents SDK without the AI SDK (non-LLM agents). You can use the AI SDK without the Agents SDK (stateless LLM calls). You can use the proxy layer with any HTTP client. But when composed together, they create a production-grade agent platform where state, intelligence, and cost control are all first-class concerns.

The TypeScript Shape

// The three layers, typed

// Layer 1: Agent Runtime — what the container looks like
interface AgentRuntime {
  // State
  state: AgentState;
  setState(newState: AgentState): void;
  sql: SqlStorage;  // Built-in SQLite

  // Communication
  broadcast(message: string): void;
  onConnect(connection: Connection): void;
  onMessage(connection: Connection, message: string): void;

  // Scheduling
  schedule(cron: string, callback: string): void;
  scheduleEvery(intervalMs: number): void;

  // Lifecycle
  onStart(): void;
  onStop(): void;
}

// Layer 2: LLM Interface — what the brain looks like
interface LLMInterface {
  // Text generation
  generateText(options: GenerateTextOptions): Promise<GenerateTextResult>;
  streamText(options: StreamTextOptions): StreamTextResult;

  // Structured output
  generateObject<T>(options: {
    model: LanguageModel;
    schema: ZodType<T>;
    prompt: string;
  }): Promise<{ object: T; usage: TokenUsage }>;

  // Tool calling
  tool(options: {
    description: string;
    parameters: ZodType;
    execute: (args: unknown) => Promise<string>;
  }): Tool;
}

// Layer 3: API Proxy — what the wallet looks like
interface APIProxy {
  // Proxied fetch
  fetch(url: string, options: RequestInit): Promise<Response>;

  // Cost tracking
  recordCall(params: {
    project: string;
    function: string;
    service: string;
    inputTokens: number;
    outputTokens: number;
    thinkingTokens: number;
    costUsd: number;
  }): Promise<void>;

  // Budget enforcement
  checkDailyLimit(project: string): Promise<{ allowed: boolean; spent: number; limit: number }>;
  checkTierPermission(project: string, estimatedCost: number): Promise<boolean>;
}

Layer 1: Agent Runtime (Container)

The agent runtime answers: where does the agent live, how does it persist, and how do clients talk to it?

The Cloudflare Agents SDK (npm install agents) gives you a TypeScript class that runs on a Durable Object — a stateful micro-server with its own SQLite database, WebSocket connections, and scheduling system. Each agent instance is isolated, horizontally scalable, and costs nothing when idle.

Core Concepts

The Agent Class

Every agent extends the Agent class, parameterized by your environment bindings and state shape:

import { Agent } from "agents";

interface Env {
  AI: Ai;
  RESEARCH_QUEUE: Queue;
  DB: D1Database;
}

interface BrandAgentState {
  brandSlug: string;
  status: "idle" | "researching" | "generating" | "error";
  lastResearchCycle: string | null;
  pendingTasks: number;
  totalContentGenerated: number;
  costToday: number;
  costBudget: number;
}

export class BrandAgent extends Agent<Env, BrandAgentState> {
  initialState: BrandAgentState = {
    brandSlug: "",
    status: "idle",
    lastResearchCycle: null,
    pendingTasks: 0,
    totalContentGenerated: 0,
    costToday: 0,
    costBudget: 5.0,
  };

  async onStart() {
    // Runs when the agent starts or resumes from hibernation
    this.sql`CREATE TABLE IF NOT EXISTS audit_log (
      id INTEGER PRIMARY KEY AUTOINCREMENT,
      action TEXT NOT NULL,
      details TEXT,
      cost_usd REAL DEFAULT 0,
      created_at TEXT DEFAULT (datetime('now'))
    )`;
  }

  async onMessage(connection: Connection, message: string) {
    const { type, payload } = JSON.parse(message);

    switch (type) {
      case "audit":
        await this.runAudit(payload.brandSlug);
        break;
      case "status":
        connection.send(JSON.stringify({ type: "status", state: this.state }));
        break;
    }
  }

  private async runAudit(brandSlug: string) {
    this.setState({
      ...this.state,
      status: "researching",
      brandSlug,
    });

    // Agent logic goes here — the runtime handles everything else
    // State persists across restarts, deploys, and hibernation
    // WebSocket clients get state updates automatically
    // SQLite is always available for structured data

    this.setState({ ...this.state, status: "idle" });
  }
}

Routing

The Agents SDK routes HTTP and WebSocket requests to agent instances using a URL pattern:

import { routeAgentRequest } from "agents";

export default {
  async fetch(request: Request, env: Env) {
    // Routes to /:agent-name/:instance-name
    // e.g., /brand-agent/niche-fi → BrandAgent instance "niche-fi"
    return routeAgentRequest(request, env, { cors: true });
  },
};

Each unique instance name gets its own Durable Object. The instance niche-fi is completely isolated from llc-tax. They have separate SQLite databases, separate state, separate WebSocket connections. You can have millions of them.

Built-in SQLite

Every agent instance has embedded SQLite accessed via this.sql. This is not D1 — it’s SQLite running directly inside the Durable Object with zero network latency:

// Create tables on startup
this.sql`CREATE TABLE IF NOT EXISTS memories (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  type TEXT NOT NULL,
  content TEXT NOT NULL,
  embedding BLOB,
  created_at TEXT DEFAULT (datetime('now'))
)`;

// Query with tagged templates
const recent = [
  ...this.sql`SELECT * FROM memories
              WHERE type = ${"observation"}
              ORDER BY created_at DESC
              LIMIT 10`
];

// Insert
this.sql`INSERT INTO memories (type, content)
         VALUES (${"decision"}, ${JSON.stringify(decision)})`;

Key insight: The agent’s SQLite database is its long-term memory. State (this.state) is for real-time data that syncs to connected clients. SQLite is for everything else — conversation history, audit logs, embeddings, task queues. The distinction matters: state changes trigger WebSocket broadcasts, SQL writes don’t.

Scheduling and Alarms

Agents can schedule their own work. The scheduling system uses Durable Object alarms under the hood — guaranteed at-least-once execution with automatic retries:

export class MonitoringAgent extends Agent<Env, MonitorState> {
  async onStart() {
    // Run every 4 hours
    this.schedule("0 */4 * * *", "runHealthCheck");
  }

  async runHealthCheck() {
    this.setState({ ...this.state, status: "checking" });

    const sites = [...this.sql`SELECT * FROM monitored_sites WHERE active = 1`];

    for (const site of sites) {
      const response = await fetch(site.url);
      this.sql`INSERT INTO health_checks (site_id, status, latency_ms)
               VALUES (${site.id}, ${response.status}, ${Date.now() - start})`;
    }

    this.setState({
      ...this.state,
      status: "idle",
      lastCheck: new Date().toISOString(),
    });
  }
}

Hibernation and Cost

This is the killer feature for running agents at scale. When a Durable Object has no active connections and no pending timers, it hibernates. Hibernated agents cost $0. They wake up instantly on the next request.

Pricing (Workers Paid plan):

Requests: $0.15 per million (first million free)
Duration: $12.50 per million GB-seconds (only while actively executing)
WebSocket messages: 20:1 ratio (100 incoming messages = 5 billed requests)
Hibernated: $0

This means you can run 100,000 brand agents. If only 50 are active at any given time, you pay for 50. The other 99,950 cost nothing. They wake up in milliseconds when needed.

The AIChatAgent Subclass

For conversational agents, the SDK provides AIChatAgent with built-in message persistence and resumable streaming:

import { AIChatAgent } from "agents";
import { streamText, tool } from "ai";
import { createGoogleGenerativeAI } from "@ai-sdk/google";

export class ChatAgent extends AIChatAgent<Env, ChatState> {
  async onChatMessage(
    onFinish?: StreamTextOnFinishCallback,
    options?: { abortSignal?: AbortSignal; body?: unknown }
  ) {
    const google = createGoogleGenerativeAI({ apiKey: this.env.GEMINI_API_KEY });

    const result = streamText({
      model: google("gemini-2.5-flash"),
      messages: this.messages,  // Auto-loaded from SQLite
      tools: {
        searchKnowledge: tool({
          description: "Search the agent's knowledge base",
          parameters: z.object({ query: z.string() }),
          execute: async ({ query }) => {
            const results = [...this.sql`
              SELECT content FROM memories
              WHERE content LIKE ${'%' + query + '%'}
              LIMIT 5
            `];
            return JSON.stringify(results);
          },
        }),
      },
      maxSteps: 5,
      onFinish,
      abortSignal: options?.abortSignal,
    });

    return result.toUIMessageStreamResponse();
  }
}

AIChatAgent handles:

Message persistence — conversations saved to SQLite automatically
Resumable streaming — if a client disconnects mid-stream, it reconnects and picks up where it left off
Chunk buffering — stores stream chunks so late-joining clients get the full response
Automatic cleanup — destroy() cancels pending requests

Key insight: AIChatAgent is where Layer 1 (Container) and Layer 2 (Brain) naturally compose. The onChatMessage method is the seam. The agent runtime manages the conversation lifecycle. The AI SDK handles the LLM call. Neither knows about the other’s internals.

Wrangler Configuration

{
  "name": "brand-agents",
  "main": "src/index.ts",
  "compatibility_date": "2025-03-01",
  "durable_objects": {
    "bindings": [
      {
        "name": "BRAND_AGENT",
        "class_name": "BrandAgent"
      },
      {
        "name": "CHAT_AGENT",
        "class_name": "ChatAgent"
      }
    ]
  },
  "migrations": [
    {
      "tag": "v1",
      "new_sqlite_classes": ["BrandAgent", "ChatAgent"]
    }
  ]
}

Layer 2: LLM Interface (Brain)

The LLM interface answers: how do you talk to language models, validate their output, and let them call tools?

The Vercel AI SDK (npm install ai) is a TypeScript toolkit that abstracts LLM providers behind a common interface. It handles structured output with Zod schemas, multi-step tool calling, streaming, retries, and provider switching — all the things you’d otherwise build (badly) yourself.

Why Not Raw Fetch?

Here’s what calling Gemini looks like without the AI SDK:

// The "just use fetch" approach — DON'T DO THIS

async function generateContent(prompt: string, schema: object): Promise<unknown> {
  const response = await fetch(
    `https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash:generateContent`,
    {
      method: "POST",
      headers: {
        "Content-Type": "application/json",
        "x-goog-api-key": env.GEMINI_API_KEY,
      },
      body: JSON.stringify({
        contents: [{ parts: [{ text: prompt }] }],
        generationConfig: {
          responseMimeType: "application/json",
          responseSchema: schema, // Not Zod — raw JSON Schema
        },
      }),
    }
  );

  if (!response.ok) {
    // What kind of error? Rate limit? Auth? Bad request? Model overloaded?
    // You have to parse the error body to find out
    const error = await response.json();
    throw new Error(`Gemini error: ${error.error?.message}`);
  }

  const data = await response.json();
  const text = data.candidates?.[0]?.content?.parts?.[0]?.text;

  if (!text) {
    throw new Error("No content in response");
  }

  // Parse JSON — but what if it's malformed?
  let parsed;
  try {
    parsed = JSON.parse(text);
  } catch {
    // Retry? With what backoff? How many times?
    throw new Error(`Invalid JSON from LLM: ${text.slice(0, 200)}`);
  }

  // Validate against schema — but you're using JSON Schema, not Zod
  // No type inference, no runtime validation, no error messages
  // You just... hope it's right

  // Token counting? Thinking tokens?
  const usage = data.usageMetadata;
  // promptTokenCount, candidatesTokenCount, totalTokenCount
  // But where are thoughtsTokenCount? Different field name in different API versions
  // Also: are thinking tokens included in candidatesTokenCount or separate?

  return parsed;
}

That’s 50 lines for a single call, no retries, no streaming, no tool calling, no type safety. Now multiply by every LLM call in your system.

Core Concepts

Provider Abstraction

The AI SDK defines a common LanguageModel interface. You create a model instance from any provider and pass it to the same functions:

import { generateText } from "ai";
import { createGoogleGenerativeAI } from "@ai-sdk/google";
import { createAnthropic } from "@ai-sdk/anthropic";
import { createOpenAI } from "@ai-sdk/openai";

// Same function, different providers
const google = createGoogleGenerativeAI({ apiKey: env.GEMINI_API_KEY });
const anthropic = createAnthropic({ apiKey: env.ANTHROPIC_API_KEY });
const openai = createOpenAI({ apiKey: env.OPENAI_API_KEY });

// Swap providers by changing one line
const { text } = await generateText({
  model: google("gemini-2.5-flash"),
  // model: anthropic("claude-sonnet-4-20250514"),
  // model: openai("gpt-4o"),
  prompt: "Analyze this brand's SEO performance...",
});

Key insight: Provider abstraction isn’t just about convenience — it’s about cost optimization. When Gemini Flash costs $0.15/M input tokens and Claude Sonnet costs $3/M, being able to route different tasks to different providers based on complexity is the difference between a $50/month and $500/month agent system.

Structured Output with Zod

This is the AI SDK’s strongest feature. Define a Zod schema, get a validated, typed object back:

import { generateObject } from "ai";
import { z } from "zod";

const BrandAuditSchema = z.object({
  overallScore: z.number().min(0).max(100).describe("Overall brand health 0-100"),
  categories: z.array(z.object({
    name: z.enum(["seo", "content", "design", "performance"]),
    score: z.number().min(0).max(100),
    issues: z.array(z.object({
      severity: z.enum(["critical", "warning", "info"]),
      description: z.string(),
      recommendation: z.string(),
    })),
  })),
  summary: z.string().describe("2-3 sentence summary of findings"),
  topPriority: z.string().describe("Single most important thing to fix"),
});

type BrandAudit = z.infer<typeof BrandAuditSchema>;

async function auditBrand(siteData: string): Promise<BrandAudit> {
  const { object, usage } = await generateObject({
    model: google("gemini-2.5-flash"),
    schema: BrandAuditSchema,
    prompt: `Audit this brand's website and provide a structured assessment:\n\n${siteData}`,
    temperature: 0.3,
    maxRetries: 3,  // Retries with schema validation on each attempt
  });

  // object is fully typed as BrandAudit
  // No JSON.parse, no manual validation, no hope-based programming
  return object;
}

The AI SDK handles:

Converting your Zod schema to the provider’s native format (JSON Schema for Gemini, tool_use for Claude)
Parsing the LLM response
Validating against the schema
Retrying if validation fails (up to maxRetries)
Throwing NoObjectGeneratedError with the raw text if all retries fail (so you can debug)

Tool Calling and Agent Loops

The AI SDK supports multi-step tool calling where the LLM can invoke tools, observe results, and decide what to do next:

import { generateText, tool } from "ai";
import { z } from "zod";

const result = await generateText({
  model: google("gemini-2.5-flash"),
  system: "You are a brand research agent. Use the available tools to gather data, then synthesize your findings.",
  prompt: "Research the competitive landscape for 'bank statement to Excel converter' tools.",
  tools: {
    searchWeb: tool({
      description: "Search the web for information",
      parameters: z.object({
        query: z.string().describe("Search query"),
      }),
      execute: async ({ query }) => {
        const results = await fetch(`${env.API_MOM_URL}/v1/brave/search?q=${encodeURIComponent(query)}`, {
          headers: { "X-Project-Id": "brand-agent", "X-Api-Key": env.API_MOM_KEY },
        });
        return await results.text();
      },
    }),
    analyzeSerp: tool({
      description: "Analyze a SERP to identify competitors and their positioning",
      parameters: z.object({
        serpData: z.string().describe("Raw SERP results to analyze"),
      }),
      execute: async ({ serpData }) => {
        // Could be another LLM call, or custom logic
        return `Analysis: Found 8 competitors. Top 3: ...`;
      },
    }),
    saveFinding: tool({
      description: "Save a research finding to the knowledge base",
      parameters: z.object({
        category: z.string(),
        finding: z.string(),
        confidence: z.number().min(0).max(1),
      }),
      execute: async ({ category, finding, confidence }) => {
        // Save to agent's SQLite (via closure over the agent instance)
        agent.sql`INSERT INTO findings (category, finding, confidence)
                  VALUES (${category}, ${finding}, ${confidence})`;
        return "Saved.";
      },
    }),
  },
  maxSteps: 10,  // LLM can call tools up to 10 times before final response
  temperature: 0.3,
});

console.log(result.text);           // Final synthesized response
console.log(result.steps.length);   // How many tool-call rounds
console.log(result.usage);          // Total tokens across all steps

The maxSteps parameter controls how many rounds the LLM can make. Each round: the LLM produces text and/or tool calls → tools execute → results go back to the LLM → repeat until the LLM produces text without tool calls (or hits the step limit).

Streaming

For real-time UIs, streamText sends tokens as they’re generated:

import { streamText } from "ai";

const result = streamText({
  model: google("gemini-2.5-flash"),
  messages: conversationHistory,
  tools: { /* ... */ },
  maxSteps: 5,
});

// In a Cloudflare Worker, return as a streaming response
return result.toTextStreamResponse();

// Or in an AIChatAgent, return UI message stream
return result.toUIMessageStreamResponse();

Thinking Token Tracking

Models like Gemini 2.5 Flash and Claude use “thinking” tokens that cost significantly more than regular output tokens. The AI SDK exposes this metadata:

const result = await generateText({
  model: google("gemini-2.5-flash"),
  prompt: "Complex reasoning task...",
  providerOptions: {
    google: {
      thinkingConfig: { thinkingBudget: 2048 },
    },
  },
});

// Thinking tokens are in providerMetadata
const thinkingTokens =
  result.providerMetadata?.google?.usageMetadata?.thoughtsTokenCount ?? 0;

// Or in AI SDK v6+
const thinkingTokensV6 = result.usage?.reasoningTokens ?? 0;

console.log(`Input: ${result.usage.inputTokens}`);
console.log(`Output: ${result.usage.outputTokens}`);
console.log(`Thinking: ${thinkingTokens}`);  // These cost $3.50/M on Flash!

Key insight: Thinking tokens are the silent budget killer. On Gemini 2.5 Flash, thinking tokens cost $3.50/M — almost 6x the regular output token price of $0.60/M. If you price all output tokens at $0.60/M, your cost tracking will undercount by 3-9x. This is exactly what caused the $47 surprise bill. The AI SDK exposes thinking token counts, but you have to actually read them and price them correctly.

The LLM Harness Pattern

In production, you don’t call generateObject directly everywhere. You wrap it in a harness that handles cost calculation, logging, and ledger recording:

interface LLMConfig {
  apiKey: string;
  model?: string;
  temperature?: number;
  maxTokens?: number;
  maxRetries?: number;
  thinkingBudget?: number;
  db?: D1Database;          // For cost ledger recording
  contextType?: string;     // "pipeline" | "api" | "cron"
  contextId?: string;       // "brand-audit-niche-fi"
}

interface LLMResult<T> {
  data: T;
  usage: {
    inputTokens: number;
    outputTokens: number;
    thinkingTokens: number;
    totalTokens: number;
  };
  costUsd: number;
  finishReason: string;
  durationMs: number;
  warnings: string[];
}

const PRICING: Record<string, { input: number; output: number; thinking: number }> = {
  "gemini-2.5-flash": {
    input: 0.15 / 1_000_000,
    output: 0.6 / 1_000_000,
    thinking: 3.5 / 1_000_000,
  },
  "gemini-2.5-pro": {
    input: 1.25 / 1_000_000,
    output: 10.0 / 1_000_000,
    thinking: 10.0 / 1_000_000,
  },
  "gemini-2.0-flash": {
    input: 0.1 / 1_000_000,
    output: 0.4 / 1_000_000,
    thinking: 0,
  },
};

function calculateCost(
  model: string,
  inputTokens: number,
  outputTokens: number,
  thinkingTokens: number = 0,
): number {
  const pricing = PRICING[model] ?? PRICING["gemini-2.5-flash"];
  const textOutputTokens = Math.max(0, outputTokens - thinkingTokens);
  return (
    inputTokens * pricing.input +
    textOutputTokens * pricing.output +
    thinkingTokens * pricing.thinking
  );
}

function extractThinkingTokens(result: any): number {
  // Vercel AI SDK: providerMetadata.google.usageMetadata.thoughtsTokenCount
  const fromProvider = result.providerMetadata?.google?.usageMetadata?.thoughtsTokenCount;
  if (fromProvider != null) return fromProvider;

  // AI SDK v6: usage.reasoningTokens
  const fromUsage = result.usage?.reasoningTokens;
  if (fromUsage != null) return fromUsage;

  return 0;
}

async function llmObject<T>(
  config: LLMConfig,
  schema: z.ZodType<T>,
  prompt: string,
  system?: string,
  label = "llmObject",
): Promise<LLMResult<T>> {
  const model = createModel(config);
  const start = Date.now();

  const result = await generateObject({
    model,
    schema,
    prompt,
    system,
    temperature: config.temperature ?? 0.3,
    maxOutputTokens: config.maxTokens ?? 8192,
    maxRetries: config.maxRetries ?? 3,
  });

  const inputTokens = result.usage?.inputTokens ?? 0;
  const outputTokens = result.usage?.outputTokens ?? 0;
  const thinkingTokens = extractThinkingTokens(result);
  const costUsd = calculateCost(config.model ?? "gemini-2.5-flash", inputTokens, outputTokens, thinkingTokens);
  const durationMs = Date.now() - start;

  console.log(
    `[llm:${label}] OK in ${durationMs}ms | tokens: ${inputTokens}→${outputTokens} (${thinkingTokens} thinking) | cost: $${costUsd.toFixed(4)}`
  );

  return {
    data: result.object,
    usage: { inputTokens, outputTokens, thinkingTokens, totalTokens: inputTokens + outputTokens },
    costUsd,
    finishReason: result.finishReason,
    durationMs,
    warnings: [],
  };
}

This pattern creates a clean boundary. Your business logic calls llmObject(config, schema, prompt) and gets back typed data plus cost information. It never touches fetch, never parses JSON, never worries about retries.

Layer 3: API Proxy / Metering (Wallet)

The API proxy layer answers: who spent how much, on what, and should they be allowed to?

This is the layer most teams skip. They put API keys in environment variables, call providers directly, and discover the cost when the bill arrives. The proxy layer makes cost a first-class concern — tracked per call, attributed to a function, enforced with daily limits.

Why a Centralized Proxy?

Consider a typical multi-service architecture:

Without proxy:
  Service A  ──[GEMINI_API_KEY_A]──→  Google
  Service B  ──[GEMINI_API_KEY_B]──→  Google
  Service C  ──[GEMINI_API_KEY_C]──→  Google
  Cron Job   ──[GEMINI_API_KEY_D]──→  Google

  Cost visibility: Check Google Dashboard → $47 total → ??? per service

With proxy:
  Service A  ──[X-Project-Id: svc-a]──→  API Proxy  ──[GEMINI_API_KEY]──→  Google
  Service B  ──[X-Project-Id: svc-b]──→  API Proxy
  Service C  ──[X-Project-Id: svc-c]──→  API Proxy
  Cron Job   ──[X-Project-Id: cron ]──→  API Proxy

  Cost visibility: GET /v1/costs?period=day
  → svc-a: $12.40 (article-write: $8, design-audit: $4.40)
  → svc-b: $3.20 (keyword-research: $3.20)
  → cron:  $31.40 (batch-generate: $31.40) ← PROBLEM IDENTIFIED

The $47 Cost Disaster (A Real Story)

Here’s what happened with zero proxy layer:

pages-plus held its own GEMINI_API_KEY. It ran 9 cron jobs generating blog posts. Each post made 3 Gemini calls (outline + draft + edit). In 5 days: 193 posts × 3 calls × ~$0.06/call = $37.
aso-mrr also held its own GEMINI_API_KEY. Its internal tracking showed $1.14 spent. Google billed $10.19. The 9x discrepancy: thinking tokens. The cost formula priced all output tokens at $0.60/M. But Gemini 2.5 Flash’s thinking tokens cost $3.50/M — almost 6x more. A call that “cost $0.02” actually cost $0.12.
Combined: $47 in 5 days = $282/month run rate. Zero alerts. Zero dashboards. Discovered only by checking the Google Cloud billing console manually.

The fix: delete all API keys from individual services. Route everything through a centralized proxy. Track every call with function-level attribution. Enforce daily spend limits. Never enable a cron job until metering is verified.

Architecture of the Proxy Layer

// Simplified API proxy (e.g., "API Mom")
// In production, this is its own Cloudflare Worker with a D1 database

interface ProxyConfig {
  projects: Map<string, {
    name: string;
    dailyLimitUsd: number;
    tierPermission: 1 | 2 | 3;  // Max cost tier allowed
    apiKeys: Map<string, string>;  // service → key
  }>;
}

// Tier definitions
// Tier 1 (< $0.01/call): Flash models, cache reads, embeddings
// Tier 2 ($0.01 - $0.10/call): Thinking models, image gen
// Tier 3 (> $0.10/call): Pro models, long-context, multi-step

async function handleProxiedRequest(
  request: Request,
  config: ProxyConfig,
  db: D1Database,
): Promise<Response> {
  const projectId = request.headers.get("X-Project-Id");
  const functionName = request.headers.get("X-Function") ?? "unknown";
  const tags = request.headers.get("X-Tags");

  // 1. Authenticate
  const project = config.projects.get(projectId);
  if (!project) return new Response("Unknown project", { status: 403 });

  // 2. Check daily limit
  const todaySpend = await getDailySpend(projectId, db);
  if (todaySpend >= project.dailyLimitUsd) {
    return Response.json(
      { error: "daily_limit_exceeded", spend: todaySpend, limit: project.dailyLimitUsd },
      { status: 429 }
    );
  }

  // 3. Proxy the request to the real provider
  const url = new URL(request.url);
  const targetUrl = mapToProviderUrl(url.pathname);
  const apiKey = project.apiKeys.get(getServiceFromPath(url.pathname));

  const start = Date.now();
  const response = await fetch(targetUrl, {
    method: request.method,
    headers: {
      ...Object.fromEntries(request.headers),
      "x-goog-api-key": apiKey,  // Inject the real API key
    },
    body: request.body,
  });

  // 4. Parse response for token usage
  const responseBody = await response.json();
  const usage = extractUsage(responseBody);
  const costUsd = calculateCost(usage);
  const durationMs = Date.now() - start;

  // 5. Check tier permission
  if (costUsd > tierThreshold(project.tierPermission)) {
    return Response.json(
      { error: "cost_tier_exceeded", estimatedCost: costUsd, maxTier: project.tierPermission },
      { status: 403 }
    );
  }

  // 6. Record to ledger
  await db.prepare(`
    INSERT INTO api_calls_ledger
      (project_id, function, service, cost_usd,
       input_tokens, output_tokens, thinking_tokens,
       duration_ms, tags, status)
    VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, 'ok')
  `).bind(
    projectId, functionName, "gemini", costUsd,
    usage.inputTokens, usage.outputTokens, usage.thinkingTokens,
    durationMs, tags
  ).run();

  // 7. Return the response (re-create since we consumed the body)
  return Response.json(responseBody, {
    status: response.status,
    headers: {
      "X-Cost-Usd": costUsd.toFixed(6),
      "X-Daily-Spend": (todaySpend + costUsd).toFixed(4),
      "X-Daily-Limit": project.dailyLimitUsd.toString(),
    },
  });
}

Cost Attribution Levels

The proxy tracks costs at multiple levels of granularity:

// The ledger schema
interface ApiCallLedger {
  id: number;
  timestamp: string;
  project_id: string;        // "pages-plus", "scalable-media"
  function: string;           // "article-write", "brand-audit", "keyword-research"
  service: string;            // "gemini", "brave", "perplexity"
  endpoint: string;           // "gemini-2.5-flash", "gemini-2.5-pro"
  cost_usd: number;
  input_tokens: number;
  output_tokens: number;
  thinking_tokens: number;    // Separate column — never lump with output
  cache_read_tokens: number;
  duration_ms: number;
  tags: string;               // JSON: ["brand:niche-fi", "trigger:cron", "batch:2026-03-15"]
  status: "ok" | "error";
  error_message: string | null;
}

The function field is critical. Without it, you know “pages-plus spent $37” but not “the article-write function spent $32 and the design-audit function spent $5.” Function-level attribution is what turns cost data into actionable intelligence.

Querying Costs

// GET /v1/costs?period=day&project=pages-plus
const costs = await db.prepare(`
  SELECT
    function,
    service,
    COUNT(*) as calls,
    SUM(cost_usd) as total_cost,
    SUM(input_tokens) as total_input,
    SUM(output_tokens) as total_output,
    SUM(thinking_tokens) as total_thinking,
    AVG(duration_ms) as avg_duration
  FROM api_calls_ledger
  WHERE project_id = ?
    AND timestamp >= datetime('now', '-1 day')
  GROUP BY function, service
  ORDER BY total_cost DESC
`).bind("pages-plus").all();

// Result:
// function          | service | calls | total_cost | total_thinking
// article-write     | gemini  | 579   | $32.40     | 2,847,000
// design-audit      | gemini  | 38    | $4.60      | 412,000
// anchor-gen        | gemini  | 12    | $0.40      | 0

Daily Spend Limits

async function getDailySpend(projectId: string, db: D1Database): Promise<number> {
  const result = await db.prepare(`
    SELECT COALESCE(SUM(cost_usd), 0) as spend
    FROM api_calls_ledger
    WHERE project_id = ?
      AND timestamp >= datetime('now', 'start of day')
      AND status = 'ok'
  `).bind(projectId).first<{ spend: number }>();

  return result?.spend ?? 0;
}

// Client-side handling of 429
async function callLLMWithBudgetCheck(proxy: string, body: unknown): Promise<Response> {
  const response = await fetch(`${proxy}/v1/gemini/generateContent`, {
    method: "POST",
    headers: {
      "X-Project-Id": "brand-agents",
      "X-Function": "research",
      "X-Api-Key": env.PROXY_KEY,
    },
    body: JSON.stringify(body),
  });

  if (response.status === 429) {
    const { spend, limit } = await response.json();
    console.warn(`[budget] Daily limit hit: $${spend.toFixed(2)} / $${limit}`);
    // Queue for tomorrow, or use a cheaper model, or skip
    return handleBudgetExhausted(spend, limit);
  }

  return response;
}

Key insight: The proxy layer’s value isn’t just tracking — it’s enforcement. Any service can track its own costs. But only a centralized proxy can prevent overspend across all services simultaneously. When a cron job tries to burn $30 overnight, the proxy returns 429 after $5. The cron handles the 429 gracefully. Nobody wakes up to a surprise bill.

The Composition Pattern

This is where the architecture comes together. The three layers compose through two integration points:

Container → Brain: The AIChatAgent.onChatMessage() method calls streamText() / generateText() / generateObject() from the AI SDK
Brain → Wallet: The AI SDK provider’s fetch option routes HTTP requests through the proxy

The Custom Fetch Bridge

Every AI SDK provider factory (createGoogleGenerativeAI, createAnthropic, createOpenAI) accepts a fetch option. This is the seam where Layer 2 connects to Layer 3:

import { createGoogleGenerativeAI } from "@ai-sdk/google";

// Direct call — bypasses cost tracking
const googleDirect = createGoogleGenerativeAI({
  apiKey: env.GEMINI_API_KEY,
});

// Proxied call — all requests go through API Mom
const googleProxied = createGoogleGenerativeAI({
  apiKey: "proxied",  // API Mom injects the real key
  baseURL: `${env.API_MOM_URL}/v1/google`,
  fetch: async (url: RequestInfo | URL, init?: RequestInit) => {
    const headers = new Headers(init?.headers);
    headers.set("X-Project-Id", "brand-agents");
    headers.set("X-Function", "content-generation");
    headers.set("X-Api-Key", env.API_MOM_KEY);
    headers.set("X-Tags", JSON.stringify(["brand:niche-fi", "trigger:agent"]));

    return fetch(url, { ...init, headers });
  },
});

The beauty of this pattern: the AI SDK doesn’t know it’s talking to a proxy. It thinks it’s calling Google’s API. The proxy handles authentication, cost tracking, and budget enforcement transparently.

Full Stack Composition

Here’s a complete agent that uses all three layers:

import { AIChatAgent } from "agents";
import { streamText, generateObject, tool } from "ai";
import { createGoogleGenerativeAI } from "@ai-sdk/google";
import { z } from "zod";

interface Env {
  API_MOM_URL: string;
  API_MOM_KEY: string;
  RESEARCH_QUEUE: Queue;
}

interface ResearchAgentState {
  brandSlug: string;
  status: "idle" | "researching" | "analyzing" | "reporting";
  findingsCount: number;
  costToday: number;
  costBudget: number;
  lastError: string | null;
}

export class ResearchAgent extends AIChatAgent<Env, ResearchAgentState> {
  initialState: ResearchAgentState = {
    brandSlug: "",
    status: "idle",
    findingsCount: 0,
    costToday: 0,
    costBudget: 2.0,
    lastError: null,
  };

  // Layer 2: Create a proxied model instance
  private createModel(functionName: string) {
    return createGoogleGenerativeAI({
      apiKey: "proxied",  // Real key lives in API Mom
      baseURL: `${this.env.API_MOM_URL}/v1/google`,
      fetch: async (url, init) => {
        const headers = new Headers(init?.headers);
        headers.set("X-Project-Id", "research-agents");
        headers.set("X-Function", functionName);
        headers.set("X-Api-Key", this.env.API_MOM_KEY);
        headers.set("X-Tags", JSON.stringify([
          `brand:${this.state.brandSlug}`,
          "trigger:chat",
        ]));
        return fetch(url, { ...init, headers });
      },
    })("gemini-2.5-flash");
  }

  // Layer 1: Lifecycle
  async onStart() {
    this.sql`CREATE TABLE IF NOT EXISTS findings (
      id INTEGER PRIMARY KEY AUTOINCREMENT,
      query TEXT NOT NULL,
      category TEXT,
      content TEXT NOT NULL,
      confidence REAL DEFAULT 0.5,
      source_url TEXT,
      created_at TEXT DEFAULT (datetime('now'))
    )`;

    this.sql`CREATE TABLE IF NOT EXISTS cost_log (
      id INTEGER PRIMARY KEY AUTOINCREMENT,
      function TEXT NOT NULL,
      cost_usd REAL NOT NULL,
      tokens_used INTEGER,
      created_at TEXT DEFAULT (datetime('now'))
    )`;
  }

  // Layer 1 → Layer 2: Chat with tools
  async onChatMessage(onFinish?: StreamTextOnFinishCallback) {
    const model = this.createModel("chat");

    const result = streamText({
      model,
      system: `You are a research agent for the brand "${this.state.brandSlug}".
               Use the available tools to research topics and save findings.
               Always search before answering. Save important findings.`,
      messages: this.messages,
      tools: {
        search: tool({
          description: "Search the web for information on a topic",
          parameters: z.object({
            query: z.string().describe("Search query"),
          }),
          execute: async ({ query }) => {
            // This call also goes through API Mom for cost tracking
            const res = await fetch(
              `${this.env.API_MOM_URL}/v1/brave/search?q=${encodeURIComponent(query)}`,
              {
                headers: {
                  "X-Project-Id": "research-agents",
                  "X-Function": "web-search",
                  "X-Api-Key": this.env.API_MOM_KEY,
                },
              }
            );
            return await res.text();
          },
        }),

        saveFinding: tool({
          description: "Save a research finding to the knowledge base",
          parameters: z.object({
            query: z.string(),
            category: z.string(),
            content: z.string(),
            confidence: z.number().min(0).max(1),
            sourceUrl: z.string().optional(),
          }),
          execute: async ({ query, category, content, confidence, sourceUrl }) => {
            this.sql`INSERT INTO findings (query, category, content, confidence, source_url)
                     VALUES (${query}, ${category}, ${content}, ${confidence}, ${sourceUrl ?? null})`;

            this.setState({
              ...this.state,
              findingsCount: this.state.findingsCount + 1,
            });

            return `Saved finding in category "${category}" with confidence ${confidence}`;
          },
        }),

        getFindings: tool({
          description: "Retrieve previous research findings",
          parameters: z.object({
            category: z.string().optional(),
            limit: z.number().default(10),
          }),
          execute: async ({ category, limit }) => {
            const results = category
              ? [...this.sql`SELECT * FROM findings WHERE category = ${category} ORDER BY created_at DESC LIMIT ${limit}`]
              : [...this.sql`SELECT * FROM findings ORDER BY created_at DESC LIMIT ${limit}`];
            return JSON.stringify(results);
          },
        }),
      },
      maxSteps: 8,
      onFinish: async (result) => {
        // Track cost in agent state
        const thinkingTokens = result.providerMetadata?.google?.usageMetadata?.thoughtsTokenCount ?? 0;
        const cost = calculateCost(
          "gemini-2.5-flash",
          result.usage?.inputTokens ?? 0,
          result.usage?.outputTokens ?? 0,
          thinkingTokens,
        );

        this.sql`INSERT INTO cost_log (function, cost_usd, tokens_used)
                 VALUES ('chat', ${cost}, ${result.usage?.totalTokens ?? 0})`;

        this.setState({
          ...this.state,
          costToday: this.state.costToday + cost,
        });

        onFinish?.(result);
      },
    });

    return result.toUIMessageStreamResponse();
  }

  // Structured output example — Layer 2 with schema validation
  async analyzeCompetitors(keyword: string) {
    const model = this.createModel("competitor-analysis");

    const CompetitorSchema = z.object({
      competitors: z.array(z.object({
        name: z.string(),
        url: z.string(),
        strengths: z.array(z.string()),
        weaknesses: z.array(z.string()),
        estimatedTraffic: z.enum(["low", "medium", "high", "very-high"]),
      })),
      marketGaps: z.array(z.string()),
      recommendation: z.string(),
    });

    const { object, usage } = await generateObject({
      model,
      schema: CompetitorSchema,
      prompt: `Analyze the competitive landscape for "${keyword}".
               Identify top competitors, their strengths and weaknesses,
               and market gaps we could exploit.`,
      maxRetries: 3,
    });

    return object;
  }
}

The Data Flow

When a user sends a chat message to this agent, here’s exactly what happens:

1. WebSocket message arrives at the Cloudflare Worker
2. routeAgentRequest() routes to the correct ResearchAgent instance (Durable Object)
3. AIChatAgent loads conversation history from SQLite (Layer 1)
4. onChatMessage() is called
5. streamText() sends the prompt + tools to the AI SDK (Layer 2)
6. AI SDK creates an HTTP request to Gemini's API
7. Custom fetch() intercepts the request, adds proxy headers
8. Request goes to API Mom (Layer 3)
9. API Mom checks daily spend limit
10. API Mom injects the real GEMINI_API_KEY
11. API Mom forwards to Google's API
12. Google returns tokens + usage metadata
13. API Mom records cost to api_calls_ledger
14. API Mom forwards response back (with X-Cost-Usd header)
15. AI SDK parses the response, validates schema if applicable
16. If tool calls: execute tools, send results back to LLM (repeat 6-15)
17. Stream tokens back through WebSocket to client (Layer 1)
18. onFinish: record cost in agent's SQLite, update state
19. State change broadcasts to all connected WebSocket clients
20. Agent goes idle → eventually hibernates → $0 cost

Each layer handles its concern. The agent code in step 4 doesn’t know about API keys. The AI SDK in step 6 doesn’t know about Durable Objects. The proxy in step 9 doesn’t know about Zod schemas. They compose through narrow interfaces: function calls and HTTP.

Patterns

Pattern 1: Budget-Aware Agent

An agent that adjusts its behavior based on remaining budget. Uses the Wallet layer’s cost response headers to make real-time decisions.

export class BudgetAwareAgent extends Agent<Env, BudgetAgentState> {
  initialState: BudgetAgentState = {
    dailyBudget: 5.0,
    spentToday: 0,
    model: "gemini-2.5-flash",
    taskQueue: [],
    completedTasks: 0,
    skippedTasks: 0,
  };

  private getModel() {
    // Downgrade model when budget is tight
    const remaining = this.state.dailyBudget - this.state.spentToday;

    if (remaining < 0.50) {
      return "gemini-2.0-flash";  // Cheapest: $0.10/M input
    }
    if (remaining < 2.0) {
      return "gemini-2.5-flash";  // Middle: $0.15/M input, has thinking
    }
    return "gemini-2.5-pro";     // Best: $1.25/M input, best quality
  }

  private createProxiedModel(functionName: string) {
    const modelId = this.getModel();
    return createGoogleGenerativeAI({
      apiKey: "proxied",
      baseURL: `${this.env.API_MOM_URL}/v1/google`,
      fetch: async (url, init) => {
        const headers = new Headers(init?.headers);
        headers.set("X-Project-Id", "budget-agent");
        headers.set("X-Function", functionName);
        headers.set("X-Api-Key", this.env.API_MOM_KEY);

        const response = await fetch(url, { ...init, headers });

        // Read cost from proxy response headers
        const costUsd = parseFloat(response.headers.get("X-Cost-Usd") ?? "0");
        const dailySpend = parseFloat(response.headers.get("X-Daily-Spend") ?? "0");

        // Update agent state with real cost data
        this.setState({
          ...this.state,
          spentToday: dailySpend,
        });

        // Log cost to agent's SQLite for analysis
        this.sql`INSERT INTO cost_events (function, model, cost_usd, daily_total)
                 VALUES (${functionName}, ${modelId}, ${costUsd}, ${dailySpend})`;

        return response;
      },
    })(modelId);
  }

  async processTask(task: AgentTask) {
    const remaining = this.state.dailyBudget - this.state.spentToday;

    // Skip expensive tasks when budget is low
    if (task.estimatedCost > remaining) {
      this.sql`INSERT INTO skipped_tasks (task_id, reason, remaining_budget)
               VALUES (${task.id}, 'budget_insufficient', ${remaining})`;

      this.setState({
        ...this.state,
        skippedTasks: this.state.skippedTasks + 1,
      });

      // Re-queue for tomorrow
      this.schedule(tomorrowMidnight(), "retryTask");
      return;
    }

    const model = this.createProxiedModel(task.function);

    try {
      const result = await generateObject({
        model,
        schema: task.outputSchema,
        prompt: task.prompt,
        maxRetries: 2,
      });

      this.setState({
        ...this.state,
        completedTasks: this.state.completedTasks + 1,
      });

      return result.object;
    } catch (err) {
      if (err.message?.includes("daily_limit_exceeded")) {
        // Proxy rejected — we've hit the hard limit
        console.warn("[budget] Hard limit hit, pausing until tomorrow");
        this.setState({ ...this.state, status: "budget_exhausted" });
        this.schedule(tomorrowMidnight(), "resetBudget");
      }
      throw err;
    }
  }

  async resetBudget() {
    this.setState({
      ...this.state,
      spentToday: 0,
      status: "idle",
    });
    // Re-process queued tasks
    await this.processQueue();
  }
}

Pattern 2: Multi-Provider Routing

Route different tasks to different LLM providers based on task complexity, leveraging the AI SDK’s provider abstraction.

type TaskComplexity = "trivial" | "standard" | "complex" | "critical";

interface ModelRoute {
  provider: "google" | "anthropic" | "openai";
  model: string;
  costPerMillionInput: number;
  costPerMillionOutput: number;
  bestFor: string[];
}

const MODEL_ROUTES: Record<TaskComplexity, ModelRoute> = {
  trivial: {
    provider: "google",
    model: "gemini-2.0-flash",
    costPerMillionInput: 0.10,
    costPerMillionOutput: 0.40,
    bestFor: ["classification", "extraction", "formatting"],
  },
  standard: {
    provider: "google",
    model: "gemini-2.5-flash",
    costPerMillionInput: 0.15,
    costPerMillionOutput: 0.60,
    bestFor: ["summarization", "content-generation", "tool-calling"],
  },
  complex: {
    provider: "anthropic",
    model: "claude-sonnet-4-20250514",
    costPerMillionInput: 3.0,
    costPerMillionOutput: 15.0,
    bestFor: ["reasoning", "code-generation", "nuanced-analysis"],
  },
  critical: {
    provider: "google",
    model: "gemini-2.5-pro",
    costPerMillionInput: 1.25,
    costPerMillionOutput: 10.0,
    bestFor: ["long-context", "multi-step-reasoning", "high-stakes-decisions"],
  },
};

function createRoutedModel(
  env: Env,
  complexity: TaskComplexity,
  functionName: string,
) {
  const route = MODEL_ROUTES[complexity];

  // Custom fetch that routes through the proxy
  const proxiedFetch = async (url: RequestInfo | URL, init?: RequestInit) => {
    const headers = new Headers(init?.headers);
    headers.set("X-Project-Id", "routed-agents");
    headers.set("X-Function", functionName);
    headers.set("X-Api-Key", env.API_MOM_KEY);
    headers.set("X-Tags", JSON.stringify([`complexity:${complexity}`, `provider:${route.provider}`]));
    return fetch(url, { ...init, headers });
  };

  switch (route.provider) {
    case "google":
      return createGoogleGenerativeAI({
        apiKey: "proxied",
        baseURL: `${env.API_MOM_URL}/v1/google`,
        fetch: proxiedFetch,
      })(route.model);

    case "anthropic":
      return createAnthropic({
        apiKey: "proxied",
        baseURL: `${env.API_MOM_URL}/v1/anthropic`,
        fetch: proxiedFetch,
      })(route.model);

    case "openai":
      return createOpenAI({
        apiKey: "proxied",
        baseURL: `${env.API_MOM_URL}/v1/openai`,
        fetch: proxiedFetch,
      })(route.model);
  }
}

// Usage in an agent
async function processWithRouting(task: AgentTask, env: Env) {
  const complexity = classifyComplexity(task);
  const model = createRoutedModel(env, complexity, task.function);

  const { object } = await generateObject({
    model,
    schema: task.schema,
    prompt: task.prompt,
    maxRetries: 3,
  });

  return object;
}

// Complexity classifier — itself a trivial LLM call
async function classifyComplexity(task: AgentTask): Promise<TaskComplexity> {
  const model = createRoutedModel(env, "trivial", "classify-complexity");

  const { object } = await generateObject({
    model,
    schema: z.object({
      complexity: z.enum(["trivial", "standard", "complex", "critical"]),
      reasoning: z.string(),
    }),
    prompt: `Classify the complexity of this task:\n\n${task.prompt.slice(0, 500)}`,
  });

  return object.complexity;
}

Pattern 3: Agent-to-Agent Communication via Queues

Multiple agents that coordinate through Cloudflare Queues, with each agent independently managing its own state and cost budget.

// Research Agent — finds information
export class ResearchWorkerAgent extends Agent<Env, ResearchState> {
  initialState: ResearchState = {
    status: "idle",
    assignedKeywords: [],
    completedKeywords: [],
    costToday: 0,
  };

  async onStart() {
    this.sql`CREATE TABLE IF NOT EXISTS research_results (
      keyword TEXT PRIMARY KEY,
      serp_data TEXT,
      competitor_data TEXT,
      opportunity_score REAL,
      researched_at TEXT DEFAULT (datetime('now'))
    )`;
  }

  // Triggered by queue message from Coordinator
  async handleResearchRequest(keyword: string, correlationId: string) {
    this.setState({ ...this.state, status: "researching" });

    const model = this.createProxiedModel("keyword-research");

    // Step 1: Search
    const searchResults = await this.searchWeb(keyword);

    // Step 2: Analyze with LLM
    const { object: analysis } = await generateObject({
      model,
      schema: KeywordAnalysisSchema,
      prompt: `Analyze the SERP for "${keyword}":\n${searchResults}`,
    });

    // Step 3: Save results
    this.sql`INSERT OR REPLACE INTO research_results
             (keyword, serp_data, competitor_data, opportunity_score)
             VALUES (${keyword}, ${JSON.stringify(searchResults)},
                     ${JSON.stringify(analysis.competitors)},
                     ${analysis.opportunityScore})`;

    // Step 4: Emit completion event
    await this.env.EVENTS_QUEUE.send({
      event_id: crypto.randomUUID(),
      type: "research.completed",
      source: "research-agent",
      timestamp: new Date().toISOString(),
      correlation_id: correlationId,
      payload: {
        keyword,
        opportunityScore: analysis.opportunityScore,
        competitorCount: analysis.competitors.length,
      },
    });

    this.setState({
      ...this.state,
      status: "idle",
      completedKeywords: [...this.state.completedKeywords, keyword],
    });
  }
}

// Content Agent — generates content based on research
export class ContentWorkerAgent extends Agent<Env, ContentState> {
  initialState: ContentState = {
    status: "idle",
    articlesGenerated: 0,
    costToday: 0,
  };

  // Triggered by research.completed event
  async handleResearchCompleted(event: DomainMessage<ResearchCompletedPayload>) {
    const { keyword, opportunityScore } = event.payload;

    // Only generate content for high-opportunity keywords
    if (opportunityScore < 0.6) {
      console.log(`[content] Skipping "${keyword}" — score ${opportunityScore} below threshold`);
      return;
    }

    this.setState({ ...this.state, status: "generating" });

    // Use a more capable model for content generation
    const model = this.createProxiedModel("article-write");

    const { object: article } = await generateObject({
      model,
      schema: ArticleSchema,
      prompt: `Write a comprehensive article about "${keyword}" targeting
               users searching for this term. Include practical advice,
               comparisons, and actionable steps.`,
      maxRetries: 3,
    });

    // Emit publish command
    await this.env.PUBLISH_QUEUE.send({
      event_id: crypto.randomUUID(),
      type: "content.publish",
      source: "content-agent",
      timestamp: new Date().toISOString(),
      correlation_id: event.correlation_id,
      payload: {
        keyword,
        title: article.title,
        slug: article.slug,
        content: article.content,  // Normally stored in R2, reference in message
      },
    });

    this.setState({
      ...this.state,
      status: "idle",
      articlesGenerated: this.state.articlesGenerated + 1,
    });
  }
}

Pattern 4: Cloudflare AI Gateway Integration

Instead of a custom proxy, use Cloudflare’s AI Gateway for caching, rate limiting, and observability. This works as a lightweight Layer 3 when you don’t need custom cost attribution.

import { createAiGateway } from "ai-gateway-provider";
import { createGoogleGenerativeAI } from "ai-gateway-provider/providers/google";

export class GatewayAgent extends Agent<Env, AgentState> {
  private createModel() {
    // Route through Cloudflare AI Gateway
    const aigateway = createAiGateway({
      binding: this.env.AI.gateway("my-gateway"),
      options: {
        cacheTtl: 300,  // Cache responses for 5 minutes
      },
    });

    const google = createGoogleGenerativeAI({
      apiKey: this.env.GEMINI_API_KEY,
    });

    // Compose: AI Gateway wraps the Google provider
    return aigateway(google("gemini-2.5-flash"));
  }

  async onChatMessage() {
    const model = this.createModel();

    const result = streamText({
      model,
      messages: this.messages,
      maxSteps: 5,
    });

    return result.toUIMessageStreamResponse();
  }
}

The AI Gateway provides:

Caching: Identical prompts hit the cache instead of the API ($0)
Rate limiting: Prevent abuse at the gateway level
Logging: Every request logged with latency, tokens, and cost
Analytics: Dashboard showing usage patterns per model
Fallbacks: If one provider fails, automatically try another

But it doesn’t provide:

Function-level cost attribution
Daily spend limits per project
Cross-service cost aggregation
Custom cost formulas (like thinking token pricing)

This is why many production systems use both: AI Gateway for caching and fallbacks, a custom proxy for attribution and enforcement.

Pattern 5: Structured Output Pipeline

A pipeline that chains multiple LLM calls, each with its own schema, building on the output of the previous step.

const OutlineSchema = z.object({
  title: z.string(),
  sections: z.array(z.object({
    heading: z.string(),
    keyPoints: z.array(z.string()),
    estimatedWordCount: z.number(),
  })),
  targetWordCount: z.number(),
  targetAudience: z.string(),
});

const DraftSchema = z.object({
  title: z.string(),
  content: z.string().describe("Full article in markdown"),
  wordCount: z.number(),
  metaDescription: z.string().max(160),
  tags: z.array(z.string()),
});

const EditSchema = z.object({
  content: z.string().describe("Edited article in markdown"),
  changes: z.array(z.object({
    type: z.enum(["grammar", "clarity", "seo", "structure"]),
    description: z.string(),
  })),
  readabilityScore: z.number().min(0).max(100),
});

async function articlePipeline(
  keyword: string,
  research: string,
  env: Env,
): Promise<{ article: z.infer<typeof EditSchema>; totalCost: number }> {
  let totalCost = 0;

  const createModel = (fn: string) => createGoogleGenerativeAI({
    apiKey: "proxied",
    baseURL: `${env.API_MOM_URL}/v1/google`,
    fetch: async (url, init) => {
      const headers = new Headers(init?.headers);
      headers.set("X-Project-Id", "content-pipeline");
      headers.set("X-Function", fn);
      headers.set("X-Api-Key", env.API_MOM_KEY);
      const response = await fetch(url, { ...init, headers });
      totalCost += parseFloat(response.headers.get("X-Cost-Usd") ?? "0");
      return response;
    },
  })("gemini-2.5-flash");

  // Step 1: Outline (cheap, fast)
  const { object: outline } = await generateObject({
    model: createModel("outline"),
    schema: OutlineSchema,
    prompt: `Create an article outline for "${keyword}" based on this research:\n${research}`,
  });

  // Step 2: Draft (main cost — longer output)
  const { object: draft } = await generateObject({
    model: createModel("draft"),
    schema: DraftSchema,
    prompt: `Write a full article following this outline:\n${JSON.stringify(outline)}`,
    maxOutputTokens: 16384,
  });

  // Step 3: Edit (moderate cost — reviews full article)
  const { object: edited } = await generateObject({
    model: createModel("edit"),
    schema: EditSchema,
    prompt: `Edit this article for clarity, SEO, and readability:\n${draft.content}`,
  });

  console.log(`[pipeline] Article for "${keyword}" complete. Total cost: $${totalCost.toFixed(4)}`);

  return { article: edited, totalCost };
}

Small Examples

Example 1: Minimal Agent with State Sync

The simplest possible agent — state syncs to all connected clients in real-time.

import { Agent } from "agents";

interface CounterState {
  count: number;
  lastUpdated: string | null;
}

export class CounterAgent extends Agent<Env, CounterState> {
  initialState: CounterState = { count: 0, lastUpdated: null };

  async onMessage(connection: Connection, message: string) {
    const { action } = JSON.parse(message);

    if (action === "increment") {
      this.setState({
        count: this.state.count + 1,
        lastUpdated: new Date().toISOString(),
      });
    }

    if (action === "reset") {
      this.setState({ count: 0, lastUpdated: new Date().toISOString() });
    }
    // State automatically broadcasts to all connected WebSocket clients
  }
}

Example 2: Custom Fetch Logger

Intercept all AI SDK requests to log request/response details — useful for debugging.

function createLoggingModel(apiKey: string, modelId: string) {
  return createGoogleGenerativeAI({
    apiKey,
    fetch: async (url, init) => {
      const startMs = Date.now();
      const requestBody = init?.body ? JSON.parse(init.body as string) : null;

      console.log(`[llm:request] ${url}`);
      console.log(`[llm:request] Prompt tokens (est): ${estimateTokens(requestBody)}`);

      const response = await fetch(url, init);
      const durationMs = Date.now() - startMs;

      // Clone to read body without consuming the stream
      const cloned = response.clone();
      const responseBody = await cloned.json();
      const usage = responseBody.usageMetadata;

      console.log(`[llm:response] ${durationMs}ms | ${response.status}`);
      console.log(`[llm:response] Input: ${usage?.promptTokenCount ?? "?"} | Output: ${usage?.candidatesTokenCount ?? "?"} | Thinking: ${usage?.thoughtsTokenCount ?? 0}`);

      return response;
    },
  })(modelId);
}

Example 3: Schema-First Tool Definition

Define tools using Zod schemas for full type safety — parameters and return types are both validated.

import { tool } from "ai";
import { z } from "zod";

const weatherTool = tool({
  description: "Get current weather for a location",
  parameters: z.object({
    city: z.string().describe("City name"),
    units: z.enum(["celsius", "fahrenheit"]).default("celsius"),
  }),
  execute: async ({ city, units }) => {
    // city is typed as string, units as "celsius" | "fahrenheit"
    const response = await fetch(
      `https://api.weather.example.com/v1/current?city=${city}&units=${units}`
    );
    const data = await response.json();
    return `${city}: ${data.temperature}°${units === "celsius" ? "C" : "F"}, ${data.condition}`;
  },
});

// Use in generateText
const { text } = await generateText({
  model: google("gemini-2.5-flash"),
  tools: { weather: weatherTool },
  prompt: "What's the weather in Tokyo?",
  maxSteps: 3,
});

Example 4: Thinking Budget Control

Limit thinking tokens to control cost — useful when you want speed over depth.

async function quickClassification(text: string) {
  const { object } = await generateObject({
    model: google("gemini-2.5-flash"),
    schema: z.object({
      category: z.enum(["positive", "negative", "neutral"]),
      confidence: z.number().min(0).max(1),
    }),
    prompt: `Classify the sentiment: "${text}"`,
    providerOptions: {
      google: {
        thinkingConfig: { thinkingBudget: 0 },  // No thinking — fastest, cheapest
      },
    },
  });
  return object;
}

async function deepAnalysis(text: string) {
  const { object } = await generateObject({
    model: google("gemini-2.5-flash"),
    schema: z.object({
      sentiment: z.enum(["positive", "negative", "neutral", "mixed"]),
      themes: z.array(z.string()),
      reasoning: z.string(),
      suggestedActions: z.array(z.string()),
    }),
    prompt: `Deeply analyze this text for sentiment, themes, and actionable insights:\n\n${text}`,
    providerOptions: {
      google: {
        thinkingConfig: { thinkingBudget: 4096 },  // Allow thinking — better quality
      },
    },
  });
  return object;
}

Example 5: Agent with Scheduled Self-Improvement

An agent that periodically reviews its own performance and adjusts its strategy.

export class AdaptiveAgent extends Agent<Env, AdaptiveState> {
  initialState: AdaptiveState = {
    strategy: "balanced",
    successRate: 0,
    avgCostPerTask: 0,
    tasksProcessed: 0,
  };

  async onStart() {
    // Self-review every 6 hours
    this.schedule("0 */6 * * *", "selfReview");

    this.sql`CREATE TABLE IF NOT EXISTS task_outcomes (
      id INTEGER PRIMARY KEY AUTOINCREMENT,
      task_type TEXT,
      model_used TEXT,
      cost_usd REAL,
      success INTEGER,
      quality_score REAL,
      created_at TEXT DEFAULT (datetime('now'))
    )`;
  }

  async selfReview() {
    const stats = [...this.sql`
      SELECT
        model_used,
        COUNT(*) as total,
        SUM(success) as successes,
        AVG(cost_usd) as avg_cost,
        AVG(quality_score) as avg_quality
      FROM task_outcomes
      WHERE created_at > datetime('now', '-6 hours')
      GROUP BY model_used
    `];

    if (stats.length === 0) return;

    // Use the cheapest model for self-reflection
    const model = this.createProxiedModel("self-review");

    const { object: review } = await generateObject({
      model,
      schema: z.object({
        recommendation: z.enum(["use_cheaper_model", "use_better_model", "stay_current"]),
        reasoning: z.string(),
        suggestedThinkingBudget: z.number(),
      }),
      prompt: `Review agent performance stats and recommend adjustments:\n${JSON.stringify(stats)}`,
    });

    this.setState({
      ...this.state,
      strategy: review.recommendation,
    });

    this.sql`INSERT INTO task_outcomes (task_type, model_used, cost_usd, success, quality_score)
             VALUES ('self-review', 'gemini-2.0-flash', 0.001, 1, 1.0)`;
  }
}

Example 6: Idempotent Queue Consumer with Cost Tracking

A queue consumer that deduplicates messages and tracks cost per processed item.

export default {
  async queue(batch: MessageBatch<DomainMessage>, env: Env) {
    for (const msg of batch.messages) {
      const { event_id, type, payload } = msg.body;

      // Idempotency check
      const existing = await env.DB.prepare(
        `SELECT 1 FROM processed_events WHERE event_id = ?`
      ).bind(event_id).first();

      if (existing) {
        msg.ack();
        continue;
      }

      try {
        let costUsd = 0;

        if (type === "content.generate") {
          const model = createGoogleGenerativeAI({
            apiKey: "proxied",
            baseURL: `${env.API_MOM_URL}/v1/google`,
            fetch: async (url, init) => {
              const headers = new Headers(init?.headers);
              headers.set("X-Project-Id", "content-worker");
              headers.set("X-Function", "content-generate");
              headers.set("X-Api-Key", env.API_MOM_KEY);
              headers.set("X-Tags", JSON.stringify([`event:${event_id}`]));
              const res = await fetch(url, { ...init, headers });
              costUsd += parseFloat(res.headers.get("X-Cost-Usd") ?? "0");
              return res;
            },
          })("gemini-2.5-flash");

          await generateObject({
            model,
            schema: ArticleSchema,
            prompt: `Generate content for: ${payload.keyword}`,
          });
        }

        // Mark as processed
        await env.DB.prepare(
          `INSERT INTO processed_events (event_id, type, cost_usd, processed_at)
           VALUES (?, ?, ?, datetime('now'))`
        ).bind(event_id, type, costUsd).run();

        msg.ack();
      } catch (err) {
        console.error(`[queue] Failed ${event_id}: ${err}`);
        msg.retry({ delaySeconds: 30 });
      }
    }
  },
};

Example 7: Provider Fallback Chain

Try providers in order — if one fails (rate limit, outage), fall back to the next.

async function generateWithFallback<T>(
  schema: z.ZodType<T>,
  prompt: string,
  env: Env,
): Promise<{ data: T; provider: string; cost: number }> {
  const providers = [
    {
      name: "google",
      create: () => createGoogleGenerativeAI({
        apiKey: "proxied",
        baseURL: `${env.API_MOM_URL}/v1/google`,
        fetch: proxyFetch(env, "google-primary"),
      })("gemini-2.5-flash"),
    },
    {
      name: "anthropic",
      create: () => createAnthropic({
        apiKey: "proxied",
        baseURL: `${env.API_MOM_URL}/v1/anthropic`,
        fetch: proxyFetch(env, "anthropic-fallback"),
      })("claude-sonnet-4-20250514"),
    },
    {
      name: "openai",
      create: () => createOpenAI({
        apiKey: "proxied",
        baseURL: `${env.API_MOM_URL}/v1/openai`,
        fetch: proxyFetch(env, "openai-fallback"),
      })("gpt-4o"),
    },
  ];

  for (const provider of providers) {
    try {
      const model = provider.create();
      const { object } = await generateObject({ model, schema, prompt, maxRetries: 2 });
      return { data: object, provider: provider.name, cost: 0 };
    } catch (err) {
      console.warn(`[fallback] ${provider.name} failed: ${err.message}`);
      if (provider === providers[providers.length - 1]) {
        throw new Error(`All providers failed. Last error: ${err.message}`);
      }
      continue;
    }
  }

  throw new Error("Unreachable");
}

Example 8: Real-Time Cost Dashboard via WebSocket

An agent that tracks cost across all agent instances and broadcasts to a monitoring dashboard.

export class CostMonitorAgent extends Agent<Env, CostMonitorState> {
  initialState: CostMonitorState = {
    totalToday: 0,
    byProject: {},
    alerts: [],
    lastUpdated: null,
  };

  async onStart() {
    // Poll cost data every 5 minutes
    this.schedule("*/5 * * * *", "refreshCosts");
  }

  async refreshCosts() {
    const response = await fetch(`${this.env.API_MOM_URL}/v1/costs?period=day`, {
      headers: { "X-Api-Key": this.env.API_MOM_ADMIN_KEY },
    });

    const costs = await response.json();
    const alerts: string[] = [];

    // Check for projects approaching limits
    for (const project of costs.projects) {
      const usagePercent = (project.spent / project.dailyLimit) * 100;
      if (usagePercent > 80) {
        alerts.push(`${project.name}: ${usagePercent.toFixed(0)}% of daily budget ($${project.spent.toFixed(2)}/$${project.dailyLimit})`);
      }
    }

    this.setState({
      totalToday: costs.totalSpend,
      byProject: Object.fromEntries(costs.projects.map(p => [p.name, p.spent])),
      alerts,
      lastUpdated: new Date().toISOString(),
    });

    // State update automatically broadcasts to all connected dashboard clients
    // via WebSocket — no explicit push needed
  }
}

Comparisons

Agent Runtime Frameworks

Framework	Runtime Model	State Persistence	Scaling Model	Cost When Idle	Language	Edge/Serverless
Cloudflare Agents SDK	Durable Objects (micro-servers)	Built-in SQLite + key-value state	Millions of instances, auto-scale	$0 (hibernation)	TypeScript	Yes (global edge)
OpenAI Agents SDK	In-process (client manages runtime)	None (bring your own)	Manual (run more processes)	N/A (not a host)	Python/TypeScript	No
LangGraph	In-process or LangGraph Cloud	Checkpointing (pluggable store)	LangGraph Cloud or manual	Depends on host	Python/TypeScript	Cloud only
CrewAI	In-process	None built-in	Manual	Depends on host	Python	No
AutoGen / MS Agent Framework	In-process, conversation-based	Session-level	Manual	Depends on host	Python	No
AWS Bedrock Agents	Managed service	Session memory (managed)	Auto-scale	Per-second billing (no true $0)	API-based	No (AWS regions)

Verdict: If you need agents that hibernate to $0, run on the edge, and scale to millions of instances without infrastructure management, the Cloudflare Agents SDK is unique. If you need a rich ecosystem of pre-built agent patterns and don’t care about runtime, LangGraph or OpenAI Agents SDK give you more out of the box.

LLM Interface Libraries

Library	Structured Output	Tool Calling	Streaming	Provider Abstraction	Custom Fetch	Bundle Size	Edge Compatible
Vercel AI SDK	Zod schemas → `generateObject`	`tool()` + `maxSteps` loops	`streamText`	25+ providers	Yes (per-provider)	~67 KB gzipped	Yes
LangChain JS	Output parsers (less type-safe)	Agent executors, tool chains	Streaming callbacks	50+ providers	Via custom LLM	~101 KB gzipped	Limited
OpenAI SDK	JSON mode + function calling	Native function calling	Streaming helpers	OpenAI only	No	~20 KB gzipped	Yes
Raw fetch	Manual JSON.parse + validation	Manual tool loop	Manual SSE parsing	One at a time	N/A	0 KB	Yes
@cloudflare/ai-gateway SDK	Zod schemas	Tool calling	Streaming	Multi-provider via Gateway	N/A	Small	Yes

Verdict: The Vercel AI SDK is the best fit for TypeScript-first, edge-compatible applications. It gives you type-safe structured output without the weight of LangChain, and provider abstraction without the lock-in of the OpenAI SDK. The custom fetch option is the key feature that enables the proxy layer integration.

LLM Proxy / Gateway Solutions

Solution	Self-Hosted	Cost Tracking	Daily Limits	Function Attribution	Custom Cost Formulas	Thinking Token Support	Open Source
Custom Proxy (API Mom pattern)	Yes (your Worker)	Full control	Yes	Yes (X-Function header)	Yes	Yes	Your code
Cloudflare AI Gateway	Managed	Dashboard only	Rate limiting	No	No	Limited	No
LiteLLM	Yes	Per-model tracking	Yes	Limited	Via callbacks	Partial	Yes
Helicone	Yes (or hosted)	Per-request traces	Alerting	Via headers	Limited	Partial	Yes
Portkey	Hosted ($49/mo+)	Full traces	Budget caps	Via metadata	Via plugins	Yes	No
Direct API calls	N/A	None	None	None	None	None	N/A

Verdict: For maximum control and Cloudflare-native integration, a custom proxy Worker is ideal — you own the cost formula, the attribution schema, and the enforcement logic. For teams that don’t want to build infrastructure, Helicone or Portkey provide good observability out of the box. Cloudflare AI Gateway is a good middle ground for caching and fallbacks but lacks the attribution depth needed for multi-service cost control.

Full Architecture Comparison

Approach	State	LLM	Cost Control	Complexity	Monthly Cost (10K calls)
Three-Layer (this article)	Durable Objects	AI SDK	Custom proxy	Medium	~$15 infra + LLM costs
Monolithic Worker	KV / D1	Raw fetch	None	Low initially, high later	~$5 infra + unknown LLM
LangChain + VPS	Redis / Postgres	LangChain	LiteLLM proxy	High	~$50 server + LLM costs
AWS Bedrock Agents	Managed	Managed	CloudWatch	Medium	~$100+ (multi-layer billing)
OpenAI Agents + Vercel	Vercel KV	OpenAI SDK	OpenAI dashboard	Low-Medium	~$20 Vercel + LLM costs

Anti-Patterns

Don’t	Do Instead	Why
Put API keys in every worker’s `wrangler.jsonc`	Single proxy holds all API keys	One place to rotate, audit, and limit keys
Call `fetch("https://generativelanguage.googleapis.com/...")` directly	Use `createGoogleGenerativeAI()` from `@ai-sdk/google`	Type safety, retries, structured output, provider switching
Parse LLM JSON with `JSON.parse()` and hope	Use `generateObject()` with a Zod schema	Automatic validation, typed results, retry on schema failure
Price all output tokens at the same rate	Separate thinking tokens ($3.50/M) from output tokens ($0.60/M)	3-9x cost undercount if you don’t
Store conversation history in KV	Use `AIChatAgent` (auto-persists to SQLite)	KV can’t query, can’t paginate, can’t search
Create one giant “orchestrator” agent	Use multiple specialized agents communicating via Queues	Isolation, independent scaling, independent budgets
Run expensive LLM calls in crons without cost ceilings	Set daily spend limits in the proxy layer	A cron running every 15 minutes can burn $30 overnight
Trust internal cost tracking without reconciliation	Compare proxy totals against provider billing monthly	Internal tracking has bugs — the provider bill is the truth
Use `setInterval()` in Durable Objects	Use `this.schedule()` or alarm-based scheduling	`setInterval` prevents hibernation, costs money 24/7
Embed LLM calling logic in the agent class	Create a shared LLM harness module	DRY, consistent cost tracking, single place for pricing updates
Skip the proxy layer “because we only have one service”	Add the proxy from day one	You will have more services. Retrofitting cost tracking is 10x harder
Use `maxSteps: 100` for tool-calling agents	Start with `maxSteps: 5-10`, increase deliberately	Each step is a full LLM call. 100 steps = 100x cost.
Retry forever on LLM failures	Use `maxRetries: 2-3` and handle `NoObjectGeneratedError`	Infinite retries = infinite cost
Let the AI SDK hit the provider directly from all environments	Route dev/staging through the same proxy with separate project IDs	Dev cost is invisible otherwise; it also shares rate limits

When to Skip a Layer

You don’t always need all three layers. Here’s when each is optional:

Skip Layer 1 (Agent Runtime) When:

Stateless LLM calls: Your Worker receives a request, calls an LLM, returns the response. No memory, no scheduling, no WebSockets. A plain Worker with the AI SDK is fine.
Batch processing: You’re processing a queue of items through an LLM. The queue consumer doesn’t need to be a Durable Object — a regular Worker with queue() handler works.
Simple API endpoints: POST /api/summarize that takes text and returns a summary. No state to manage.

Use Layer 1 when you need: persistence across requests, real-time WebSocket connections, scheduling, or millions of independent agent instances.

Skip Layer 2 (LLM Interface) When:

No LLM calls: Your agent does deterministic work — monitors health, manages queues, aggregates metrics. Not every agent needs AI.
Workers AI only: If you’re using only Cloudflare Workers AI models, you can call them directly via env.AI.run(). The AI SDK adds value primarily for external providers.
Simple completions: If you just need env.AI.run("@cf/meta/llama-3.1-8b-instruct", { messages }) with no structured output, no tools, and no provider switching, the raw binding is simpler.

Use Layer 2 when you need: structured output (Zod schemas), tool calling, provider abstraction, or streaming to UI clients.

Skip Layer 3 (API Proxy) When:

Development only: You’re prototyping and haven’t committed to a production architecture yet. Direct API calls with hardcoded keys are fine for exploration.
Single service, single provider: You have one Worker making a few LLM calls per day with a hard-coded budget. The overhead of a proxy isn’t justified.
Cloudflare AI Gateway is sufficient: You need caching and rate limiting but not function-level cost attribution.

Use Layer 3 when you need: multi-service cost attribution, daily spend limits, centralized API key management, or when you’ve been surprised by a bill even once.

The Progressive Adoption Path

Day 1:  Layer 2 only (AI SDK in a Worker)
        ↓
Week 2: Layer 1 + Layer 2 (AI SDK in a Durable Object agent)
        ↓
Month 1: All three layers (agent + AI SDK + proxy)

You don’t have to build all three on day one. Start with the AI SDK for type-safe LLM calls. Add the agent runtime when you need state. Add the proxy when you need cost control. Each layer snaps into place without requiring changes to the others.

Production Checklist

Before deploying agents with all three layers:

Layer 1 (Container)

Agent class extends Agent or AIChatAgent from agents package
initialState defined with sensible defaults
SQLite tables created in onStart() with IF NOT EXISTS
wrangler.jsonc has durable_objects bindings and migrations with new_sqlite_classes
routeAgentRequest() configured in the Worker’s fetch handler
No setInterval() or setTimeout() (use this.schedule() instead)
State updates use this.setState() for real-time sync
Long-term data stored in SQLite, not state

Layer 2 (Brain)

AI SDK provider created with create*() factory
All structured output uses generateObject() + Zod schema
Tool parameters and return types defined with Zod
maxSteps set deliberately (not unbounded)
maxRetries set (2-3 for structured output, 1 for streaming)
Thinking tokens extracted and priced separately
NoObjectGeneratedError caught and handled

Layer 3 (Wallet)

No API keys in any worker’s wrangler.jsonc (only in the proxy)
All LLM calls route through the proxy via custom fetch
Every proxied call includes X-Project-Id and X-Function
Daily spend limit configured per project
Workers handle 429 daily_limit_exceeded gracefully
Cost formula prices thinking tokens separately
Monthly reconciliation against provider billing dashboard

Advanced: The Cloudflare AI Gateway Bridge

For teams using Cloudflare’s AI Gateway, you can layer it between the AI SDK and your custom proxy for a hybrid approach:

App → Agents SDK → AI SDK → Custom Proxy → CF AI Gateway → Provider
         (state)    (LLM)     (cost)        (cache/rate)    (API)

import { createAiGateway } from "ai-gateway-provider";
import { createGoogleGenerativeAI } from "ai-gateway-provider/providers/google";

function createFullStackModel(env: Env, functionName: string) {
  // Layer 3a: Cloudflare AI Gateway (caching, rate limiting, fallbacks)
  const aigateway = createAiGateway({
    binding: env.AI.gateway("production-gateway"),
    options: { cacheTtl: 300 },
  });

  // Layer 3b: Google provider (routes through AI Gateway)
  const google = createGoogleGenerativeAI({
    apiKey: env.GEMINI_API_KEY,
  });

  // Wrap with custom fetch for cost attribution
  // This is the Layer 3c: your cost proxy
  const gatewayModel = aigateway(google("gemini-2.5-flash"));

  // The AI Gateway handles caching and rate limiting
  // Your proxy handles cost attribution and budget enforcement
  // The AI SDK handles structured output and tool calling
  // The Agents SDK handles state and lifecycle
  return gatewayModel;
}

This hybrid gives you:

AI Gateway: Response caching ($0 for repeated prompts), rate limiting, provider fallbacks, usage analytics
Custom proxy: Function-level cost attribution, daily spend limits, centralized key management
AI SDK: Structured output, tool calling, streaming, provider abstraction
Agents SDK: Durable state, WebSocket, scheduling, hibernation

Cost Reference

Quick reference for making model routing decisions:

Model	Input ($/M tokens)	Output ($/M tokens)	Thinking ($/M tokens)	Best For
Gemini 2.0 Flash	$0.10	$0.40	N/A	Classification, extraction, simple tasks
Gemini 2.5 Flash	$0.15	$0.60	$3.50	General purpose, tool calling, content
Gemini 2.5 Pro	$1.25	$10.00	$10.00	Complex reasoning, long context
Claude Sonnet 4	$3.00	$15.00	N/A	Code generation, nuanced analysis
Claude Haiku 3.5	$0.80	$4.00	N/A	Fast classification, extraction
GPT-4o	$2.50	$10.00	N/A	Multi-modal, general purpose
GPT-4o Mini	$0.15	$0.60	N/A	Budget-friendly general tasks
Workers AI (Llama 3)	Free (Cloudflare)	Free	N/A	Prototyping, non-critical tasks

Key insight: The cost difference between Gemini 2.0 Flash ($0.10/M input) and Claude Sonnet 4 ($3.00/M input) is 30x. If you’re routing every task through the same expensive model, you’re burning money on classification tasks that a cheap model handles just as well. Multi-provider routing isn’t a nice-to-have — it’s a cost optimization multiplier.

Summary

The three-layer architecture isn’t about adding complexity — it’s about preventing the complexity that inevitably emerges when state, LLM interaction, and cost control are tangled together.

Layer 1 (Container): The Cloudflare Agents SDK gives each agent a Durable Object with SQLite, WebSocket, and scheduling. Agents hibernate to $0 when idle. Millions can run concurrently. State survives everything.
Layer 2 (Brain): The Vercel AI SDK provides generateObject() with Zod schemas, tool() with multi-step loops, streamText() for real-time UI, and provider abstraction to swap Google/Anthropic/OpenAI without changing your agent code.
Layer 3 (Wallet): A centralized API proxy tracks every LLM call with function-level cost attribution, enforces daily spend limits, manages API keys in one place, and prevents the $47 surprise bills that happen when services hold their own keys.

The layers compose through two narrow interfaces:

AIChatAgent.onChatMessage() calls AI SDK functions (Container → Brain)
AI SDK’s fetch option routes through the proxy (Brain → Wallet)

Start with Layer 2 on day one. Add Layer 1 when you need state. Add Layer 3 before your first production deployment. Each layer is independently useful and independently replaceable. That’s the whole point.

References

Cloudflare Agents SDK

Cloudflare Agents Documentation — Official docs covering getting started, API reference, and guides
Cloudflare Agents SDK on GitHub — Source code and examples for the agents package
agents on npm — The main SDK package (replaces deprecated agents-sdk)
Agents API Reference — Agent class methods, state management, scheduling, WebSocket
Chat Agents API Reference — AIChatAgent, message persistence, resumable streaming
Using AI Models with Agents — streamText, generateText, provider configuration
Agents Starter Kit — Full working example with tools and streaming chat
Building agents with OpenAI and Cloudflare’s Agents SDK — Blog post showing OpenAI integration patterns

Cloudflare Durable Objects

Durable Objects Documentation — Overview, concepts, and best practices
Durable Objects Pricing — Per-request, duration, hibernation, WebSocket billing
Durable Objects Alarms — At-least-once scheduling with exponential backoff
Durable Object Lifecycle — Hibernation conditions, idle behavior

Vercel AI SDK

AI SDK Documentation — Official docs for the ai npm package
AI SDK on GitHub — Source code for the AI Toolkit for TypeScript
Generating Structured Data — generateObject with Zod schemas
generateObject API Reference — Full options and return types
Intercepting Fetch Requests — Custom fetch for proxy routing (the Layer 2→3 bridge)
Google Generative AI Provider — createGoogleGenerativeAI options including custom fetch

Cloudflare AI Gateway

AI Gateway Overview — Caching, rate limiting, analytics for AI API calls
Vercel AI SDK Integration — Using AI Gateway with the Vercel AI SDK
ai-gateway-provider on GitHub — Community provider for AI SDK + AI Gateway
ai-gateway-provider (Cloudflare AI Gateway Provider) — AI SDK community provider documentation
@cloudflare/ai-gateway on npm — Official SDK for AI Gateway

LLM Proxy and Cost Tracking

LiteLLM — Open-source LLM gateway with 100+ provider support
Helicone — Open-source LLM observability platform with Rust-based performance
Portkey — AI gateway with full traces, budget caps, and governance
Top 5 LLM Gateways Comparison (Helicone) — Detailed comparison of gateway solutions

Alternative Agent Frameworks

OpenAI Agents SDK — Lightweight agent framework with handoffs and guardrails
LangChain — Comprehensive framework for LLM applications and agents
LangGraph — Stateful, multi-actor agent orchestration
CrewAI — Role-based multi-agent collaboration framework
AWS Bedrock Agents — Managed agent service with multi-layer billing
A Practical Guide to Building Agents (OpenAI) — OpenAI’s production agent guide

Framework Comparisons

LangChain vs Vercel AI SDK vs OpenAI SDK: 2026 Guide (Strapi) — Comprehensive comparison of agent frameworks
AI Framework Comparison (Komelin) — Vercel AI SDK, Mastra, LangChain, and Genkit
CrewAI vs LangGraph vs AutoGen vs OpenAgents — Multi-agent framework comparison
Building for Agentic AI — Agent SDKs & Design Patterns (GovTech) — Practical patterns for agent SDKs

Cloudflare Workers Platform

Workers Pricing — CPU time billing, free tier, paid plan details
Cloudflare Queues — At-least-once message delivery for event-driven architectures
Workers AI — Built-in AI models, no API key needed