Skip to content
Gary Wu
Go back

The Three-Layer AI Agent Architecture

Edit page

Most production AI agents are a tangled mess — LLM calls mixed with state management, API keys scattered across services, cost tracking bolted on as an afterthought (if at all). The result: $47 in surprise Gemini bills, 9x cost undercounting, and zero visibility into what your agents are actually spending.

This article presents a clean separation: three composable layers that handle agent runtime, LLM interaction, and API cost control independently. Built on Cloudflare Workers, the Vercel AI SDK, and a centralized API proxy, this architecture lets you build agents that are stateful, intelligent, and financially observable — without coupling any of those concerns together.

What you’ll learn:


Table of Contents

Open Table of Contents

The Problem

Here is the typical evolution of an AI agent project:

Week 1: You write a Worker that calls the Gemini API with fetch(). It works. You parse the JSON response manually. You hardcode the API key as an environment variable. Cost tracking? “We’ll add that later.”

Week 3: You need structured output, so you write a Zod schema and validate the response yourself. Sometimes the LLM returns malformed JSON. You add retry logic. The retry logic has bugs. You’re now maintaining 200 lines of LLM interaction code that has nothing to do with your agent’s actual purpose.

Week 5: You need the agent to remember things between requests. You bolt on KV storage. Then you need WebSocket connections for real-time updates. Then scheduling. Then you realize KV doesn’t support SQL queries. You start wishing for a database. Your “simple agent” is now 1,500 lines of infrastructure code.

Week 8: Your Google Cloud bill arrives. $47 in 5 days. Your internal tracking shows $5. The 9x discrepancy? You forgot that Gemini’s thinking tokens cost $3.50/M — not the $0.60/M you used for regular output tokens. Six services hold their own API keys. Nobody knows which service spent what. You disable all cron jobs as an emergency measure.

This isn’t hypothetical. This is what happened across two production services in a real Cloudflare Workers deployment. The fix wasn’t “better monitoring” — it was architectural. Each concern (state, LLM interaction, cost control) needed its own layer with clear boundaries.

What changes if you get this right

ConcernTangledThree-Layer
Agent stateKV hacks, lost on redeployDurable Object with SQLite, survives everything
LLM callsRaw fetch(), manual JSON parsinggenerateObject() with Zod, automatic retries
Provider switchingRewrite all call sitesChange one import
Cost tracking”Check the Google dashboard”Per-call attribution with function-level tags
Cost controlHope for the bestDaily spend limits, tier enforcement, automatic 429s
Concurrent agentsOne singleton, maybeMillions of independent instances, zero cost when idle
Real-time updatesPollingWebSocket with automatic state sync

Architecture Overview

The three-layer architecture separates concerns along natural boundaries:

┌─────────────────────────────────────────────────────────────┐
│                     Your Application                        │
│  (Business logic, domain rules, what the agent actually does)│
└──────────────┬──────────────────────────────────────────────┘

┌──────────────▼──────────────────────────────────────────────┐
│  Layer 1: Agent Runtime (Container)                         │
│  Cloudflare Agents SDK  ·  npm: agents                      │
│                                                             │
│  Durable Object per agent instance                          │
│  Built-in SQLite  ·  State sync  ·  WebSocket               │
│  Scheduling  ·  Hibernation  ·  Lifecycle hooks              │
│  Cost when idle: $0                                          │
└──────────────┬──────────────────────────────────────────────┘

┌──────────────▼──────────────────────────────────────────────┐
│  Layer 2: LLM Interface (Brain)                             │
│  Vercel AI SDK  ·  npm: ai + @ai-sdk/google                 │
│                                                             │
│  streamText / generateText / generateObject                  │
│  Zod-validated structured output                             │
│  Tool calling with multi-step agent loops                    │
│  Provider abstraction (Google, Anthropic, OpenAI)            │
│  Thinking token tracking                                     │
└──────────────┬──────────────────────────────────────────────┘
               │  custom fetch()
┌──────────────▼──────────────────────────────────────────────┐
│  Layer 3: API Proxy / Metering (Wallet)                     │
│  Centralized HTTP proxy  ·  e.g., API Mom                   │
│                                                             │
│  Cost tracking with per-call attribution                     │
│  API key management (single source of truth)                 │
│  Daily spend limits  ·  Tier enforcement                     │
│  Caching  ·  Rate limiting  ·  Budget caps                   │
└──────────────┬──────────────────────────────────────────────┘

         ┌─────▼─────┐
         │  Provider  │  (Gemini, Claude, GPT, etc.)
         └───────────┘

Key insight: Each layer is independently useful and independently replaceable. You can use the Agents SDK without the AI SDK (non-LLM agents). You can use the AI SDK without the Agents SDK (stateless LLM calls). You can use the proxy layer with any HTTP client. But when composed together, they create a production-grade agent platform where state, intelligence, and cost control are all first-class concerns.

The TypeScript Shape

// The three layers, typed

// Layer 1: Agent Runtime — what the container looks like
interface AgentRuntime {
  // State
  state: AgentState;
  setState(newState: AgentState): void;
  sql: SqlStorage;  // Built-in SQLite

  // Communication
  broadcast(message: string): void;
  onConnect(connection: Connection): void;
  onMessage(connection: Connection, message: string): void;

  // Scheduling
  schedule(cron: string, callback: string): void;
  scheduleEvery(intervalMs: number): void;

  // Lifecycle
  onStart(): void;
  onStop(): void;
}

// Layer 2: LLM Interface — what the brain looks like
interface LLMInterface {
  // Text generation
  generateText(options: GenerateTextOptions): Promise<GenerateTextResult>;
  streamText(options: StreamTextOptions): StreamTextResult;

  // Structured output
  generateObject<T>(options: {
    model: LanguageModel;
    schema: ZodType<T>;
    prompt: string;
  }): Promise<{ object: T; usage: TokenUsage }>;

  // Tool calling
  tool(options: {
    description: string;
    parameters: ZodType;
    execute: (args: unknown) => Promise<string>;
  }): Tool;
}

// Layer 3: API Proxy — what the wallet looks like
interface APIProxy {
  // Proxied fetch
  fetch(url: string, options: RequestInit): Promise<Response>;

  // Cost tracking
  recordCall(params: {
    project: string;
    function: string;
    service: string;
    inputTokens: number;
    outputTokens: number;
    thinkingTokens: number;
    costUsd: number;
  }): Promise<void>;

  // Budget enforcement
  checkDailyLimit(project: string): Promise<{ allowed: boolean; spent: number; limit: number }>;
  checkTierPermission(project: string, estimatedCost: number): Promise<boolean>;
}

Layer 1: Agent Runtime (Container)

The agent runtime answers: where does the agent live, how does it persist, and how do clients talk to it?

The Cloudflare Agents SDK (npm install agents) gives you a TypeScript class that runs on a Durable Object — a stateful micro-server with its own SQLite database, WebSocket connections, and scheduling system. Each agent instance is isolated, horizontally scalable, and costs nothing when idle.

Core Concepts

The Agent Class

Every agent extends the Agent class, parameterized by your environment bindings and state shape:

import { Agent } from "agents";

interface Env {
  AI: Ai;
  RESEARCH_QUEUE: Queue;
  DB: D1Database;
}

interface BrandAgentState {
  brandSlug: string;
  status: "idle" | "researching" | "generating" | "error";
  lastResearchCycle: string | null;
  pendingTasks: number;
  totalContentGenerated: number;
  costToday: number;
  costBudget: number;
}

export class BrandAgent extends Agent<Env, BrandAgentState> {
  initialState: BrandAgentState = {
    brandSlug: "",
    status: "idle",
    lastResearchCycle: null,
    pendingTasks: 0,
    totalContentGenerated: 0,
    costToday: 0,
    costBudget: 5.0,
  };

  async onStart() {
    // Runs when the agent starts or resumes from hibernation
    this.sql`CREATE TABLE IF NOT EXISTS audit_log (
      id INTEGER PRIMARY KEY AUTOINCREMENT,
      action TEXT NOT NULL,
      details TEXT,
      cost_usd REAL DEFAULT 0,
      created_at TEXT DEFAULT (datetime('now'))
    )`;
  }

  async onMessage(connection: Connection, message: string) {
    const { type, payload } = JSON.parse(message);

    switch (type) {
      case "audit":
        await this.runAudit(payload.brandSlug);
        break;
      case "status":
        connection.send(JSON.stringify({ type: "status", state: this.state }));
        break;
    }
  }

  private async runAudit(brandSlug: string) {
    this.setState({
      ...this.state,
      status: "researching",
      brandSlug,
    });

    // Agent logic goes here — the runtime handles everything else
    // State persists across restarts, deploys, and hibernation
    // WebSocket clients get state updates automatically
    // SQLite is always available for structured data

    this.setState({ ...this.state, status: "idle" });
  }
}

Routing

The Agents SDK routes HTTP and WebSocket requests to agent instances using a URL pattern:

import { routeAgentRequest } from "agents";

export default {
  async fetch(request: Request, env: Env) {
    // Routes to /:agent-name/:instance-name
    // e.g., /brand-agent/niche-fi → BrandAgent instance "niche-fi"
    return routeAgentRequest(request, env, { cors: true });
  },
};

Each unique instance name gets its own Durable Object. The instance niche-fi is completely isolated from llc-tax. They have separate SQLite databases, separate state, separate WebSocket connections. You can have millions of them.

Built-in SQLite

Every agent instance has embedded SQLite accessed via this.sql. This is not D1 — it’s SQLite running directly inside the Durable Object with zero network latency:

// Create tables on startup
this.sql`CREATE TABLE IF NOT EXISTS memories (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  type TEXT NOT NULL,
  content TEXT NOT NULL,
  embedding BLOB,
  created_at TEXT DEFAULT (datetime('now'))
)`;

// Query with tagged templates
const recent = [
  ...this.sql`SELECT * FROM memories
              WHERE type = ${"observation"}
              ORDER BY created_at DESC
              LIMIT 10`
];

// Insert
this.sql`INSERT INTO memories (type, content)
         VALUES (${"decision"}, ${JSON.stringify(decision)})`;

Key insight: The agent’s SQLite database is its long-term memory. State (this.state) is for real-time data that syncs to connected clients. SQLite is for everything else — conversation history, audit logs, embeddings, task queues. The distinction matters: state changes trigger WebSocket broadcasts, SQL writes don’t.

Scheduling and Alarms

Agents can schedule their own work. The scheduling system uses Durable Object alarms under the hood — guaranteed at-least-once execution with automatic retries:

export class MonitoringAgent extends Agent<Env, MonitorState> {
  async onStart() {
    // Run every 4 hours
    this.schedule("0 */4 * * *", "runHealthCheck");
  }

  async runHealthCheck() {
    this.setState({ ...this.state, status: "checking" });

    const sites = [...this.sql`SELECT * FROM monitored_sites WHERE active = 1`];

    for (const site of sites) {
      const response = await fetch(site.url);
      this.sql`INSERT INTO health_checks (site_id, status, latency_ms)
               VALUES (${site.id}, ${response.status}, ${Date.now() - start})`;
    }

    this.setState({
      ...this.state,
      status: "idle",
      lastCheck: new Date().toISOString(),
    });
  }
}

Hibernation and Cost

This is the killer feature for running agents at scale. When a Durable Object has no active connections and no pending timers, it hibernates. Hibernated agents cost $0. They wake up instantly on the next request.

Pricing (Workers Paid plan):

This means you can run 100,000 brand agents. If only 50 are active at any given time, you pay for 50. The other 99,950 cost nothing. They wake up in milliseconds when needed.

The AIChatAgent Subclass

For conversational agents, the SDK provides AIChatAgent with built-in message persistence and resumable streaming:

import { AIChatAgent } from "agents";
import { streamText, tool } from "ai";
import { createGoogleGenerativeAI } from "@ai-sdk/google";

export class ChatAgent extends AIChatAgent<Env, ChatState> {
  async onChatMessage(
    onFinish?: StreamTextOnFinishCallback,
    options?: { abortSignal?: AbortSignal; body?: unknown }
  ) {
    const google = createGoogleGenerativeAI({ apiKey: this.env.GEMINI_API_KEY });

    const result = streamText({
      model: google("gemini-2.5-flash"),
      messages: this.messages,  // Auto-loaded from SQLite
      tools: {
        searchKnowledge: tool({
          description: "Search the agent's knowledge base",
          parameters: z.object({ query: z.string() }),
          execute: async ({ query }) => {
            const results = [...this.sql`
              SELECT content FROM memories
              WHERE content LIKE ${'%' + query + '%'}
              LIMIT 5
            `];
            return JSON.stringify(results);
          },
        }),
      },
      maxSteps: 5,
      onFinish,
      abortSignal: options?.abortSignal,
    });

    return result.toUIMessageStreamResponse();
  }
}

AIChatAgent handles:

Key insight: AIChatAgent is where Layer 1 (Container) and Layer 2 (Brain) naturally compose. The onChatMessage method is the seam. The agent runtime manages the conversation lifecycle. The AI SDK handles the LLM call. Neither knows about the other’s internals.

Wrangler Configuration

{
  "name": "brand-agents",
  "main": "src/index.ts",
  "compatibility_date": "2025-03-01",
  "durable_objects": {
    "bindings": [
      {
        "name": "BRAND_AGENT",
        "class_name": "BrandAgent"
      },
      {
        "name": "CHAT_AGENT",
        "class_name": "ChatAgent"
      }
    ]
  },
  "migrations": [
    {
      "tag": "v1",
      "new_sqlite_classes": ["BrandAgent", "ChatAgent"]
    }
  ]
}

Layer 2: LLM Interface (Brain)

The LLM interface answers: how do you talk to language models, validate their output, and let them call tools?

The Vercel AI SDK (npm install ai) is a TypeScript toolkit that abstracts LLM providers behind a common interface. It handles structured output with Zod schemas, multi-step tool calling, streaming, retries, and provider switching — all the things you’d otherwise build (badly) yourself.

Why Not Raw Fetch?

Here’s what calling Gemini looks like without the AI SDK:

// The "just use fetch" approach — DON'T DO THIS

async function generateContent(prompt: string, schema: object): Promise<unknown> {
  const response = await fetch(
    `https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash:generateContent`,
    {
      method: "POST",
      headers: {
        "Content-Type": "application/json",
        "x-goog-api-key": env.GEMINI_API_KEY,
      },
      body: JSON.stringify({
        contents: [{ parts: [{ text: prompt }] }],
        generationConfig: {
          responseMimeType: "application/json",
          responseSchema: schema, // Not Zod — raw JSON Schema
        },
      }),
    }
  );

  if (!response.ok) {
    // What kind of error? Rate limit? Auth? Bad request? Model overloaded?
    // You have to parse the error body to find out
    const error = await response.json();
    throw new Error(`Gemini error: ${error.error?.message}`);
  }

  const data = await response.json();
  const text = data.candidates?.[0]?.content?.parts?.[0]?.text;

  if (!text) {
    throw new Error("No content in response");
  }

  // Parse JSON — but what if it's malformed?
  let parsed;
  try {
    parsed = JSON.parse(text);
  } catch {
    // Retry? With what backoff? How many times?
    throw new Error(`Invalid JSON from LLM: ${text.slice(0, 200)}`);
  }

  // Validate against schema — but you're using JSON Schema, not Zod
  // No type inference, no runtime validation, no error messages
  // You just... hope it's right

  // Token counting? Thinking tokens?
  const usage = data.usageMetadata;
  // promptTokenCount, candidatesTokenCount, totalTokenCount
  // But where are thoughtsTokenCount? Different field name in different API versions
  // Also: are thinking tokens included in candidatesTokenCount or separate?

  return parsed;
}

That’s 50 lines for a single call, no retries, no streaming, no tool calling, no type safety. Now multiply by every LLM call in your system.

Core Concepts

Provider Abstraction

The AI SDK defines a common LanguageModel interface. You create a model instance from any provider and pass it to the same functions:

import { generateText } from "ai";
import { createGoogleGenerativeAI } from "@ai-sdk/google";
import { createAnthropic } from "@ai-sdk/anthropic";
import { createOpenAI } from "@ai-sdk/openai";

// Same function, different providers
const google = createGoogleGenerativeAI({ apiKey: env.GEMINI_API_KEY });
const anthropic = createAnthropic({ apiKey: env.ANTHROPIC_API_KEY });
const openai = createOpenAI({ apiKey: env.OPENAI_API_KEY });

// Swap providers by changing one line
const { text } = await generateText({
  model: google("gemini-2.5-flash"),
  // model: anthropic("claude-sonnet-4-20250514"),
  // model: openai("gpt-4o"),
  prompt: "Analyze this brand's SEO performance...",
});

Key insight: Provider abstraction isn’t just about convenience — it’s about cost optimization. When Gemini Flash costs $0.15/M input tokens and Claude Sonnet costs $3/M, being able to route different tasks to different providers based on complexity is the difference between a $50/month and $500/month agent system.

Structured Output with Zod

This is the AI SDK’s strongest feature. Define a Zod schema, get a validated, typed object back:

import { generateObject } from "ai";
import { z } from "zod";

const BrandAuditSchema = z.object({
  overallScore: z.number().min(0).max(100).describe("Overall brand health 0-100"),
  categories: z.array(z.object({
    name: z.enum(["seo", "content", "design", "performance"]),
    score: z.number().min(0).max(100),
    issues: z.array(z.object({
      severity: z.enum(["critical", "warning", "info"]),
      description: z.string(),
      recommendation: z.string(),
    })),
  })),
  summary: z.string().describe("2-3 sentence summary of findings"),
  topPriority: z.string().describe("Single most important thing to fix"),
});

type BrandAudit = z.infer<typeof BrandAuditSchema>;

async function auditBrand(siteData: string): Promise<BrandAudit> {
  const { object, usage } = await generateObject({
    model: google("gemini-2.5-flash"),
    schema: BrandAuditSchema,
    prompt: `Audit this brand's website and provide a structured assessment:\n\n${siteData}`,
    temperature: 0.3,
    maxRetries: 3,  // Retries with schema validation on each attempt
  });

  // object is fully typed as BrandAudit
  // No JSON.parse, no manual validation, no hope-based programming
  return object;
}

The AI SDK handles:

Tool Calling and Agent Loops

The AI SDK supports multi-step tool calling where the LLM can invoke tools, observe results, and decide what to do next:

import { generateText, tool } from "ai";
import { z } from "zod";

const result = await generateText({
  model: google("gemini-2.5-flash"),
  system: "You are a brand research agent. Use the available tools to gather data, then synthesize your findings.",
  prompt: "Research the competitive landscape for 'bank statement to Excel converter' tools.",
  tools: {
    searchWeb: tool({
      description: "Search the web for information",
      parameters: z.object({
        query: z.string().describe("Search query"),
      }),
      execute: async ({ query }) => {
        const results = await fetch(`${env.API_MOM_URL}/v1/brave/search?q=${encodeURIComponent(query)}`, {
          headers: { "X-Project-Id": "brand-agent", "X-Api-Key": env.API_MOM_KEY },
        });
        return await results.text();
      },
    }),
    analyzeSerp: tool({
      description: "Analyze a SERP to identify competitors and their positioning",
      parameters: z.object({
        serpData: z.string().describe("Raw SERP results to analyze"),
      }),
      execute: async ({ serpData }) => {
        // Could be another LLM call, or custom logic
        return `Analysis: Found 8 competitors. Top 3: ...`;
      },
    }),
    saveFinding: tool({
      description: "Save a research finding to the knowledge base",
      parameters: z.object({
        category: z.string(),
        finding: z.string(),
        confidence: z.number().min(0).max(1),
      }),
      execute: async ({ category, finding, confidence }) => {
        // Save to agent's SQLite (via closure over the agent instance)
        agent.sql`INSERT INTO findings (category, finding, confidence)
                  VALUES (${category}, ${finding}, ${confidence})`;
        return "Saved.";
      },
    }),
  },
  maxSteps: 10,  // LLM can call tools up to 10 times before final response
  temperature: 0.3,
});

console.log(result.text);           // Final synthesized response
console.log(result.steps.length);   // How many tool-call rounds
console.log(result.usage);          // Total tokens across all steps

The maxSteps parameter controls how many rounds the LLM can make. Each round: the LLM produces text and/or tool calls → tools execute → results go back to the LLM → repeat until the LLM produces text without tool calls (or hits the step limit).

Streaming

For real-time UIs, streamText sends tokens as they’re generated:

import { streamText } from "ai";

const result = streamText({
  model: google("gemini-2.5-flash"),
  messages: conversationHistory,
  tools: { /* ... */ },
  maxSteps: 5,
});

// In a Cloudflare Worker, return as a streaming response
return result.toTextStreamResponse();

// Or in an AIChatAgent, return UI message stream
return result.toUIMessageStreamResponse();

Thinking Token Tracking

Models like Gemini 2.5 Flash and Claude use “thinking” tokens that cost significantly more than regular output tokens. The AI SDK exposes this metadata:

const result = await generateText({
  model: google("gemini-2.5-flash"),
  prompt: "Complex reasoning task...",
  providerOptions: {
    google: {
      thinkingConfig: { thinkingBudget: 2048 },
    },
  },
});

// Thinking tokens are in providerMetadata
const thinkingTokens =
  result.providerMetadata?.google?.usageMetadata?.thoughtsTokenCount ?? 0;

// Or in AI SDK v6+
const thinkingTokensV6 = result.usage?.reasoningTokens ?? 0;

console.log(`Input: ${result.usage.inputTokens}`);
console.log(`Output: ${result.usage.outputTokens}`);
console.log(`Thinking: ${thinkingTokens}`);  // These cost $3.50/M on Flash!

Key insight: Thinking tokens are the silent budget killer. On Gemini 2.5 Flash, thinking tokens cost $3.50/M — almost 6x the regular output token price of $0.60/M. If you price all output tokens at $0.60/M, your cost tracking will undercount by 3-9x. This is exactly what caused the $47 surprise bill. The AI SDK exposes thinking token counts, but you have to actually read them and price them correctly.

The LLM Harness Pattern

In production, you don’t call generateObject directly everywhere. You wrap it in a harness that handles cost calculation, logging, and ledger recording:

interface LLMConfig {
  apiKey: string;
  model?: string;
  temperature?: number;
  maxTokens?: number;
  maxRetries?: number;
  thinkingBudget?: number;
  db?: D1Database;          // For cost ledger recording
  contextType?: string;     // "pipeline" | "api" | "cron"
  contextId?: string;       // "brand-audit-niche-fi"
}

interface LLMResult<T> {
  data: T;
  usage: {
    inputTokens: number;
    outputTokens: number;
    thinkingTokens: number;
    totalTokens: number;
  };
  costUsd: number;
  finishReason: string;
  durationMs: number;
  warnings: string[];
}

const PRICING: Record<string, { input: number; output: number; thinking: number }> = {
  "gemini-2.5-flash": {
    input: 0.15 / 1_000_000,
    output: 0.6 / 1_000_000,
    thinking: 3.5 / 1_000_000,
  },
  "gemini-2.5-pro": {
    input: 1.25 / 1_000_000,
    output: 10.0 / 1_000_000,
    thinking: 10.0 / 1_000_000,
  },
  "gemini-2.0-flash": {
    input: 0.1 / 1_000_000,
    output: 0.4 / 1_000_000,
    thinking: 0,
  },
};

function calculateCost(
  model: string,
  inputTokens: number,
  outputTokens: number,
  thinkingTokens: number = 0,
): number {
  const pricing = PRICING[model] ?? PRICING["gemini-2.5-flash"];
  const textOutputTokens = Math.max(0, outputTokens - thinkingTokens);
  return (
    inputTokens * pricing.input +
    textOutputTokens * pricing.output +
    thinkingTokens * pricing.thinking
  );
}

function extractThinkingTokens(result: any): number {
  // Vercel AI SDK: providerMetadata.google.usageMetadata.thoughtsTokenCount
  const fromProvider = result.providerMetadata?.google?.usageMetadata?.thoughtsTokenCount;
  if (fromProvider != null) return fromProvider;

  // AI SDK v6: usage.reasoningTokens
  const fromUsage = result.usage?.reasoningTokens;
  if (fromUsage != null) return fromUsage;

  return 0;
}

async function llmObject<T>(
  config: LLMConfig,
  schema: z.ZodType<T>,
  prompt: string,
  system?: string,
  label = "llmObject",
): Promise<LLMResult<T>> {
  const model = createModel(config);
  const start = Date.now();

  const result = await generateObject({
    model,
    schema,
    prompt,
    system,
    temperature: config.temperature ?? 0.3,
    maxOutputTokens: config.maxTokens ?? 8192,
    maxRetries: config.maxRetries ?? 3,
  });

  const inputTokens = result.usage?.inputTokens ?? 0;
  const outputTokens = result.usage?.outputTokens ?? 0;
  const thinkingTokens = extractThinkingTokens(result);
  const costUsd = calculateCost(config.model ?? "gemini-2.5-flash", inputTokens, outputTokens, thinkingTokens);
  const durationMs = Date.now() - start;

  console.log(
    `[llm:${label}] OK in ${durationMs}ms | tokens: ${inputTokens}${outputTokens} (${thinkingTokens} thinking) | cost: $${costUsd.toFixed(4)}`
  );

  return {
    data: result.object,
    usage: { inputTokens, outputTokens, thinkingTokens, totalTokens: inputTokens + outputTokens },
    costUsd,
    finishReason: result.finishReason,
    durationMs,
    warnings: [],
  };
}

This pattern creates a clean boundary. Your business logic calls llmObject(config, schema, prompt) and gets back typed data plus cost information. It never touches fetch, never parses JSON, never worries about retries.


Layer 3: API Proxy / Metering (Wallet)

The API proxy layer answers: who spent how much, on what, and should they be allowed to?

This is the layer most teams skip. They put API keys in environment variables, call providers directly, and discover the cost when the bill arrives. The proxy layer makes cost a first-class concern — tracked per call, attributed to a function, enforced with daily limits.

Why a Centralized Proxy?

Consider a typical multi-service architecture:

Without proxy:
  Service A  ──[GEMINI_API_KEY_A]──→  Google
  Service B  ──[GEMINI_API_KEY_B]──→  Google
  Service C  ──[GEMINI_API_KEY_C]──→  Google
  Cron Job   ──[GEMINI_API_KEY_D]──→  Google

  Cost visibility: Check Google Dashboard → $47 total → ??? per service
With proxy:
  Service A  ──[X-Project-Id: svc-a]──→  API Proxy  ──[GEMINI_API_KEY]──→  Google
  Service B  ──[X-Project-Id: svc-b]──→  API Proxy
  Service C  ──[X-Project-Id: svc-c]──→  API Proxy
  Cron Job   ──[X-Project-Id: cron ]──→  API Proxy

  Cost visibility: GET /v1/costs?period=day
  → svc-a: $12.40 (article-write: $8, design-audit: $4.40)
  → svc-b: $3.20 (keyword-research: $3.20)
  → cron:  $31.40 (batch-generate: $31.40) ← PROBLEM IDENTIFIED

The $47 Cost Disaster (A Real Story)

Here’s what happened with zero proxy layer:

  1. pages-plus held its own GEMINI_API_KEY. It ran 9 cron jobs generating blog posts. Each post made 3 Gemini calls (outline + draft + edit). In 5 days: 193 posts × 3 calls × ~$0.06/call = $37.

  2. aso-mrr also held its own GEMINI_API_KEY. Its internal tracking showed $1.14 spent. Google billed $10.19. The 9x discrepancy: thinking tokens. The cost formula priced all output tokens at $0.60/M. But Gemini 2.5 Flash’s thinking tokens cost $3.50/M — almost 6x more. A call that “cost $0.02” actually cost $0.12.

  3. Combined: $47 in 5 days = $282/month run rate. Zero alerts. Zero dashboards. Discovered only by checking the Google Cloud billing console manually.

The fix: delete all API keys from individual services. Route everything through a centralized proxy. Track every call with function-level attribution. Enforce daily spend limits. Never enable a cron job until metering is verified.

Architecture of the Proxy Layer

// Simplified API proxy (e.g., "API Mom")
// In production, this is its own Cloudflare Worker with a D1 database

interface ProxyConfig {
  projects: Map<string, {
    name: string;
    dailyLimitUsd: number;
    tierPermission: 1 | 2 | 3;  // Max cost tier allowed
    apiKeys: Map<string, string>;  // service → key
  }>;
}

// Tier definitions
// Tier 1 (< $0.01/call): Flash models, cache reads, embeddings
// Tier 2 ($0.01 - $0.10/call): Thinking models, image gen
// Tier 3 (> $0.10/call): Pro models, long-context, multi-step

async function handleProxiedRequest(
  request: Request,
  config: ProxyConfig,
  db: D1Database,
): Promise<Response> {
  const projectId = request.headers.get("X-Project-Id");
  const functionName = request.headers.get("X-Function") ?? "unknown";
  const tags = request.headers.get("X-Tags");

  // 1. Authenticate
  const project = config.projects.get(projectId);
  if (!project) return new Response("Unknown project", { status: 403 });

  // 2. Check daily limit
  const todaySpend = await getDailySpend(projectId, db);
  if (todaySpend >= project.dailyLimitUsd) {
    return Response.json(
      { error: "daily_limit_exceeded", spend: todaySpend, limit: project.dailyLimitUsd },
      { status: 429 }
    );
  }

  // 3. Proxy the request to the real provider
  const url = new URL(request.url);
  const targetUrl = mapToProviderUrl(url.pathname);
  const apiKey = project.apiKeys.get(getServiceFromPath(url.pathname));

  const start = Date.now();
  const response = await fetch(targetUrl, {
    method: request.method,
    headers: {
      ...Object.fromEntries(request.headers),
      "x-goog-api-key": apiKey,  // Inject the real API key
    },
    body: request.body,
  });

  // 4. Parse response for token usage
  const responseBody = await response.json();
  const usage = extractUsage(responseBody);
  const costUsd = calculateCost(usage);
  const durationMs = Date.now() - start;

  // 5. Check tier permission
  if (costUsd > tierThreshold(project.tierPermission)) {
    return Response.json(
      { error: "cost_tier_exceeded", estimatedCost: costUsd, maxTier: project.tierPermission },
      { status: 403 }
    );
  }

  // 6. Record to ledger
  await db.prepare(`
    INSERT INTO api_calls_ledger
      (project_id, function, service, cost_usd,
       input_tokens, output_tokens, thinking_tokens,
       duration_ms, tags, status)
    VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, 'ok')
  `).bind(
    projectId, functionName, "gemini", costUsd,
    usage.inputTokens, usage.outputTokens, usage.thinkingTokens,
    durationMs, tags
  ).run();

  // 7. Return the response (re-create since we consumed the body)
  return Response.json(responseBody, {
    status: response.status,
    headers: {
      "X-Cost-Usd": costUsd.toFixed(6),
      "X-Daily-Spend": (todaySpend + costUsd).toFixed(4),
      "X-Daily-Limit": project.dailyLimitUsd.toString(),
    },
  });
}

Cost Attribution Levels

The proxy tracks costs at multiple levels of granularity:

// The ledger schema
interface ApiCallLedger {
  id: number;
  timestamp: string;
  project_id: string;        // "pages-plus", "scalable-media"
  function: string;           // "article-write", "brand-audit", "keyword-research"
  service: string;            // "gemini", "brave", "perplexity"
  endpoint: string;           // "gemini-2.5-flash", "gemini-2.5-pro"
  cost_usd: number;
  input_tokens: number;
  output_tokens: number;
  thinking_tokens: number;    // Separate column — never lump with output
  cache_read_tokens: number;
  duration_ms: number;
  tags: string;               // JSON: ["brand:niche-fi", "trigger:cron", "batch:2026-03-15"]
  status: "ok" | "error";
  error_message: string | null;
}

The function field is critical. Without it, you know “pages-plus spent $37” but not “the article-write function spent $32 and the design-audit function spent $5.” Function-level attribution is what turns cost data into actionable intelligence.

Querying Costs

// GET /v1/costs?period=day&project=pages-plus
const costs = await db.prepare(`
  SELECT
    function,
    service,
    COUNT(*) as calls,
    SUM(cost_usd) as total_cost,
    SUM(input_tokens) as total_input,
    SUM(output_tokens) as total_output,
    SUM(thinking_tokens) as total_thinking,
    AVG(duration_ms) as avg_duration
  FROM api_calls_ledger
  WHERE project_id = ?
    AND timestamp >= datetime('now', '-1 day')
  GROUP BY function, service
  ORDER BY total_cost DESC
`).bind("pages-plus").all();

// Result:
// function          | service | calls | total_cost | total_thinking
// article-write     | gemini  | 579   | $32.40     | 2,847,000
// design-audit      | gemini  | 38    | $4.60      | 412,000
// anchor-gen        | gemini  | 12    | $0.40      | 0

Daily Spend Limits

async function getDailySpend(projectId: string, db: D1Database): Promise<number> {
  const result = await db.prepare(`
    SELECT COALESCE(SUM(cost_usd), 0) as spend
    FROM api_calls_ledger
    WHERE project_id = ?
      AND timestamp >= datetime('now', 'start of day')
      AND status = 'ok'
  `).bind(projectId).first<{ spend: number }>();

  return result?.spend ?? 0;
}

// Client-side handling of 429
async function callLLMWithBudgetCheck(proxy: string, body: unknown): Promise<Response> {
  const response = await fetch(`${proxy}/v1/gemini/generateContent`, {
    method: "POST",
    headers: {
      "X-Project-Id": "brand-agents",
      "X-Function": "research",
      "X-Api-Key": env.PROXY_KEY,
    },
    body: JSON.stringify(body),
  });

  if (response.status === 429) {
    const { spend, limit } = await response.json();
    console.warn(`[budget] Daily limit hit: $${spend.toFixed(2)} / $${limit}`);
    // Queue for tomorrow, or use a cheaper model, or skip
    return handleBudgetExhausted(spend, limit);
  }

  return response;
}

Key insight: The proxy layer’s value isn’t just tracking — it’s enforcement. Any service can track its own costs. But only a centralized proxy can prevent overspend across all services simultaneously. When a cron job tries to burn $30 overnight, the proxy returns 429 after $5. The cron handles the 429 gracefully. Nobody wakes up to a surprise bill.


The Composition Pattern

This is where the architecture comes together. The three layers compose through two integration points:

  1. Container → Brain: The AIChatAgent.onChatMessage() method calls streamText() / generateText() / generateObject() from the AI SDK
  2. Brain → Wallet: The AI SDK provider’s fetch option routes HTTP requests through the proxy

The Custom Fetch Bridge

Every AI SDK provider factory (createGoogleGenerativeAI, createAnthropic, createOpenAI) accepts a fetch option. This is the seam where Layer 2 connects to Layer 3:

import { createGoogleGenerativeAI } from "@ai-sdk/google";

// Direct call — bypasses cost tracking
const googleDirect = createGoogleGenerativeAI({
  apiKey: env.GEMINI_API_KEY,
});

// Proxied call — all requests go through API Mom
const googleProxied = createGoogleGenerativeAI({
  apiKey: "proxied",  // API Mom injects the real key
  baseURL: `${env.API_MOM_URL}/v1/google`,
  fetch: async (url: RequestInfo | URL, init?: RequestInit) => {
    const headers = new Headers(init?.headers);
    headers.set("X-Project-Id", "brand-agents");
    headers.set("X-Function", "content-generation");
    headers.set("X-Api-Key", env.API_MOM_KEY);
    headers.set("X-Tags", JSON.stringify(["brand:niche-fi", "trigger:agent"]));

    return fetch(url, { ...init, headers });
  },
});

The beauty of this pattern: the AI SDK doesn’t know it’s talking to a proxy. It thinks it’s calling Google’s API. The proxy handles authentication, cost tracking, and budget enforcement transparently.

Full Stack Composition

Here’s a complete agent that uses all three layers:

import { AIChatAgent } from "agents";
import { streamText, generateObject, tool } from "ai";
import { createGoogleGenerativeAI } from "@ai-sdk/google";
import { z } from "zod";

interface Env {
  API_MOM_URL: string;
  API_MOM_KEY: string;
  RESEARCH_QUEUE: Queue;
}

interface ResearchAgentState {
  brandSlug: string;
  status: "idle" | "researching" | "analyzing" | "reporting";
  findingsCount: number;
  costToday: number;
  costBudget: number;
  lastError: string | null;
}

export class ResearchAgent extends AIChatAgent<Env, ResearchAgentState> {
  initialState: ResearchAgentState = {
    brandSlug: "",
    status: "idle",
    findingsCount: 0,
    costToday: 0,
    costBudget: 2.0,
    lastError: null,
  };

  // Layer 2: Create a proxied model instance
  private createModel(functionName: string) {
    return createGoogleGenerativeAI({
      apiKey: "proxied",  // Real key lives in API Mom
      baseURL: `${this.env.API_MOM_URL}/v1/google`,
      fetch: async (url, init) => {
        const headers = new Headers(init?.headers);
        headers.set("X-Project-Id", "research-agents");
        headers.set("X-Function", functionName);
        headers.set("X-Api-Key", this.env.API_MOM_KEY);
        headers.set("X-Tags", JSON.stringify([
          `brand:${this.state.brandSlug}`,
          "trigger:chat",
        ]));
        return fetch(url, { ...init, headers });
      },
    })("gemini-2.5-flash");
  }

  // Layer 1: Lifecycle
  async onStart() {
    this.sql`CREATE TABLE IF NOT EXISTS findings (
      id INTEGER PRIMARY KEY AUTOINCREMENT,
      query TEXT NOT NULL,
      category TEXT,
      content TEXT NOT NULL,
      confidence REAL DEFAULT 0.5,
      source_url TEXT,
      created_at TEXT DEFAULT (datetime('now'))
    )`;

    this.sql`CREATE TABLE IF NOT EXISTS cost_log (
      id INTEGER PRIMARY KEY AUTOINCREMENT,
      function TEXT NOT NULL,
      cost_usd REAL NOT NULL,
      tokens_used INTEGER,
      created_at TEXT DEFAULT (datetime('now'))
    )`;
  }

  // Layer 1 → Layer 2: Chat with tools
  async onChatMessage(onFinish?: StreamTextOnFinishCallback) {
    const model = this.createModel("chat");

    const result = streamText({
      model,
      system: `You are a research agent for the brand "${this.state.brandSlug}".
               Use the available tools to research topics and save findings.
               Always search before answering. Save important findings.`,
      messages: this.messages,
      tools: {
        search: tool({
          description: "Search the web for information on a topic",
          parameters: z.object({
            query: z.string().describe("Search query"),
          }),
          execute: async ({ query }) => {
            // This call also goes through API Mom for cost tracking
            const res = await fetch(
              `${this.env.API_MOM_URL}/v1/brave/search?q=${encodeURIComponent(query)}`,
              {
                headers: {
                  "X-Project-Id": "research-agents",
                  "X-Function": "web-search",
                  "X-Api-Key": this.env.API_MOM_KEY,
                },
              }
            );
            return await res.text();
          },
        }),

        saveFinding: tool({
          description: "Save a research finding to the knowledge base",
          parameters: z.object({
            query: z.string(),
            category: z.string(),
            content: z.string(),
            confidence: z.number().min(0).max(1),
            sourceUrl: z.string().optional(),
          }),
          execute: async ({ query, category, content, confidence, sourceUrl }) => {
            this.sql`INSERT INTO findings (query, category, content, confidence, source_url)
                     VALUES (${query}, ${category}, ${content}, ${confidence}, ${sourceUrl ?? null})`;

            this.setState({
              ...this.state,
              findingsCount: this.state.findingsCount + 1,
            });

            return `Saved finding in category "${category}" with confidence ${confidence}`;
          },
        }),

        getFindings: tool({
          description: "Retrieve previous research findings",
          parameters: z.object({
            category: z.string().optional(),
            limit: z.number().default(10),
          }),
          execute: async ({ category, limit }) => {
            const results = category
              ? [...this.sql`SELECT * FROM findings WHERE category = ${category} ORDER BY created_at DESC LIMIT ${limit}`]
              : [...this.sql`SELECT * FROM findings ORDER BY created_at DESC LIMIT ${limit}`];
            return JSON.stringify(results);
          },
        }),
      },
      maxSteps: 8,
      onFinish: async (result) => {
        // Track cost in agent state
        const thinkingTokens = result.providerMetadata?.google?.usageMetadata?.thoughtsTokenCount ?? 0;
        const cost = calculateCost(
          "gemini-2.5-flash",
          result.usage?.inputTokens ?? 0,
          result.usage?.outputTokens ?? 0,
          thinkingTokens,
        );

        this.sql`INSERT INTO cost_log (function, cost_usd, tokens_used)
                 VALUES ('chat', ${cost}, ${result.usage?.totalTokens ?? 0})`;

        this.setState({
          ...this.state,
          costToday: this.state.costToday + cost,
        });

        onFinish?.(result);
      },
    });

    return result.toUIMessageStreamResponse();
  }

  // Structured output example — Layer 2 with schema validation
  async analyzeCompetitors(keyword: string) {
    const model = this.createModel("competitor-analysis");

    const CompetitorSchema = z.object({
      competitors: z.array(z.object({
        name: z.string(),
        url: z.string(),
        strengths: z.array(z.string()),
        weaknesses: z.array(z.string()),
        estimatedTraffic: z.enum(["low", "medium", "high", "very-high"]),
      })),
      marketGaps: z.array(z.string()),
      recommendation: z.string(),
    });

    const { object, usage } = await generateObject({
      model,
      schema: CompetitorSchema,
      prompt: `Analyze the competitive landscape for "${keyword}".
               Identify top competitors, their strengths and weaknesses,
               and market gaps we could exploit.`,
      maxRetries: 3,
    });

    return object;
  }
}

The Data Flow

When a user sends a chat message to this agent, here’s exactly what happens:

1. WebSocket message arrives at the Cloudflare Worker
2. routeAgentRequest() routes to the correct ResearchAgent instance (Durable Object)
3. AIChatAgent loads conversation history from SQLite (Layer 1)
4. onChatMessage() is called
5. streamText() sends the prompt + tools to the AI SDK (Layer 2)
6. AI SDK creates an HTTP request to Gemini's API
7. Custom fetch() intercepts the request, adds proxy headers
8. Request goes to API Mom (Layer 3)
9. API Mom checks daily spend limit
10. API Mom injects the real GEMINI_API_KEY
11. API Mom forwards to Google's API
12. Google returns tokens + usage metadata
13. API Mom records cost to api_calls_ledger
14. API Mom forwards response back (with X-Cost-Usd header)
15. AI SDK parses the response, validates schema if applicable
16. If tool calls: execute tools, send results back to LLM (repeat 6-15)
17. Stream tokens back through WebSocket to client (Layer 1)
18. onFinish: record cost in agent's SQLite, update state
19. State change broadcasts to all connected WebSocket clients
20. Agent goes idle → eventually hibernates → $0 cost

Each layer handles its concern. The agent code in step 4 doesn’t know about API keys. The AI SDK in step 6 doesn’t know about Durable Objects. The proxy in step 9 doesn’t know about Zod schemas. They compose through narrow interfaces: function calls and HTTP.


Patterns

Pattern 1: Budget-Aware Agent

An agent that adjusts its behavior based on remaining budget. Uses the Wallet layer’s cost response headers to make real-time decisions.

export class BudgetAwareAgent extends Agent<Env, BudgetAgentState> {
  initialState: BudgetAgentState = {
    dailyBudget: 5.0,
    spentToday: 0,
    model: "gemini-2.5-flash",
    taskQueue: [],
    completedTasks: 0,
    skippedTasks: 0,
  };

  private getModel() {
    // Downgrade model when budget is tight
    const remaining = this.state.dailyBudget - this.state.spentToday;

    if (remaining < 0.50) {
      return "gemini-2.0-flash";  // Cheapest: $0.10/M input
    }
    if (remaining < 2.0) {
      return "gemini-2.5-flash";  // Middle: $0.15/M input, has thinking
    }
    return "gemini-2.5-pro";     // Best: $1.25/M input, best quality
  }

  private createProxiedModel(functionName: string) {
    const modelId = this.getModel();
    return createGoogleGenerativeAI({
      apiKey: "proxied",
      baseURL: `${this.env.API_MOM_URL}/v1/google`,
      fetch: async (url, init) => {
        const headers = new Headers(init?.headers);
        headers.set("X-Project-Id", "budget-agent");
        headers.set("X-Function", functionName);
        headers.set("X-Api-Key", this.env.API_MOM_KEY);

        const response = await fetch(url, { ...init, headers });

        // Read cost from proxy response headers
        const costUsd = parseFloat(response.headers.get("X-Cost-Usd") ?? "0");
        const dailySpend = parseFloat(response.headers.get("X-Daily-Spend") ?? "0");

        // Update agent state with real cost data
        this.setState({
          ...this.state,
          spentToday: dailySpend,
        });

        // Log cost to agent's SQLite for analysis
        this.sql`INSERT INTO cost_events (function, model, cost_usd, daily_total)
                 VALUES (${functionName}, ${modelId}, ${costUsd}, ${dailySpend})`;

        return response;
      },
    })(modelId);
  }

  async processTask(task: AgentTask) {
    const remaining = this.state.dailyBudget - this.state.spentToday;

    // Skip expensive tasks when budget is low
    if (task.estimatedCost > remaining) {
      this.sql`INSERT INTO skipped_tasks (task_id, reason, remaining_budget)
               VALUES (${task.id}, 'budget_insufficient', ${remaining})`;

      this.setState({
        ...this.state,
        skippedTasks: this.state.skippedTasks + 1,
      });

      // Re-queue for tomorrow
      this.schedule(tomorrowMidnight(), "retryTask");
      return;
    }

    const model = this.createProxiedModel(task.function);

    try {
      const result = await generateObject({
        model,
        schema: task.outputSchema,
        prompt: task.prompt,
        maxRetries: 2,
      });

      this.setState({
        ...this.state,
        completedTasks: this.state.completedTasks + 1,
      });

      return result.object;
    } catch (err) {
      if (err.message?.includes("daily_limit_exceeded")) {
        // Proxy rejected — we've hit the hard limit
        console.warn("[budget] Hard limit hit, pausing until tomorrow");
        this.setState({ ...this.state, status: "budget_exhausted" });
        this.schedule(tomorrowMidnight(), "resetBudget");
      }
      throw err;
    }
  }

  async resetBudget() {
    this.setState({
      ...this.state,
      spentToday: 0,
      status: "idle",
    });
    // Re-process queued tasks
    await this.processQueue();
  }
}

Pattern 2: Multi-Provider Routing

Route different tasks to different LLM providers based on task complexity, leveraging the AI SDK’s provider abstraction.

type TaskComplexity = "trivial" | "standard" | "complex" | "critical";

interface ModelRoute {
  provider: "google" | "anthropic" | "openai";
  model: string;
  costPerMillionInput: number;
  costPerMillionOutput: number;
  bestFor: string[];
}

const MODEL_ROUTES: Record<TaskComplexity, ModelRoute> = {
  trivial: {
    provider: "google",
    model: "gemini-2.0-flash",
    costPerMillionInput: 0.10,
    costPerMillionOutput: 0.40,
    bestFor: ["classification", "extraction", "formatting"],
  },
  standard: {
    provider: "google",
    model: "gemini-2.5-flash",
    costPerMillionInput: 0.15,
    costPerMillionOutput: 0.60,
    bestFor: ["summarization", "content-generation", "tool-calling"],
  },
  complex: {
    provider: "anthropic",
    model: "claude-sonnet-4-20250514",
    costPerMillionInput: 3.0,
    costPerMillionOutput: 15.0,
    bestFor: ["reasoning", "code-generation", "nuanced-analysis"],
  },
  critical: {
    provider: "google",
    model: "gemini-2.5-pro",
    costPerMillionInput: 1.25,
    costPerMillionOutput: 10.0,
    bestFor: ["long-context", "multi-step-reasoning", "high-stakes-decisions"],
  },
};

function createRoutedModel(
  env: Env,
  complexity: TaskComplexity,
  functionName: string,
) {
  const route = MODEL_ROUTES[complexity];

  // Custom fetch that routes through the proxy
  const proxiedFetch = async (url: RequestInfo | URL, init?: RequestInit) => {
    const headers = new Headers(init?.headers);
    headers.set("X-Project-Id", "routed-agents");
    headers.set("X-Function", functionName);
    headers.set("X-Api-Key", env.API_MOM_KEY);
    headers.set("X-Tags", JSON.stringify([`complexity:${complexity}`, `provider:${route.provider}`]));
    return fetch(url, { ...init, headers });
  };

  switch (route.provider) {
    case "google":
      return createGoogleGenerativeAI({
        apiKey: "proxied",
        baseURL: `${env.API_MOM_URL}/v1/google`,
        fetch: proxiedFetch,
      })(route.model);

    case "anthropic":
      return createAnthropic({
        apiKey: "proxied",
        baseURL: `${env.API_MOM_URL}/v1/anthropic`,
        fetch: proxiedFetch,
      })(route.model);

    case "openai":
      return createOpenAI({
        apiKey: "proxied",
        baseURL: `${env.API_MOM_URL}/v1/openai`,
        fetch: proxiedFetch,
      })(route.model);
  }
}

// Usage in an agent
async function processWithRouting(task: AgentTask, env: Env) {
  const complexity = classifyComplexity(task);
  const model = createRoutedModel(env, complexity, task.function);

  const { object } = await generateObject({
    model,
    schema: task.schema,
    prompt: task.prompt,
    maxRetries: 3,
  });

  return object;
}

// Complexity classifier — itself a trivial LLM call
async function classifyComplexity(task: AgentTask): Promise<TaskComplexity> {
  const model = createRoutedModel(env, "trivial", "classify-complexity");

  const { object } = await generateObject({
    model,
    schema: z.object({
      complexity: z.enum(["trivial", "standard", "complex", "critical"]),
      reasoning: z.string(),
    }),
    prompt: `Classify the complexity of this task:\n\n${task.prompt.slice(0, 500)}`,
  });

  return object.complexity;
}

Pattern 3: Agent-to-Agent Communication via Queues

Multiple agents that coordinate through Cloudflare Queues, with each agent independently managing its own state and cost budget.

// Research Agent — finds information
export class ResearchWorkerAgent extends Agent<Env, ResearchState> {
  initialState: ResearchState = {
    status: "idle",
    assignedKeywords: [],
    completedKeywords: [],
    costToday: 0,
  };

  async onStart() {
    this.sql`CREATE TABLE IF NOT EXISTS research_results (
      keyword TEXT PRIMARY KEY,
      serp_data TEXT,
      competitor_data TEXT,
      opportunity_score REAL,
      researched_at TEXT DEFAULT (datetime('now'))
    )`;
  }

  // Triggered by queue message from Coordinator
  async handleResearchRequest(keyword: string, correlationId: string) {
    this.setState({ ...this.state, status: "researching" });

    const model = this.createProxiedModel("keyword-research");

    // Step 1: Search
    const searchResults = await this.searchWeb(keyword);

    // Step 2: Analyze with LLM
    const { object: analysis } = await generateObject({
      model,
      schema: KeywordAnalysisSchema,
      prompt: `Analyze the SERP for "${keyword}":\n${searchResults}`,
    });

    // Step 3: Save results
    this.sql`INSERT OR REPLACE INTO research_results
             (keyword, serp_data, competitor_data, opportunity_score)
             VALUES (${keyword}, ${JSON.stringify(searchResults)},
                     ${JSON.stringify(analysis.competitors)},
                     ${analysis.opportunityScore})`;

    // Step 4: Emit completion event
    await this.env.EVENTS_QUEUE.send({
      event_id: crypto.randomUUID(),
      type: "research.completed",
      source: "research-agent",
      timestamp: new Date().toISOString(),
      correlation_id: correlationId,
      payload: {
        keyword,
        opportunityScore: analysis.opportunityScore,
        competitorCount: analysis.competitors.length,
      },
    });

    this.setState({
      ...this.state,
      status: "idle",
      completedKeywords: [...this.state.completedKeywords, keyword],
    });
  }
}

// Content Agent — generates content based on research
export class ContentWorkerAgent extends Agent<Env, ContentState> {
  initialState: ContentState = {
    status: "idle",
    articlesGenerated: 0,
    costToday: 0,
  };

  // Triggered by research.completed event
  async handleResearchCompleted(event: DomainMessage<ResearchCompletedPayload>) {
    const { keyword, opportunityScore } = event.payload;

    // Only generate content for high-opportunity keywords
    if (opportunityScore < 0.6) {
      console.log(`[content] Skipping "${keyword}" — score ${opportunityScore} below threshold`);
      return;
    }

    this.setState({ ...this.state, status: "generating" });

    // Use a more capable model for content generation
    const model = this.createProxiedModel("article-write");

    const { object: article } = await generateObject({
      model,
      schema: ArticleSchema,
      prompt: `Write a comprehensive article about "${keyword}" targeting
               users searching for this term. Include practical advice,
               comparisons, and actionable steps.`,
      maxRetries: 3,
    });

    // Emit publish command
    await this.env.PUBLISH_QUEUE.send({
      event_id: crypto.randomUUID(),
      type: "content.publish",
      source: "content-agent",
      timestamp: new Date().toISOString(),
      correlation_id: event.correlation_id,
      payload: {
        keyword,
        title: article.title,
        slug: article.slug,
        content: article.content,  // Normally stored in R2, reference in message
      },
    });

    this.setState({
      ...this.state,
      status: "idle",
      articlesGenerated: this.state.articlesGenerated + 1,
    });
  }
}

Pattern 4: Cloudflare AI Gateway Integration

Instead of a custom proxy, use Cloudflare’s AI Gateway for caching, rate limiting, and observability. This works as a lightweight Layer 3 when you don’t need custom cost attribution.

import { createAiGateway } from "ai-gateway-provider";
import { createGoogleGenerativeAI } from "ai-gateway-provider/providers/google";

export class GatewayAgent extends Agent<Env, AgentState> {
  private createModel() {
    // Route through Cloudflare AI Gateway
    const aigateway = createAiGateway({
      binding: this.env.AI.gateway("my-gateway"),
      options: {
        cacheTtl: 300,  // Cache responses for 5 minutes
      },
    });

    const google = createGoogleGenerativeAI({
      apiKey: this.env.GEMINI_API_KEY,
    });

    // Compose: AI Gateway wraps the Google provider
    return aigateway(google("gemini-2.5-flash"));
  }

  async onChatMessage() {
    const model = this.createModel();

    const result = streamText({
      model,
      messages: this.messages,
      maxSteps: 5,
    });

    return result.toUIMessageStreamResponse();
  }
}

The AI Gateway provides:

But it doesn’t provide:

This is why many production systems use both: AI Gateway for caching and fallbacks, a custom proxy for attribution and enforcement.

Pattern 5: Structured Output Pipeline

A pipeline that chains multiple LLM calls, each with its own schema, building on the output of the previous step.

const OutlineSchema = z.object({
  title: z.string(),
  sections: z.array(z.object({
    heading: z.string(),
    keyPoints: z.array(z.string()),
    estimatedWordCount: z.number(),
  })),
  targetWordCount: z.number(),
  targetAudience: z.string(),
});

const DraftSchema = z.object({
  title: z.string(),
  content: z.string().describe("Full article in markdown"),
  wordCount: z.number(),
  metaDescription: z.string().max(160),
  tags: z.array(z.string()),
});

const EditSchema = z.object({
  content: z.string().describe("Edited article in markdown"),
  changes: z.array(z.object({
    type: z.enum(["grammar", "clarity", "seo", "structure"]),
    description: z.string(),
  })),
  readabilityScore: z.number().min(0).max(100),
});

async function articlePipeline(
  keyword: string,
  research: string,
  env: Env,
): Promise<{ article: z.infer<typeof EditSchema>; totalCost: number }> {
  let totalCost = 0;

  const createModel = (fn: string) => createGoogleGenerativeAI({
    apiKey: "proxied",
    baseURL: `${env.API_MOM_URL}/v1/google`,
    fetch: async (url, init) => {
      const headers = new Headers(init?.headers);
      headers.set("X-Project-Id", "content-pipeline");
      headers.set("X-Function", fn);
      headers.set("X-Api-Key", env.API_MOM_KEY);
      const response = await fetch(url, { ...init, headers });
      totalCost += parseFloat(response.headers.get("X-Cost-Usd") ?? "0");
      return response;
    },
  })("gemini-2.5-flash");

  // Step 1: Outline (cheap, fast)
  const { object: outline } = await generateObject({
    model: createModel("outline"),
    schema: OutlineSchema,
    prompt: `Create an article outline for "${keyword}" based on this research:\n${research}`,
  });

  // Step 2: Draft (main cost — longer output)
  const { object: draft } = await generateObject({
    model: createModel("draft"),
    schema: DraftSchema,
    prompt: `Write a full article following this outline:\n${JSON.stringify(outline)}`,
    maxOutputTokens: 16384,
  });

  // Step 3: Edit (moderate cost — reviews full article)
  const { object: edited } = await generateObject({
    model: createModel("edit"),
    schema: EditSchema,
    prompt: `Edit this article for clarity, SEO, and readability:\n${draft.content}`,
  });

  console.log(`[pipeline] Article for "${keyword}" complete. Total cost: $${totalCost.toFixed(4)}`);

  return { article: edited, totalCost };
}

Small Examples

Example 1: Minimal Agent with State Sync

The simplest possible agent — state syncs to all connected clients in real-time.

import { Agent } from "agents";

interface CounterState {
  count: number;
  lastUpdated: string | null;
}

export class CounterAgent extends Agent<Env, CounterState> {
  initialState: CounterState = { count: 0, lastUpdated: null };

  async onMessage(connection: Connection, message: string) {
    const { action } = JSON.parse(message);

    if (action === "increment") {
      this.setState({
        count: this.state.count + 1,
        lastUpdated: new Date().toISOString(),
      });
    }

    if (action === "reset") {
      this.setState({ count: 0, lastUpdated: new Date().toISOString() });
    }
    // State automatically broadcasts to all connected WebSocket clients
  }
}

Example 2: Custom Fetch Logger

Intercept all AI SDK requests to log request/response details — useful for debugging.

function createLoggingModel(apiKey: string, modelId: string) {
  return createGoogleGenerativeAI({
    apiKey,
    fetch: async (url, init) => {
      const startMs = Date.now();
      const requestBody = init?.body ? JSON.parse(init.body as string) : null;

      console.log(`[llm:request] ${url}`);
      console.log(`[llm:request] Prompt tokens (est): ${estimateTokens(requestBody)}`);

      const response = await fetch(url, init);
      const durationMs = Date.now() - startMs;

      // Clone to read body without consuming the stream
      const cloned = response.clone();
      const responseBody = await cloned.json();
      const usage = responseBody.usageMetadata;

      console.log(`[llm:response] ${durationMs}ms | ${response.status}`);
      console.log(`[llm:response] Input: ${usage?.promptTokenCount ?? "?"} | Output: ${usage?.candidatesTokenCount ?? "?"} | Thinking: ${usage?.thoughtsTokenCount ?? 0}`);

      return response;
    },
  })(modelId);
}

Example 3: Schema-First Tool Definition

Define tools using Zod schemas for full type safety — parameters and return types are both validated.

import { tool } from "ai";
import { z } from "zod";

const weatherTool = tool({
  description: "Get current weather for a location",
  parameters: z.object({
    city: z.string().describe("City name"),
    units: z.enum(["celsius", "fahrenheit"]).default("celsius"),
  }),
  execute: async ({ city, units }) => {
    // city is typed as string, units as "celsius" | "fahrenheit"
    const response = await fetch(
      `https://api.weather.example.com/v1/current?city=${city}&units=${units}`
    );
    const data = await response.json();
    return `${city}: ${data.temperature}°${units === "celsius" ? "C" : "F"}, ${data.condition}`;
  },
});

// Use in generateText
const { text } = await generateText({
  model: google("gemini-2.5-flash"),
  tools: { weather: weatherTool },
  prompt: "What's the weather in Tokyo?",
  maxSteps: 3,
});

Example 4: Thinking Budget Control

Limit thinking tokens to control cost — useful when you want speed over depth.

async function quickClassification(text: string) {
  const { object } = await generateObject({
    model: google("gemini-2.5-flash"),
    schema: z.object({
      category: z.enum(["positive", "negative", "neutral"]),
      confidence: z.number().min(0).max(1),
    }),
    prompt: `Classify the sentiment: "${text}"`,
    providerOptions: {
      google: {
        thinkingConfig: { thinkingBudget: 0 },  // No thinking — fastest, cheapest
      },
    },
  });
  return object;
}

async function deepAnalysis(text: string) {
  const { object } = await generateObject({
    model: google("gemini-2.5-flash"),
    schema: z.object({
      sentiment: z.enum(["positive", "negative", "neutral", "mixed"]),
      themes: z.array(z.string()),
      reasoning: z.string(),
      suggestedActions: z.array(z.string()),
    }),
    prompt: `Deeply analyze this text for sentiment, themes, and actionable insights:\n\n${text}`,
    providerOptions: {
      google: {
        thinkingConfig: { thinkingBudget: 4096 },  // Allow thinking — better quality
      },
    },
  });
  return object;
}

Example 5: Agent with Scheduled Self-Improvement

An agent that periodically reviews its own performance and adjusts its strategy.

export class AdaptiveAgent extends Agent<Env, AdaptiveState> {
  initialState: AdaptiveState = {
    strategy: "balanced",
    successRate: 0,
    avgCostPerTask: 0,
    tasksProcessed: 0,
  };

  async onStart() {
    // Self-review every 6 hours
    this.schedule("0 */6 * * *", "selfReview");

    this.sql`CREATE TABLE IF NOT EXISTS task_outcomes (
      id INTEGER PRIMARY KEY AUTOINCREMENT,
      task_type TEXT,
      model_used TEXT,
      cost_usd REAL,
      success INTEGER,
      quality_score REAL,
      created_at TEXT DEFAULT (datetime('now'))
    )`;
  }

  async selfReview() {
    const stats = [...this.sql`
      SELECT
        model_used,
        COUNT(*) as total,
        SUM(success) as successes,
        AVG(cost_usd) as avg_cost,
        AVG(quality_score) as avg_quality
      FROM task_outcomes
      WHERE created_at > datetime('now', '-6 hours')
      GROUP BY model_used
    `];

    if (stats.length === 0) return;

    // Use the cheapest model for self-reflection
    const model = this.createProxiedModel("self-review");

    const { object: review } = await generateObject({
      model,
      schema: z.object({
        recommendation: z.enum(["use_cheaper_model", "use_better_model", "stay_current"]),
        reasoning: z.string(),
        suggestedThinkingBudget: z.number(),
      }),
      prompt: `Review agent performance stats and recommend adjustments:\n${JSON.stringify(stats)}`,
    });

    this.setState({
      ...this.state,
      strategy: review.recommendation,
    });

    this.sql`INSERT INTO task_outcomes (task_type, model_used, cost_usd, success, quality_score)
             VALUES ('self-review', 'gemini-2.0-flash', 0.001, 1, 1.0)`;
  }
}

Example 6: Idempotent Queue Consumer with Cost Tracking

A queue consumer that deduplicates messages and tracks cost per processed item.

export default {
  async queue(batch: MessageBatch<DomainMessage>, env: Env) {
    for (const msg of batch.messages) {
      const { event_id, type, payload } = msg.body;

      // Idempotency check
      const existing = await env.DB.prepare(
        `SELECT 1 FROM processed_events WHERE event_id = ?`
      ).bind(event_id).first();

      if (existing) {
        msg.ack();
        continue;
      }

      try {
        let costUsd = 0;

        if (type === "content.generate") {
          const model = createGoogleGenerativeAI({
            apiKey: "proxied",
            baseURL: `${env.API_MOM_URL}/v1/google`,
            fetch: async (url, init) => {
              const headers = new Headers(init?.headers);
              headers.set("X-Project-Id", "content-worker");
              headers.set("X-Function", "content-generate");
              headers.set("X-Api-Key", env.API_MOM_KEY);
              headers.set("X-Tags", JSON.stringify([`event:${event_id}`]));
              const res = await fetch(url, { ...init, headers });
              costUsd += parseFloat(res.headers.get("X-Cost-Usd") ?? "0");
              return res;
            },
          })("gemini-2.5-flash");

          await generateObject({
            model,
            schema: ArticleSchema,
            prompt: `Generate content for: ${payload.keyword}`,
          });
        }

        // Mark as processed
        await env.DB.prepare(
          `INSERT INTO processed_events (event_id, type, cost_usd, processed_at)
           VALUES (?, ?, ?, datetime('now'))`
        ).bind(event_id, type, costUsd).run();

        msg.ack();
      } catch (err) {
        console.error(`[queue] Failed ${event_id}: ${err}`);
        msg.retry({ delaySeconds: 30 });
      }
    }
  },
};

Example 7: Provider Fallback Chain

Try providers in order — if one fails (rate limit, outage), fall back to the next.

async function generateWithFallback<T>(
  schema: z.ZodType<T>,
  prompt: string,
  env: Env,
): Promise<{ data: T; provider: string; cost: number }> {
  const providers = [
    {
      name: "google",
      create: () => createGoogleGenerativeAI({
        apiKey: "proxied",
        baseURL: `${env.API_MOM_URL}/v1/google`,
        fetch: proxyFetch(env, "google-primary"),
      })("gemini-2.5-flash"),
    },
    {
      name: "anthropic",
      create: () => createAnthropic({
        apiKey: "proxied",
        baseURL: `${env.API_MOM_URL}/v1/anthropic`,
        fetch: proxyFetch(env, "anthropic-fallback"),
      })("claude-sonnet-4-20250514"),
    },
    {
      name: "openai",
      create: () => createOpenAI({
        apiKey: "proxied",
        baseURL: `${env.API_MOM_URL}/v1/openai`,
        fetch: proxyFetch(env, "openai-fallback"),
      })("gpt-4o"),
    },
  ];

  for (const provider of providers) {
    try {
      const model = provider.create();
      const { object } = await generateObject({ model, schema, prompt, maxRetries: 2 });
      return { data: object, provider: provider.name, cost: 0 };
    } catch (err) {
      console.warn(`[fallback] ${provider.name} failed: ${err.message}`);
      if (provider === providers[providers.length - 1]) {
        throw new Error(`All providers failed. Last error: ${err.message}`);
      }
      continue;
    }
  }

  throw new Error("Unreachable");
}

Example 8: Real-Time Cost Dashboard via WebSocket

An agent that tracks cost across all agent instances and broadcasts to a monitoring dashboard.

export class CostMonitorAgent extends Agent<Env, CostMonitorState> {
  initialState: CostMonitorState = {
    totalToday: 0,
    byProject: {},
    alerts: [],
    lastUpdated: null,
  };

  async onStart() {
    // Poll cost data every 5 minutes
    this.schedule("*/5 * * * *", "refreshCosts");
  }

  async refreshCosts() {
    const response = await fetch(`${this.env.API_MOM_URL}/v1/costs?period=day`, {
      headers: { "X-Api-Key": this.env.API_MOM_ADMIN_KEY },
    });

    const costs = await response.json();
    const alerts: string[] = [];

    // Check for projects approaching limits
    for (const project of costs.projects) {
      const usagePercent = (project.spent / project.dailyLimit) * 100;
      if (usagePercent > 80) {
        alerts.push(`${project.name}: ${usagePercent.toFixed(0)}% of daily budget ($${project.spent.toFixed(2)}/$${project.dailyLimit})`);
      }
    }

    this.setState({
      totalToday: costs.totalSpend,
      byProject: Object.fromEntries(costs.projects.map(p => [p.name, p.spent])),
      alerts,
      lastUpdated: new Date().toISOString(),
    });

    // State update automatically broadcasts to all connected dashboard clients
    // via WebSocket — no explicit push needed
  }
}

Comparisons

Agent Runtime Frameworks

FrameworkRuntime ModelState PersistenceScaling ModelCost When IdleLanguageEdge/Serverless
Cloudflare Agents SDKDurable Objects (micro-servers)Built-in SQLite + key-value stateMillions of instances, auto-scale$0 (hibernation)TypeScriptYes (global edge)
OpenAI Agents SDKIn-process (client manages runtime)None (bring your own)Manual (run more processes)N/A (not a host)Python/TypeScriptNo
LangGraphIn-process or LangGraph CloudCheckpointing (pluggable store)LangGraph Cloud or manualDepends on hostPython/TypeScriptCloud only
CrewAIIn-processNone built-inManualDepends on hostPythonNo
AutoGen / MS Agent FrameworkIn-process, conversation-basedSession-levelManualDepends on hostPythonNo
AWS Bedrock AgentsManaged serviceSession memory (managed)Auto-scalePer-second billing (no true $0)API-basedNo (AWS regions)

Verdict: If you need agents that hibernate to $0, run on the edge, and scale to millions of instances without infrastructure management, the Cloudflare Agents SDK is unique. If you need a rich ecosystem of pre-built agent patterns and don’t care about runtime, LangGraph or OpenAI Agents SDK give you more out of the box.

LLM Interface Libraries

LibraryStructured OutputTool CallingStreamingProvider AbstractionCustom FetchBundle SizeEdge Compatible
Vercel AI SDKZod schemas → generateObjecttool() + maxSteps loopsstreamText25+ providersYes (per-provider)~67 KB gzippedYes
LangChain JSOutput parsers (less type-safe)Agent executors, tool chainsStreaming callbacks50+ providersVia custom LLM~101 KB gzippedLimited
OpenAI SDKJSON mode + function callingNative function callingStreaming helpersOpenAI onlyNo~20 KB gzippedYes
Raw fetchManual JSON.parse + validationManual tool loopManual SSE parsingOne at a timeN/A0 KBYes
@cloudflare/ai-gateway SDKZod schemasTool callingStreamingMulti-provider via GatewayN/ASmallYes

Verdict: The Vercel AI SDK is the best fit for TypeScript-first, edge-compatible applications. It gives you type-safe structured output without the weight of LangChain, and provider abstraction without the lock-in of the OpenAI SDK. The custom fetch option is the key feature that enables the proxy layer integration.

LLM Proxy / Gateway Solutions

SolutionSelf-HostedCost TrackingDaily LimitsFunction AttributionCustom Cost FormulasThinking Token SupportOpen Source
Custom Proxy (API Mom pattern)Yes (your Worker)Full controlYesYes (X-Function header)YesYesYour code
Cloudflare AI GatewayManagedDashboard onlyRate limitingNoNoLimitedNo
LiteLLMYesPer-model trackingYesLimitedVia callbacksPartialYes
HeliconeYes (or hosted)Per-request tracesAlertingVia headersLimitedPartialYes
PortkeyHosted ($49/mo+)Full tracesBudget capsVia metadataVia pluginsYesNo
Direct API callsN/ANoneNoneNoneNoneNoneN/A

Verdict: For maximum control and Cloudflare-native integration, a custom proxy Worker is ideal — you own the cost formula, the attribution schema, and the enforcement logic. For teams that don’t want to build infrastructure, Helicone or Portkey provide good observability out of the box. Cloudflare AI Gateway is a good middle ground for caching and fallbacks but lacks the attribution depth needed for multi-service cost control.

Full Architecture Comparison

ApproachStateLLMCost ControlComplexityMonthly Cost (10K calls)
Three-Layer (this article)Durable ObjectsAI SDKCustom proxyMedium~$15 infra + LLM costs
Monolithic WorkerKV / D1Raw fetchNoneLow initially, high later~$5 infra + unknown LLM
LangChain + VPSRedis / PostgresLangChainLiteLLM proxyHigh~$50 server + LLM costs
AWS Bedrock AgentsManagedManagedCloudWatchMedium~$100+ (multi-layer billing)
OpenAI Agents + VercelVercel KVOpenAI SDKOpenAI dashboardLow-Medium~$20 Vercel + LLM costs

Anti-Patterns

Don’tDo InsteadWhy
Put API keys in every worker’s wrangler.jsoncSingle proxy holds all API keysOne place to rotate, audit, and limit keys
Call fetch("https://generativelanguage.googleapis.com/...") directlyUse createGoogleGenerativeAI() from @ai-sdk/googleType safety, retries, structured output, provider switching
Parse LLM JSON with JSON.parse() and hopeUse generateObject() with a Zod schemaAutomatic validation, typed results, retry on schema failure
Price all output tokens at the same rateSeparate thinking tokens ($3.50/M) from output tokens ($0.60/M)3-9x cost undercount if you don’t
Store conversation history in KVUse AIChatAgent (auto-persists to SQLite)KV can’t query, can’t paginate, can’t search
Create one giant “orchestrator” agentUse multiple specialized agents communicating via QueuesIsolation, independent scaling, independent budgets
Run expensive LLM calls in crons without cost ceilingsSet daily spend limits in the proxy layerA cron running every 15 minutes can burn $30 overnight
Trust internal cost tracking without reconciliationCompare proxy totals against provider billing monthlyInternal tracking has bugs — the provider bill is the truth
Use setInterval() in Durable ObjectsUse this.schedule() or alarm-based schedulingsetInterval prevents hibernation, costs money 24/7
Embed LLM calling logic in the agent classCreate a shared LLM harness moduleDRY, consistent cost tracking, single place for pricing updates
Skip the proxy layer “because we only have one service”Add the proxy from day oneYou will have more services. Retrofitting cost tracking is 10x harder
Use maxSteps: 100 for tool-calling agentsStart with maxSteps: 5-10, increase deliberatelyEach step is a full LLM call. 100 steps = 100x cost.
Retry forever on LLM failuresUse maxRetries: 2-3 and handle NoObjectGeneratedErrorInfinite retries = infinite cost
Let the AI SDK hit the provider directly from all environmentsRoute dev/staging through the same proxy with separate project IDsDev cost is invisible otherwise; it also shares rate limits

When to Skip a Layer

You don’t always need all three layers. Here’s when each is optional:

Skip Layer 1 (Agent Runtime) When:

Use Layer 1 when you need: persistence across requests, real-time WebSocket connections, scheduling, or millions of independent agent instances.

Skip Layer 2 (LLM Interface) When:

Use Layer 2 when you need: structured output (Zod schemas), tool calling, provider abstraction, or streaming to UI clients.

Skip Layer 3 (API Proxy) When:

Use Layer 3 when you need: multi-service cost attribution, daily spend limits, centralized API key management, or when you’ve been surprised by a bill even once.

The Progressive Adoption Path

Day 1:  Layer 2 only (AI SDK in a Worker)

Week 2: Layer 1 + Layer 2 (AI SDK in a Durable Object agent)

Month 1: All three layers (agent + AI SDK + proxy)

You don’t have to build all three on day one. Start with the AI SDK for type-safe LLM calls. Add the agent runtime when you need state. Add the proxy when you need cost control. Each layer snaps into place without requiring changes to the others.


Production Checklist

Before deploying agents with all three layers:

Layer 1 (Container)

Layer 2 (Brain)

Layer 3 (Wallet)


Advanced: The Cloudflare AI Gateway Bridge

For teams using Cloudflare’s AI Gateway, you can layer it between the AI SDK and your custom proxy for a hybrid approach:

App → Agents SDK → AI SDK → Custom Proxy → CF AI Gateway → Provider
         (state)    (LLM)     (cost)        (cache/rate)    (API)
import { createAiGateway } from "ai-gateway-provider";
import { createGoogleGenerativeAI } from "ai-gateway-provider/providers/google";

function createFullStackModel(env: Env, functionName: string) {
  // Layer 3a: Cloudflare AI Gateway (caching, rate limiting, fallbacks)
  const aigateway = createAiGateway({
    binding: env.AI.gateway("production-gateway"),
    options: { cacheTtl: 300 },
  });

  // Layer 3b: Google provider (routes through AI Gateway)
  const google = createGoogleGenerativeAI({
    apiKey: env.GEMINI_API_KEY,
  });

  // Wrap with custom fetch for cost attribution
  // This is the Layer 3c: your cost proxy
  const gatewayModel = aigateway(google("gemini-2.5-flash"));

  // The AI Gateway handles caching and rate limiting
  // Your proxy handles cost attribution and budget enforcement
  // The AI SDK handles structured output and tool calling
  // The Agents SDK handles state and lifecycle
  return gatewayModel;
}

This hybrid gives you:


Cost Reference

Quick reference for making model routing decisions:

ModelInput ($/M tokens)Output ($/M tokens)Thinking ($/M tokens)Best For
Gemini 2.0 Flash$0.10$0.40N/AClassification, extraction, simple tasks
Gemini 2.5 Flash$0.15$0.60$3.50General purpose, tool calling, content
Gemini 2.5 Pro$1.25$10.00$10.00Complex reasoning, long context
Claude Sonnet 4$3.00$15.00N/ACode generation, nuanced analysis
Claude Haiku 3.5$0.80$4.00N/AFast classification, extraction
GPT-4o$2.50$10.00N/AMulti-modal, general purpose
GPT-4o Mini$0.15$0.60N/ABudget-friendly general tasks
Workers AI (Llama 3)Free (Cloudflare)FreeN/APrototyping, non-critical tasks

Key insight: The cost difference between Gemini 2.0 Flash ($0.10/M input) and Claude Sonnet 4 ($3.00/M input) is 30x. If you’re routing every task through the same expensive model, you’re burning money on classification tasks that a cheap model handles just as well. Multi-provider routing isn’t a nice-to-have — it’s a cost optimization multiplier.


Summary

The three-layer architecture isn’t about adding complexity — it’s about preventing the complexity that inevitably emerges when state, LLM interaction, and cost control are tangled together.

The layers compose through two narrow interfaces:

  1. AIChatAgent.onChatMessage() calls AI SDK functions (Container → Brain)
  2. AI SDK’s fetch option routes through the proxy (Brain → Wallet)

Start with Layer 2 on day one. Add Layer 1 when you need state. Add Layer 3 before your first production deployment. Each layer is independently useful and independently replaceable. That’s the whole point.


References

Cloudflare Agents SDK

Cloudflare Durable Objects

Vercel AI SDK

Cloudflare AI Gateway

LLM Proxy and Cost Tracking

Alternative Agent Frameworks

Framework Comparisons

Cloudflare Workers Platform


Edit page
Share this post on:

Previous Post
Durable Object Patterns on Cloudflare Workers
Next Post
Prime: Persistent Org-Level AI Agents on Cloudflare