Production AI Anti-Patterns

Org Status: 🟡 Dormant Cloudflare: N/A Last Audited: 2026-04-28

Ten production anti-patterns that cost us $282/month in invisible LLM spend, buried bugs in 1,000-line god modules, and left 223 of 226 API endpoints wide open to malformed input. Every anti-pattern here comes from a real codebase running real traffic. Every fix has been shipped and measured.

This is not a theoretical taxonomy. It is a post-mortem.

What you will learn:

How manual JSON parsing from LLM responses creates a maintenance nightmare — and how generateObject() with Zod schemas eliminates it entirely
Why raw LLM calls without cost tracking led to a $47 spend in 5 days across two projects, and the proxy architecture that prevents it
The thinking token pricing blindspot that caused a 9x cost undercount ($1.14 tracked vs $10.19 billed)
Why god modules, missing input validation, and duplicated types compound into unmaintainable AI codebases
How manual state machines for agents fail at recovery — and what stateful agent runtimes (CF Agents SDK, Temporal, Inngest) provide instead
The scattered API key problem that required deleting 14 keys across 4 workers in a single emergency session
Why console.log is not logging, and how Pino in browser mode gives you structured observability on edge runtimes with zero infrastructure
How uncontrolled cron spending generated 193 blog posts in 5 days with no human review, no cost attribution, and no kill switch

The Problem
Core Concepts
Anti-Pattern 1: Manual JSON Parsing from LLM Responses
Anti-Pattern 2: Raw LLM Calls Without Cost Tracking
Anti-Pattern 3: Thinking Token Pricing Blindspot
Anti-Pattern 4: God Modules
Anti-Pattern 5: No Input Validation on API Endpoints
Anti-Pattern 6: Duplicated Type Definitions
Anti-Pattern 7: Manual State Machines for Agents
Anti-Pattern 8: Scattered API Keys
Anti-Pattern 9: No Structured Logging
Anti-Pattern 10: Uncontrolled Cron Spending
Small Examples
Comparisons
The Anti-Pattern Summary Table
Putting It All Together: The Modern Production AI Stack
References

You built an LLM-powered application. It works in development. You deploy it. Within a week, you discover:

You cannot explain your AI spend. The Gemini dashboard says $47 in 5 days. Your internal tracking says $1.14. The delta is not a rounding error — it is a 9x undercount caused by not pricing thinking tokens.
You cannot trust your outputs. Half your LLM responses are wrapped in markdown code fences (\“json … ```). Your regex-based cleanup handles 4 of the 7 variations providers emit. The other 3 crash silently and return undefined` to the database.
You cannot find the bug. The relevant logic lives in a 1,126-line file that handles SEO analysis, content scoring, keyword extraction, SERP parsing, and site audits. When something breaks, you read the whole file.
You cannot stop the bleeding. Nine crons run every 5-30 minutes, each triggering LLM calls. There is no budget awareness, no kill switch, no human in the loop. The pipeline generated 193 blog posts in 5 days. Nobody reviewed them. Nobody knew they existed.

These are not hypothetical risks. These are the bugs we shipped, the money we burned, and the emergency sessions we ran to fix them. The core issue is always the same: LLM applications have different failure modes than traditional software, and traditional engineering practices do not cover them.

Traditional software is deterministic. You call a function, you get a return value, you check the type. LLM software is probabilistic. You send a prompt, you get a response that might be JSON, might have the right fields, might cost $0.003 or $0.30 depending on whether the model decided to “think” about it. The output shape, the cost, and the latency are all variable — and if your code assumes they are fixed, you will get surprised in production.

The modern fix is not one tool. It is a stack:

Structured output (Vercel AI SDK + Zod) replaces manual parsing
Cost proxy (centralized gateway) replaces raw provider calls
Token-tier metering replaces naive token counting
Single-responsibility modules replace god files
Schema validation at boundaries replaces request.json() trust
Shared type packages replace copy-paste interfaces
Stateful agent runtimes replace hand-rolled state machines
Centralized key management replaces scattered secrets
Structured logging (Pino browser mode) replaces console.log
Budget-aware pipelines replace unbounded crons

Before diving into the anti-patterns, three foundational concepts that underpin every fix.

Structured Output

The idea that an LLM call should return a typed object, not a string you have to parse.

import { generateObject } from "ai";
import { google } from "@ai-sdk/google";
import { z } from "zod";

// The schema IS the specification
const ArticleSchema = z.object({
  title: z.string().describe("SEO-optimized article title"),
  slug: z.string().describe("URL-safe slug"),
  sections: z
    .array(
      z.object({
        heading: z.string(),
        content: z.string(),
        wordCount: z.number(),
      })
    )
    .describe("Article sections in order"),
  seoMeta: z.object({
    description: z.string().max(160),
    keywords: z.array(z.string()).max(10),
  }),
});

type Article = z.infer<typeof ArticleSchema>;

const { object: article } = await generateObject({
  model: google("gemini-2.5-flash"),
  schema: ArticleSchema,
  prompt: `Write an article about TypeScript monorepo patterns`,
});

// article is fully typed as Article
// No parsing. No try/catch. No regex cleanup.
console.log(article.title); // string, guaranteed
console.log(article.sections[0].wordCount); // number, guaranteed

Key insight: When you use generateObject(), the schema is sent to the model as a response format constraint. The model generates tokens that conform to the schema. The SDK validates the response against the schema before returning. If validation fails, it retries automatically. You never see a malformed response.

Cost Attribution

The idea that every LLM call should record what it cost, who triggered it, and why.

interface CostRecord {
  // Identity
  project_id: string; // "pages-plus"
  function: string; // "article-write"
  service: string; // "gemini"
  model: string; // "gemini-2.5-flash"

  // Token tiers -- each priced differently
  input_tokens: number;
  output_tokens: number;
  thinking_tokens: number; // $3.50/M for Gemini Flash Thinking
  cache_read_tokens: number; // $0.0375/M
  cache_write_tokens: number;

  // Cost
  cost_usd: number; // Calculated from token tiers + model pricing

  // Context
  tags: string[]; // ["brand:llc-tax", "trigger:cron", "batch:2026-03-15"]
  timestamp: string; // ISO 8601
}

Key insight: The cost of an LLM call is not (input_tokens + output_tokens) * price_per_token. Models with thinking/reasoning modes have 3-5 different token tiers, each with different prices. If you only track two tiers, you will undercount by 2-10x.

Boundary Validation

The idea that every system boundary — HTTP endpoints, queue consumers, function arguments — should validate its input against a schema.

import { z } from "zod";

const PublishRequestSchema = z.object({
  brand_slug: z
    .string()
    .regex(/^[a-z0-9-]+$/)
    .min(1)
    .max(64),
  content_id: z.string().uuid(),
  publish_to: z.enum(["blog", "social", "newsletter"]),
  schedule_at: z.string().datetime().optional(),
  metadata: z
    .record(z.string(), z.unknown())
    .optional()
    .default({}),
});

type PublishRequest = z.infer<typeof PublishRequestSchema>;

// In the handler
app.post("/v1/publish", async (c) => {
  const result = PublishRequestSchema.safeParse(await c.req.json());

  if (!result.success) {
    return c.json(
      {
        error: "validation_failed",
        issues: result.error.issues.map((i) => ({
          path: i.path.join("."),
          message: i.message,
        })),
      },
      400
    );
  }

  // result.data is fully typed as PublishRequest
  // No casting, no `as any`, no runtime surprises
  return c.json(await publishContent(result.data, c.env));
});

Key insight: TypeScript types disappear at runtime. When your Worker receives a JSON body from the internet, TypeScript cannot guarantee its shape. Zod bridges compile-time and runtime: one schema gives you both the TypeScript type (via z.infer) and the runtime validator (via .parse()/.safeParse()). Define once, validate everywhere.

The Problem

You ask an LLM for structured data. It returns a string. Sometimes the string is valid JSON. Sometimes it is JSON wrapped in markdown code fences. Sometimes it has a preamble like “Here’s the JSON you requested:” before the actual object. Sometimes the keys are in a different order. Sometimes there are extra fields. Sometimes there are missing fields.

Your code looks like this:

// Found in 10+ files across 4 production projects
async function generateArticle(prompt: string, env: Env): Promise<Article> {
  const response = await fetch(
    "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash:generateContent",
    {
      method: "POST",
      headers: {
        "Content-Type": "application/json",
        "x-goog-api-key": env.GEMINI_API_KEY,
      },
      body: JSON.stringify({
        contents: [{ parts: [{ text: prompt }] }],
        generationConfig: {
          responseMimeType: "application/json",
        },
      }),
    }
  );

  const data = await response.json();
  const text = data.candidates?.[0]?.content?.parts?.[0]?.text;

  if (!text) {
    throw new Error("No response from Gemini");
  }

  // The cleanup gauntlet
  let cleaned = text
    .replace(/```json\n?/g, "")
    .replace(/```\n?/g, "")
    .replace(/^\s*Here.*?:\s*/i, "")
    .trim();

  try {
    const parsed = JSON.parse(cleaned);

    // Manual field validation
    if (!parsed.title || typeof parsed.title !== "string") {
      throw new Error("Missing or invalid title");
    }
    if (!Array.isArray(parsed.sections)) {
      throw new Error("Missing or invalid sections");
    }
    // 20 more lines of manual checks...

    return parsed as Article;
  } catch (err) {
    console.log("Failed to parse LLM response:", cleaned.substring(0, 200));
    throw new Error(`JSON parse failed: ${err.message}`);
  }
}

This pattern has five distinct failure modes:

Regex misses a variant. The model outputs ```JSON (capital J) or ```json5 or wraps the response in <json> tags. Your regex does not handle it. JSON.parse throws. The operation fails silently or crashes.
Field validation is incomplete. You check for title and sections but forget to check that each section has a heading and content. The partial object propagates downstream and crashes in a different function with an unhelpful error.
Type assertion lies. parsed as Article tells TypeScript “trust me, this is an Article.” TypeScript obeys. At runtime, it might be missing three fields. The type assertion bypasses every safety net TypeScript provides.
No retry logic. If the model returns malformed JSON once, this code throws. There is no retry with a different prompt, no retry with a stricter system message, no fallback to a different model.
No cost tracking. The raw fetch() call returns no token counts. You have no idea what this call cost. Multiply by 9 crons running every 15 minutes, and you get the $47-in-5-days disaster.

The Fix: `generateObject()` with Zod

import { generateObject } from "ai";
import { google } from "@ai-sdk/google";
import { z } from "zod";

const ArticleSchema = z.object({
  title: z.string().min(10).max(200),
  slug: z
    .string()
    .regex(/^[a-z0-9-]+$/)
    .max(100),
  sections: z
    .array(
      z.object({
        heading: z.string().min(1),
        content: z.string().min(50),
        wordCount: z.number().int().positive(),
      })
    )
    .min(3)
    .max(20),
  seoMeta: z.object({
    description: z.string().max(160),
    keywords: z.array(z.string()).min(1).max(10),
  }),
  readingTimeMinutes: z.number().positive(),
});

type Article = z.infer<typeof ArticleSchema>;

async function generateArticle(prompt: string): Promise<Article> {
  const { object, usage } = await generateObject({
    model: google("gemini-2.5-flash"),
    schema: ArticleSchema,
    prompt,
  });

  // object is Article -- fully validated, fully typed
  // usage.promptTokens and usage.completionTokens available for cost tracking
  return object;
}

What changed:

Before (manual)	After (generateObject)
Raw `fetch()` to provider API	Provider-agnostic SDK call
Regex cleanup of markdown fences	No cleanup needed — response is never a raw string
`JSON.parse()` in try/catch	SDK handles parsing and validation
Manual field-by-field validation	Zod schema validates structure, types, and constraints
`as Article` type assertion	`z.infer<typeof ArticleSchema>` — type derived from schema
No retry on malformed response	SDK retries automatically on validation failure
No token counts available	`usage` object with prompt/completion token counts
40+ lines of parsing boilerplate	0 lines of parsing code

How It Works Under the Hood

When you call generateObject(), the Vercel AI SDK:

Converts your Zod schema to a JSON Schema and sends it to the model as a response_format constraint (for models that support structured output) or as a function call schema (for models that support function calling).
The model generates tokens constrained to the schema. This is not post-processing. The model’s token sampling is guided by the schema structure, making it far more reliable than asking for JSON in the prompt.
Validates the response against your Zod schema. If validation fails (e.g., a string is too long, a number is negative), the SDK retries with an error message appended to the prompt.
Returns a fully typed object. The return type is z.infer<typeof YourSchema> — you never touch JSON.parse or as.

// The SDK handles all of these failure modes automatically:
//
// 1. Model returns markdown-wrapped JSON    -> structured output bypasses this
// 2. Model returns extra fields             -> Zod strips them (.strict() to reject)
// 3. Model returns wrong types              -> Zod validation catches it, SDK retries
// 4. Model returns partial object           -> Zod validation catches it, SDK retries
// 5. Model returns nothing                  -> SDK throws with clear error
//
// You handle ZERO of these cases. The SDK handles ALL of them.

When You Still Need Manual Parsing

There are two legitimate cases:

Streaming partial objects. If you need to display results as they stream in, streamObject() gives you partial objects during generation. But the final object is still validated.
Legacy provider APIs. If you are using a provider the AI SDK does not support, you are back to raw fetch(). But even then, use Zod to validate after parsing — do not use as.

// If you MUST parse manually, at least validate with Zod
const raw = JSON.parse(responseText);
const result = ArticleSchema.safeParse(raw);

if (!result.success) {
  console.error("Validation failed:", result.error.issues);
  // Retry, fall back, or fail explicitly -- never pass invalid data downstream
  throw new Error(`LLM response failed validation: ${result.error.message}`);
}

// result.data is Article -- safe to use
return result.data;

The Problem

Every LLM call costs money. The cost varies by model, by token count, by whether the model used “thinking” mode, by whether prompt caching kicked in. When you call provider APIs directly from your application code, you have no record of what anything cost.

// The pattern that burned $47 in 5 days
// Found in pages-plus: 6 direct Gemini calls, 9 crons, zero cost tracking

async function generateBlogPost(keyword: string, env: Env): Promise<BlogPost> {
  // Direct call to Gemini -- no proxy, no metering, no attribution
  const response = await fetch(
    `https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash:generateContent`,
    {
      method: "POST",
      headers: {
        "Content-Type": "application/json",
        "x-goog-api-key": env.GEMINI_API_KEY,
      },
      body: JSON.stringify({
        contents: [
          {
            parts: [
              {
                text: `Write a comprehensive blog post about: ${keyword}`,
              },
            ],
          },
        ],
      }),
    }
  );

  const data = await response.json();
  // No cost tracking. No token counting. No attribution.
  // This call cost somewhere between $0.01 and $0.30.
  // Nobody will ever know.
  return parseBlogPost(data);
}

War Story: The $47-in-5-Days Gemini Disaster

Here is what happened:

pages-plus had 9 cron triggers in its wrangler.jsonc. They ran every 5-30 minutes.
Each cron triggered content generation: blog outlines, blog drafts, blog edits, anchor text generation, meta description generation, internal link suggestions.
Each generation called Gemini directly using env.GEMINI_API_KEY.
There was no api_calls_ledger table. No cost tracking. No daily limit.
In 5 days, the pipeline generated 193 blog posts. No human reviewed them. No human knew they existed.
The Gemini billing dashboard showed $37 from pages-plus alone.
A second project (aso-mrr) contributed another $10.19.
Total: $47 in 5 days. $282/month run rate.
Discovered by manually checking the Google Cloud billing dashboard during a routine ops session.
Emergency response: All 13 crons across both projects disabled the same day. 14 API keys deleted from worker configurations. New standard written. 12 issues created to implement metering before any cron could be re-enabled.

The Fix: Centralized LLM Proxy with Per-Call Metering

The architecture that prevents this:

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  pages-plus  │     │ scalable-media│    │   aso-mrr    │
│  (no API key)│     │ (no API key) │     │ (no API key) │
└──────┬───────┘     └──────┬───────┘     └──────┬───────┘
       │                    │                    │
       │   X-Project-Id     │   X-Function       │   X-Tags
       │   X-Api-Key        │   X-Api-Key        │   X-Api-Key
       │                    │                    │
       ▼                    ▼                    ▼
┌─────────────────────────────────────────────────────────┐
│                      API Proxy                          │
│                                                         │
│  ┌─────────────┐  ┌──────────────┐  ┌───────────────┐  │
│  │ Auth +      │  │ Cost Calc    │  │ Daily Limit   │  │
│  │ Routing     │  │ (all tiers)  │  │ Enforcement   │  │
│  └─────────────┘  └──────────────┘  └───────────────┘  │
│                                                         │
│  ┌─────────────┐  ┌──────────────┐  ┌───────────────┐  │
│  │ Provider    │  │ Ledger       │  │ Cache         │  │
│  │ Keys        │  │ (D1/Postgres)│  │ (optional)    │  │
│  └─────────────┘  └──────────────┘  └───────────────┘  │
└──────────────────────────┬──────────────────────────────┘
                           │
                           ▼
            ┌──────────────────────────┐
            │  Provider APIs           │
            │  (Gemini, OpenAI, etc.)  │
            └──────────────────────────┘

The proxy worker:

// api-proxy/src/routes/gemini.ts
import { Hono } from "hono";
import { z } from "zod";

const app = new Hono<{ Bindings: Env }>();

const GeminiProxySchema = z.object({
  model: z.string(),
  contents: z.array(z.unknown()),
  generationConfig: z.record(z.unknown()).optional(),
});

app.post("/v1/gemini/generateContent", async (c) => {
  const projectId = c.req.header("X-Project-Id");
  const apiKey = c.req.header("X-Api-Key");
  const functionTag = c.req.header("X-Function") ?? "unknown";
  const tags = c.req.header("X-Tags")?.split(",") ?? [];

  // 1. Authenticate the calling project
  const project = await authenticateProject(projectId, apiKey, c.env.DB);
  if (!project) {
    return c.json({ error: "unauthorized" }, 401);
  }

  // 2. Check daily spend limit BEFORE making the call
  const todaySpend = await getDailySpend(projectId, "gemini", c.env.DB);
  if (todaySpend >= project.daily_limit_usd) {
    return c.json(
      {
        error: "daily_limit_exceeded",
        spend: todaySpend,
        limit: project.daily_limit_usd,
      },
      429
    );
  }

  // 3. Parse and validate the request
  const body = GeminiProxySchema.parse(await c.req.json());

  // 4. Forward to Gemini using the proxy's API key (not the caller's)
  const geminiResponse = await fetch(
    `https://generativelanguage.googleapis.com/v1beta/models/${body.model}:generateContent`,
    {
      method: "POST",
      headers: {
        "Content-Type": "application/json",
        "x-goog-api-key": c.env.GEMINI_API_KEY, // Only the proxy has the key
      },
      body: JSON.stringify({
        contents: body.contents,
        generationConfig: body.generationConfig,
      }),
    }
  );

  const result = await geminiResponse.json();

  // 5. Extract token counts from ALL tiers
  const usage = result.usageMetadata;
  const cost = calculateGeminiCost(body.model, {
    input: usage?.promptTokenCount ?? 0,
    output: usage?.candidatesTokenCount ?? 0,
    thinking: usage?.thoughtsTokenCount ?? 0,
    cacheRead: usage?.cachedContentTokenCount ?? 0,
  });

  // 6. Record in the ledger -- every call, every tier, every tag
  await c.env.DB.prepare(
    `INSERT INTO api_calls_ledger
     (id, project_id, service, model, function, tags,
      input_tokens, output_tokens, thinking_tokens, cache_read_tokens,
      cost_usd, created_at)
     VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)`
  )
    .bind(
      crypto.randomUUID(),
      projectId,
      "gemini",
      body.model,
      functionTag,
      JSON.stringify(tags),
      usage?.promptTokenCount ?? 0,
      usage?.candidatesTokenCount ?? 0,
      usage?.thoughtsTokenCount ?? 0,
      usage?.cachedContentTokenCount ?? 0,
      cost,
      new Date().toISOString()
    )
    .run();

  // 7. Return the response to the caller
  return c.json(result);
});

function calculateGeminiCost(
  model: string,
  tokens: {
    input: number;
    output: number;
    thinking: number;
    cacheRead: number;
  }
): number {
  // Gemini 2.5 Flash pricing (as of 2026-03)
  const pricing: Record<string, { input: number; output: number; thinking: number; cacheRead: number }> = {
    "gemini-2.5-flash": {
      input: 0.15, // per 1M tokens
      output: 0.60,
      thinking: 3.50,
      cacheRead: 0.0375,
    },
    "gemini-2.5-pro": {
      input: 1.25,
      output: 10.0,
      thinking: 10.0,
      cacheRead: 0.3125,
    },
  };

  const p = pricing[model] ?? pricing["gemini-2.5-flash"];

  return (
    (tokens.input * p.input +
      tokens.output * p.output +
      tokens.thinking * p.thinking +
      tokens.cacheRead * p.cacheRead) /
    1_000_000
  );
}

The calling worker now looks like this:

// pages-plus/src/domain/content.ts
// No GEMINI_API_KEY. No direct provider calls. No cost math.

async function generateBlogPost(keyword: string, env: Env): Promise<BlogPost> {
  const response = await fetch(`${env.API_PROXY_URL}/v1/gemini/generateContent`, {
    method: "POST",
    headers: {
      "Content-Type": "application/json",
      "X-Project-Id": "pages-plus",
      "X-Api-Key": env.API_PROXY_KEY,
      "X-Function": "blog-post-generate",
      "X-Tags": `brand:${keyword},trigger:cron`,
    },
    body: JSON.stringify({
      model: "gemini-2.5-flash",
      contents: [
        { parts: [{ text: `Write a blog post about: ${keyword}` }] },
      ],
    }),
  });

  if (response.status === 429) {
    // Daily limit hit -- stop gracefully, don't retry
    console.log(`Daily spend limit reached for pages-plus`);
    return null;
  }

  return parseBlogPost(await response.json());
}

Available Gateway Options

You do not have to build this yourself. Several production-ready options exist:

Cloudflare AI Gateway — Free tier, sits between your app and provider APIs. Provides analytics, caching, rate limiting, and logging. One line of code to add. Best if you are already on Cloudflare.

LiteLLM Proxy — Open-source Python proxy supporting 100+ LLM providers with OpenAI-compatible API. Cost tracking per key/user/team. Tag-based budget management. Best if you want self-hosted control with multi-provider support.

Custom proxy — Build your own (as shown above) when you need specific attribution logic, integration with your existing database, or edge-native deployment. Best when existing gateways do not match your metering requirements.

The Problem

Modern LLMs have multiple token tiers with wildly different prices. When you track cost as (input + output) * price, you are using a formula from 2023. The real formula in 2025+ has 3-5 tiers:

Token Tier	Gemini 2.5 Flash Price	What It Is
Input	$0.15/M	Prompt tokens sent to the model
Output	$0.60/M	Generated response tokens
Thinking	$3.50/M	Internal reasoning tokens (not shown to user)
Cache Read	$0.0375/M	Tokens served from prompt cache
Cache Write	$0.15/M	Tokens written to prompt cache

Thinking tokens cost 5.8x more than output tokens. And for complex tasks (SEO analysis, content planning, multi-step reasoning), thinking tokens can outnumber output tokens 3-to-1.

War Story: The 9x Undercount

aso-mrr tracked its Gemini costs in an internal ledger.
The cost formula was: (input_tokens * 0.15 + output_tokens * 0.60) / 1_000_000
The formula did not include thinking tokens.
Over 5 days, the internal ledger showed $1.14 in total spend.
The Gemini billing dashboard showed $10.19.
That is a 9x undercount.
Root cause: Gemini Flash Thinking charges $3.50/M for thinking tokens. Those tokens were being generated on every call but not counted in the cost formula. For reasoning-heavy tasks like app store optimization analysis, thinking tokens dominated the bill.

The Fix: Track All Token Tiers

interface TokenUsage {
  input: number;
  output: number;
  thinking: number;
  cacheRead: number;
  cacheWrite: number;
}

interface ModelPricing {
  input: number; // per 1M tokens
  output: number;
  thinking: number;
  cacheRead: number;
  cacheWrite: number;
}

// Pricing table -- update when providers change prices
const MODEL_PRICING: Record<string, ModelPricing> = {
  "gemini-2.5-flash": {
    input: 0.15,
    output: 0.60,
    thinking: 3.50,
    cacheRead: 0.0375,
    cacheWrite: 0.15,
  },
  "gemini-2.5-pro": {
    input: 1.25,
    output: 10.0,
    thinking: 10.0,
    cacheRead: 0.3125,
    cacheWrite: 1.25,
  },
  "claude-sonnet-4-20250514": {
    input: 3.0,
    output: 15.0,
    thinking: 0, // Claude charges thinking at output rate when extended thinking enabled
    cacheRead: 0.30,
    cacheWrite: 3.75,
  },
  "gpt-4o": {
    input: 2.50,
    output: 10.0,
    thinking: 0, // GPT-4o does not have a separate thinking tier
    cacheRead: 1.25,
    cacheWrite: 2.50,
  },
};

function calculateCost(model: string, usage: TokenUsage): number {
  const pricing = MODEL_PRICING[model];
  if (!pricing) {
    console.warn(`Unknown model pricing: ${model}, using gemini-2.5-flash as fallback`);
    return calculateCost("gemini-2.5-flash", usage);
  }

  return (
    (usage.input * pricing.input +
      usage.output * pricing.output +
      usage.thinking * pricing.thinking +
      usage.cacheRead * pricing.cacheRead +
      usage.cacheWrite * pricing.cacheWrite) /
    1_000_000
  );
}

// Extract token counts from Gemini response
function extractGeminiUsage(response: GeminiResponse): TokenUsage {
  const meta = response.usageMetadata;
  return {
    input: meta?.promptTokenCount ?? 0,
    output: meta?.candidatesTokenCount ?? 0,
    thinking: meta?.thoughtsTokenCount ?? 0, // THIS IS THE ONE PEOPLE MISS
    cacheRead: meta?.cachedContentTokenCount ?? 0,
    cacheWrite: 0, // Gemini does not report cache writes in response
  };
}

// Example: What the 9x undercount looked like
const usage: TokenUsage = {
  input: 5000,
  output: 2000,
  thinking: 8000, // 4x the output -- typical for reasoning tasks
  cacheRead: 0,
  cacheWrite: 0,
};

const wrongCost = (usage.input * 0.15 + usage.output * 0.60) / 1_000_000;
// $0.0019 -- what the broken formula reported

const correctCost = calculateCost("gemini-2.5-flash", usage);
// $0.0298 -- 15.7x higher

// Over hundreds of calls per day, this delta becomes tens of dollars

The Ledger Schema

CREATE TABLE api_calls_ledger (
  id TEXT PRIMARY KEY,
  project_id TEXT NOT NULL,
  service TEXT NOT NULL,        -- 'gemini', 'openai', 'anthropic'
  model TEXT NOT NULL,          -- 'gemini-2.5-flash'
  function TEXT NOT NULL,       -- 'article-write', 'seo-analysis'
  tags TEXT DEFAULT '[]',       -- JSON array: ['brand:llc-tax', 'trigger:cron']

  -- Token tiers -- NEVER lump these together
  input_tokens INTEGER DEFAULT 0,
  output_tokens INTEGER DEFAULT 0,
  thinking_tokens INTEGER DEFAULT 0,
  cache_read_tokens INTEGER DEFAULT 0,
  cache_write_tokens INTEGER DEFAULT 0,

  -- Calculated cost
  cost_usd REAL NOT NULL,

  -- Metadata
  latency_ms INTEGER,
  status_code INTEGER,
  created_at TEXT NOT NULL,

  -- Indexes for querying
  UNIQUE(id)
);

CREATE INDEX idx_ledger_project_date ON api_calls_ledger(project_id, created_at);
CREATE INDEX idx_ledger_function ON api_calls_ledger(function, created_at);
CREATE INDEX idx_ledger_service ON api_calls_ledger(service, created_at);

Monthly Reconciliation Query

-- Compare tracked costs to provider bill
SELECT
  service,
  model,
  COUNT(*) as calls,
  SUM(input_tokens) as total_input,
  SUM(output_tokens) as total_output,
  SUM(thinking_tokens) as total_thinking,
  ROUND(SUM(cost_usd), 2) as tracked_cost,
  -- Compare this to your provider's billing dashboard
  -- If difference > 20%, the pricing formula is wrong
  strftime('%Y-%m', created_at) as month
FROM api_calls_ledger
GROUP BY service, model, month
ORDER BY tracked_cost DESC;

The Problem

AI applications grow fast. You start with a function that calls an LLM. Then you add preprocessing. Then postprocessing. Then validation. Then caching. Then a different LLM call for a related task. Then another. Before you know it, you have a 1,126-line file that does five unrelated things.

// src/lib/seo-engine.ts -- 1,126 lines
// This file does ALL of the following:
//
// 1. SEO analysis (fetch page, parse HTML, score SEO factors)
// 2. Content scoring (call Gemini to rate content quality)
// 3. Keyword extraction (parse content, TF-IDF, call Gemini for related terms)
// 4. SERP parsing (fetch Google results, parse structured data)
// 5. Site audits (crawl pages, check redirects, validate meta tags)
//
// When a bug appears in keyword extraction, you read 1,126 lines.
// When you want to test SERP parsing, you import a module that also
// initializes Gemini clients, HTML parsers, and HTTP pools.
// When you refactor content scoring, you risk breaking site audits
// because they share 4 utility functions defined at the bottom.

export async function analyzeSEO(url: string, env: Env) {
  // 200 lines of fetching and parsing...
}

export async function scoreContent(content: string, env: Env) {
  // 150 lines of Gemini calls and scoring...
}

export async function extractKeywords(text: string, env: Env) {
  // 180 lines of NLP and LLM calls...
}

export async function parseSERP(query: string, env: Env) {
  // 250 lines of Google result parsing...
}

export async function auditSite(domain: string, env: Env) {
  // 346 lines of crawling and validation...
}

// Plus 50 lines of shared utilities at the bottom
function cleanHtml(html: string): string { /* ... */ }
function extractMetaTags(html: string): MetaTags { /* ... */ }
function normalizeUrl(url: string): string { /* ... */ }
function calculateScore(factors: Factor[]): number { /* ... */ }

Why This Matters More for AI Code

God modules are bad in any codebase. They are especially bad in AI codebases because:

LLM calls are expensive to test. If your keyword extraction test imports a module that also initializes a Gemini client for content scoring, your test either pays for an unnecessary LLM call or requires mocking infrastructure you should not need.
LLM integrations change frequently. Provider APIs change, models get updated, pricing changes. When the Gemini API adds a new parameter, you edit a 1,126-line file. The blast radius is five features.
Prompt engineering requires iteration. Improving a prompt for content scoring should not require reading through SERP parsing code. When prompts live in god modules, iteration is slow because the context load is high.
Cost attribution is impossible. If five functions in one file all call Gemini, and your cost tracking tags by file/module, you cannot distinguish a $0.01 keyword extraction from a $0.30 site audit.

The Fix: Single-Responsibility Modules Under 300 Lines

src/
  domain/
    seo-analysis.ts        (120 lines)
    content-scoring.ts     (90 lines)
    keyword-extraction.ts  (110 lines)
    serp-parser.ts         (140 lines)
    site-audit.ts          (180 lines)
  shared/
    html-utils.ts          (40 lines)
    scoring.ts             (30 lines)
    url-utils.ts           (20 lines)

Each module:

// src/domain/keyword-extraction.ts -- 110 lines
// Single responsibility: extract keywords from text using NLP + LLM

import { generateObject } from "ai";
import { google } from "@ai-sdk/google";
import { z } from "zod";
import { cleanHtml } from "../shared/html-utils";

const KeywordResultSchema = z.object({
  primary: z.array(
    z.object({
      term: z.string(),
      relevance: z.number().min(0).max(1),
      searchVolume: z.enum(["high", "medium", "low", "unknown"]),
    })
  ),
  related: z.array(z.string()),
  topics: z.array(z.string()),
});

export type KeywordResult = z.infer<typeof KeywordResultSchema>;

export async function extractKeywords(
  text: string,
  options?: { maxKeywords?: number }
): Promise<KeywordResult> {
  const cleaned = cleanHtml(text);
  const max = options?.maxKeywords ?? 20;

  const { object } = await generateObject({
    model: google("gemini-2.5-flash"),
    schema: KeywordResultSchema,
    prompt: `Extract the top ${max} keywords from this text. Rate each by relevance (0-1) and estimate search volume.\n\nText:\n${cleaned}`,
  });

  return object;
}

// src/domain/content-scoring.ts -- 90 lines
// Single responsibility: score content quality using LLM evaluation

import { generateObject } from "ai";
import { google } from "@ai-sdk/google";
import { z } from "zod";

const ContentScoreSchema = z.object({
  overall: z.number().min(0).max(100),
  factors: z.object({
    readability: z.number().min(0).max(100),
    depth: z.number().min(0).max(100),
    accuracy: z.number().min(0).max(100),
    actionability: z.number().min(0).max(100),
  }),
  suggestions: z.array(z.string()).max(5),
});

export type ContentScore = z.infer<typeof ContentScoreSchema>;

export async function scoreContent(
  content: string,
  context?: { targetAudience?: string; keyword?: string }
): Promise<ContentScore> {
  const { object } = await generateObject({
    model: google("gemini-2.5-flash"),
    schema: ContentScoreSchema,
    prompt: `Score this content on readability, depth, accuracy, and actionability (0-100 each).
${context?.keyword ? `Target keyword: ${context.keyword}` : ""}
${context?.targetAudience ? `Target audience: ${context.targetAudience}` : ""}

Content:
${content.substring(0, 10000)}`,
  });

  return object;
}

The Splitting Heuristic

When deciding how to split a god module:

Group by data dependency. Functions that share the same inputs and outputs belong together. SEO analysis and SERP parsing both work on URLs, but one fetches pages and one fetches search results — different data sources, different modules.
Group by change frequency. Prompt engineering for content scoring changes weekly. HTML parsing utilities change yearly. They should not be in the same file.
Group by test boundary. If you need to mock Gemini to test keyword extraction but not SERP parsing, they should be in different modules so SERP parsing tests do not need Gemini mocks.
Target under 300 lines. This is not arbitrary. 300 lines is roughly the amount of code a developer can hold in working memory while debugging. Above 300 lines, you start scrolling, which means you start losing context.

The Problem

Your Worker has 226 API endpoints. 223 of them look like this:

app.post("/v1/brands/:slug/content", async (c) => {
  const body = await c.req.json();
  // body is `any` -- TypeScript does not know its shape
  // If body.title is undefined, it becomes NULL in the database
  // If body.sections is a string instead of an array, the .map() call crashes
  // If body.metadata.seo.keywords has 500 entries, the database write times out

  const content = await createContent(body, c.env);
  return c.json(content, 201);
});

The 3 that have validation look like this:

app.post("/v1/auth/login", async (c) => {
  const body = await c.req.json();

  // Manual validation -- tedious, incomplete, and no type narrowing
  if (!body.email || typeof body.email !== "string") {
    return c.json({ error: "email is required" }, 400);
  }
  if (!body.password || typeof body.password !== "string") {
    return c.json({ error: "password is required" }, 400);
  }
  if (body.password.length < 8) {
    return c.json({ error: "password must be at least 8 characters" }, 400);
  }

  // body is still `any` after all these checks
  // TypeScript does not narrow `any` through manual checks
  const user = await authenticateUser(body.email, body.password, c.env);
  return c.json(user);
});

This is dangerous in AI applications specifically because:

LLM prompts are constructed from user input. If body.keyword is actually an object instead of a string, your prompt becomes Write about [object Object]. The LLM processes garbage. You pay for the tokens.
Database writes accept anything. If body.sections is a 50,000-character string instead of an array of section objects, D1 stores it. When another service reads it and calls .map(), the system crashes downstream.
Queue messages propagate invalid data. An unvalidated request body gets wrapped in a queue message and sent to another service. That service also does not validate. The bad data now lives in two databases.

The Fix: Zod at Every Boundary

import { z } from "zod";
import { Hono } from "hono";

// Define the schema ONCE -- it's both the validator and the type
const CreateContentSchema = z.object({
  title: z.string().min(5).max(200),
  keyword: z
    .string()
    .min(1)
    .max(100)
    .describe("Primary SEO keyword for content generation"),
  sections: z
    .array(
      z.object({
        heading: z.string().min(1).max(200),
        instructions: z.string().max(1000).optional(),
      })
    )
    .min(1)
    .max(20),
  metadata: z
    .object({
      seo: z
        .object({
          keywords: z.array(z.string().max(50)).max(10).default([]),
          description: z.string().max(160).optional(),
        })
        .default({}),
      publishAt: z.string().datetime().optional(),
    })
    .default({}),
  generateWithAI: z.boolean().default(true),
});

type CreateContentInput = z.infer<typeof CreateContentSchema>;

// Middleware for validation
function validate<T extends z.ZodSchema>(schema: T) {
  return async (c: Context, next: Next) => {
    const result = schema.safeParse(await c.req.json());
    if (!result.success) {
      return c.json(
        {
          error: "validation_failed",
          issues: result.error.issues.map((issue) => ({
            path: issue.path.join("."),
            code: issue.code,
            message: issue.message,
          })),
        },
        400
      );
    }
    c.set("validated", result.data);
    return next();
  };
}

app.post(
  "/v1/brands/:slug/content",
  validate(CreateContentSchema),
  async (c) => {
    const input = c.get("validated") as CreateContentInput;
    // input is fully typed, fully validated, fully constrained
    // input.title is a string between 5-200 chars
    // input.sections is an array with 1-20 items
    // input.metadata.seo.keywords has at most 10 entries
    // No casting, no manual checks, no surprises

    const content = await createContent(input, c.env);
    return c.json(content, 201);
  }
);

Validation Middleware for Hono

Here is a reusable middleware pattern for Hono (the most common router on Cloudflare Workers):

// src/middleware/validation.ts
import { z } from "zod";
import type { Context, MiddlewareHandler } from "hono";

export function validateBody<T extends z.ZodSchema>(
  schema: T
): MiddlewareHandler {
  return async (c, next) => {
    let body: unknown;
    try {
      body = await c.req.json();
    } catch {
      return c.json({ error: "invalid_json", message: "Request body is not valid JSON" }, 400);
    }

    const result = schema.safeParse(body);
    if (!result.success) {
      return c.json(
        {
          error: "validation_failed",
          issues: result.error.issues.map((i) => ({
            path: i.path.join("."),
            code: i.code,
            message: i.message,
          })),
        },
        400
      );
    }

    c.set("body", result.data);
    return next();
  };
}

export function validateParams<T extends z.ZodSchema>(
  schema: T
): MiddlewareHandler {
  return async (c, next) => {
    const result = schema.safeParse(c.req.param());
    if (!result.success) {
      return c.json(
        {
          error: "invalid_params",
          issues: result.error.issues.map((i) => ({
            path: i.path.join("."),
            message: i.message,
          })),
        },
        400
      );
    }

    c.set("params", result.data);
    return next();
  };
}

export function validateQuery<T extends z.ZodSchema>(
  schema: T
): MiddlewareHandler {
  return async (c, next) => {
    const result = schema.safeParse(c.req.query());
    if (!result.success) {
      return c.json(
        {
          error: "invalid_query",
          issues: result.error.issues.map((i) => ({
            path: i.path.join("."),
            message: i.message,
          })),
        },
        400
      );
    }

    c.set("query", result.data);
    return next();
  };
}

Usage across the codebase becomes consistent:

const BrandSlugParams = z.object({
  slug: z.string().regex(/^[a-z0-9-]+$/).min(1).max(64),
});

const ListQuerySchema = z.object({
  limit: z.coerce.number().int().min(1).max(100).default(20),
  offset: z.coerce.number().int().min(0).default(0),
  status: z.enum(["draft", "published", "archived"]).optional(),
});

app.get(
  "/v1/brands/:slug/content",
  validateParams(BrandSlugParams),
  validateQuery(ListQuerySchema),
  async (c) => {
    const { slug } = c.get("params");
    const { limit, offset, status } = c.get("query");
    // All validated. All typed. All constrained.
    return c.json(await listContent(slug, { limit, offset, status }, c.env));
  }
);

The Problem

When multiple workers call the same LLM API, each worker defines its own response types:

// pages-plus/src/types/gemini.ts
interface GeminiApiResponse {
  candidates: Array<{
    content: { parts: Array<{ text: string }> };
    finishReason: string;
  }>;
  usageMetadata: {
    promptTokenCount: number;
    candidatesTokenCount: number;
    totalTokenCount: number;
  };
}

// aso-mrr/src/types/gemini.ts -- COPY-PASTED, slightly different
interface GeminiResponse {
  candidates: {
    content: { parts: { text: string }[] };
    finishReason: string;
  }[];
  usageMetadata: {
    promptTokenCount: number;
    candidatesTokenCount: number;
    // Missing: thoughtsTokenCount -- this copy doesn't know about thinking tokens
    totalTokenCount: number;
  };
}

// scalable-media/src/lib/gemini.ts -- yet another copy
type GeminiResult = {
  candidates: Array<{
    content: { parts: Array<{ text: string }> };
  }>;
  usageMetadata?: {
    promptTokenCount?: number;
    candidatesTokenCount?: number;
  };
};

// gatherfeed/src/services/ai.ts -- and another
// This one has thoughtsTokenCount but misspelled as thoughtTokenCount

Six files. Four slightly different versions. When Google adds a new field to the response, you update one copy and forget the others. When thinking tokens ship, three of the four copies miss the field. That is how you get a 9x cost undercount.

The Fix: Shared Type Modules

Option A: Monorepo shared package

// packages/shared/src/types/llm.ts
import { z } from "zod";

// Define once with Zod -- get runtime validation AND TypeScript type
export const GeminiUsageSchema = z.object({
  promptTokenCount: z.number().default(0),
  candidatesTokenCount: z.number().default(0),
  thoughtsTokenCount: z.number().default(0),
  cachedContentTokenCount: z.number().default(0),
  totalTokenCount: z.number().default(0),
});

export const GeminiResponseSchema = z.object({
  candidates: z.array(
    z.object({
      content: z.object({
        parts: z.array(z.object({ text: z.string() })),
      }),
      finishReason: z.string(),
    })
  ),
  usageMetadata: GeminiUsageSchema.optional(),
});

export type GeminiUsage = z.infer<typeof GeminiUsageSchema>;
export type GeminiResponse = z.infer<typeof GeminiResponseSchema>;

// Helper to extract text from response
export function extractText(response: GeminiResponse): string | null {
  return response.candidates?.[0]?.content?.parts?.[0]?.text ?? null;
}

// Helper to extract usage with defaults for all tiers
export function extractUsage(response: GeminiResponse): GeminiUsage {
  return GeminiUsageSchema.parse(response.usageMetadata ?? {});
}

Every worker imports from the shared package:

// pages-plus/src/domain/content.ts
import { GeminiResponseSchema, extractText, extractUsage } from "@acme/shared/types/llm";

// aso-mrr/src/services/analysis.ts
import { GeminiResponseSchema, extractText, extractUsage } from "@acme/shared/types/llm";

// One definition. One import. One place to update.

Option B: Eliminate the type entirely with AI SDK

The better fix is to stop interacting with provider APIs directly. When you use the Vercel AI SDK, you never see a GeminiApiResponse. The SDK returns a normalized GenerateObjectResult or GenerateTextResult with consistent types across all providers.

import { generateObject } from "ai";
import { google } from "@ai-sdk/google";

const { object, usage } = await generateObject({
  model: google("gemini-2.5-flash"),
  schema: MySchema,
  prompt: "...",
});

// usage is always: { promptTokens: number, completionTokens: number }
// No GeminiApiResponse. No provider-specific types. No copy-paste.

Key insight: The best way to eliminate duplicated types is to eliminate the need for the type entirely. If you are defining GeminiApiResponse in your codebase, you are at the wrong abstraction level. Use an SDK that abstracts the provider.

The Problem

You are building an AI agent that performs multi-step tasks: research a topic, generate an outline, write content, review for quality, publish. Each step depends on the previous step. Some steps may fail and need retries. The whole sequence may take minutes. And you need to know where you are if the process crashes and restarts.

The manual approach:

// The hand-rolled state machine -- no persistence, no recovery, no observability

type AgentState = "idle" | "researching" | "outlining" | "writing" | "reviewing" | "publishing";

interface AgentContext {
  state: AgentState;
  keyword: string;
  researchData?: ResearchResult;
  outline?: ArticleOutline;
  draft?: string;
  reviewScore?: number;
  error?: string;
  retryCount: number;
}

async function runAgent(keyword: string, env: Env): Promise<void> {
  const ctx: AgentContext = {
    state: "idle",
    keyword,
    retryCount: 0,
  };

  try {
    // Step 1: Research
    ctx.state = "researching";
    ctx.researchData = await doResearch(keyword, env);

    // Step 2: Outline
    ctx.state = "outlining";
    ctx.outline = await generateOutline(ctx.researchData, env);

    // Step 3: Write
    ctx.state = "writing";
    ctx.draft = await writeDraft(ctx.outline, env);

    // Step 4: Review
    ctx.state = "reviewing";
    ctx.reviewScore = await reviewContent(ctx.draft, env);

    if (ctx.reviewScore < 70) {
      // Retry writing -- but what if it fails again?
      // What if the process crashes here?
      // What if we've already spent $0.50 on research and outlining?
      ctx.state = "writing";
      ctx.draft = await writeDraft(ctx.outline, env);
      ctx.reviewScore = await reviewContent(ctx.draft, env);
    }

    // Step 5: Publish
    ctx.state = "publishing";
    await publishContent(ctx.draft, env);
  } catch (err) {
    ctx.error = err.message;
    // The state is in memory. If the process crashes, it's gone.
    // There's no way to resume from the last successful step.
    // The entire pipeline must restart from scratch.
    // All the LLM calls (and their costs) are wasted.
    console.log("Agent failed at state:", ctx.state, err);
  }
}

The problems:

No persistence. If the Worker crashes after step 3 (writing), all progress is lost. The research and outline cost money. That money is wasted.
No observability. You cannot query “how many agents are in the reviewing state right now?” or “what was the last successful step for keyword X?”
No recovery. When the process restarts, it starts from step 1. There is no way to resume from step 4.
No concurrency control. Two cron ticks could start two agents for the same keyword. Both run to completion. You pay twice.
No backpressure. If step 3 (writing) takes 30 seconds but step 1 (research) takes 2 seconds, you can saturate the LLM with writing requests while research builds up.

The Fix: Stateful Agent Runtimes

Cloudflare Agents SDK (Durable Objects with built-in state):

import { Agent } from "agents";
import { generateObject } from "ai";
import { google } from "@ai-sdk/google";
import { z } from "zod";

interface ContentAgentState {
  status: "idle" | "researching" | "outlining" | "writing" | "reviewing" | "publishing" | "done" | "failed";
  keyword: string;
  researchData?: unknown;
  outline?: unknown;
  draft?: string;
  reviewScore?: number;
  completedSteps: string[];
  totalCost: number;
  error?: string;
}

const initialState: ContentAgentState = {
  status: "idle",
  keyword: "",
  completedSteps: [],
  totalCost: 0,
};

export class ContentAgent extends Agent<Env, ContentAgentState> {
  initialState = initialState;

  async startPipeline(keyword: string) {
    this.setState({ ...this.state, keyword, status: "researching" });

    try {
      // Step 1: Research -- state persists automatically
      if (!this.state.completedSteps.includes("research")) {
        const research = await this.doResearch(keyword);
        this.setState({
          ...this.state,
          researchData: research,
          completedSteps: [...this.state.completedSteps, "research"],
          status: "outlining",
        });
      }

      // Step 2: Outline
      if (!this.state.completedSteps.includes("outline")) {
        const outline = await this.generateOutline();
        this.setState({
          ...this.state,
          outline,
          completedSteps: [...this.state.completedSteps, "outline"],
          status: "writing",
        });
      }

      // Step 3: Write
      if (!this.state.completedSteps.includes("write")) {
        const draft = await this.writeDraft();
        this.setState({
          ...this.state,
          draft,
          completedSteps: [...this.state.completedSteps, "write"],
          status: "reviewing",
        });
      }

      // Step 4: Review
      if (!this.state.completedSteps.includes("review")) {
        const score = await this.reviewContent();
        this.setState({
          ...this.state,
          reviewScore: score,
          completedSteps: [...this.state.completedSteps, "review"],
          status: score >= 70 ? "publishing" : "writing",
        });

        if (score < 70) {
          // Remove write step to retry
          this.setState({
            ...this.state,
            completedSteps: this.state.completedSteps.filter(
              (s) => s !== "write" && s !== "review"
            ),
          });
          // Re-run from write step
          return this.startPipeline(keyword);
        }
      }

      // Step 5: Publish
      if (!this.state.completedSteps.includes("publish")) {
        await this.publishContent();
        this.setState({
          ...this.state,
          completedSteps: [...this.state.completedSteps, "publish"],
          status: "done",
        });
      }
    } catch (err) {
      this.setState({
        ...this.state,
        status: "failed",
        error: err instanceof Error ? err.message : String(err),
      });
      // State is persisted even on failure.
      // On retry, completedSteps tells us where to resume.
      // Already-spent LLM costs are not wasted.
    }
  }

  private async doResearch(keyword: string) {
    const { object, usage } = await generateObject({
      model: google("gemini-2.5-flash"),
      schema: ResearchSchema,
      prompt: `Research the topic: ${keyword}`,
    });

    this.setState({
      ...this.state,
      totalCost:
        this.state.totalCost +
        (usage.promptTokens * 0.15 + usage.completionTokens * 0.6) / 1_000_000,
    });

    return object;
  }

  private async generateOutline() {
    const { object } = await generateObject({
      model: google("gemini-2.5-flash"),
      schema: OutlineSchema,
      prompt: `Create an outline based on this research: ${JSON.stringify(this.state.researchData)}`,
    });
    return object;
  }

  private async writeDraft() {
    const { object } = await generateObject({
      model: google("gemini-2.5-flash"),
      schema: z.object({ content: z.string() }),
      prompt: `Write the full article from this outline: ${JSON.stringify(this.state.outline)}`,
    });
    return object.content;
  }

  private async reviewContent(): Promise<number> {
    const { object } = await generateObject({
      model: google("gemini-2.5-flash"),
      schema: z.object({ score: z.number().min(0).max(100), feedback: z.string() }),
      prompt: `Score this content 0-100: ${this.state.draft?.substring(0, 5000)}`,
    });
    return object.score;
  }

  private async publishContent() {
    await this.env.PUBLISH_QUEUE.send({
      event_id: crypto.randomUUID(),
      type: "content.publish",
      source: "content-agent",
      timestamp: new Date().toISOString(),
      payload: {
        keyword: this.state.keyword,
        content: this.state.draft,
      },
    });
  }
}

What the Agents SDK gives you:

Concern	Manual State Machine	CF Agents SDK
State persistence	In-memory (lost on crash)	SQLite-backed (survives restarts)
Recovery	Start from scratch	Resume from last completed step
Concurrency	No dedup, double-runs	One DO instance per keyword
Observability	`console.log`	Query `this.state` via HTTP
Scheduling	Manual `setTimeout`	Built-in `this.schedule()` with alarms
Cost tracking	Manual counter	State tracks `totalCost` durably
Testing	Need to mock everything	Test each step independently

When Agents SDK Is Not the Right Fit

The Cloudflare Agents SDK runs on Cloudflare’s edge network. If you are not on Cloudflare, or if your workflows span multiple cloud providers, consider:

Temporal — Battle-tested durable execution. Workflows survive infrastructure failures. TypeScript SDK available. Best for complex, long-running workflows across services.
Inngest — Event-driven, serverless-friendly. Steps are durable. Good TypeScript support. Best for event-triggered multi-step pipelines.
LangGraph — Purpose-built for AI agent graphs. State management between nodes. Best for complex agent architectures with branching logic. But can be brittle in production — state management bugs and retry complexity are common complaints.

The Problem

Each Worker manages its own API keys for external providers:

// pages-plus/wrangler.jsonc
{
  "vars": {
    "GEMINI_API_KEY": "...",
    "OPENAI_API_KEY": "...",
    "BRAVE_API_KEY": "..."
  }
}

// aso-mrr/wrangler.jsonc
{
  "vars": {
    "GEMINI_API_KEY": "...",  // Same key or different? Who knows.
    "DATAFORSEO_LOGIN": "...",
    "DATAFORSEO_PASSWORD": "..."
  }
}

// scalable-media/wrangler.jsonc
{
  "vars": {
    "GEMINI_API_KEY": "...",  // Third copy of the key
    "PERPLEXITY_API_KEY": "...",
    "TWITTER_BEARER_TOKEN": "..."
  }
}

// gatherfeed/wrangler.jsonc
{
  "vars": {
    "GEMINI_API_KEY": "...",  // Fourth copy
    "BRAVE_API_KEY": "...",   // Second copy
    "YOUTUBE_API_KEY": "..."
  }
}

The problems compound:

Key rotation requires N deployments. When you rotate a Gemini key, you update 4 workers. If you miss one, it breaks in production.
No usage attribution. All 4 workers use the same Gemini key. The billing dashboard shows total spend but not which worker spent what.
No rate limiting. Each worker hits Gemini independently. Four workers each making requests at the provider’s rate limit means 4x the intended rate.
Security surface. Every Worker’s environment has every key it needs. If any Worker is compromised, all its keys are exposed. In our case, that was 14 keys across 4 workers.
No centralized kill switch. When you discover $47 in unexpected spending, you cannot flip one switch. You must update and deploy 4 workers to remove the keys.

The Fix: Centralized Key Management

One service holds all provider keys. Every other service authenticates to this proxy with a project-specific credential.

// api-proxy/src/providers/registry.ts

interface ProviderConfig {
  name: string;
  baseUrl: string;
  authHeader: string;
  keyEnvVar: string;
  rateLimit: { requests: number; windowMs: number };
}

const PROVIDERS: Record<string, ProviderConfig> = {
  gemini: {
    name: "Google Gemini",
    baseUrl: "https://generativelanguage.googleapis.com/v1beta",
    authHeader: "x-goog-api-key",
    keyEnvVar: "GEMINI_API_KEY",
    rateLimit: { requests: 60, windowMs: 60_000 },
  },
  openai: {
    name: "OpenAI",
    baseUrl: "https://api.openai.com/v1",
    authHeader: "Authorization",
    keyEnvVar: "OPENAI_API_KEY",
    rateLimit: { requests: 100, windowMs: 60_000 },
  },
  brave: {
    name: "Brave Search",
    baseUrl: "https://api.search.brave.com/res/v1",
    authHeader: "X-Subscription-Token",
    keyEnvVar: "BRAVE_API_KEY",
    rateLimit: { requests: 15, windowMs: 1_000 },
  },
  perplexity: {
    name: "Perplexity",
    baseUrl: "https://api.perplexity.ai",
    authHeader: "Authorization",
    keyEnvVar: "PERPLEXITY_API_KEY",
    rateLimit: { requests: 20, windowMs: 60_000 },
  },
};

// Project auth -- each calling service gets ONE credential
interface ProjectAuth {
  projectId: string;
  apiKey: string;
  dailyLimitUsd: number;
  allowedProviders: string[];
  costTier: 1 | 2 | 3;
}

Worker configurations become minimal:

// pages-plus/wrangler.jsonc -- AFTER consolidation
{
  "vars": {
    // NO provider API keys. Only the proxy credential.
    "API_PROXY_URL": "https://api-proxy.your-domain.com",
    "API_PROXY_KEY": "..."
  }
}

Key rotation is now a single operation:

wrangler secret put GEMINI_API_KEY --name api-proxy

Kill switch is now a single operation:

// Disable a project's access to all providers
// One API call. Immediate effect. No deployments.
await db
  .update(projects)
  .set({ active: false })
  .where(eq(projects.projectId, "pages-plus"));

The Problem

Your Workers use console.log for observability:

// What logging looks like in most Workers

app.post("/v1/content/generate", async (c) => {
  console.log("Received generate request");

  try {
    const body = await c.req.json();
    console.log("Generating content for:", body.keyword);

    const result = await generateContent(body, c.env);
    console.log("Content generated successfully");

    return c.json(result);
  } catch (err) {
    console.log("Error generating content:", err.message);
    // Which request? What were the inputs? What was the state?
    // The log says "Error generating content: fetch failed"
    // Good luck debugging that in production.
    return c.json({ error: "internal_error" }, 500);
  }
});

The problems:

No context propagation. Each console.log is a standalone string. There is no request ID linking them together. When 50 requests happen in parallel, the logs are interleaved and useless.
No structured data. Logs are strings, not objects. You cannot filter by status:error or function:content-generate or cost>0.10 because those fields do not exist.
No level management. Everything is console.log. You cannot turn off debug logs in production or escalate warnings to error. There is no severity.
No performance data. You do not know how long each operation took, what it cost, or what the token counts were. Those values exist in your code but never reach the logs.

The Fix: Pino in Browser Mode on Cloudflare Workers

Pino is the fastest Node.js logger. Its browser mode works on Cloudflare Workers because it outputs via console.log internally (which Workers capture) but gives you structured JSON, child loggers, and log levels.

// src/lib/logger.ts
import pino from "pino";

export function createLogger(service: string) {
  return pino({
    level: "info",
    browser: {
      asObject: true,
      write: {
        info: (o: object) => console.log(JSON.stringify(o)),
        warn: (o: object) => console.warn(JSON.stringify(o)),
        error: (o: object) => console.error(JSON.stringify(o)),
        debug: (o: object) => console.debug(JSON.stringify(o)),
      },
    },
    base: { service },
    timestamp: pino.stdTimeFunctions.isoTime,
  });
}

Usage with Hono middleware:

// src/middleware/logging.ts
import { createLogger } from "../lib/logger";
import type { MiddlewareHandler } from "hono";

export function requestLogger(service: string): MiddlewareHandler {
  const baseLogger = createLogger(service);

  return async (c, next) => {
    const requestId = c.req.header("x-request-id") ?? crypto.randomUUID();
    const start = Date.now();

    // Create a child logger with request context
    const log = baseLogger.child({
      requestId,
      method: c.req.method,
      path: c.req.path,
    });

    // Attach to the context so handlers can use it
    c.set("log", log);
    c.set("requestId", requestId);

    try {
      await next();

      const duration = Date.now() - start;
      log.info({
        msg: "request completed",
        status: c.res.status,
        duration,
      });
    } catch (err) {
      const duration = Date.now() - start;
      log.error({
        msg: "request failed",
        status: 500,
        duration,
        error: err instanceof Error ? err.message : String(err),
        stack: err instanceof Error ? err.stack : undefined,
      });
      throw err;
    }
  };
}

Application wiring:

// src/index.ts
import { Hono } from "hono";
import { requestLogger } from "./middleware/logging";

const app = new Hono<{ Bindings: Env }>();

app.use("*", requestLogger("pages-plus"));

app.post("/v1/content/generate", async (c) => {
  const log = c.get("log");

  const body = await c.req.json();
  log.info({ msg: "generating content", keyword: body.keyword });

  const start = Date.now();
  const result = await generateContent(body, c.env);
  const duration = Date.now() - start;

  log.info({
    msg: "content generated",
    keyword: body.keyword,
    wordCount: result.wordCount,
    duration,
    cost: result.cost,
    model: result.model,
    tokens: result.usage,
  });

  return c.json(result);
});

What the output looks like:

{
  "level": 30,
  "time": "2026-03-15T10:23:45.123Z",
  "service": "pages-plus",
  "requestId": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "method": "POST",
  "path": "/v1/content/generate",
  "msg": "content generated",
  "keyword": "bank statement to excel",
  "wordCount": 1847,
  "duration": 4523,
  "cost": 0.0034,
  "model": "gemini-2.5-flash",
  "tokens": {
    "input": 2340,
    "output": 1200,
    "thinking": 3800
  }
}

Now you can:

Correlate: Find all logs for request a1b2c3d4 across middleware, handler, and domain functions
Filter: Show only requests where cost > 0.10 or duration > 5000
Aggregate: Calculate average cost per keyword, p99 latency per endpoint
Alert: Trigger when error rate exceeds threshold or daily cost exceeds budget

Child Loggers for Deep Context

The real power of Pino is child loggers. Each child inherits its parent’s context and adds its own:

// In the request handler
const log = c.get("log"); // Has requestId, method, path

// Pass to domain function
const contentLog = log.child({ function: "content-generate", keyword });

// Inside domain function, create deeper children
const llmLog = contentLog.child({ provider: "gemini", model: "gemini-2.5-flash" });
llmLog.info({ msg: "calling LLM", promptLength: prompt.length });
// Output has: requestId + method + path + function + keyword + provider + model + msg + promptLength
// ALL context propagated automatically. Zero extra work.

The Problem

Your wrangler.jsonc has 9 cron triggers:

{
  "triggers": {
    "crons": [
      "*/15 * * * *",
      "*/30 * * * *",
      "0 */2 * * *",
      "0 */4 * * *",
      "0 6 * * *",
      "0 12 * * *",
      "0 18 * * *",
      "30 8 * * *",
      "0 0 * * *"
    ]
  }
}

Each cron triggers LLM calls. Some generate article outlines. Some generate full drafts. Some generate meta descriptions. Some generate internal link suggestions. None of them check how much has been spent today. None of them have a kill switch.

War Story: The 193 Ghost Posts

Here is the sequence of events:

Monday: Deploy content pipeline with 9 crons. Each cron handles one stage of content generation.
Monday-Friday: Crons run 24/7. Each generates content using Gemini. No cost tracking. No quality gate. No human review.
Friday: During a routine ops session, check the Gemini billing dashboard. See $37 from pages-plus.
Investigation: The pipeline generated 193 blog posts in 5 days.
- No human ever reviewed the content quality
- No human knew the posts existed (they were in draft status in D1)
- No cost was attributed to any specific function
- No daily limit existed to stop the pipeline
Emergency response:
- All 9 crons in pages-plus disabled immediately
- All 4 crons in aso-mrr disabled
- 14 API keys removed from worker configurations
- New P0 standard written: API Cost Metering Standard
- 12 issues created to implement metering before re-enabling any cron

The Fix: Budget-Aware Pipelines

Every cron that triggers LLM calls must check the budget before proceeding:

// src/crons/content-pipeline.ts

import type { ScheduledEvent } from "@cloudflare/workers-types";

interface BudgetCheck {
  allowed: boolean;
  spent: number;
  limit: number;
  remaining: number;
}

async function checkBudget(
  projectId: string,
  env: Env
): Promise<BudgetCheck> {
  const response = await fetch(
    `${env.API_PROXY_URL}/v1/costs?period=day&project=${projectId}`,
    {
      headers: {
        "X-Project-Id": projectId,
        "X-Api-Key": env.API_PROXY_KEY,
      },
    }
  );

  if (!response.ok) {
    // If we cannot check the budget, do not proceed
    return { allowed: false, spent: 0, limit: 0, remaining: 0 };
  }

  const data = await response.json();
  const spent = data.total_cost_usd;
  const limit = data.daily_limit_usd;

  return {
    allowed: spent < limit * 0.8, // Stop at 80% to leave headroom
    spent,
    limit,
    remaining: limit - spent,
  };
}

export async function handleScheduled(
  event: ScheduledEvent,
  env: Env
): Promise<void> {
  const log = createLogger("pages-plus").child({ trigger: "cron", cron: event.cron });

  // 1. Budget check BEFORE any LLM call
  const budget = await checkBudget("pages-plus", env);

  if (!budget.allowed) {
    log.warn({
      msg: "cron skipped: budget limit approaching",
      spent: budget.spent,
      limit: budget.limit,
      remaining: budget.remaining,
    });
    return; // Exit gracefully. Do nothing. Cost: $0.
  }

  log.info({
    msg: "cron executing",
    budget: {
      spent: budget.spent,
      limit: budget.limit,
      remaining: budget.remaining,
    },
  });

  // 2. Process with cost awareness
  const pendingItems = await getPendingContentItems(env);

  // Estimate cost before processing
  const estimatedCostPerItem = 0.03; // Based on historical average
  const maxItems = Math.floor(budget.remaining / estimatedCostPerItem);
  const itemsToProcess = pendingItems.slice(0, Math.min(maxItems, 10)); // Cap at 10 per run

  log.info({
    msg: "processing items",
    pending: pendingItems.length,
    processing: itemsToProcess.length,
    maxByBudget: maxItems,
  });

  for (const item of itemsToProcess) {
    try {
      await processContentItem(item, env);
    } catch (err) {
      log.error({
        msg: "item processing failed",
        itemId: item.id,
        error: err instanceof Error ? err.message : String(err),
      });
      // Continue with next item, don't fail the whole batch
    }
  }

  log.info({
    msg: "cron completed",
    processed: itemsToProcess.length,
    skipped: pendingItems.length - itemsToProcess.length,
  });
}

Kill Switch Pattern

Every automated pipeline needs a manual kill switch that does not require a deployment:

// Check a KV flag before running ANY automated pipeline
async function isPipelineEnabled(
  pipeline: string,
  env: Env
): Promise<boolean> {
  // KV read is fast, cheap, and can be updated without deployment
  const flag = await env.KV.get(`pipeline:${pipeline}:enabled`);

  // Default to DISABLED -- pipelines must be explicitly enabled
  // This is the opposite of the usual default, and it's intentional.
  // An unset flag means "we haven't verified this pipeline yet."
  return flag === "true";
}

// In the cron handler
export async function handleScheduled(
  event: ScheduledEvent,
  env: Env
): Promise<void> {
  // Kill switch check -- before budget check, before anything
  if (!(await isPipelineEnabled("content-generation", env))) {
    // Silent return. No log spam. The pipeline is off.
    return;
  }

  // Budget check
  const budget = await checkBudget("pages-plus", env);
  if (!budget.allowed) return;

  // ... actual work
}

// To disable a pipeline in emergency:
// wrangler kv key put "pipeline:content-generation:enabled" "false" --namespace-id <id>
// Takes effect immediately. No deployment. No code change.

Cron Governance Checklist

Before enabling any cron that triggers LLM calls:

// src/crons/governance.ts

interface CronGovernanceCheck {
  pipeline: string;
  checks: {
    killSwitchExists: boolean; // Can you disable without deploying?
    budgetCheckExists: boolean; // Does it check spend before calling LLMs?
    dailyLimitSet: boolean; // Is there a dollar limit per day?
    costAttribution: boolean; // Does each call tag its function + project?
    maxItemsCapped: boolean; // Is there a per-run item limit?
    humanReviewGate: boolean; // Do outputs get reviewed before publishing?
    errorHandling: boolean; // Does failure in one item not kill the batch?
    logging: boolean; // Is there structured logging for observability?
  };
}

// Every check must be true before the cron is enabled.
// This is the standard that was written after the $47 incident.
// It exists because we did not have it before.

Example 1: Extracting Structured Data from Unstructured Text

import { generateObject } from "ai";
import { google } from "@ai-sdk/google";
import { z } from "zod";

const InvoiceSchema = z.object({
  vendor: z.string(),
  invoiceNumber: z.string(),
  date: z.string().date(),
  lineItems: z.array(
    z.object({
      description: z.string(),
      quantity: z.number(),
      unitPrice: z.number(),
      total: z.number(),
    })
  ),
  subtotal: z.number(),
  tax: z.number(),
  total: z.number(),
  currency: z.string().length(3),
});

async function extractInvoice(rawText: string) {
  const { object } = await generateObject({
    model: google("gemini-2.5-flash"),
    schema: InvoiceSchema,
    prompt: `Extract invoice data from this text:\n\n${rawText}`,
  });

  // object.lineItems[0].total is a number, guaranteed
  // object.currency is a 3-char string, guaranteed
  // No regex. No manual parsing. No try/catch.
  return object;
}

Example 2: Cost-Aware Model Selection

function selectModel(task: {
  complexity: "low" | "medium" | "high";
  budgetRemaining: number;
  requiresReasoning: boolean;
}): string {
  // If budget is tight, always use the cheapest model
  if (task.budgetRemaining < 0.50) {
    return "gemini-2.5-flash"; // $0.15/M input, $0.60/M output
  }

  // High-complexity reasoning tasks justify thinking tokens
  if (task.complexity === "high" && task.requiresReasoning) {
    // But only if budget allows -- thinking tokens are 5.8x output
    if (task.budgetRemaining > 2.0) {
      return "gemini-2.5-pro"; // $1.25/M input, $10/M output
    }
  }

  // Default: Flash handles 90% of tasks adequately
  return "gemini-2.5-flash";
}

Example 3: Zod Schema for Queue Message Validation

import { z } from "zod";

const DomainMessageSchema = z.object({
  event_id: z.string().uuid(),
  type: z.string().regex(/^[a-z]+\.[a-z]+$/), // e.g., "content.published"
  source: z.string().min(1),
  timestamp: z.string().datetime(),
  correlation_id: z.string().optional(),
  payload: z.record(z.unknown()),
});

type DomainMessage = z.infer<typeof DomainMessageSchema>;

// In queue consumer
async function handleBatch(batch: MessageBatch<unknown>) {
  for (const msg of batch.messages) {
    const result = DomainMessageSchema.safeParse(msg.body);

    if (!result.success) {
      console.error("Invalid queue message:", result.error.issues);
      msg.ack(); // Discard malformed messages, don't retry
      continue;
    }

    const message = result.data;
    await processMessage(message);
    msg.ack();
  }
}

Example 4: Shared Type Package Structure

// packages/shared/src/index.ts
// Explicit exports -- no barrel re-exports

export {
  GeminiResponseSchema,
  GeminiUsageSchema,
  extractText,
  extractUsage,
} from "./types/gemini";

export type { GeminiResponse, GeminiUsage } from "./types/gemini";

export {
  DomainMessageSchema,
  type DomainMessage,
} from "./types/messages";

export {
  calculateCost,
  MODEL_PRICING,
} from "./cost/calculator";

export type { TokenUsage, ModelPricing } from "./cost/calculator";

// packages/shared/package.json
{
  "name": "@acme/shared",
  "version": "1.0.0",
  "exports": {
    ".": "./src/index.ts",
    "./types/*": "./src/types/*.ts",
    "./cost/*": "./src/cost/*.ts"
  }
}

// pages-plus/package.json
{
  "dependencies": {
    "@acme/shared": "workspace:*"
  }
}

Example 5: Retry with Exponential Backoff for LLM Calls

async function withRetry<T>(
  fn: () => Promise<T>,
  options: {
    maxRetries?: number;
    baseDelayMs?: number;
    maxDelayMs?: number;
    retryOn?: (error: unknown) => boolean;
  } = {}
): Promise<T> {
  const {
    maxRetries = 3,
    baseDelayMs = 1000,
    maxDelayMs = 30000,
    retryOn = () => true,
  } = options;

  let lastError: unknown;

  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    try {
      return await fn();
    } catch (err) {
      lastError = err;

      if (attempt === maxRetries || !retryOn(err)) {
        throw err;
      }

      const delay = Math.min(baseDelayMs * 2 ** attempt, maxDelayMs);
      const jitter = delay * (0.5 + Math.random() * 0.5);
      await new Promise((resolve) => setTimeout(resolve, jitter));
    }
  }

  throw lastError;
}

// Usage with LLM calls
const result = await withRetry(
  () =>
    generateObject({
      model: google("gemini-2.5-flash"),
      schema: ArticleSchema,
      prompt: "...",
    }),
  {
    maxRetries: 2,
    retryOn: (err) => {
      // Retry on rate limits and server errors
      // Do NOT retry on validation errors (schema mismatch)
      if (err instanceof Error) {
        return err.message.includes("429") || err.message.includes("500");
      }
      return false;
    },
  }
);

Example 6: Pino Child Logger Chain

import pino from "pino";

const root = pino({
  browser: {
    asObject: true,
    write: {
      info: (o: object) => console.log(JSON.stringify(o)),
      error: (o: object) => console.error(JSON.stringify(o)),
    },
  },
  base: { service: "scalable-media" },
});

// Request-level context
const requestLog = root.child({ requestId: "abc-123" });

// Function-level context
const contentLog = requestLog.child({ function: "content-generate" });

// LLM call-level context
const llmLog = contentLog.child({
  provider: "gemini",
  model: "gemini-2.5-flash",
});

llmLog.info({ msg: "calling model", promptTokens: 2340 });
// Output: { service, requestId, function, provider, model, msg, promptTokens }
// Every field from every ancestor is included automatically.

llmLog.info({
  msg: "model responded",
  outputTokens: 1200,
  thinkingTokens: 3800,
  cost: 0.0145,
  latencyMs: 4523,
});
// Full trace from service to specific LLM call, all in one log line.

Example 7: Validation Error Formatting for API Consumers

import { z } from "zod";

function formatZodError(error: z.ZodError): {
  error: string;
  issues: Array<{ path: string; code: string; message: string; expected?: string; received?: string }>;
} {
  return {
    error: "validation_failed",
    issues: error.issues.map((issue) => {
      const base = {
        path: issue.path.join("."),
        code: issue.code,
        message: issue.message,
      };

      if (issue.code === "invalid_type") {
        return {
          ...base,
          expected: issue.expected,
          received: issue.received,
        };
      }

      return base;
    }),
  };
}

// Usage
const result = CreateContentSchema.safeParse(body);
if (!result.success) {
  return c.json(formatZodError(result.error), 400);
}

// Response to the client:
// {
//   "error": "validation_failed",
//   "issues": [
//     { "path": "sections.0.heading", "code": "too_small", "message": "String must contain at least 1 character(s)" },
//     { "path": "metadata.seo.keywords.11", "code": "too_big", "message": "Array must contain at most 10 element(s)" }
//   ]
// }

Example 8: Daily Cost Dashboard Query

// GET /v1/costs/dashboard
app.get("/v1/costs/dashboard", async (c) => {
  const db = c.env.DB;

  const [byProject, byFunction, byModel, dailyTrend] = await Promise.all([
    // Cost by project (today)
    db
      .prepare(
        `SELECT project_id, ROUND(SUM(cost_usd), 2) as cost,
         COUNT(*) as calls, SUM(thinking_tokens) as thinking
         FROM api_calls_ledger
         WHERE created_at >= date('now')
         GROUP BY project_id ORDER BY cost DESC`
      )
      .all(),

    // Cost by function (today)
    db
      .prepare(
        `SELECT function, ROUND(SUM(cost_usd), 2) as cost,
         COUNT(*) as calls
         FROM api_calls_ledger
         WHERE created_at >= date('now')
         GROUP BY function ORDER BY cost DESC`
      )
      .all(),

    // Cost by model (today)
    db
      .prepare(
        `SELECT model, ROUND(SUM(cost_usd), 2) as cost,
         COUNT(*) as calls,
         SUM(input_tokens) as input_tokens,
         SUM(output_tokens) as output_tokens,
         SUM(thinking_tokens) as thinking_tokens
         FROM api_calls_ledger
         WHERE created_at >= date('now')
         GROUP BY model ORDER BY cost DESC`
      )
      .all(),

    // Daily trend (last 7 days)
    db
      .prepare(
        `SELECT date(created_at) as day, ROUND(SUM(cost_usd), 2) as cost,
         COUNT(*) as calls
         FROM api_calls_ledger
         WHERE created_at >= date('now', '-7 days')
         GROUP BY day ORDER BY day`
      )
      .all(),
  ]);

  return c.json({ byProject, byFunction, byModel, dailyTrend });
});

LLM Response Parsing

Approach	How It Works	Pros	Cons
Manual regex + JSON.parse	Strip markdown fences, trim, parse, cast	No dependencies	Fragile, incomplete coverage, no retries, `as` type assertions lie
Provider function calling	Define functions, model returns structured call	Provider-native, no SDK dependency	Provider-specific API, manual validation still needed, inconsistent across providers
Vercel AI SDK `generateObject()`	Zod schema sent as response format, SDK validates + retries	Type-safe, provider-agnostic, automatic retries, zero parsing code	Adds dependency (~50KB), requires Zod schema upfront, deeply nested schemas can cause issues
Instructor (Python)	Pydantic models as output schemas, retries on validation failure	Pythonic, good error messages, patches provider clients	Python only, monkey-patches clients, adds complexity
Outlines / Guidance	Constrained generation at token level	Guaranteed valid output, fastest	Requires model server access, not for API-based models

Recommendation: Use generateObject() for TypeScript projects. It handles the 95% case (structured output from API-based models) with zero boilerplate. If you are in Python, Instructor is the equivalent. If you run your own model server, Outlines gives you guaranteed-valid generation.

Cost Tracking

Approach	Setup	Token Tier Support	Attribution	Daily Limits	Real-Time
No tracking	None	None	None	None	No
Provider dashboards	Already available	Full (provider knows)	Project-level (by API key)	Manual alerts only	Yes
Cloudflare AI Gateway	One URL change	Full	By gateway ID	Rate limiting available	Yes
LiteLLM Proxy	Self-hosted proxy	Full	By key/user/team/tag	Per-key budgets	Yes
Portkey	SDK + dashboard	Full	By virtual key/metadata	Budget alerts	Yes
Custom proxy	Build + deploy	You implement it	Fully customizable	You implement it	You implement it

Recommendation: Start with Cloudflare AI Gateway if you are already on Cloudflare — it is free and requires one line of code. Move to LiteLLM or a custom proxy when you need function-level cost attribution with custom tags. Provider dashboards are not enough — they show total spend but not why each dollar was spent.

Agent State Management

Runtime	Language	State Persistence	Recovery Model	Deployment	Best For
Manual state machine	Any	In-memory (lost on crash)	None — restart from scratch	Wherever your code runs	Prototypes, simple sequences
CF Agents SDK	TypeScript	SQLite per agent (durable)	Resume from last checkpoint	Cloudflare edge, global	Per-entity agents (one per user/brand), edge-native
Temporal	TS, Python, Go, Java	Event-sourced (durable)	Replay from event history	Self-hosted or cloud	Mission-critical workflows, multi-service orchestration
Inngest	TypeScript	Step-level (durable)	Resume from last step	Serverless, event-driven	Event-triggered pipelines, serverless-friendly
LangGraph	Python, TypeScript	Configurable (Redis, Postgres)	Checkpoint-based	Self-hosted	Complex agent graphs with branching, AI-specific abstractions

Recommendation: If you are on Cloudflare, the Agents SDK is the obvious choice — it runs on Durable Objects with built-in SQLite state, hibernation, and global distribution. If you are not on Cloudflare, Temporal for complex multi-service workflows or Inngest for simpler event-driven pipelines. LangGraph if you need graph-based agent routing, but be prepared for production challenges with state management and debugging.

Logging on Edge/Serverless

Library	Edge/Worker Support	Structured Output	Child Loggers	Size	Notes
`console.log`	Native	No (strings only)	No	0KB	Not logging, just printing
Pino (browser mode)	Yes (browser mode)	Yes (JSON)	Yes	~15KB	Fastest structured logger, browser mode uses console internally
Winston	No (Node.js only)	Yes	No (but has metadata)	~200KB	Too heavy, Node.js APIs, not edge-compatible
Custom JSON wrapper	Yes	Yes	If you build it	1-5KB	Full control, maintenance burden
Workers Logs	Native (CF only)	Yes	No	0KB	Cloudflare-specific, automatic, but limited control

Recommendation: Pino in browser mode for Cloudflare Workers. It gives you structured JSON, child loggers with context propagation, log levels, and it works by writing to console.log internally — which Workers capture. Zero infrastructure to set up. If you want Cloudflare-native, Workers Logs is automatic but less flexible.

Input Validation

Approach	Runtime Validation	Type Safety	Error Messages	Schema Reuse	Bundle Size
None (`request.json()`)	No	No (`any`)	N/A (crashes)	N/A	0KB
Manual `if` checks	Yes (incomplete)	No (no narrowing)	Custom but inconsistent	Copy-paste	0KB
Zod	Yes (complete)	Yes (`z.infer`)	Structured, detailed	Schema objects	~14KB
tRPC	Yes (via Zod)	Yes (end-to-end)	Structured	Full stack	~30KB
Valibot	Yes (complete)	Yes (similar to Zod)	Structured	Schema objects	~5KB
ArkType	Yes (complete)	Yes (type-first)	Structured	Schema objects	~40KB

Recommendation: Zod is the standard choice. It has the largest ecosystem, works with AI SDK, tRPC, React Hook Form, and most TypeScript tools. If bundle size is critical (edge workers), Valibot offers similar functionality at 1/3 the size. tRPC if you control both client and server and want end-to-end type safety.

#	Don’t	Do Instead	Why
1	Parse LLM responses with regex + `JSON.parse` + `as`	Use `generateObject()` with a Zod schema	Regex misses variants, `as` lies to TypeScript, no retries on malformed responses
2	Call provider APIs directly with raw `fetch()`	Route through a centralized proxy with per-call metering	No cost tracking = $47 in 5 days with zero attribution
3	Calculate cost as `(input + output) * price`	Track ALL token tiers: input, output, thinking, cache_read, cache_write	Thinking tokens cost 5.8x output tokens, missing them causes 9x undercounts
4	Let files grow past 300 lines with mixed concerns	Split by single responsibility, target <300 lines per module	God modules hide coupling, make testing expensive, slow prompt iteration
5	Use `await c.req.json()` without validation	Validate with Zod at every system boundary (HTTP, queue, function)	Unvalidated input propagates through queues and databases, crashes downstream
6	Copy-paste `GeminiApiResponse` interface across files	Shared type packages, or use AI SDK (no provider types needed)	Copies drift. One copy misses `thoughtsTokenCount`. 9x undercount follows.
7	Hand-roll state machines with in-memory state	Use a stateful agent runtime (CF Agents SDK, Temporal, Inngest)	No persistence = restart from scratch on crash. Wasted LLM spend.
8	Put provider API keys in every worker	Centralize keys in one proxy service	Rotation requires N deployments. No kill switch. No centralized rate limiting.
9	Use `console.log("something happened")` for observability	Pino browser mode with child loggers for context propagation	Strings are not searchable. No request correlation. No cost/latency data.
10	Run crons that call LLMs without budget checks	Budget-aware pipelines with daily limits, cost checks, and kill switches	193 ghost posts in 5 days. $37 in untracked spend. No human in the loop.

Here is the architecture that addresses all 10 anti-patterns simultaneously:

┌─────────────────────────────────────────────────────────────────────┐
│                         APPLICATION LAYER                          │
│                                                                    │
│  ┌───────────────┐  ┌───────────────┐  ┌───────────────┐          │
│  │ Content       │  │ Analysis      │  │ Agent         │          │
│  │ Pipeline      │  │ Service       │  │ Service       │          │
│  │               │  │               │  │               │          │
│  │ Zod schemas   │  │ Zod schemas   │  │ CF Agents SDK │          │
│  │ generateObject│  │ generateObject│  │ Durable state │          │
│  │ Pino logging  │  │ Pino logging  │  │ Pino logging  │          │
│  │ Budget checks │  │ Budget checks │  │ Budget checks │          │
│  └───────┬───────┘  └───────┬───────┘  └───────┬───────┘          │
│          │                  │                  │                    │
│          └──────────────────┼──────────────────┘                   │
│                             │                                      │
│                    X-Project-Id + X-Function + X-Tags              │
│                             │                                      │
│                             ▼                                      │
│  ┌──────────────────────────────────────────────────────────┐      │
│  │                    API PROXY                              │     │
│  │                                                          │      │
│  │  Auth → Rate Limit → Budget Check → Forward → Meter      │     │
│  │                                                          │      │
│  │  - One service holds all provider API keys               │      │
│  │  - Every call logged to api_calls_ledger                 │      │
│  │  - All token tiers tracked (input/output/thinking/cache) │      │
│  │  - Daily spend limits enforced per project               │      │
│  │  - Kill switch via project.active flag                   │      │
│  └──────────────────────────┬───────────────────────────────┘      │
│                             │                                      │
└─────────────────────────────┼──────────────────────────────────────┘
                              │
                              ▼
              ┌───────────────────────────────┐
              │      PROVIDER APIs            │
              │  Gemini | OpenAI | Anthropic  │
              │  Brave | Perplexity | etc.    │
              └───────────────────────────────┘

The key properties of this stack:

Every LLM call is structured. Zod schemas define the output. generateObject() validates it. No regex, no JSON.parse, no as.
Every LLM call is metered. The proxy records project, function, tags, all token tiers, and calculated cost. Monthly reconciliation catches formula drift.
Every LLM call is budgeted. Daily limits stop runaway spending. Kill switches disable pipelines without deployment. Crons check budget before calling.
Every module is small. Under 300 lines. Single responsibility. Independently testable. Prompts are easy to find and iterate.
Every boundary is validated. HTTP endpoints, queue consumers, function arguments — Zod schemas at every transition point. Invalid data is rejected at the edge, not in the database.
Every type is defined once. Shared packages or SDK abstractions. No copy-paste drift. One update propagates everywhere.
Every agent has durable state. CF Agents SDK or equivalent runtime. Crash recovery resumes from the last checkpoint. No wasted LLM spend.
Every key is centralized. One service, one rotation point, one kill switch. No scattered secrets.
Every log is structured. Pino with child loggers. Request IDs propagate through the call chain. Cost, latency, and token counts are logged fields, not embedded strings.
Every cron is governed. Budget check before execution. Item caps per run. Kill switch via KV. Human review gate before publishing.

This is not theoretical. This is the stack that replaced the one that burned $47 in 5 days.

Official Documentation

Vercel AI SDK - generateObject — API reference for structured output generation with schema validation
Vercel AI SDK - Generating Structured Data — Guide to structured data extraction with Zod schemas
Vercel AI SDK 6 — Unified generateObject/generateText with multi-step tool calling and structured output
Zod Documentation — TypeScript-first schema validation with static type inference
Cloudflare AI Gateway — Managed LLM proxy with analytics, caching, rate limiting
Cloudflare AI Gateway Pricing — Core features free, Logpush extra
Cloudflare Agents SDK — Stateful AI agents on Durable Objects with built-in SQL, scheduling, WebSocket
Cloudflare Agents API Reference — Agent class methods, state management, lifecycle hooks
Cloudflare Agent Class Internals — How Agents extend Durable Objects
Cloudflare Durable Objects — Stateful micro-servers with SQLite, the foundation for Agents
Cloudflare Workers Logs — Native structured logging for Workers
Gemini API Pricing — Token tier pricing including thinking tokens for Flash/Pro models
Pino Logger — Low overhead Node.js logger with browser mode for edge runtimes
LiteLLM — Open-source LLM proxy supporting 100+ providers with cost tracking
LiteLLM Spend Tracking — Per-key/user/team cost attribution and budget management
LiteLLM Tag Budgets — Tag-based spend tracking for cost center attribution
Temporal — Durable execution platform for mission-critical workflows
Inngest — Event-driven serverless workflow orchestration
LangGraph — Graph-based AI agent orchestration framework

Blog Posts and Guides

The Complete Guide to LLM Observability (Portkey) — Traces, metrics, events for production LLM apps
LLM Cost Tracking Solution (TrueFoundry) — Observability, governance, and cost optimization for LLM operations
Cloudflare AI Gateway Pricing Explained — Detailed pricing breakdown and comparison
Comparing Inngest and Temporal for State Management — Side-by-side comparison for distributed systems
The Ultimate Guide to TypeScript Orchestration — Temporal vs Trigger.dev vs Inngest comparison
Orchestrating Multi-Step Agents: Temporal/Dagster/LangGraph Patterns — Patterns for long-running agent work
Prototype to Production-Ready Agentic AI (Temporal) — Moving LangGraph agents to Temporal for production
We Tested 8 LangGraph Alternatives (ZenML) — Practical comparison of agent orchestration frameworks
Building AI Agents with MCP, AuthN/AuthZ, and Durable Objects — Cloudflare’s vision for agent infrastructure
Pino on Cloudflare Workers (GitHub Issue #2035) — Browser mode usage for Workers environments
Structured Outputs with Vercel AI SDK — Practical guide to generateObject with Zod
Professional Validation with Zod (CodeSignal) — Production-readiness patterns with Zod
LiteLLM Cost Tracking: Multi-Model Expense Management (Statsig) — Real-world cost tracking with LiteLLM
Monitor LiteLLM with Datadog — LLM observability integration

Companion Articles

The Three-Layer AI Agent Architecture — Agent runtime + LLM interface + API proxy, the architectural pattern this article’s fixes implement
Event-Driven Architecture on Cloudflare Workers — Queues, fan-out, idempotency, consumer middleware
Cost Observability for Cloudflare Workers — D1 row reads incident, Analytics Engine, budget governor patterns
Building an Autonomous Data Pipeline on Cloudflare Workers — Workers + D1 + Queues + DO + R2, priority scheduler, cost story

Libraries and Tools

Vercel AI SDK (npm: ai) — TypeScript SDK for building AI applications, structured output, streaming
Zod (npm: zod) — TypeScript-first schema validation, runtime + compile-time safety
Pino (npm: pino) — Low-overhead structured logger for Node.js and browser environments
LiteLLM (GitHub) — Python SDK and proxy server for 100+ LLM APIs with cost tracking
Cloudflare Agents (GitHub) — Build and deploy stateful AI agents on Cloudflare’s edge network
Hono — Lightweight web framework for Cloudflare Workers, Deno, Bun, Node.js
hono-pino (JSR) — Pino integration middleware for Hono

Every anti-pattern in this article was shipped to production, discovered through operational pain, and fixed with the techniques described. The $47 was real. The 193 ghost posts were real. The 9x undercount was real. The fixes are also real, and they are running in production today.

Production AI Anti-Patterns

Structured Output

Cost Attribution

Boundary Validation

The Problem

The Fix: generateObject() with Zod

How It Works Under the Hood

When You Still Need Manual Parsing

The Problem

War Story: The $47-in-5-Days Gemini Disaster

The Fix: Centralized LLM Proxy with Per-Call Metering

Available Gateway Options

The Problem

War Story: The 9x Undercount

The Fix: Track All Token Tiers

The Ledger Schema

Monthly Reconciliation Query

The Problem

Why This Matters More for AI Code

The Fix: Single-Responsibility Modules Under 300 Lines

The Splitting Heuristic

The Problem

The Fix: Zod at Every Boundary

Validation Middleware for Hono

The Problem

The Fix: Shared Type Modules

The Problem

The Fix: Stateful Agent Runtimes

When Agents SDK Is Not the Right Fit

The Problem

The Fix: Centralized Key Management

The Problem

The Fix: Pino in Browser Mode on Cloudflare Workers

Child Loggers for Deep Context

The Problem

War Story: The 193 Ghost Posts

The Fix: Budget-Aware Pipelines

Kill Switch Pattern

Cron Governance Checklist

Example 1: Extracting Structured Data from Unstructured Text

Example 2: Cost-Aware Model Selection

Example 3: Zod Schema for Queue Message Validation

Example 4: Shared Type Package Structure

Example 5: Retry with Exponential Backoff for LLM Calls

Example 6: Pino Child Logger Chain

Example 7: Validation Error Formatting for API Consumers

Example 8: Daily Cost Dashboard Query

LLM Response Parsing

Cost Tracking

Agent State Management

Logging on Edge/Serverless

Input Validation

Official Documentation

Blog Posts and Guides

Companion Articles

Libraries and Tools

The Fix: `generateObject()` with Zod