Skip to content
Gary Wu
Go back

Production AI Anti-Patterns

Edit page

Org Status: 🟑 Dormant Cloudflare: N/A Last Audited: 2026-04-28


Ten production anti-patterns that cost us $282/month in invisible LLM spend, buried bugs in 1,000-line god modules, and left 223 of 226 API endpoints wide open to malformed input. Every anti-pattern here comes from a real codebase running real traffic. Every fix has been shipped and measured.

This is not a theoretical taxonomy. It is a post-mortem.

What you will learn:


  1. The Problem
  2. Core Concepts
  3. Anti-Pattern 1: Manual JSON Parsing from LLM Responses
  4. Anti-Pattern 2: Raw LLM Calls Without Cost Tracking
  5. Anti-Pattern 3: Thinking Token Pricing Blindspot
  6. Anti-Pattern 4: God Modules
  7. Anti-Pattern 5: No Input Validation on API Endpoints
  8. Anti-Pattern 6: Duplicated Type Definitions
  9. Anti-Pattern 7: Manual State Machines for Agents
  10. Anti-Pattern 8: Scattered API Keys
  11. Anti-Pattern 9: No Structured Logging
  12. Anti-Pattern 10: Uncontrolled Cron Spending
  13. Small Examples
  14. Comparisons
  15. The Anti-Pattern Summary Table
  16. Putting It All Together: The Modern Production AI Stack
  17. References

You built an LLM-powered application. It works in development. You deploy it. Within a week, you discover:

  1. You cannot explain your AI spend. The Gemini dashboard says $47 in 5 days. Your internal tracking says $1.14. The delta is not a rounding error β€” it is a 9x undercount caused by not pricing thinking tokens.

  2. You cannot trust your outputs. Half your LLM responses are wrapped in markdown code fences (\β€œjson … ```). Your regex-based cleanup handles 4 of the 7 variations providers emit. The other 3 crash silently and return undefined` to the database.

  3. You cannot find the bug. The relevant logic lives in a 1,126-line file that handles SEO analysis, content scoring, keyword extraction, SERP parsing, and site audits. When something breaks, you read the whole file.

  4. You cannot stop the bleeding. Nine crons run every 5-30 minutes, each triggering LLM calls. There is no budget awareness, no kill switch, no human in the loop. The pipeline generated 193 blog posts in 5 days. Nobody reviewed them. Nobody knew they existed.

These are not hypothetical risks. These are the bugs we shipped, the money we burned, and the emergency sessions we ran to fix them. The core issue is always the same: LLM applications have different failure modes than traditional software, and traditional engineering practices do not cover them.

Traditional software is deterministic. You call a function, you get a return value, you check the type. LLM software is probabilistic. You send a prompt, you get a response that might be JSON, might have the right fields, might cost $0.003 or $0.30 depending on whether the model decided to β€œthink” about it. The output shape, the cost, and the latency are all variable β€” and if your code assumes they are fixed, you will get surprised in production.

The modern fix is not one tool. It is a stack:


Before diving into the anti-patterns, three foundational concepts that underpin every fix.

Structured Output

The idea that an LLM call should return a typed object, not a string you have to parse.

import { generateObject } from "ai";
import { google } from "@ai-sdk/google";
import { z } from "zod";

// The schema IS the specification
const ArticleSchema = z.object({
  title: z.string().describe("SEO-optimized article title"),
  slug: z.string().describe("URL-safe slug"),
  sections: z
    .array(
      z.object({
        heading: z.string(),
        content: z.string(),
        wordCount: z.number(),
      })
    )
    .describe("Article sections in order"),
  seoMeta: z.object({
    description: z.string().max(160),
    keywords: z.array(z.string()).max(10),
  }),
});

type Article = z.infer<typeof ArticleSchema>;

const { object: article } = await generateObject({
  model: google("gemini-2.5-flash"),
  schema: ArticleSchema,
  prompt: `Write an article about TypeScript monorepo patterns`,
});

// article is fully typed as Article
// No parsing. No try/catch. No regex cleanup.
console.log(article.title); // string, guaranteed
console.log(article.sections[0].wordCount); // number, guaranteed

Key insight: When you use generateObject(), the schema is sent to the model as a response format constraint. The model generates tokens that conform to the schema. The SDK validates the response against the schema before returning. If validation fails, it retries automatically. You never see a malformed response.

Cost Attribution

The idea that every LLM call should record what it cost, who triggered it, and why.

interface CostRecord {
  // Identity
  project_id: string; // "pages-plus"
  function: string; // "article-write"
  service: string; // "gemini"
  model: string; // "gemini-2.5-flash"

  // Token tiers -- each priced differently
  input_tokens: number;
  output_tokens: number;
  thinking_tokens: number; // $3.50/M for Gemini Flash Thinking
  cache_read_tokens: number; // $0.0375/M
  cache_write_tokens: number;

  // Cost
  cost_usd: number; // Calculated from token tiers + model pricing

  // Context
  tags: string[]; // ["brand:llc-tax", "trigger:cron", "batch:2026-03-15"]
  timestamp: string; // ISO 8601
}

Key insight: The cost of an LLM call is not (input_tokens + output_tokens) * price_per_token. Models with thinking/reasoning modes have 3-5 different token tiers, each with different prices. If you only track two tiers, you will undercount by 2-10x.

Boundary Validation

The idea that every system boundary β€” HTTP endpoints, queue consumers, function arguments β€” should validate its input against a schema.

import { z } from "zod";

const PublishRequestSchema = z.object({
  brand_slug: z
    .string()
    .regex(/^[a-z0-9-]+$/)
    .min(1)
    .max(64),
  content_id: z.string().uuid(),
  publish_to: z.enum(["blog", "social", "newsletter"]),
  schedule_at: z.string().datetime().optional(),
  metadata: z
    .record(z.string(), z.unknown())
    .optional()
    .default({}),
});

type PublishRequest = z.infer<typeof PublishRequestSchema>;

// In the handler
app.post("/v1/publish", async (c) => {
  const result = PublishRequestSchema.safeParse(await c.req.json());

  if (!result.success) {
    return c.json(
      {
        error: "validation_failed",
        issues: result.error.issues.map((i) => ({
          path: i.path.join("."),
          message: i.message,
        })),
      },
      400
    );
  }

  // result.data is fully typed as PublishRequest
  // No casting, no `as any`, no runtime surprises
  return c.json(await publishContent(result.data, c.env));
});

Key insight: TypeScript types disappear at runtime. When your Worker receives a JSON body from the internet, TypeScript cannot guarantee its shape. Zod bridges compile-time and runtime: one schema gives you both the TypeScript type (via z.infer) and the runtime validator (via .parse()/.safeParse()). Define once, validate everywhere.


The Problem

You ask an LLM for structured data. It returns a string. Sometimes the string is valid JSON. Sometimes it is JSON wrapped in markdown code fences. Sometimes it has a preamble like β€œHere’s the JSON you requested:” before the actual object. Sometimes the keys are in a different order. Sometimes there are extra fields. Sometimes there are missing fields.

Your code looks like this:

// Found in 10+ files across 4 production projects
async function generateArticle(prompt: string, env: Env): Promise<Article> {
  const response = await fetch(
    "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash:generateContent",
    {
      method: "POST",
      headers: {
        "Content-Type": "application/json",
        "x-goog-api-key": env.GEMINI_API_KEY,
      },
      body: JSON.stringify({
        contents: [{ parts: [{ text: prompt }] }],
        generationConfig: {
          responseMimeType: "application/json",
        },
      }),
    }
  );

  const data = await response.json();
  const text = data.candidates?.[0]?.content?.parts?.[0]?.text;

  if (!text) {
    throw new Error("No response from Gemini");
  }

  // The cleanup gauntlet
  let cleaned = text
    .replace(/```json\n?/g, "")
    .replace(/```\n?/g, "")
    .replace(/^\s*Here.*?:\s*/i, "")
    .trim();

  try {
    const parsed = JSON.parse(cleaned);

    // Manual field validation
    if (!parsed.title || typeof parsed.title !== "string") {
      throw new Error("Missing or invalid title");
    }
    if (!Array.isArray(parsed.sections)) {
      throw new Error("Missing or invalid sections");
    }
    // 20 more lines of manual checks...

    return parsed as Article;
  } catch (err) {
    console.log("Failed to parse LLM response:", cleaned.substring(0, 200));
    throw new Error(`JSON parse failed: ${err.message}`);
  }
}

This pattern has five distinct failure modes:

  1. Regex misses a variant. The model outputs ```JSON (capital J) or ```json5 or wraps the response in <json> tags. Your regex does not handle it. JSON.parse throws. The operation fails silently or crashes.

  2. Field validation is incomplete. You check for title and sections but forget to check that each section has a heading and content. The partial object propagates downstream and crashes in a different function with an unhelpful error.

  3. Type assertion lies. parsed as Article tells TypeScript β€œtrust me, this is an Article.” TypeScript obeys. At runtime, it might be missing three fields. The type assertion bypasses every safety net TypeScript provides.

  4. No retry logic. If the model returns malformed JSON once, this code throws. There is no retry with a different prompt, no retry with a stricter system message, no fallback to a different model.

  5. No cost tracking. The raw fetch() call returns no token counts. You have no idea what this call cost. Multiply by 9 crons running every 15 minutes, and you get the $47-in-5-days disaster.

The Fix: generateObject() with Zod

import { generateObject } from "ai";
import { google } from "@ai-sdk/google";
import { z } from "zod";

const ArticleSchema = z.object({
  title: z.string().min(10).max(200),
  slug: z
    .string()
    .regex(/^[a-z0-9-]+$/)
    .max(100),
  sections: z
    .array(
      z.object({
        heading: z.string().min(1),
        content: z.string().min(50),
        wordCount: z.number().int().positive(),
      })
    )
    .min(3)
    .max(20),
  seoMeta: z.object({
    description: z.string().max(160),
    keywords: z.array(z.string()).min(1).max(10),
  }),
  readingTimeMinutes: z.number().positive(),
});

type Article = z.infer<typeof ArticleSchema>;

async function generateArticle(prompt: string): Promise<Article> {
  const { object, usage } = await generateObject({
    model: google("gemini-2.5-flash"),
    schema: ArticleSchema,
    prompt,
  });

  // object is Article -- fully validated, fully typed
  // usage.promptTokens and usage.completionTokens available for cost tracking
  return object;
}

What changed:

Before (manual)After (generateObject)
Raw fetch() to provider APIProvider-agnostic SDK call
Regex cleanup of markdown fencesNo cleanup needed β€” response is never a raw string
JSON.parse() in try/catchSDK handles parsing and validation
Manual field-by-field validationZod schema validates structure, types, and constraints
as Article type assertionz.infer<typeof ArticleSchema> β€” type derived from schema
No retry on malformed responseSDK retries automatically on validation failure
No token counts availableusage object with prompt/completion token counts
40+ lines of parsing boilerplate0 lines of parsing code

How It Works Under the Hood

When you call generateObject(), the Vercel AI SDK:

  1. Converts your Zod schema to a JSON Schema and sends it to the model as a response_format constraint (for models that support structured output) or as a function call schema (for models that support function calling).

  2. The model generates tokens constrained to the schema. This is not post-processing. The model’s token sampling is guided by the schema structure, making it far more reliable than asking for JSON in the prompt.

  3. Validates the response against your Zod schema. If validation fails (e.g., a string is too long, a number is negative), the SDK retries with an error message appended to the prompt.

  4. Returns a fully typed object. The return type is z.infer<typeof YourSchema> β€” you never touch JSON.parse or as.

// The SDK handles all of these failure modes automatically:
//
// 1. Model returns markdown-wrapped JSON    -> structured output bypasses this
// 2. Model returns extra fields             -> Zod strips them (.strict() to reject)
// 3. Model returns wrong types              -> Zod validation catches it, SDK retries
// 4. Model returns partial object           -> Zod validation catches it, SDK retries
// 5. Model returns nothing                  -> SDK throws with clear error
//
// You handle ZERO of these cases. The SDK handles ALL of them.

When You Still Need Manual Parsing

There are two legitimate cases:

  1. Streaming partial objects. If you need to display results as they stream in, streamObject() gives you partial objects during generation. But the final object is still validated.

  2. Legacy provider APIs. If you are using a provider the AI SDK does not support, you are back to raw fetch(). But even then, use Zod to validate after parsing β€” do not use as.

// If you MUST parse manually, at least validate with Zod
const raw = JSON.parse(responseText);
const result = ArticleSchema.safeParse(raw);

if (!result.success) {
  console.error("Validation failed:", result.error.issues);
  // Retry, fall back, or fail explicitly -- never pass invalid data downstream
  throw new Error(`LLM response failed validation: ${result.error.message}`);
}

// result.data is Article -- safe to use
return result.data;

The Problem

Every LLM call costs money. The cost varies by model, by token count, by whether the model used β€œthinking” mode, by whether prompt caching kicked in. When you call provider APIs directly from your application code, you have no record of what anything cost.

// The pattern that burned $47 in 5 days
// Found in pages-plus: 6 direct Gemini calls, 9 crons, zero cost tracking

async function generateBlogPost(keyword: string, env: Env): Promise<BlogPost> {
  // Direct call to Gemini -- no proxy, no metering, no attribution
  const response = await fetch(
    `https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash:generateContent`,
    {
      method: "POST",
      headers: {
        "Content-Type": "application/json",
        "x-goog-api-key": env.GEMINI_API_KEY,
      },
      body: JSON.stringify({
        contents: [
          {
            parts: [
              {
                text: `Write a comprehensive blog post about: ${keyword}`,
              },
            ],
          },
        ],
      }),
    }
  );

  const data = await response.json();
  // No cost tracking. No token counting. No attribution.
  // This call cost somewhere between $0.01 and $0.30.
  // Nobody will ever know.
  return parseBlogPost(data);
}

War Story: The $47-in-5-Days Gemini Disaster

Here is what happened:

The Fix: Centralized LLM Proxy with Per-Call Metering

The architecture that prevents this:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  pages-plus  β”‚     β”‚ scalable-mediaβ”‚    β”‚   aso-mrr    β”‚
β”‚  (no API key)β”‚     β”‚ (no API key) β”‚     β”‚ (no API key) β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚                    β”‚                    β”‚
       β”‚   X-Project-Id     β”‚   X-Function       β”‚   X-Tags
       β”‚   X-Api-Key        β”‚   X-Api-Key        β”‚   X-Api-Key
       β”‚                    β”‚                    β”‚
       β–Ό                    β–Ό                    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      API Proxy                          β”‚
β”‚                                                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ Auth +      β”‚  β”‚ Cost Calc    β”‚  β”‚ Daily Limit   β”‚  β”‚
β”‚  β”‚ Routing     β”‚  β”‚ (all tiers)  β”‚  β”‚ Enforcement   β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                                                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ Provider    β”‚  β”‚ Ledger       β”‚  β”‚ Cache         β”‚  β”‚
β”‚  β”‚ Keys        β”‚  β”‚ (D1/Postgres)β”‚  β”‚ (optional)    β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚
                           β–Ό
            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
            β”‚  Provider APIs           β”‚
            β”‚  (Gemini, OpenAI, etc.)  β”‚
            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The proxy worker:

// api-proxy/src/routes/gemini.ts
import { Hono } from "hono";
import { z } from "zod";

const app = new Hono<{ Bindings: Env }>();

const GeminiProxySchema = z.object({
  model: z.string(),
  contents: z.array(z.unknown()),
  generationConfig: z.record(z.unknown()).optional(),
});

app.post("/v1/gemini/generateContent", async (c) => {
  const projectId = c.req.header("X-Project-Id");
  const apiKey = c.req.header("X-Api-Key");
  const functionTag = c.req.header("X-Function") ?? "unknown";
  const tags = c.req.header("X-Tags")?.split(",") ?? [];

  // 1. Authenticate the calling project
  const project = await authenticateProject(projectId, apiKey, c.env.DB);
  if (!project) {
    return c.json({ error: "unauthorized" }, 401);
  }

  // 2. Check daily spend limit BEFORE making the call
  const todaySpend = await getDailySpend(projectId, "gemini", c.env.DB);
  if (todaySpend >= project.daily_limit_usd) {
    return c.json(
      {
        error: "daily_limit_exceeded",
        spend: todaySpend,
        limit: project.daily_limit_usd,
      },
      429
    );
  }

  // 3. Parse and validate the request
  const body = GeminiProxySchema.parse(await c.req.json());

  // 4. Forward to Gemini using the proxy's API key (not the caller's)
  const geminiResponse = await fetch(
    `https://generativelanguage.googleapis.com/v1beta/models/${body.model}:generateContent`,
    {
      method: "POST",
      headers: {
        "Content-Type": "application/json",
        "x-goog-api-key": c.env.GEMINI_API_KEY, // Only the proxy has the key
      },
      body: JSON.stringify({
        contents: body.contents,
        generationConfig: body.generationConfig,
      }),
    }
  );

  const result = await geminiResponse.json();

  // 5. Extract token counts from ALL tiers
  const usage = result.usageMetadata;
  const cost = calculateGeminiCost(body.model, {
    input: usage?.promptTokenCount ?? 0,
    output: usage?.candidatesTokenCount ?? 0,
    thinking: usage?.thoughtsTokenCount ?? 0,
    cacheRead: usage?.cachedContentTokenCount ?? 0,
  });

  // 6. Record in the ledger -- every call, every tier, every tag
  await c.env.DB.prepare(
    `INSERT INTO api_calls_ledger
     (id, project_id, service, model, function, tags,
      input_tokens, output_tokens, thinking_tokens, cache_read_tokens,
      cost_usd, created_at)
     VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)`
  )
    .bind(
      crypto.randomUUID(),
      projectId,
      "gemini",
      body.model,
      functionTag,
      JSON.stringify(tags),
      usage?.promptTokenCount ?? 0,
      usage?.candidatesTokenCount ?? 0,
      usage?.thoughtsTokenCount ?? 0,
      usage?.cachedContentTokenCount ?? 0,
      cost,
      new Date().toISOString()
    )
    .run();

  // 7. Return the response to the caller
  return c.json(result);
});

function calculateGeminiCost(
  model: string,
  tokens: {
    input: number;
    output: number;
    thinking: number;
    cacheRead: number;
  }
): number {
  // Gemini 2.5 Flash pricing (as of 2026-03)
  const pricing: Record<string, { input: number; output: number; thinking: number; cacheRead: number }> = {
    "gemini-2.5-flash": {
      input: 0.15, // per 1M tokens
      output: 0.60,
      thinking: 3.50,
      cacheRead: 0.0375,
    },
    "gemini-2.5-pro": {
      input: 1.25,
      output: 10.0,
      thinking: 10.0,
      cacheRead: 0.3125,
    },
  };

  const p = pricing[model] ?? pricing["gemini-2.5-flash"];

  return (
    (tokens.input * p.input +
      tokens.output * p.output +
      tokens.thinking * p.thinking +
      tokens.cacheRead * p.cacheRead) /
    1_000_000
  );
}

The calling worker now looks like this:

// pages-plus/src/domain/content.ts
// No GEMINI_API_KEY. No direct provider calls. No cost math.

async function generateBlogPost(keyword: string, env: Env): Promise<BlogPost> {
  const response = await fetch(`${env.API_PROXY_URL}/v1/gemini/generateContent`, {
    method: "POST",
    headers: {
      "Content-Type": "application/json",
      "X-Project-Id": "pages-plus",
      "X-Api-Key": env.API_PROXY_KEY,
      "X-Function": "blog-post-generate",
      "X-Tags": `brand:${keyword},trigger:cron`,
    },
    body: JSON.stringify({
      model: "gemini-2.5-flash",
      contents: [
        { parts: [{ text: `Write a blog post about: ${keyword}` }] },
      ],
    }),
  });

  if (response.status === 429) {
    // Daily limit hit -- stop gracefully, don't retry
    console.log(`Daily spend limit reached for pages-plus`);
    return null;
  }

  return parseBlogPost(await response.json());
}

Available Gateway Options

You do not have to build this yourself. Several production-ready options exist:

Cloudflare AI Gateway β€” Free tier, sits between your app and provider APIs. Provides analytics, caching, rate limiting, and logging. One line of code to add. Best if you are already on Cloudflare.

LiteLLM Proxy β€” Open-source Python proxy supporting 100+ LLM providers with OpenAI-compatible API. Cost tracking per key/user/team. Tag-based budget management. Best if you want self-hosted control with multi-provider support.

Custom proxy β€” Build your own (as shown above) when you need specific attribution logic, integration with your existing database, or edge-native deployment. Best when existing gateways do not match your metering requirements.


The Problem

Modern LLMs have multiple token tiers with wildly different prices. When you track cost as (input + output) * price, you are using a formula from 2023. The real formula in 2025+ has 3-5 tiers:

Token TierGemini 2.5 Flash PriceWhat It Is
Input$0.15/MPrompt tokens sent to the model
Output$0.60/MGenerated response tokens
Thinking$3.50/MInternal reasoning tokens (not shown to user)
Cache Read$0.0375/MTokens served from prompt cache
Cache Write$0.15/MTokens written to prompt cache

Thinking tokens cost 5.8x more than output tokens. And for complex tasks (SEO analysis, content planning, multi-step reasoning), thinking tokens can outnumber output tokens 3-to-1.

War Story: The 9x Undercount

The Fix: Track All Token Tiers

interface TokenUsage {
  input: number;
  output: number;
  thinking: number;
  cacheRead: number;
  cacheWrite: number;
}

interface ModelPricing {
  input: number; // per 1M tokens
  output: number;
  thinking: number;
  cacheRead: number;
  cacheWrite: number;
}

// Pricing table -- update when providers change prices
const MODEL_PRICING: Record<string, ModelPricing> = {
  "gemini-2.5-flash": {
    input: 0.15,
    output: 0.60,
    thinking: 3.50,
    cacheRead: 0.0375,
    cacheWrite: 0.15,
  },
  "gemini-2.5-pro": {
    input: 1.25,
    output: 10.0,
    thinking: 10.0,
    cacheRead: 0.3125,
    cacheWrite: 1.25,
  },
  "claude-sonnet-4-20250514": {
    input: 3.0,
    output: 15.0,
    thinking: 0, // Claude charges thinking at output rate when extended thinking enabled
    cacheRead: 0.30,
    cacheWrite: 3.75,
  },
  "gpt-4o": {
    input: 2.50,
    output: 10.0,
    thinking: 0, // GPT-4o does not have a separate thinking tier
    cacheRead: 1.25,
    cacheWrite: 2.50,
  },
};

function calculateCost(model: string, usage: TokenUsage): number {
  const pricing = MODEL_PRICING[model];
  if (!pricing) {
    console.warn(`Unknown model pricing: ${model}, using gemini-2.5-flash as fallback`);
    return calculateCost("gemini-2.5-flash", usage);
  }

  return (
    (usage.input * pricing.input +
      usage.output * pricing.output +
      usage.thinking * pricing.thinking +
      usage.cacheRead * pricing.cacheRead +
      usage.cacheWrite * pricing.cacheWrite) /
    1_000_000
  );
}

// Extract token counts from Gemini response
function extractGeminiUsage(response: GeminiResponse): TokenUsage {
  const meta = response.usageMetadata;
  return {
    input: meta?.promptTokenCount ?? 0,
    output: meta?.candidatesTokenCount ?? 0,
    thinking: meta?.thoughtsTokenCount ?? 0, // THIS IS THE ONE PEOPLE MISS
    cacheRead: meta?.cachedContentTokenCount ?? 0,
    cacheWrite: 0, // Gemini does not report cache writes in response
  };
}

// Example: What the 9x undercount looked like
const usage: TokenUsage = {
  input: 5000,
  output: 2000,
  thinking: 8000, // 4x the output -- typical for reasoning tasks
  cacheRead: 0,
  cacheWrite: 0,
};

const wrongCost = (usage.input * 0.15 + usage.output * 0.60) / 1_000_000;
// $0.0019 -- what the broken formula reported

const correctCost = calculateCost("gemini-2.5-flash", usage);
// $0.0298 -- 15.7x higher

// Over hundreds of calls per day, this delta becomes tens of dollars

The Ledger Schema

CREATE TABLE api_calls_ledger (
  id TEXT PRIMARY KEY,
  project_id TEXT NOT NULL,
  service TEXT NOT NULL,        -- 'gemini', 'openai', 'anthropic'
  model TEXT NOT NULL,          -- 'gemini-2.5-flash'
  function TEXT NOT NULL,       -- 'article-write', 'seo-analysis'
  tags TEXT DEFAULT '[]',       -- JSON array: ['brand:llc-tax', 'trigger:cron']

  -- Token tiers -- NEVER lump these together
  input_tokens INTEGER DEFAULT 0,
  output_tokens INTEGER DEFAULT 0,
  thinking_tokens INTEGER DEFAULT 0,
  cache_read_tokens INTEGER DEFAULT 0,
  cache_write_tokens INTEGER DEFAULT 0,

  -- Calculated cost
  cost_usd REAL NOT NULL,

  -- Metadata
  latency_ms INTEGER,
  status_code INTEGER,
  created_at TEXT NOT NULL,

  -- Indexes for querying
  UNIQUE(id)
);

CREATE INDEX idx_ledger_project_date ON api_calls_ledger(project_id, created_at);
CREATE INDEX idx_ledger_function ON api_calls_ledger(function, created_at);
CREATE INDEX idx_ledger_service ON api_calls_ledger(service, created_at);

Monthly Reconciliation Query

-- Compare tracked costs to provider bill
SELECT
  service,
  model,
  COUNT(*) as calls,
  SUM(input_tokens) as total_input,
  SUM(output_tokens) as total_output,
  SUM(thinking_tokens) as total_thinking,
  ROUND(SUM(cost_usd), 2) as tracked_cost,
  -- Compare this to your provider's billing dashboard
  -- If difference > 20%, the pricing formula is wrong
  strftime('%Y-%m', created_at) as month
FROM api_calls_ledger
GROUP BY service, model, month
ORDER BY tracked_cost DESC;

The Problem

AI applications grow fast. You start with a function that calls an LLM. Then you add preprocessing. Then postprocessing. Then validation. Then caching. Then a different LLM call for a related task. Then another. Before you know it, you have a 1,126-line file that does five unrelated things.

// src/lib/seo-engine.ts -- 1,126 lines
// This file does ALL of the following:
//
// 1. SEO analysis (fetch page, parse HTML, score SEO factors)
// 2. Content scoring (call Gemini to rate content quality)
// 3. Keyword extraction (parse content, TF-IDF, call Gemini for related terms)
// 4. SERP parsing (fetch Google results, parse structured data)
// 5. Site audits (crawl pages, check redirects, validate meta tags)
//
// When a bug appears in keyword extraction, you read 1,126 lines.
// When you want to test SERP parsing, you import a module that also
// initializes Gemini clients, HTML parsers, and HTTP pools.
// When you refactor content scoring, you risk breaking site audits
// because they share 4 utility functions defined at the bottom.

export async function analyzeSEO(url: string, env: Env) {
  // 200 lines of fetching and parsing...
}

export async function scoreContent(content: string, env: Env) {
  // 150 lines of Gemini calls and scoring...
}

export async function extractKeywords(text: string, env: Env) {
  // 180 lines of NLP and LLM calls...
}

export async function parseSERP(query: string, env: Env) {
  // 250 lines of Google result parsing...
}

export async function auditSite(domain: string, env: Env) {
  // 346 lines of crawling and validation...
}

// Plus 50 lines of shared utilities at the bottom
function cleanHtml(html: string): string { /* ... */ }
function extractMetaTags(html: string): MetaTags { /* ... */ }
function normalizeUrl(url: string): string { /* ... */ }
function calculateScore(factors: Factor[]): number { /* ... */ }

Why This Matters More for AI Code

God modules are bad in any codebase. They are especially bad in AI codebases because:

  1. LLM calls are expensive to test. If your keyword extraction test imports a module that also initializes a Gemini client for content scoring, your test either pays for an unnecessary LLM call or requires mocking infrastructure you should not need.

  2. LLM integrations change frequently. Provider APIs change, models get updated, pricing changes. When the Gemini API adds a new parameter, you edit a 1,126-line file. The blast radius is five features.

  3. Prompt engineering requires iteration. Improving a prompt for content scoring should not require reading through SERP parsing code. When prompts live in god modules, iteration is slow because the context load is high.

  4. Cost attribution is impossible. If five functions in one file all call Gemini, and your cost tracking tags by file/module, you cannot distinguish a $0.01 keyword extraction from a $0.30 site audit.

The Fix: Single-Responsibility Modules Under 300 Lines

src/
  domain/
    seo-analysis.ts        (120 lines)
    content-scoring.ts     (90 lines)
    keyword-extraction.ts  (110 lines)
    serp-parser.ts         (140 lines)
    site-audit.ts          (180 lines)
  shared/
    html-utils.ts          (40 lines)
    scoring.ts             (30 lines)
    url-utils.ts           (20 lines)

Each module:

// src/domain/keyword-extraction.ts -- 110 lines
// Single responsibility: extract keywords from text using NLP + LLM

import { generateObject } from "ai";
import { google } from "@ai-sdk/google";
import { z } from "zod";
import { cleanHtml } from "../shared/html-utils";

const KeywordResultSchema = z.object({
  primary: z.array(
    z.object({
      term: z.string(),
      relevance: z.number().min(0).max(1),
      searchVolume: z.enum(["high", "medium", "low", "unknown"]),
    })
  ),
  related: z.array(z.string()),
  topics: z.array(z.string()),
});

export type KeywordResult = z.infer<typeof KeywordResultSchema>;

export async function extractKeywords(
  text: string,
  options?: { maxKeywords?: number }
): Promise<KeywordResult> {
  const cleaned = cleanHtml(text);
  const max = options?.maxKeywords ?? 20;

  const { object } = await generateObject({
    model: google("gemini-2.5-flash"),
    schema: KeywordResultSchema,
    prompt: `Extract the top ${max} keywords from this text. Rate each by relevance (0-1) and estimate search volume.\n\nText:\n${cleaned}`,
  });

  return object;
}
// src/domain/content-scoring.ts -- 90 lines
// Single responsibility: score content quality using LLM evaluation

import { generateObject } from "ai";
import { google } from "@ai-sdk/google";
import { z } from "zod";

const ContentScoreSchema = z.object({
  overall: z.number().min(0).max(100),
  factors: z.object({
    readability: z.number().min(0).max(100),
    depth: z.number().min(0).max(100),
    accuracy: z.number().min(0).max(100),
    actionability: z.number().min(0).max(100),
  }),
  suggestions: z.array(z.string()).max(5),
});

export type ContentScore = z.infer<typeof ContentScoreSchema>;

export async function scoreContent(
  content: string,
  context?: { targetAudience?: string; keyword?: string }
): Promise<ContentScore> {
  const { object } = await generateObject({
    model: google("gemini-2.5-flash"),
    schema: ContentScoreSchema,
    prompt: `Score this content on readability, depth, accuracy, and actionability (0-100 each).
${context?.keyword ? `Target keyword: ${context.keyword}` : ""}
${context?.targetAudience ? `Target audience: ${context.targetAudience}` : ""}

Content:
${content.substring(0, 10000)}`,
  });

  return object;
}

The Splitting Heuristic

When deciding how to split a god module:

  1. Group by data dependency. Functions that share the same inputs and outputs belong together. SEO analysis and SERP parsing both work on URLs, but one fetches pages and one fetches search results β€” different data sources, different modules.

  2. Group by change frequency. Prompt engineering for content scoring changes weekly. HTML parsing utilities change yearly. They should not be in the same file.

  3. Group by test boundary. If you need to mock Gemini to test keyword extraction but not SERP parsing, they should be in different modules so SERP parsing tests do not need Gemini mocks.

  4. Target under 300 lines. This is not arbitrary. 300 lines is roughly the amount of code a developer can hold in working memory while debugging. Above 300 lines, you start scrolling, which means you start losing context.


The Problem

Your Worker has 226 API endpoints. 223 of them look like this:

app.post("/v1/brands/:slug/content", async (c) => {
  const body = await c.req.json();
  // body is `any` -- TypeScript does not know its shape
  // If body.title is undefined, it becomes NULL in the database
  // If body.sections is a string instead of an array, the .map() call crashes
  // If body.metadata.seo.keywords has 500 entries, the database write times out

  const content = await createContent(body, c.env);
  return c.json(content, 201);
});

The 3 that have validation look like this:

app.post("/v1/auth/login", async (c) => {
  const body = await c.req.json();

  // Manual validation -- tedious, incomplete, and no type narrowing
  if (!body.email || typeof body.email !== "string") {
    return c.json({ error: "email is required" }, 400);
  }
  if (!body.password || typeof body.password !== "string") {
    return c.json({ error: "password is required" }, 400);
  }
  if (body.password.length < 8) {
    return c.json({ error: "password must be at least 8 characters" }, 400);
  }

  // body is still `any` after all these checks
  // TypeScript does not narrow `any` through manual checks
  const user = await authenticateUser(body.email, body.password, c.env);
  return c.json(user);
});

This is dangerous in AI applications specifically because:

  1. LLM prompts are constructed from user input. If body.keyword is actually an object instead of a string, your prompt becomes Write about [object Object]. The LLM processes garbage. You pay for the tokens.

  2. Database writes accept anything. If body.sections is a 50,000-character string instead of an array of section objects, D1 stores it. When another service reads it and calls .map(), the system crashes downstream.

  3. Queue messages propagate invalid data. An unvalidated request body gets wrapped in a queue message and sent to another service. That service also does not validate. The bad data now lives in two databases.

The Fix: Zod at Every Boundary

import { z } from "zod";
import { Hono } from "hono";

// Define the schema ONCE -- it's both the validator and the type
const CreateContentSchema = z.object({
  title: z.string().min(5).max(200),
  keyword: z
    .string()
    .min(1)
    .max(100)
    .describe("Primary SEO keyword for content generation"),
  sections: z
    .array(
      z.object({
        heading: z.string().min(1).max(200),
        instructions: z.string().max(1000).optional(),
      })
    )
    .min(1)
    .max(20),
  metadata: z
    .object({
      seo: z
        .object({
          keywords: z.array(z.string().max(50)).max(10).default([]),
          description: z.string().max(160).optional(),
        })
        .default({}),
      publishAt: z.string().datetime().optional(),
    })
    .default({}),
  generateWithAI: z.boolean().default(true),
});

type CreateContentInput = z.infer<typeof CreateContentSchema>;

// Middleware for validation
function validate<T extends z.ZodSchema>(schema: T) {
  return async (c: Context, next: Next) => {
    const result = schema.safeParse(await c.req.json());
    if (!result.success) {
      return c.json(
        {
          error: "validation_failed",
          issues: result.error.issues.map((issue) => ({
            path: issue.path.join("."),
            code: issue.code,
            message: issue.message,
          })),
        },
        400
      );
    }
    c.set("validated", result.data);
    return next();
  };
}

app.post(
  "/v1/brands/:slug/content",
  validate(CreateContentSchema),
  async (c) => {
    const input = c.get("validated") as CreateContentInput;
    // input is fully typed, fully validated, fully constrained
    // input.title is a string between 5-200 chars
    // input.sections is an array with 1-20 items
    // input.metadata.seo.keywords has at most 10 entries
    // No casting, no manual checks, no surprises

    const content = await createContent(input, c.env);
    return c.json(content, 201);
  }
);

Validation Middleware for Hono

Here is a reusable middleware pattern for Hono (the most common router on Cloudflare Workers):

// src/middleware/validation.ts
import { z } from "zod";
import type { Context, MiddlewareHandler } from "hono";

export function validateBody<T extends z.ZodSchema>(
  schema: T
): MiddlewareHandler {
  return async (c, next) => {
    let body: unknown;
    try {
      body = await c.req.json();
    } catch {
      return c.json({ error: "invalid_json", message: "Request body is not valid JSON" }, 400);
    }

    const result = schema.safeParse(body);
    if (!result.success) {
      return c.json(
        {
          error: "validation_failed",
          issues: result.error.issues.map((i) => ({
            path: i.path.join("."),
            code: i.code,
            message: i.message,
          })),
        },
        400
      );
    }

    c.set("body", result.data);
    return next();
  };
}

export function validateParams<T extends z.ZodSchema>(
  schema: T
): MiddlewareHandler {
  return async (c, next) => {
    const result = schema.safeParse(c.req.param());
    if (!result.success) {
      return c.json(
        {
          error: "invalid_params",
          issues: result.error.issues.map((i) => ({
            path: i.path.join("."),
            message: i.message,
          })),
        },
        400
      );
    }

    c.set("params", result.data);
    return next();
  };
}

export function validateQuery<T extends z.ZodSchema>(
  schema: T
): MiddlewareHandler {
  return async (c, next) => {
    const result = schema.safeParse(c.req.query());
    if (!result.success) {
      return c.json(
        {
          error: "invalid_query",
          issues: result.error.issues.map((i) => ({
            path: i.path.join("."),
            message: i.message,
          })),
        },
        400
      );
    }

    c.set("query", result.data);
    return next();
  };
}

Usage across the codebase becomes consistent:

const BrandSlugParams = z.object({
  slug: z.string().regex(/^[a-z0-9-]+$/).min(1).max(64),
});

const ListQuerySchema = z.object({
  limit: z.coerce.number().int().min(1).max(100).default(20),
  offset: z.coerce.number().int().min(0).default(0),
  status: z.enum(["draft", "published", "archived"]).optional(),
});

app.get(
  "/v1/brands/:slug/content",
  validateParams(BrandSlugParams),
  validateQuery(ListQuerySchema),
  async (c) => {
    const { slug } = c.get("params");
    const { limit, offset, status } = c.get("query");
    // All validated. All typed. All constrained.
    return c.json(await listContent(slug, { limit, offset, status }, c.env));
  }
);

The Problem

When multiple workers call the same LLM API, each worker defines its own response types:

// pages-plus/src/types/gemini.ts
interface GeminiApiResponse {
  candidates: Array<{
    content: { parts: Array<{ text: string }> };
    finishReason: string;
  }>;
  usageMetadata: {
    promptTokenCount: number;
    candidatesTokenCount: number;
    totalTokenCount: number;
  };
}

// aso-mrr/src/types/gemini.ts -- COPY-PASTED, slightly different
interface GeminiResponse {
  candidates: {
    content: { parts: { text: string }[] };
    finishReason: string;
  }[];
  usageMetadata: {
    promptTokenCount: number;
    candidatesTokenCount: number;
    // Missing: thoughtsTokenCount -- this copy doesn't know about thinking tokens
    totalTokenCount: number;
  };
}

// scalable-media/src/lib/gemini.ts -- yet another copy
type GeminiResult = {
  candidates: Array<{
    content: { parts: Array<{ text: string }> };
  }>;
  usageMetadata?: {
    promptTokenCount?: number;
    candidatesTokenCount?: number;
  };
};

// gatherfeed/src/services/ai.ts -- and another
// This one has thoughtsTokenCount but misspelled as thoughtTokenCount

Six files. Four slightly different versions. When Google adds a new field to the response, you update one copy and forget the others. When thinking tokens ship, three of the four copies miss the field. That is how you get a 9x cost undercount.

The Fix: Shared Type Modules

Option A: Monorepo shared package

// packages/shared/src/types/llm.ts
import { z } from "zod";

// Define once with Zod -- get runtime validation AND TypeScript type
export const GeminiUsageSchema = z.object({
  promptTokenCount: z.number().default(0),
  candidatesTokenCount: z.number().default(0),
  thoughtsTokenCount: z.number().default(0),
  cachedContentTokenCount: z.number().default(0),
  totalTokenCount: z.number().default(0),
});

export const GeminiResponseSchema = z.object({
  candidates: z.array(
    z.object({
      content: z.object({
        parts: z.array(z.object({ text: z.string() })),
      }),
      finishReason: z.string(),
    })
  ),
  usageMetadata: GeminiUsageSchema.optional(),
});

export type GeminiUsage = z.infer<typeof GeminiUsageSchema>;
export type GeminiResponse = z.infer<typeof GeminiResponseSchema>;

// Helper to extract text from response
export function extractText(response: GeminiResponse): string | null {
  return response.candidates?.[0]?.content?.parts?.[0]?.text ?? null;
}

// Helper to extract usage with defaults for all tiers
export function extractUsage(response: GeminiResponse): GeminiUsage {
  return GeminiUsageSchema.parse(response.usageMetadata ?? {});
}

Every worker imports from the shared package:

// pages-plus/src/domain/content.ts
import { GeminiResponseSchema, extractText, extractUsage } from "@acme/shared/types/llm";

// aso-mrr/src/services/analysis.ts
import { GeminiResponseSchema, extractText, extractUsage } from "@acme/shared/types/llm";

// One definition. One import. One place to update.

Option B: Eliminate the type entirely with AI SDK

The better fix is to stop interacting with provider APIs directly. When you use the Vercel AI SDK, you never see a GeminiApiResponse. The SDK returns a normalized GenerateObjectResult or GenerateTextResult with consistent types across all providers.

import { generateObject } from "ai";
import { google } from "@ai-sdk/google";

const { object, usage } = await generateObject({
  model: google("gemini-2.5-flash"),
  schema: MySchema,
  prompt: "...",
});

// usage is always: { promptTokens: number, completionTokens: number }
// No GeminiApiResponse. No provider-specific types. No copy-paste.

Key insight: The best way to eliminate duplicated types is to eliminate the need for the type entirely. If you are defining GeminiApiResponse in your codebase, you are at the wrong abstraction level. Use an SDK that abstracts the provider.


The Problem

You are building an AI agent that performs multi-step tasks: research a topic, generate an outline, write content, review for quality, publish. Each step depends on the previous step. Some steps may fail and need retries. The whole sequence may take minutes. And you need to know where you are if the process crashes and restarts.

The manual approach:

// The hand-rolled state machine -- no persistence, no recovery, no observability

type AgentState = "idle" | "researching" | "outlining" | "writing" | "reviewing" | "publishing";

interface AgentContext {
  state: AgentState;
  keyword: string;
  researchData?: ResearchResult;
  outline?: ArticleOutline;
  draft?: string;
  reviewScore?: number;
  error?: string;
  retryCount: number;
}

async function runAgent(keyword: string, env: Env): Promise<void> {
  const ctx: AgentContext = {
    state: "idle",
    keyword,
    retryCount: 0,
  };

  try {
    // Step 1: Research
    ctx.state = "researching";
    ctx.researchData = await doResearch(keyword, env);

    // Step 2: Outline
    ctx.state = "outlining";
    ctx.outline = await generateOutline(ctx.researchData, env);

    // Step 3: Write
    ctx.state = "writing";
    ctx.draft = await writeDraft(ctx.outline, env);

    // Step 4: Review
    ctx.state = "reviewing";
    ctx.reviewScore = await reviewContent(ctx.draft, env);

    if (ctx.reviewScore < 70) {
      // Retry writing -- but what if it fails again?
      // What if the process crashes here?
      // What if we've already spent $0.50 on research and outlining?
      ctx.state = "writing";
      ctx.draft = await writeDraft(ctx.outline, env);
      ctx.reviewScore = await reviewContent(ctx.draft, env);
    }

    // Step 5: Publish
    ctx.state = "publishing";
    await publishContent(ctx.draft, env);
  } catch (err) {
    ctx.error = err.message;
    // The state is in memory. If the process crashes, it's gone.
    // There's no way to resume from the last successful step.
    // The entire pipeline must restart from scratch.
    // All the LLM calls (and their costs) are wasted.
    console.log("Agent failed at state:", ctx.state, err);
  }
}

The problems:

  1. No persistence. If the Worker crashes after step 3 (writing), all progress is lost. The research and outline cost money. That money is wasted.
  2. No observability. You cannot query β€œhow many agents are in the reviewing state right now?” or β€œwhat was the last successful step for keyword X?”
  3. No recovery. When the process restarts, it starts from step 1. There is no way to resume from step 4.
  4. No concurrency control. Two cron ticks could start two agents for the same keyword. Both run to completion. You pay twice.
  5. No backpressure. If step 3 (writing) takes 30 seconds but step 1 (research) takes 2 seconds, you can saturate the LLM with writing requests while research builds up.

The Fix: Stateful Agent Runtimes

Cloudflare Agents SDK (Durable Objects with built-in state):

import { Agent } from "agents";
import { generateObject } from "ai";
import { google } from "@ai-sdk/google";
import { z } from "zod";

interface ContentAgentState {
  status: "idle" | "researching" | "outlining" | "writing" | "reviewing" | "publishing" | "done" | "failed";
  keyword: string;
  researchData?: unknown;
  outline?: unknown;
  draft?: string;
  reviewScore?: number;
  completedSteps: string[];
  totalCost: number;
  error?: string;
}

const initialState: ContentAgentState = {
  status: "idle",
  keyword: "",
  completedSteps: [],
  totalCost: 0,
};

export class ContentAgent extends Agent<Env, ContentAgentState> {
  initialState = initialState;

  async startPipeline(keyword: string) {
    this.setState({ ...this.state, keyword, status: "researching" });

    try {
      // Step 1: Research -- state persists automatically
      if (!this.state.completedSteps.includes("research")) {
        const research = await this.doResearch(keyword);
        this.setState({
          ...this.state,
          researchData: research,
          completedSteps: [...this.state.completedSteps, "research"],
          status: "outlining",
        });
      }

      // Step 2: Outline
      if (!this.state.completedSteps.includes("outline")) {
        const outline = await this.generateOutline();
        this.setState({
          ...this.state,
          outline,
          completedSteps: [...this.state.completedSteps, "outline"],
          status: "writing",
        });
      }

      // Step 3: Write
      if (!this.state.completedSteps.includes("write")) {
        const draft = await this.writeDraft();
        this.setState({
          ...this.state,
          draft,
          completedSteps: [...this.state.completedSteps, "write"],
          status: "reviewing",
        });
      }

      // Step 4: Review
      if (!this.state.completedSteps.includes("review")) {
        const score = await this.reviewContent();
        this.setState({
          ...this.state,
          reviewScore: score,
          completedSteps: [...this.state.completedSteps, "review"],
          status: score >= 70 ? "publishing" : "writing",
        });

        if (score < 70) {
          // Remove write step to retry
          this.setState({
            ...this.state,
            completedSteps: this.state.completedSteps.filter(
              (s) => s !== "write" && s !== "review"
            ),
          });
          // Re-run from write step
          return this.startPipeline(keyword);
        }
      }

      // Step 5: Publish
      if (!this.state.completedSteps.includes("publish")) {
        await this.publishContent();
        this.setState({
          ...this.state,
          completedSteps: [...this.state.completedSteps, "publish"],
          status: "done",
        });
      }
    } catch (err) {
      this.setState({
        ...this.state,
        status: "failed",
        error: err instanceof Error ? err.message : String(err),
      });
      // State is persisted even on failure.
      // On retry, completedSteps tells us where to resume.
      // Already-spent LLM costs are not wasted.
    }
  }

  private async doResearch(keyword: string) {
    const { object, usage } = await generateObject({
      model: google("gemini-2.5-flash"),
      schema: ResearchSchema,
      prompt: `Research the topic: ${keyword}`,
    });

    this.setState({
      ...this.state,
      totalCost:
        this.state.totalCost +
        (usage.promptTokens * 0.15 + usage.completionTokens * 0.6) / 1_000_000,
    });

    return object;
  }

  private async generateOutline() {
    const { object } = await generateObject({
      model: google("gemini-2.5-flash"),
      schema: OutlineSchema,
      prompt: `Create an outline based on this research: ${JSON.stringify(this.state.researchData)}`,
    });
    return object;
  }

  private async writeDraft() {
    const { object } = await generateObject({
      model: google("gemini-2.5-flash"),
      schema: z.object({ content: z.string() }),
      prompt: `Write the full article from this outline: ${JSON.stringify(this.state.outline)}`,
    });
    return object.content;
  }

  private async reviewContent(): Promise<number> {
    const { object } = await generateObject({
      model: google("gemini-2.5-flash"),
      schema: z.object({ score: z.number().min(0).max(100), feedback: z.string() }),
      prompt: `Score this content 0-100: ${this.state.draft?.substring(0, 5000)}`,
    });
    return object.score;
  }

  private async publishContent() {
    await this.env.PUBLISH_QUEUE.send({
      event_id: crypto.randomUUID(),
      type: "content.publish",
      source: "content-agent",
      timestamp: new Date().toISOString(),
      payload: {
        keyword: this.state.keyword,
        content: this.state.draft,
      },
    });
  }
}

What the Agents SDK gives you:

ConcernManual State MachineCF Agents SDK
State persistenceIn-memory (lost on crash)SQLite-backed (survives restarts)
RecoveryStart from scratchResume from last completed step
ConcurrencyNo dedup, double-runsOne DO instance per keyword
Observabilityconsole.logQuery this.state via HTTP
SchedulingManual setTimeoutBuilt-in this.schedule() with alarms
Cost trackingManual counterState tracks totalCost durably
TestingNeed to mock everythingTest each step independently

When Agents SDK Is Not the Right Fit

The Cloudflare Agents SDK runs on Cloudflare’s edge network. If you are not on Cloudflare, or if your workflows span multiple cloud providers, consider:


The Problem

Each Worker manages its own API keys for external providers:

// pages-plus/wrangler.jsonc
{
  "vars": {
    "GEMINI_API_KEY": "...",
    "OPENAI_API_KEY": "...",
    "BRAVE_API_KEY": "..."
  }
}

// aso-mrr/wrangler.jsonc
{
  "vars": {
    "GEMINI_API_KEY": "...",  // Same key or different? Who knows.
    "DATAFORSEO_LOGIN": "...",
    "DATAFORSEO_PASSWORD": "..."
  }
}

// scalable-media/wrangler.jsonc
{
  "vars": {
    "GEMINI_API_KEY": "...",  // Third copy of the key
    "PERPLEXITY_API_KEY": "...",
    "TWITTER_BEARER_TOKEN": "..."
  }
}

// gatherfeed/wrangler.jsonc
{
  "vars": {
    "GEMINI_API_KEY": "...",  // Fourth copy
    "BRAVE_API_KEY": "...",   // Second copy
    "YOUTUBE_API_KEY": "..."
  }
}

The problems compound:

  1. Key rotation requires N deployments. When you rotate a Gemini key, you update 4 workers. If you miss one, it breaks in production.

  2. No usage attribution. All 4 workers use the same Gemini key. The billing dashboard shows total spend but not which worker spent what.

  3. No rate limiting. Each worker hits Gemini independently. Four workers each making requests at the provider’s rate limit means 4x the intended rate.

  4. Security surface. Every Worker’s environment has every key it needs. If any Worker is compromised, all its keys are exposed. In our case, that was 14 keys across 4 workers.

  5. No centralized kill switch. When you discover $47 in unexpected spending, you cannot flip one switch. You must update and deploy 4 workers to remove the keys.

The Fix: Centralized Key Management

One service holds all provider keys. Every other service authenticates to this proxy with a project-specific credential.

// api-proxy/src/providers/registry.ts

interface ProviderConfig {
  name: string;
  baseUrl: string;
  authHeader: string;
  keyEnvVar: string;
  rateLimit: { requests: number; windowMs: number };
}

const PROVIDERS: Record<string, ProviderConfig> = {
  gemini: {
    name: "Google Gemini",
    baseUrl: "https://generativelanguage.googleapis.com/v1beta",
    authHeader: "x-goog-api-key",
    keyEnvVar: "GEMINI_API_KEY",
    rateLimit: { requests: 60, windowMs: 60_000 },
  },
  openai: {
    name: "OpenAI",
    baseUrl: "https://api.openai.com/v1",
    authHeader: "Authorization",
    keyEnvVar: "OPENAI_API_KEY",
    rateLimit: { requests: 100, windowMs: 60_000 },
  },
  brave: {
    name: "Brave Search",
    baseUrl: "https://api.search.brave.com/res/v1",
    authHeader: "X-Subscription-Token",
    keyEnvVar: "BRAVE_API_KEY",
    rateLimit: { requests: 15, windowMs: 1_000 },
  },
  perplexity: {
    name: "Perplexity",
    baseUrl: "https://api.perplexity.ai",
    authHeader: "Authorization",
    keyEnvVar: "PERPLEXITY_API_KEY",
    rateLimit: { requests: 20, windowMs: 60_000 },
  },
};

// Project auth -- each calling service gets ONE credential
interface ProjectAuth {
  projectId: string;
  apiKey: string;
  dailyLimitUsd: number;
  allowedProviders: string[];
  costTier: 1 | 2 | 3;
}

Worker configurations become minimal:

// pages-plus/wrangler.jsonc -- AFTER consolidation
{
  "vars": {
    // NO provider API keys. Only the proxy credential.
    "API_PROXY_URL": "https://api-proxy.your-domain.com",
    "API_PROXY_KEY": "..."
  }
}

Key rotation is now a single operation:

wrangler secret put GEMINI_API_KEY --name api-proxy

Kill switch is now a single operation:

// Disable a project's access to all providers
// One API call. Immediate effect. No deployments.
await db
  .update(projects)
  .set({ active: false })
  .where(eq(projects.projectId, "pages-plus"));

The Problem

Your Workers use console.log for observability:

// What logging looks like in most Workers

app.post("/v1/content/generate", async (c) => {
  console.log("Received generate request");

  try {
    const body = await c.req.json();
    console.log("Generating content for:", body.keyword);

    const result = await generateContent(body, c.env);
    console.log("Content generated successfully");

    return c.json(result);
  } catch (err) {
    console.log("Error generating content:", err.message);
    // Which request? What were the inputs? What was the state?
    // The log says "Error generating content: fetch failed"
    // Good luck debugging that in production.
    return c.json({ error: "internal_error" }, 500);
  }
});

The problems:

  1. No context propagation. Each console.log is a standalone string. There is no request ID linking them together. When 50 requests happen in parallel, the logs are interleaved and useless.

  2. No structured data. Logs are strings, not objects. You cannot filter by status:error or function:content-generate or cost>0.10 because those fields do not exist.

  3. No level management. Everything is console.log. You cannot turn off debug logs in production or escalate warnings to error. There is no severity.

  4. No performance data. You do not know how long each operation took, what it cost, or what the token counts were. Those values exist in your code but never reach the logs.

The Fix: Pino in Browser Mode on Cloudflare Workers

Pino is the fastest Node.js logger. Its browser mode works on Cloudflare Workers because it outputs via console.log internally (which Workers capture) but gives you structured JSON, child loggers, and log levels.

// src/lib/logger.ts
import pino from "pino";

export function createLogger(service: string) {
  return pino({
    level: "info",
    browser: {
      asObject: true,
      write: {
        info: (o: object) => console.log(JSON.stringify(o)),
        warn: (o: object) => console.warn(JSON.stringify(o)),
        error: (o: object) => console.error(JSON.stringify(o)),
        debug: (o: object) => console.debug(JSON.stringify(o)),
      },
    },
    base: { service },
    timestamp: pino.stdTimeFunctions.isoTime,
  });
}

Usage with Hono middleware:

// src/middleware/logging.ts
import { createLogger } from "../lib/logger";
import type { MiddlewareHandler } from "hono";

export function requestLogger(service: string): MiddlewareHandler {
  const baseLogger = createLogger(service);

  return async (c, next) => {
    const requestId = c.req.header("x-request-id") ?? crypto.randomUUID();
    const start = Date.now();

    // Create a child logger with request context
    const log = baseLogger.child({
      requestId,
      method: c.req.method,
      path: c.req.path,
    });

    // Attach to the context so handlers can use it
    c.set("log", log);
    c.set("requestId", requestId);

    try {
      await next();

      const duration = Date.now() - start;
      log.info({
        msg: "request completed",
        status: c.res.status,
        duration,
      });
    } catch (err) {
      const duration = Date.now() - start;
      log.error({
        msg: "request failed",
        status: 500,
        duration,
        error: err instanceof Error ? err.message : String(err),
        stack: err instanceof Error ? err.stack : undefined,
      });
      throw err;
    }
  };
}

Application wiring:

// src/index.ts
import { Hono } from "hono";
import { requestLogger } from "./middleware/logging";

const app = new Hono<{ Bindings: Env }>();

app.use("*", requestLogger("pages-plus"));

app.post("/v1/content/generate", async (c) => {
  const log = c.get("log");

  const body = await c.req.json();
  log.info({ msg: "generating content", keyword: body.keyword });

  const start = Date.now();
  const result = await generateContent(body, c.env);
  const duration = Date.now() - start;

  log.info({
    msg: "content generated",
    keyword: body.keyword,
    wordCount: result.wordCount,
    duration,
    cost: result.cost,
    model: result.model,
    tokens: result.usage,
  });

  return c.json(result);
});

What the output looks like:

{
  "level": 30,
  "time": "2026-03-15T10:23:45.123Z",
  "service": "pages-plus",
  "requestId": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "method": "POST",
  "path": "/v1/content/generate",
  "msg": "content generated",
  "keyword": "bank statement to excel",
  "wordCount": 1847,
  "duration": 4523,
  "cost": 0.0034,
  "model": "gemini-2.5-flash",
  "tokens": {
    "input": 2340,
    "output": 1200,
    "thinking": 3800
  }
}

Now you can:

Child Loggers for Deep Context

The real power of Pino is child loggers. Each child inherits its parent’s context and adds its own:

// In the request handler
const log = c.get("log"); // Has requestId, method, path

// Pass to domain function
const contentLog = log.child({ function: "content-generate", keyword });

// Inside domain function, create deeper children
const llmLog = contentLog.child({ provider: "gemini", model: "gemini-2.5-flash" });
llmLog.info({ msg: "calling LLM", promptLength: prompt.length });
// Output has: requestId + method + path + function + keyword + provider + model + msg + promptLength
// ALL context propagated automatically. Zero extra work.

The Problem

Your wrangler.jsonc has 9 cron triggers:

{
  "triggers": {
    "crons": [
      "*/15 * * * *",
      "*/30 * * * *",
      "0 */2 * * *",
      "0 */4 * * *",
      "0 6 * * *",
      "0 12 * * *",
      "0 18 * * *",
      "30 8 * * *",
      "0 0 * * *"
    ]
  }
}

Each cron triggers LLM calls. Some generate article outlines. Some generate full drafts. Some generate meta descriptions. Some generate internal link suggestions. None of them check how much has been spent today. None of them have a kill switch.

War Story: The 193 Ghost Posts

Here is the sequence of events:

  1. Monday: Deploy content pipeline with 9 crons. Each cron handles one stage of content generation.
  2. Monday-Friday: Crons run 24/7. Each generates content using Gemini. No cost tracking. No quality gate. No human review.
  3. Friday: During a routine ops session, check the Gemini billing dashboard. See $37 from pages-plus.
  4. Investigation: The pipeline generated 193 blog posts in 5 days.
    • No human ever reviewed the content quality
    • No human knew the posts existed (they were in draft status in D1)
    • No cost was attributed to any specific function
    • No daily limit existed to stop the pipeline
  5. Emergency response:
    • All 9 crons in pages-plus disabled immediately
    • All 4 crons in aso-mrr disabled
    • 14 API keys removed from worker configurations
    • New P0 standard written: API Cost Metering Standard
    • 12 issues created to implement metering before re-enabling any cron

The Fix: Budget-Aware Pipelines

Every cron that triggers LLM calls must check the budget before proceeding:

// src/crons/content-pipeline.ts

import type { ScheduledEvent } from "@cloudflare/workers-types";

interface BudgetCheck {
  allowed: boolean;
  spent: number;
  limit: number;
  remaining: number;
}

async function checkBudget(
  projectId: string,
  env: Env
): Promise<BudgetCheck> {
  const response = await fetch(
    `${env.API_PROXY_URL}/v1/costs?period=day&project=${projectId}`,
    {
      headers: {
        "X-Project-Id": projectId,
        "X-Api-Key": env.API_PROXY_KEY,
      },
    }
  );

  if (!response.ok) {
    // If we cannot check the budget, do not proceed
    return { allowed: false, spent: 0, limit: 0, remaining: 0 };
  }

  const data = await response.json();
  const spent = data.total_cost_usd;
  const limit = data.daily_limit_usd;

  return {
    allowed: spent < limit * 0.8, // Stop at 80% to leave headroom
    spent,
    limit,
    remaining: limit - spent,
  };
}

export async function handleScheduled(
  event: ScheduledEvent,
  env: Env
): Promise<void> {
  const log = createLogger("pages-plus").child({ trigger: "cron", cron: event.cron });

  // 1. Budget check BEFORE any LLM call
  const budget = await checkBudget("pages-plus", env);

  if (!budget.allowed) {
    log.warn({
      msg: "cron skipped: budget limit approaching",
      spent: budget.spent,
      limit: budget.limit,
      remaining: budget.remaining,
    });
    return; // Exit gracefully. Do nothing. Cost: $0.
  }

  log.info({
    msg: "cron executing",
    budget: {
      spent: budget.spent,
      limit: budget.limit,
      remaining: budget.remaining,
    },
  });

  // 2. Process with cost awareness
  const pendingItems = await getPendingContentItems(env);

  // Estimate cost before processing
  const estimatedCostPerItem = 0.03; // Based on historical average
  const maxItems = Math.floor(budget.remaining / estimatedCostPerItem);
  const itemsToProcess = pendingItems.slice(0, Math.min(maxItems, 10)); // Cap at 10 per run

  log.info({
    msg: "processing items",
    pending: pendingItems.length,
    processing: itemsToProcess.length,
    maxByBudget: maxItems,
  });

  for (const item of itemsToProcess) {
    try {
      await processContentItem(item, env);
    } catch (err) {
      log.error({
        msg: "item processing failed",
        itemId: item.id,
        error: err instanceof Error ? err.message : String(err),
      });
      // Continue with next item, don't fail the whole batch
    }
  }

  log.info({
    msg: "cron completed",
    processed: itemsToProcess.length,
    skipped: pendingItems.length - itemsToProcess.length,
  });
}

Kill Switch Pattern

Every automated pipeline needs a manual kill switch that does not require a deployment:

// Check a KV flag before running ANY automated pipeline
async function isPipelineEnabled(
  pipeline: string,
  env: Env
): Promise<boolean> {
  // KV read is fast, cheap, and can be updated without deployment
  const flag = await env.KV.get(`pipeline:${pipeline}:enabled`);

  // Default to DISABLED -- pipelines must be explicitly enabled
  // This is the opposite of the usual default, and it's intentional.
  // An unset flag means "we haven't verified this pipeline yet."
  return flag === "true";
}

// In the cron handler
export async function handleScheduled(
  event: ScheduledEvent,
  env: Env
): Promise<void> {
  // Kill switch check -- before budget check, before anything
  if (!(await isPipelineEnabled("content-generation", env))) {
    // Silent return. No log spam. The pipeline is off.
    return;
  }

  // Budget check
  const budget = await checkBudget("pages-plus", env);
  if (!budget.allowed) return;

  // ... actual work
}

// To disable a pipeline in emergency:
// wrangler kv key put "pipeline:content-generation:enabled" "false" --namespace-id <id>
// Takes effect immediately. No deployment. No code change.

Cron Governance Checklist

Before enabling any cron that triggers LLM calls:

// src/crons/governance.ts

interface CronGovernanceCheck {
  pipeline: string;
  checks: {
    killSwitchExists: boolean; // Can you disable without deploying?
    budgetCheckExists: boolean; // Does it check spend before calling LLMs?
    dailyLimitSet: boolean; // Is there a dollar limit per day?
    costAttribution: boolean; // Does each call tag its function + project?
    maxItemsCapped: boolean; // Is there a per-run item limit?
    humanReviewGate: boolean; // Do outputs get reviewed before publishing?
    errorHandling: boolean; // Does failure in one item not kill the batch?
    logging: boolean; // Is there structured logging for observability?
  };
}

// Every check must be true before the cron is enabled.
// This is the standard that was written after the $47 incident.
// It exists because we did not have it before.

Example 1: Extracting Structured Data from Unstructured Text

import { generateObject } from "ai";
import { google } from "@ai-sdk/google";
import { z } from "zod";

const InvoiceSchema = z.object({
  vendor: z.string(),
  invoiceNumber: z.string(),
  date: z.string().date(),
  lineItems: z.array(
    z.object({
      description: z.string(),
      quantity: z.number(),
      unitPrice: z.number(),
      total: z.number(),
    })
  ),
  subtotal: z.number(),
  tax: z.number(),
  total: z.number(),
  currency: z.string().length(3),
});

async function extractInvoice(rawText: string) {
  const { object } = await generateObject({
    model: google("gemini-2.5-flash"),
    schema: InvoiceSchema,
    prompt: `Extract invoice data from this text:\n\n${rawText}`,
  });

  // object.lineItems[0].total is a number, guaranteed
  // object.currency is a 3-char string, guaranteed
  // No regex. No manual parsing. No try/catch.
  return object;
}

Example 2: Cost-Aware Model Selection

function selectModel(task: {
  complexity: "low" | "medium" | "high";
  budgetRemaining: number;
  requiresReasoning: boolean;
}): string {
  // If budget is tight, always use the cheapest model
  if (task.budgetRemaining < 0.50) {
    return "gemini-2.5-flash"; // $0.15/M input, $0.60/M output
  }

  // High-complexity reasoning tasks justify thinking tokens
  if (task.complexity === "high" && task.requiresReasoning) {
    // But only if budget allows -- thinking tokens are 5.8x output
    if (task.budgetRemaining > 2.0) {
      return "gemini-2.5-pro"; // $1.25/M input, $10/M output
    }
  }

  // Default: Flash handles 90% of tasks adequately
  return "gemini-2.5-flash";
}

Example 3: Zod Schema for Queue Message Validation

import { z } from "zod";

const DomainMessageSchema = z.object({
  event_id: z.string().uuid(),
  type: z.string().regex(/^[a-z]+\.[a-z]+$/), // e.g., "content.published"
  source: z.string().min(1),
  timestamp: z.string().datetime(),
  correlation_id: z.string().optional(),
  payload: z.record(z.unknown()),
});

type DomainMessage = z.infer<typeof DomainMessageSchema>;

// In queue consumer
async function handleBatch(batch: MessageBatch<unknown>) {
  for (const msg of batch.messages) {
    const result = DomainMessageSchema.safeParse(msg.body);

    if (!result.success) {
      console.error("Invalid queue message:", result.error.issues);
      msg.ack(); // Discard malformed messages, don't retry
      continue;
    }

    const message = result.data;
    await processMessage(message);
    msg.ack();
  }
}

Example 4: Shared Type Package Structure

// packages/shared/src/index.ts
// Explicit exports -- no barrel re-exports

export {
  GeminiResponseSchema,
  GeminiUsageSchema,
  extractText,
  extractUsage,
} from "./types/gemini";

export type { GeminiResponse, GeminiUsage } from "./types/gemini";

export {
  DomainMessageSchema,
  type DomainMessage,
} from "./types/messages";

export {
  calculateCost,
  MODEL_PRICING,
} from "./cost/calculator";

export type { TokenUsage, ModelPricing } from "./cost/calculator";
// packages/shared/package.json
{
  "name": "@acme/shared",
  "version": "1.0.0",
  "exports": {
    ".": "./src/index.ts",
    "./types/*": "./src/types/*.ts",
    "./cost/*": "./src/cost/*.ts"
  }
}
// pages-plus/package.json
{
  "dependencies": {
    "@acme/shared": "workspace:*"
  }
}

Example 5: Retry with Exponential Backoff for LLM Calls

async function withRetry<T>(
  fn: () => Promise<T>,
  options: {
    maxRetries?: number;
    baseDelayMs?: number;
    maxDelayMs?: number;
    retryOn?: (error: unknown) => boolean;
  } = {}
): Promise<T> {
  const {
    maxRetries = 3,
    baseDelayMs = 1000,
    maxDelayMs = 30000,
    retryOn = () => true,
  } = options;

  let lastError: unknown;

  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    try {
      return await fn();
    } catch (err) {
      lastError = err;

      if (attempt === maxRetries || !retryOn(err)) {
        throw err;
      }

      const delay = Math.min(baseDelayMs * 2 ** attempt, maxDelayMs);
      const jitter = delay * (0.5 + Math.random() * 0.5);
      await new Promise((resolve) => setTimeout(resolve, jitter));
    }
  }

  throw lastError;
}

// Usage with LLM calls
const result = await withRetry(
  () =>
    generateObject({
      model: google("gemini-2.5-flash"),
      schema: ArticleSchema,
      prompt: "...",
    }),
  {
    maxRetries: 2,
    retryOn: (err) => {
      // Retry on rate limits and server errors
      // Do NOT retry on validation errors (schema mismatch)
      if (err instanceof Error) {
        return err.message.includes("429") || err.message.includes("500");
      }
      return false;
    },
  }
);

Example 6: Pino Child Logger Chain

import pino from "pino";

const root = pino({
  browser: {
    asObject: true,
    write: {
      info: (o: object) => console.log(JSON.stringify(o)),
      error: (o: object) => console.error(JSON.stringify(o)),
    },
  },
  base: { service: "scalable-media" },
});

// Request-level context
const requestLog = root.child({ requestId: "abc-123" });

// Function-level context
const contentLog = requestLog.child({ function: "content-generate" });

// LLM call-level context
const llmLog = contentLog.child({
  provider: "gemini",
  model: "gemini-2.5-flash",
});

llmLog.info({ msg: "calling model", promptTokens: 2340 });
// Output: { service, requestId, function, provider, model, msg, promptTokens }
// Every field from every ancestor is included automatically.

llmLog.info({
  msg: "model responded",
  outputTokens: 1200,
  thinkingTokens: 3800,
  cost: 0.0145,
  latencyMs: 4523,
});
// Full trace from service to specific LLM call, all in one log line.

Example 7: Validation Error Formatting for API Consumers

import { z } from "zod";

function formatZodError(error: z.ZodError): {
  error: string;
  issues: Array<{ path: string; code: string; message: string; expected?: string; received?: string }>;
} {
  return {
    error: "validation_failed",
    issues: error.issues.map((issue) => {
      const base = {
        path: issue.path.join("."),
        code: issue.code,
        message: issue.message,
      };

      if (issue.code === "invalid_type") {
        return {
          ...base,
          expected: issue.expected,
          received: issue.received,
        };
      }

      return base;
    }),
  };
}

// Usage
const result = CreateContentSchema.safeParse(body);
if (!result.success) {
  return c.json(formatZodError(result.error), 400);
}

// Response to the client:
// {
//   "error": "validation_failed",
//   "issues": [
//     { "path": "sections.0.heading", "code": "too_small", "message": "String must contain at least 1 character(s)" },
//     { "path": "metadata.seo.keywords.11", "code": "too_big", "message": "Array must contain at most 10 element(s)" }
//   ]
// }

Example 8: Daily Cost Dashboard Query

// GET /v1/costs/dashboard
app.get("/v1/costs/dashboard", async (c) => {
  const db = c.env.DB;

  const [byProject, byFunction, byModel, dailyTrend] = await Promise.all([
    // Cost by project (today)
    db
      .prepare(
        `SELECT project_id, ROUND(SUM(cost_usd), 2) as cost,
         COUNT(*) as calls, SUM(thinking_tokens) as thinking
         FROM api_calls_ledger
         WHERE created_at >= date('now')
         GROUP BY project_id ORDER BY cost DESC`
      )
      .all(),

    // Cost by function (today)
    db
      .prepare(
        `SELECT function, ROUND(SUM(cost_usd), 2) as cost,
         COUNT(*) as calls
         FROM api_calls_ledger
         WHERE created_at >= date('now')
         GROUP BY function ORDER BY cost DESC`
      )
      .all(),

    // Cost by model (today)
    db
      .prepare(
        `SELECT model, ROUND(SUM(cost_usd), 2) as cost,
         COUNT(*) as calls,
         SUM(input_tokens) as input_tokens,
         SUM(output_tokens) as output_tokens,
         SUM(thinking_tokens) as thinking_tokens
         FROM api_calls_ledger
         WHERE created_at >= date('now')
         GROUP BY model ORDER BY cost DESC`
      )
      .all(),

    // Daily trend (last 7 days)
    db
      .prepare(
        `SELECT date(created_at) as day, ROUND(SUM(cost_usd), 2) as cost,
         COUNT(*) as calls
         FROM api_calls_ledger
         WHERE created_at >= date('now', '-7 days')
         GROUP BY day ORDER BY day`
      )
      .all(),
  ]);

  return c.json({ byProject, byFunction, byModel, dailyTrend });
});

LLM Response Parsing

ApproachHow It WorksProsCons
Manual regex + JSON.parseStrip markdown fences, trim, parse, castNo dependenciesFragile, incomplete coverage, no retries, as type assertions lie
Provider function callingDefine functions, model returns structured callProvider-native, no SDK dependencyProvider-specific API, manual validation still needed, inconsistent across providers
Vercel AI SDK generateObject()Zod schema sent as response format, SDK validates + retriesType-safe, provider-agnostic, automatic retries, zero parsing codeAdds dependency (~50KB), requires Zod schema upfront, deeply nested schemas can cause issues
Instructor (Python)Pydantic models as output schemas, retries on validation failurePythonic, good error messages, patches provider clientsPython only, monkey-patches clients, adds complexity
Outlines / GuidanceConstrained generation at token levelGuaranteed valid output, fastestRequires model server access, not for API-based models

Recommendation: Use generateObject() for TypeScript projects. It handles the 95% case (structured output from API-based models) with zero boilerplate. If you are in Python, Instructor is the equivalent. If you run your own model server, Outlines gives you guaranteed-valid generation.

Cost Tracking

ApproachSetupToken Tier SupportAttributionDaily LimitsReal-Time
No trackingNoneNoneNoneNoneNo
Provider dashboardsAlready availableFull (provider knows)Project-level (by API key)Manual alerts onlyYes
Cloudflare AI GatewayOne URL changeFullBy gateway IDRate limiting availableYes
LiteLLM ProxySelf-hosted proxyFullBy key/user/team/tagPer-key budgetsYes
PortkeySDK + dashboardFullBy virtual key/metadataBudget alertsYes
Custom proxyBuild + deployYou implement itFully customizableYou implement itYou implement it

Recommendation: Start with Cloudflare AI Gateway if you are already on Cloudflare β€” it is free and requires one line of code. Move to LiteLLM or a custom proxy when you need function-level cost attribution with custom tags. Provider dashboards are not enough β€” they show total spend but not why each dollar was spent.

Agent State Management

RuntimeLanguageState PersistenceRecovery ModelDeploymentBest For
Manual state machineAnyIn-memory (lost on crash)None β€” restart from scratchWherever your code runsPrototypes, simple sequences
CF Agents SDKTypeScriptSQLite per agent (durable)Resume from last checkpointCloudflare edge, globalPer-entity agents (one per user/brand), edge-native
TemporalTS, Python, Go, JavaEvent-sourced (durable)Replay from event historySelf-hosted or cloudMission-critical workflows, multi-service orchestration
InngestTypeScriptStep-level (durable)Resume from last stepServerless, event-drivenEvent-triggered pipelines, serverless-friendly
LangGraphPython, TypeScriptConfigurable (Redis, Postgres)Checkpoint-basedSelf-hostedComplex agent graphs with branching, AI-specific abstractions

Recommendation: If you are on Cloudflare, the Agents SDK is the obvious choice β€” it runs on Durable Objects with built-in SQLite state, hibernation, and global distribution. If you are not on Cloudflare, Temporal for complex multi-service workflows or Inngest for simpler event-driven pipelines. LangGraph if you need graph-based agent routing, but be prepared for production challenges with state management and debugging.

Logging on Edge/Serverless

LibraryEdge/Worker SupportStructured OutputChild LoggersSizeNotes
console.logNativeNo (strings only)No0KBNot logging, just printing
Pino (browser mode)Yes (browser mode)Yes (JSON)Yes~15KBFastest structured logger, browser mode uses console internally
WinstonNo (Node.js only)YesNo (but has metadata)~200KBToo heavy, Node.js APIs, not edge-compatible
Custom JSON wrapperYesYesIf you build it1-5KBFull control, maintenance burden
Workers LogsNative (CF only)YesNo0KBCloudflare-specific, automatic, but limited control

Recommendation: Pino in browser mode for Cloudflare Workers. It gives you structured JSON, child loggers with context propagation, log levels, and it works by writing to console.log internally β€” which Workers capture. Zero infrastructure to set up. If you want Cloudflare-native, Workers Logs is automatic but less flexible.

Input Validation

ApproachRuntime ValidationType SafetyError MessagesSchema ReuseBundle Size
None (request.json())NoNo (any)N/A (crashes)N/A0KB
Manual if checksYes (incomplete)No (no narrowing)Custom but inconsistentCopy-paste0KB
ZodYes (complete)Yes (z.infer)Structured, detailedSchema objects~14KB
tRPCYes (via Zod)Yes (end-to-end)StructuredFull stack~30KB
ValibotYes (complete)Yes (similar to Zod)StructuredSchema objects~5KB
ArkTypeYes (complete)Yes (type-first)StructuredSchema objects~40KB

Recommendation: Zod is the standard choice. It has the largest ecosystem, works with AI SDK, tRPC, React Hook Form, and most TypeScript tools. If bundle size is critical (edge workers), Valibot offers similar functionality at 1/3 the size. tRPC if you control both client and server and want end-to-end type safety.


#Don’tDo InsteadWhy
1Parse LLM responses with regex + JSON.parse + asUse generateObject() with a Zod schemaRegex misses variants, as lies to TypeScript, no retries on malformed responses
2Call provider APIs directly with raw fetch()Route through a centralized proxy with per-call meteringNo cost tracking = $47 in 5 days with zero attribution
3Calculate cost as (input + output) * priceTrack ALL token tiers: input, output, thinking, cache_read, cache_writeThinking tokens cost 5.8x output tokens, missing them causes 9x undercounts
4Let files grow past 300 lines with mixed concernsSplit by single responsibility, target <300 lines per moduleGod modules hide coupling, make testing expensive, slow prompt iteration
5Use await c.req.json() without validationValidate with Zod at every system boundary (HTTP, queue, function)Unvalidated input propagates through queues and databases, crashes downstream
6Copy-paste GeminiApiResponse interface across filesShared type packages, or use AI SDK (no provider types needed)Copies drift. One copy misses thoughtsTokenCount. 9x undercount follows.
7Hand-roll state machines with in-memory stateUse a stateful agent runtime (CF Agents SDK, Temporal, Inngest)No persistence = restart from scratch on crash. Wasted LLM spend.
8Put provider API keys in every workerCentralize keys in one proxy serviceRotation requires N deployments. No kill switch. No centralized rate limiting.
9Use console.log("something happened") for observabilityPino browser mode with child loggers for context propagationStrings are not searchable. No request correlation. No cost/latency data.
10Run crons that call LLMs without budget checksBudget-aware pipelines with daily limits, cost checks, and kill switches193 ghost posts in 5 days. $37 in untracked spend. No human in the loop.

Here is the architecture that addresses all 10 anti-patterns simultaneously:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         APPLICATION LAYER                          β”‚
β”‚                                                                    β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”‚
β”‚  β”‚ Content       β”‚  β”‚ Analysis      β”‚  β”‚ Agent         β”‚          β”‚
β”‚  β”‚ Pipeline      β”‚  β”‚ Service       β”‚  β”‚ Service       β”‚          β”‚
β”‚  β”‚               β”‚  β”‚               β”‚  β”‚               β”‚          β”‚
β”‚  β”‚ Zod schemas   β”‚  β”‚ Zod schemas   β”‚  β”‚ CF Agents SDK β”‚          β”‚
β”‚  β”‚ generateObjectβ”‚  β”‚ generateObjectβ”‚  β”‚ Durable state β”‚          β”‚
β”‚  β”‚ Pino logging  β”‚  β”‚ Pino logging  β”‚  β”‚ Pino logging  β”‚          β”‚
β”‚  β”‚ Budget checks β”‚  β”‚ Budget checks β”‚  β”‚ Budget checks β”‚          β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜          β”‚
β”‚          β”‚                  β”‚                  β”‚                    β”‚
β”‚          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                   β”‚
β”‚                             β”‚                                      β”‚
β”‚                    X-Project-Id + X-Function + X-Tags              β”‚
β”‚                             β”‚                                      β”‚
β”‚                             β–Ό                                      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”‚
β”‚  β”‚                    API PROXY                              β”‚     β”‚
β”‚  β”‚                                                          β”‚      β”‚
β”‚  β”‚  Auth β†’ Rate Limit β†’ Budget Check β†’ Forward β†’ Meter      β”‚     β”‚
β”‚  β”‚                                                          β”‚      β”‚
β”‚  β”‚  - One service holds all provider API keys               β”‚      β”‚
β”‚  β”‚  - Every call logged to api_calls_ledger                 β”‚      β”‚
β”‚  β”‚  - All token tiers tracked (input/output/thinking/cache) β”‚      β”‚
β”‚  β”‚  - Daily spend limits enforced per project               β”‚      β”‚
β”‚  β”‚  - Kill switch via project.active flag                   β”‚      β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β”‚
β”‚                             β”‚                                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β–Ό
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚      PROVIDER APIs            β”‚
              β”‚  Gemini | OpenAI | Anthropic  β”‚
              β”‚  Brave | Perplexity | etc.    β”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The key properties of this stack:

  1. Every LLM call is structured. Zod schemas define the output. generateObject() validates it. No regex, no JSON.parse, no as.

  2. Every LLM call is metered. The proxy records project, function, tags, all token tiers, and calculated cost. Monthly reconciliation catches formula drift.

  3. Every LLM call is budgeted. Daily limits stop runaway spending. Kill switches disable pipelines without deployment. Crons check budget before calling.

  4. Every module is small. Under 300 lines. Single responsibility. Independently testable. Prompts are easy to find and iterate.

  5. Every boundary is validated. HTTP endpoints, queue consumers, function arguments β€” Zod schemas at every transition point. Invalid data is rejected at the edge, not in the database.

  6. Every type is defined once. Shared packages or SDK abstractions. No copy-paste drift. One update propagates everywhere.

  7. Every agent has durable state. CF Agents SDK or equivalent runtime. Crash recovery resumes from the last checkpoint. No wasted LLM spend.

  8. Every key is centralized. One service, one rotation point, one kill switch. No scattered secrets.

  9. Every log is structured. Pino with child loggers. Request IDs propagate through the call chain. Cost, latency, and token counts are logged fields, not embedded strings.

  10. Every cron is governed. Budget check before execution. Item caps per run. Kill switch via KV. Human review gate before publishing.

This is not theoretical. This is the stack that replaced the one that burned $47 in 5 days.


Official Documentation

Blog Posts and Guides

Companion Articles

Libraries and Tools


Every anti-pattern in this article was shipped to production, discovered through operational pain, and fixed with the techniques described. The $47 was real. The 193 ghost posts were real. The 9x undercount was real. The fixes are also real, and they are running in production today.


Edit page
Share this post on:

Previous Post
Ghostty, Tmux, and AI Integration
Next Post
Progress Visibility