Org Status: π‘ Dormant Cloudflare: N/A Last Audited: 2026-04-28
Ten production anti-patterns that cost us $282/month in invisible LLM spend, buried bugs in 1,000-line god modules, and left 223 of 226 API endpoints wide open to malformed input. Every anti-pattern here comes from a real codebase running real traffic. Every fix has been shipped and measured.
This is not a theoretical taxonomy. It is a post-mortem.
What you will learn:
- How manual JSON parsing from LLM responses creates a maintenance nightmare β and how
generateObject()with Zod schemas eliminates it entirely - Why raw LLM calls without cost tracking led to a $47 spend in 5 days across two projects, and the proxy architecture that prevents it
- The thinking token pricing blindspot that caused a 9x cost undercount ($1.14 tracked vs $10.19 billed)
- Why god modules, missing input validation, and duplicated types compound into unmaintainable AI codebases
- How manual state machines for agents fail at recovery β and what stateful agent runtimes (CF Agents SDK, Temporal, Inngest) provide instead
- The scattered API key problem that required deleting 14 keys across 4 workers in a single emergency session
- Why
console.logis not logging, and how Pino in browser mode gives you structured observability on edge runtimes with zero infrastructure - How uncontrolled cron spending generated 193 blog posts in 5 days with no human review, no cost attribution, and no kill switch
- The Problem
- Core Concepts
- Anti-Pattern 1: Manual JSON Parsing from LLM Responses
- Anti-Pattern 2: Raw LLM Calls Without Cost Tracking
- Anti-Pattern 3: Thinking Token Pricing Blindspot
- Anti-Pattern 4: God Modules
- Anti-Pattern 5: No Input Validation on API Endpoints
- Anti-Pattern 6: Duplicated Type Definitions
- Anti-Pattern 7: Manual State Machines for Agents
- Anti-Pattern 8: Scattered API Keys
- Anti-Pattern 9: No Structured Logging
- Anti-Pattern 10: Uncontrolled Cron Spending
- Small Examples
- Comparisons
- The Anti-Pattern Summary Table
- Putting It All Together: The Modern Production AI Stack
- References
You built an LLM-powered application. It works in development. You deploy it. Within a week, you discover:
-
You cannot explain your AI spend. The Gemini dashboard says $47 in 5 days. Your internal tracking says $1.14. The delta is not a rounding error β it is a 9x undercount caused by not pricing thinking tokens.
-
You cannot trust your outputs. Half your LLM responses are wrapped in markdown code fences (
\βjson β¦ ```). Your regex-based cleanup handles 4 of the 7 variations providers emit. The other 3 crash silently and returnundefined` to the database. -
You cannot find the bug. The relevant logic lives in a 1,126-line file that handles SEO analysis, content scoring, keyword extraction, SERP parsing, and site audits. When something breaks, you read the whole file.
-
You cannot stop the bleeding. Nine crons run every 5-30 minutes, each triggering LLM calls. There is no budget awareness, no kill switch, no human in the loop. The pipeline generated 193 blog posts in 5 days. Nobody reviewed them. Nobody knew they existed.
These are not hypothetical risks. These are the bugs we shipped, the money we burned, and the emergency sessions we ran to fix them. The core issue is always the same: LLM applications have different failure modes than traditional software, and traditional engineering practices do not cover them.
Traditional software is deterministic. You call a function, you get a return value, you check the type. LLM software is probabilistic. You send a prompt, you get a response that might be JSON, might have the right fields, might cost $0.003 or $0.30 depending on whether the model decided to βthinkβ about it. The output shape, the cost, and the latency are all variable β and if your code assumes they are fixed, you will get surprised in production.
The modern fix is not one tool. It is a stack:
- Structured output (Vercel AI SDK + Zod) replaces manual parsing
- Cost proxy (centralized gateway) replaces raw provider calls
- Token-tier metering replaces naive token counting
- Single-responsibility modules replace god files
- Schema validation at boundaries replaces
request.json()trust - Shared type packages replace copy-paste interfaces
- Stateful agent runtimes replace hand-rolled state machines
- Centralized key management replaces scattered secrets
- Structured logging (Pino browser mode) replaces
console.log - Budget-aware pipelines replace unbounded crons
Before diving into the anti-patterns, three foundational concepts that underpin every fix.
Structured Output
The idea that an LLM call should return a typed object, not a string you have to parse.
import { generateObject } from "ai";
import { google } from "@ai-sdk/google";
import { z } from "zod";
// The schema IS the specification
const ArticleSchema = z.object({
title: z.string().describe("SEO-optimized article title"),
slug: z.string().describe("URL-safe slug"),
sections: z
.array(
z.object({
heading: z.string(),
content: z.string(),
wordCount: z.number(),
})
)
.describe("Article sections in order"),
seoMeta: z.object({
description: z.string().max(160),
keywords: z.array(z.string()).max(10),
}),
});
type Article = z.infer<typeof ArticleSchema>;
const { object: article } = await generateObject({
model: google("gemini-2.5-flash"),
schema: ArticleSchema,
prompt: `Write an article about TypeScript monorepo patterns`,
});
// article is fully typed as Article
// No parsing. No try/catch. No regex cleanup.
console.log(article.title); // string, guaranteed
console.log(article.sections[0].wordCount); // number, guaranteed
Key insight: When you use
generateObject(), the schema is sent to the model as a response format constraint. The model generates tokens that conform to the schema. The SDK validates the response against the schema before returning. If validation fails, it retries automatically. You never see a malformed response.
Cost Attribution
The idea that every LLM call should record what it cost, who triggered it, and why.
interface CostRecord {
// Identity
project_id: string; // "pages-plus"
function: string; // "article-write"
service: string; // "gemini"
model: string; // "gemini-2.5-flash"
// Token tiers -- each priced differently
input_tokens: number;
output_tokens: number;
thinking_tokens: number; // $3.50/M for Gemini Flash Thinking
cache_read_tokens: number; // $0.0375/M
cache_write_tokens: number;
// Cost
cost_usd: number; // Calculated from token tiers + model pricing
// Context
tags: string[]; // ["brand:llc-tax", "trigger:cron", "batch:2026-03-15"]
timestamp: string; // ISO 8601
}
Key insight: The cost of an LLM call is not
(input_tokens + output_tokens) * price_per_token. Models with thinking/reasoning modes have 3-5 different token tiers, each with different prices. If you only track two tiers, you will undercount by 2-10x.
Boundary Validation
The idea that every system boundary β HTTP endpoints, queue consumers, function arguments β should validate its input against a schema.
import { z } from "zod";
const PublishRequestSchema = z.object({
brand_slug: z
.string()
.regex(/^[a-z0-9-]+$/)
.min(1)
.max(64),
content_id: z.string().uuid(),
publish_to: z.enum(["blog", "social", "newsletter"]),
schedule_at: z.string().datetime().optional(),
metadata: z
.record(z.string(), z.unknown())
.optional()
.default({}),
});
type PublishRequest = z.infer<typeof PublishRequestSchema>;
// In the handler
app.post("/v1/publish", async (c) => {
const result = PublishRequestSchema.safeParse(await c.req.json());
if (!result.success) {
return c.json(
{
error: "validation_failed",
issues: result.error.issues.map((i) => ({
path: i.path.join("."),
message: i.message,
})),
},
400
);
}
// result.data is fully typed as PublishRequest
// No casting, no `as any`, no runtime surprises
return c.json(await publishContent(result.data, c.env));
});
Key insight: TypeScript types disappear at runtime. When your Worker receives a JSON body from the internet, TypeScript cannot guarantee its shape. Zod bridges compile-time and runtime: one schema gives you both the TypeScript type (via
z.infer) and the runtime validator (via.parse()/.safeParse()). Define once, validate everywhere.
The Problem
You ask an LLM for structured data. It returns a string. Sometimes the string is valid JSON. Sometimes it is JSON wrapped in markdown code fences. Sometimes it has a preamble like βHereβs the JSON you requested:β before the actual object. Sometimes the keys are in a different order. Sometimes there are extra fields. Sometimes there are missing fields.
Your code looks like this:
// Found in 10+ files across 4 production projects
async function generateArticle(prompt: string, env: Env): Promise<Article> {
const response = await fetch(
"https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash:generateContent",
{
method: "POST",
headers: {
"Content-Type": "application/json",
"x-goog-api-key": env.GEMINI_API_KEY,
},
body: JSON.stringify({
contents: [{ parts: [{ text: prompt }] }],
generationConfig: {
responseMimeType: "application/json",
},
}),
}
);
const data = await response.json();
const text = data.candidates?.[0]?.content?.parts?.[0]?.text;
if (!text) {
throw new Error("No response from Gemini");
}
// The cleanup gauntlet
let cleaned = text
.replace(/```json\n?/g, "")
.replace(/```\n?/g, "")
.replace(/^\s*Here.*?:\s*/i, "")
.trim();
try {
const parsed = JSON.parse(cleaned);
// Manual field validation
if (!parsed.title || typeof parsed.title !== "string") {
throw new Error("Missing or invalid title");
}
if (!Array.isArray(parsed.sections)) {
throw new Error("Missing or invalid sections");
}
// 20 more lines of manual checks...
return parsed as Article;
} catch (err) {
console.log("Failed to parse LLM response:", cleaned.substring(0, 200));
throw new Error(`JSON parse failed: ${err.message}`);
}
}
This pattern has five distinct failure modes:
-
Regex misses a variant. The model outputs
```JSON(capital J) or```json5or wraps the response in<json>tags. Your regex does not handle it.JSON.parsethrows. The operation fails silently or crashes. -
Field validation is incomplete. You check for
titleandsectionsbut forget to check that each section has aheadingandcontent. The partial object propagates downstream and crashes in a different function with an unhelpful error. -
Type assertion lies.
parsed as Articletells TypeScript βtrust me, this is an Article.β TypeScript obeys. At runtime, it might be missing three fields. The type assertion bypasses every safety net TypeScript provides. -
No retry logic. If the model returns malformed JSON once, this code throws. There is no retry with a different prompt, no retry with a stricter system message, no fallback to a different model.
-
No cost tracking. The raw
fetch()call returns no token counts. You have no idea what this call cost. Multiply by 9 crons running every 15 minutes, and you get the $47-in-5-days disaster.
The Fix: generateObject() with Zod
import { generateObject } from "ai";
import { google } from "@ai-sdk/google";
import { z } from "zod";
const ArticleSchema = z.object({
title: z.string().min(10).max(200),
slug: z
.string()
.regex(/^[a-z0-9-]+$/)
.max(100),
sections: z
.array(
z.object({
heading: z.string().min(1),
content: z.string().min(50),
wordCount: z.number().int().positive(),
})
)
.min(3)
.max(20),
seoMeta: z.object({
description: z.string().max(160),
keywords: z.array(z.string()).min(1).max(10),
}),
readingTimeMinutes: z.number().positive(),
});
type Article = z.infer<typeof ArticleSchema>;
async function generateArticle(prompt: string): Promise<Article> {
const { object, usage } = await generateObject({
model: google("gemini-2.5-flash"),
schema: ArticleSchema,
prompt,
});
// object is Article -- fully validated, fully typed
// usage.promptTokens and usage.completionTokens available for cost tracking
return object;
}
What changed:
| Before (manual) | After (generateObject) |
|---|---|
Raw fetch() to provider API | Provider-agnostic SDK call |
| Regex cleanup of markdown fences | No cleanup needed β response is never a raw string |
JSON.parse() in try/catch | SDK handles parsing and validation |
| Manual field-by-field validation | Zod schema validates structure, types, and constraints |
as Article type assertion | z.infer<typeof ArticleSchema> β type derived from schema |
| No retry on malformed response | SDK retries automatically on validation failure |
| No token counts available | usage object with prompt/completion token counts |
| 40+ lines of parsing boilerplate | 0 lines of parsing code |
How It Works Under the Hood
When you call generateObject(), the Vercel AI SDK:
-
Converts your Zod schema to a JSON Schema and sends it to the model as a
response_formatconstraint (for models that support structured output) or as a function call schema (for models that support function calling). -
The model generates tokens constrained to the schema. This is not post-processing. The modelβs token sampling is guided by the schema structure, making it far more reliable than asking for JSON in the prompt.
-
Validates the response against your Zod schema. If validation fails (e.g., a string is too long, a number is negative), the SDK retries with an error message appended to the prompt.
-
Returns a fully typed object. The return type is
z.infer<typeof YourSchema>β you never touchJSON.parseoras.
// The SDK handles all of these failure modes automatically:
//
// 1. Model returns markdown-wrapped JSON -> structured output bypasses this
// 2. Model returns extra fields -> Zod strips them (.strict() to reject)
// 3. Model returns wrong types -> Zod validation catches it, SDK retries
// 4. Model returns partial object -> Zod validation catches it, SDK retries
// 5. Model returns nothing -> SDK throws with clear error
//
// You handle ZERO of these cases. The SDK handles ALL of them.
When You Still Need Manual Parsing
There are two legitimate cases:
-
Streaming partial objects. If you need to display results as they stream in,
streamObject()gives you partial objects during generation. But the final object is still validated. -
Legacy provider APIs. If you are using a provider the AI SDK does not support, you are back to raw
fetch(). But even then, use Zod to validate after parsing β do not useas.
// If you MUST parse manually, at least validate with Zod
const raw = JSON.parse(responseText);
const result = ArticleSchema.safeParse(raw);
if (!result.success) {
console.error("Validation failed:", result.error.issues);
// Retry, fall back, or fail explicitly -- never pass invalid data downstream
throw new Error(`LLM response failed validation: ${result.error.message}`);
}
// result.data is Article -- safe to use
return result.data;
The Problem
Every LLM call costs money. The cost varies by model, by token count, by whether the model used βthinkingβ mode, by whether prompt caching kicked in. When you call provider APIs directly from your application code, you have no record of what anything cost.
// The pattern that burned $47 in 5 days
// Found in pages-plus: 6 direct Gemini calls, 9 crons, zero cost tracking
async function generateBlogPost(keyword: string, env: Env): Promise<BlogPost> {
// Direct call to Gemini -- no proxy, no metering, no attribution
const response = await fetch(
`https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash:generateContent`,
{
method: "POST",
headers: {
"Content-Type": "application/json",
"x-goog-api-key": env.GEMINI_API_KEY,
},
body: JSON.stringify({
contents: [
{
parts: [
{
text: `Write a comprehensive blog post about: ${keyword}`,
},
],
},
],
}),
}
);
const data = await response.json();
// No cost tracking. No token counting. No attribution.
// This call cost somewhere between $0.01 and $0.30.
// Nobody will ever know.
return parseBlogPost(data);
}
War Story: The $47-in-5-Days Gemini Disaster
Here is what happened:
- pages-plus had 9 cron triggers in its
wrangler.jsonc. They ran every 5-30 minutes. - Each cron triggered content generation: blog outlines, blog drafts, blog edits, anchor text generation, meta description generation, internal link suggestions.
- Each generation called Gemini directly using
env.GEMINI_API_KEY. - There was no
api_calls_ledgertable. No cost tracking. No daily limit. - In 5 days, the pipeline generated 193 blog posts. No human reviewed them. No human knew they existed.
- The Gemini billing dashboard showed $37 from pages-plus alone.
- A second project (aso-mrr) contributed another $10.19.
- Total: $47 in 5 days. $282/month run rate.
- Discovered by manually checking the Google Cloud billing dashboard during a routine ops session.
- Emergency response: All 13 crons across both projects disabled the same day. 14 API keys deleted from worker configurations. New standard written. 12 issues created to implement metering before any cron could be re-enabled.
The Fix: Centralized LLM Proxy with Per-Call Metering
The architecture that prevents this:
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β pages-plus β β scalable-mediaβ β aso-mrr β
β (no API key)β β (no API key) β β (no API key) β
ββββββββ¬ββββββββ ββββββββ¬ββββββββ ββββββββ¬ββββββββ
β β β
β X-Project-Id β X-Function β X-Tags
β X-Api-Key β X-Api-Key β X-Api-Key
β β β
βΌ βΌ βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β API Proxy β
β β
β βββββββββββββββ ββββββββββββββββ βββββββββββββββββ β
β β Auth + β β Cost Calc β β Daily Limit β β
β β Routing β β (all tiers) β β Enforcement β β
β βββββββββββββββ ββββββββββββββββ βββββββββββββββββ β
β β
β βββββββββββββββ ββββββββββββββββ βββββββββββββββββ β
β β Provider β β Ledger β β Cache β β
β β Keys β β (D1/Postgres)β β (optional) β β
β βββββββββββββββ ββββββββββββββββ βββββββββββββββββ β
ββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββ
β Provider APIs β
β (Gemini, OpenAI, etc.) β
ββββββββββββββββββββββββββββ
The proxy worker:
// api-proxy/src/routes/gemini.ts
import { Hono } from "hono";
import { z } from "zod";
const app = new Hono<{ Bindings: Env }>();
const GeminiProxySchema = z.object({
model: z.string(),
contents: z.array(z.unknown()),
generationConfig: z.record(z.unknown()).optional(),
});
app.post("/v1/gemini/generateContent", async (c) => {
const projectId = c.req.header("X-Project-Id");
const apiKey = c.req.header("X-Api-Key");
const functionTag = c.req.header("X-Function") ?? "unknown";
const tags = c.req.header("X-Tags")?.split(",") ?? [];
// 1. Authenticate the calling project
const project = await authenticateProject(projectId, apiKey, c.env.DB);
if (!project) {
return c.json({ error: "unauthorized" }, 401);
}
// 2. Check daily spend limit BEFORE making the call
const todaySpend = await getDailySpend(projectId, "gemini", c.env.DB);
if (todaySpend >= project.daily_limit_usd) {
return c.json(
{
error: "daily_limit_exceeded",
spend: todaySpend,
limit: project.daily_limit_usd,
},
429
);
}
// 3. Parse and validate the request
const body = GeminiProxySchema.parse(await c.req.json());
// 4. Forward to Gemini using the proxy's API key (not the caller's)
const geminiResponse = await fetch(
`https://generativelanguage.googleapis.com/v1beta/models/${body.model}:generateContent`,
{
method: "POST",
headers: {
"Content-Type": "application/json",
"x-goog-api-key": c.env.GEMINI_API_KEY, // Only the proxy has the key
},
body: JSON.stringify({
contents: body.contents,
generationConfig: body.generationConfig,
}),
}
);
const result = await geminiResponse.json();
// 5. Extract token counts from ALL tiers
const usage = result.usageMetadata;
const cost = calculateGeminiCost(body.model, {
input: usage?.promptTokenCount ?? 0,
output: usage?.candidatesTokenCount ?? 0,
thinking: usage?.thoughtsTokenCount ?? 0,
cacheRead: usage?.cachedContentTokenCount ?? 0,
});
// 6. Record in the ledger -- every call, every tier, every tag
await c.env.DB.prepare(
`INSERT INTO api_calls_ledger
(id, project_id, service, model, function, tags,
input_tokens, output_tokens, thinking_tokens, cache_read_tokens,
cost_usd, created_at)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)`
)
.bind(
crypto.randomUUID(),
projectId,
"gemini",
body.model,
functionTag,
JSON.stringify(tags),
usage?.promptTokenCount ?? 0,
usage?.candidatesTokenCount ?? 0,
usage?.thoughtsTokenCount ?? 0,
usage?.cachedContentTokenCount ?? 0,
cost,
new Date().toISOString()
)
.run();
// 7. Return the response to the caller
return c.json(result);
});
function calculateGeminiCost(
model: string,
tokens: {
input: number;
output: number;
thinking: number;
cacheRead: number;
}
): number {
// Gemini 2.5 Flash pricing (as of 2026-03)
const pricing: Record<string, { input: number; output: number; thinking: number; cacheRead: number }> = {
"gemini-2.5-flash": {
input: 0.15, // per 1M tokens
output: 0.60,
thinking: 3.50,
cacheRead: 0.0375,
},
"gemini-2.5-pro": {
input: 1.25,
output: 10.0,
thinking: 10.0,
cacheRead: 0.3125,
},
};
const p = pricing[model] ?? pricing["gemini-2.5-flash"];
return (
(tokens.input * p.input +
tokens.output * p.output +
tokens.thinking * p.thinking +
tokens.cacheRead * p.cacheRead) /
1_000_000
);
}
The calling worker now looks like this:
// pages-plus/src/domain/content.ts
// No GEMINI_API_KEY. No direct provider calls. No cost math.
async function generateBlogPost(keyword: string, env: Env): Promise<BlogPost> {
const response = await fetch(`${env.API_PROXY_URL}/v1/gemini/generateContent`, {
method: "POST",
headers: {
"Content-Type": "application/json",
"X-Project-Id": "pages-plus",
"X-Api-Key": env.API_PROXY_KEY,
"X-Function": "blog-post-generate",
"X-Tags": `brand:${keyword},trigger:cron`,
},
body: JSON.stringify({
model: "gemini-2.5-flash",
contents: [
{ parts: [{ text: `Write a blog post about: ${keyword}` }] },
],
}),
});
if (response.status === 429) {
// Daily limit hit -- stop gracefully, don't retry
console.log(`Daily spend limit reached for pages-plus`);
return null;
}
return parseBlogPost(await response.json());
}
Available Gateway Options
You do not have to build this yourself. Several production-ready options exist:
Cloudflare AI Gateway β Free tier, sits between your app and provider APIs. Provides analytics, caching, rate limiting, and logging. One line of code to add. Best if you are already on Cloudflare.
LiteLLM Proxy β Open-source Python proxy supporting 100+ LLM providers with OpenAI-compatible API. Cost tracking per key/user/team. Tag-based budget management. Best if you want self-hosted control with multi-provider support.
Custom proxy β Build your own (as shown above) when you need specific attribution logic, integration with your existing database, or edge-native deployment. Best when existing gateways do not match your metering requirements.
The Problem
Modern LLMs have multiple token tiers with wildly different prices. When you track cost as (input + output) * price, you are using a formula from 2023. The real formula in 2025+ has 3-5 tiers:
| Token Tier | Gemini 2.5 Flash Price | What It Is |
|---|---|---|
| Input | $0.15/M | Prompt tokens sent to the model |
| Output | $0.60/M | Generated response tokens |
| Thinking | $3.50/M | Internal reasoning tokens (not shown to user) |
| Cache Read | $0.0375/M | Tokens served from prompt cache |
| Cache Write | $0.15/M | Tokens written to prompt cache |
Thinking tokens cost 5.8x more than output tokens. And for complex tasks (SEO analysis, content planning, multi-step reasoning), thinking tokens can outnumber output tokens 3-to-1.
War Story: The 9x Undercount
- aso-mrr tracked its Gemini costs in an internal ledger.
- The cost formula was:
(input_tokens * 0.15 + output_tokens * 0.60) / 1_000_000 - The formula did not include thinking tokens.
- Over 5 days, the internal ledger showed $1.14 in total spend.
- The Gemini billing dashboard showed $10.19.
- That is a 9x undercount.
- Root cause: Gemini Flash Thinking charges $3.50/M for thinking tokens. Those tokens were being generated on every call but not counted in the cost formula. For reasoning-heavy tasks like app store optimization analysis, thinking tokens dominated the bill.
The Fix: Track All Token Tiers
interface TokenUsage {
input: number;
output: number;
thinking: number;
cacheRead: number;
cacheWrite: number;
}
interface ModelPricing {
input: number; // per 1M tokens
output: number;
thinking: number;
cacheRead: number;
cacheWrite: number;
}
// Pricing table -- update when providers change prices
const MODEL_PRICING: Record<string, ModelPricing> = {
"gemini-2.5-flash": {
input: 0.15,
output: 0.60,
thinking: 3.50,
cacheRead: 0.0375,
cacheWrite: 0.15,
},
"gemini-2.5-pro": {
input: 1.25,
output: 10.0,
thinking: 10.0,
cacheRead: 0.3125,
cacheWrite: 1.25,
},
"claude-sonnet-4-20250514": {
input: 3.0,
output: 15.0,
thinking: 0, // Claude charges thinking at output rate when extended thinking enabled
cacheRead: 0.30,
cacheWrite: 3.75,
},
"gpt-4o": {
input: 2.50,
output: 10.0,
thinking: 0, // GPT-4o does not have a separate thinking tier
cacheRead: 1.25,
cacheWrite: 2.50,
},
};
function calculateCost(model: string, usage: TokenUsage): number {
const pricing = MODEL_PRICING[model];
if (!pricing) {
console.warn(`Unknown model pricing: ${model}, using gemini-2.5-flash as fallback`);
return calculateCost("gemini-2.5-flash", usage);
}
return (
(usage.input * pricing.input +
usage.output * pricing.output +
usage.thinking * pricing.thinking +
usage.cacheRead * pricing.cacheRead +
usage.cacheWrite * pricing.cacheWrite) /
1_000_000
);
}
// Extract token counts from Gemini response
function extractGeminiUsage(response: GeminiResponse): TokenUsage {
const meta = response.usageMetadata;
return {
input: meta?.promptTokenCount ?? 0,
output: meta?.candidatesTokenCount ?? 0,
thinking: meta?.thoughtsTokenCount ?? 0, // THIS IS THE ONE PEOPLE MISS
cacheRead: meta?.cachedContentTokenCount ?? 0,
cacheWrite: 0, // Gemini does not report cache writes in response
};
}
// Example: What the 9x undercount looked like
const usage: TokenUsage = {
input: 5000,
output: 2000,
thinking: 8000, // 4x the output -- typical for reasoning tasks
cacheRead: 0,
cacheWrite: 0,
};
const wrongCost = (usage.input * 0.15 + usage.output * 0.60) / 1_000_000;
// $0.0019 -- what the broken formula reported
const correctCost = calculateCost("gemini-2.5-flash", usage);
// $0.0298 -- 15.7x higher
// Over hundreds of calls per day, this delta becomes tens of dollars
The Ledger Schema
CREATE TABLE api_calls_ledger (
id TEXT PRIMARY KEY,
project_id TEXT NOT NULL,
service TEXT NOT NULL, -- 'gemini', 'openai', 'anthropic'
model TEXT NOT NULL, -- 'gemini-2.5-flash'
function TEXT NOT NULL, -- 'article-write', 'seo-analysis'
tags TEXT DEFAULT '[]', -- JSON array: ['brand:llc-tax', 'trigger:cron']
-- Token tiers -- NEVER lump these together
input_tokens INTEGER DEFAULT 0,
output_tokens INTEGER DEFAULT 0,
thinking_tokens INTEGER DEFAULT 0,
cache_read_tokens INTEGER DEFAULT 0,
cache_write_tokens INTEGER DEFAULT 0,
-- Calculated cost
cost_usd REAL NOT NULL,
-- Metadata
latency_ms INTEGER,
status_code INTEGER,
created_at TEXT NOT NULL,
-- Indexes for querying
UNIQUE(id)
);
CREATE INDEX idx_ledger_project_date ON api_calls_ledger(project_id, created_at);
CREATE INDEX idx_ledger_function ON api_calls_ledger(function, created_at);
CREATE INDEX idx_ledger_service ON api_calls_ledger(service, created_at);
Monthly Reconciliation Query
-- Compare tracked costs to provider bill
SELECT
service,
model,
COUNT(*) as calls,
SUM(input_tokens) as total_input,
SUM(output_tokens) as total_output,
SUM(thinking_tokens) as total_thinking,
ROUND(SUM(cost_usd), 2) as tracked_cost,
-- Compare this to your provider's billing dashboard
-- If difference > 20%, the pricing formula is wrong
strftime('%Y-%m', created_at) as month
FROM api_calls_ledger
GROUP BY service, model, month
ORDER BY tracked_cost DESC;
The Problem
AI applications grow fast. You start with a function that calls an LLM. Then you add preprocessing. Then postprocessing. Then validation. Then caching. Then a different LLM call for a related task. Then another. Before you know it, you have a 1,126-line file that does five unrelated things.
// src/lib/seo-engine.ts -- 1,126 lines
// This file does ALL of the following:
//
// 1. SEO analysis (fetch page, parse HTML, score SEO factors)
// 2. Content scoring (call Gemini to rate content quality)
// 3. Keyword extraction (parse content, TF-IDF, call Gemini for related terms)
// 4. SERP parsing (fetch Google results, parse structured data)
// 5. Site audits (crawl pages, check redirects, validate meta tags)
//
// When a bug appears in keyword extraction, you read 1,126 lines.
// When you want to test SERP parsing, you import a module that also
// initializes Gemini clients, HTML parsers, and HTTP pools.
// When you refactor content scoring, you risk breaking site audits
// because they share 4 utility functions defined at the bottom.
export async function analyzeSEO(url: string, env: Env) {
// 200 lines of fetching and parsing...
}
export async function scoreContent(content: string, env: Env) {
// 150 lines of Gemini calls and scoring...
}
export async function extractKeywords(text: string, env: Env) {
// 180 lines of NLP and LLM calls...
}
export async function parseSERP(query: string, env: Env) {
// 250 lines of Google result parsing...
}
export async function auditSite(domain: string, env: Env) {
// 346 lines of crawling and validation...
}
// Plus 50 lines of shared utilities at the bottom
function cleanHtml(html: string): string { /* ... */ }
function extractMetaTags(html: string): MetaTags { /* ... */ }
function normalizeUrl(url: string): string { /* ... */ }
function calculateScore(factors: Factor[]): number { /* ... */ }
Why This Matters More for AI Code
God modules are bad in any codebase. They are especially bad in AI codebases because:
-
LLM calls are expensive to test. If your keyword extraction test imports a module that also initializes a Gemini client for content scoring, your test either pays for an unnecessary LLM call or requires mocking infrastructure you should not need.
-
LLM integrations change frequently. Provider APIs change, models get updated, pricing changes. When the Gemini API adds a new parameter, you edit a 1,126-line file. The blast radius is five features.
-
Prompt engineering requires iteration. Improving a prompt for content scoring should not require reading through SERP parsing code. When prompts live in god modules, iteration is slow because the context load is high.
-
Cost attribution is impossible. If five functions in one file all call Gemini, and your cost tracking tags by file/module, you cannot distinguish a $0.01 keyword extraction from a $0.30 site audit.
The Fix: Single-Responsibility Modules Under 300 Lines
src/
domain/
seo-analysis.ts (120 lines)
content-scoring.ts (90 lines)
keyword-extraction.ts (110 lines)
serp-parser.ts (140 lines)
site-audit.ts (180 lines)
shared/
html-utils.ts (40 lines)
scoring.ts (30 lines)
url-utils.ts (20 lines)
Each module:
// src/domain/keyword-extraction.ts -- 110 lines
// Single responsibility: extract keywords from text using NLP + LLM
import { generateObject } from "ai";
import { google } from "@ai-sdk/google";
import { z } from "zod";
import { cleanHtml } from "../shared/html-utils";
const KeywordResultSchema = z.object({
primary: z.array(
z.object({
term: z.string(),
relevance: z.number().min(0).max(1),
searchVolume: z.enum(["high", "medium", "low", "unknown"]),
})
),
related: z.array(z.string()),
topics: z.array(z.string()),
});
export type KeywordResult = z.infer<typeof KeywordResultSchema>;
export async function extractKeywords(
text: string,
options?: { maxKeywords?: number }
): Promise<KeywordResult> {
const cleaned = cleanHtml(text);
const max = options?.maxKeywords ?? 20;
const { object } = await generateObject({
model: google("gemini-2.5-flash"),
schema: KeywordResultSchema,
prompt: `Extract the top ${max} keywords from this text. Rate each by relevance (0-1) and estimate search volume.\n\nText:\n${cleaned}`,
});
return object;
}
// src/domain/content-scoring.ts -- 90 lines
// Single responsibility: score content quality using LLM evaluation
import { generateObject } from "ai";
import { google } from "@ai-sdk/google";
import { z } from "zod";
const ContentScoreSchema = z.object({
overall: z.number().min(0).max(100),
factors: z.object({
readability: z.number().min(0).max(100),
depth: z.number().min(0).max(100),
accuracy: z.number().min(0).max(100),
actionability: z.number().min(0).max(100),
}),
suggestions: z.array(z.string()).max(5),
});
export type ContentScore = z.infer<typeof ContentScoreSchema>;
export async function scoreContent(
content: string,
context?: { targetAudience?: string; keyword?: string }
): Promise<ContentScore> {
const { object } = await generateObject({
model: google("gemini-2.5-flash"),
schema: ContentScoreSchema,
prompt: `Score this content on readability, depth, accuracy, and actionability (0-100 each).
${context?.keyword ? `Target keyword: ${context.keyword}` : ""}
${context?.targetAudience ? `Target audience: ${context.targetAudience}` : ""}
Content:
${content.substring(0, 10000)}`,
});
return object;
}
The Splitting Heuristic
When deciding how to split a god module:
-
Group by data dependency. Functions that share the same inputs and outputs belong together. SEO analysis and SERP parsing both work on URLs, but one fetches pages and one fetches search results β different data sources, different modules.
-
Group by change frequency. Prompt engineering for content scoring changes weekly. HTML parsing utilities change yearly. They should not be in the same file.
-
Group by test boundary. If you need to mock Gemini to test keyword extraction but not SERP parsing, they should be in different modules so SERP parsing tests do not need Gemini mocks.
-
Target under 300 lines. This is not arbitrary. 300 lines is roughly the amount of code a developer can hold in working memory while debugging. Above 300 lines, you start scrolling, which means you start losing context.
The Problem
Your Worker has 226 API endpoints. 223 of them look like this:
app.post("/v1/brands/:slug/content", async (c) => {
const body = await c.req.json();
// body is `any` -- TypeScript does not know its shape
// If body.title is undefined, it becomes NULL in the database
// If body.sections is a string instead of an array, the .map() call crashes
// If body.metadata.seo.keywords has 500 entries, the database write times out
const content = await createContent(body, c.env);
return c.json(content, 201);
});
The 3 that have validation look like this:
app.post("/v1/auth/login", async (c) => {
const body = await c.req.json();
// Manual validation -- tedious, incomplete, and no type narrowing
if (!body.email || typeof body.email !== "string") {
return c.json({ error: "email is required" }, 400);
}
if (!body.password || typeof body.password !== "string") {
return c.json({ error: "password is required" }, 400);
}
if (body.password.length < 8) {
return c.json({ error: "password must be at least 8 characters" }, 400);
}
// body is still `any` after all these checks
// TypeScript does not narrow `any` through manual checks
const user = await authenticateUser(body.email, body.password, c.env);
return c.json(user);
});
This is dangerous in AI applications specifically because:
-
LLM prompts are constructed from user input. If
body.keywordis actually an object instead of a string, your prompt becomesWrite about [object Object]. The LLM processes garbage. You pay for the tokens. -
Database writes accept anything. If
body.sectionsis a 50,000-character string instead of an array of section objects, D1 stores it. When another service reads it and calls.map(), the system crashes downstream. -
Queue messages propagate invalid data. An unvalidated request body gets wrapped in a queue message and sent to another service. That service also does not validate. The bad data now lives in two databases.
The Fix: Zod at Every Boundary
import { z } from "zod";
import { Hono } from "hono";
// Define the schema ONCE -- it's both the validator and the type
const CreateContentSchema = z.object({
title: z.string().min(5).max(200),
keyword: z
.string()
.min(1)
.max(100)
.describe("Primary SEO keyword for content generation"),
sections: z
.array(
z.object({
heading: z.string().min(1).max(200),
instructions: z.string().max(1000).optional(),
})
)
.min(1)
.max(20),
metadata: z
.object({
seo: z
.object({
keywords: z.array(z.string().max(50)).max(10).default([]),
description: z.string().max(160).optional(),
})
.default({}),
publishAt: z.string().datetime().optional(),
})
.default({}),
generateWithAI: z.boolean().default(true),
});
type CreateContentInput = z.infer<typeof CreateContentSchema>;
// Middleware for validation
function validate<T extends z.ZodSchema>(schema: T) {
return async (c: Context, next: Next) => {
const result = schema.safeParse(await c.req.json());
if (!result.success) {
return c.json(
{
error: "validation_failed",
issues: result.error.issues.map((issue) => ({
path: issue.path.join("."),
code: issue.code,
message: issue.message,
})),
},
400
);
}
c.set("validated", result.data);
return next();
};
}
app.post(
"/v1/brands/:slug/content",
validate(CreateContentSchema),
async (c) => {
const input = c.get("validated") as CreateContentInput;
// input is fully typed, fully validated, fully constrained
// input.title is a string between 5-200 chars
// input.sections is an array with 1-20 items
// input.metadata.seo.keywords has at most 10 entries
// No casting, no manual checks, no surprises
const content = await createContent(input, c.env);
return c.json(content, 201);
}
);
Validation Middleware for Hono
Here is a reusable middleware pattern for Hono (the most common router on Cloudflare Workers):
// src/middleware/validation.ts
import { z } from "zod";
import type { Context, MiddlewareHandler } from "hono";
export function validateBody<T extends z.ZodSchema>(
schema: T
): MiddlewareHandler {
return async (c, next) => {
let body: unknown;
try {
body = await c.req.json();
} catch {
return c.json({ error: "invalid_json", message: "Request body is not valid JSON" }, 400);
}
const result = schema.safeParse(body);
if (!result.success) {
return c.json(
{
error: "validation_failed",
issues: result.error.issues.map((i) => ({
path: i.path.join("."),
code: i.code,
message: i.message,
})),
},
400
);
}
c.set("body", result.data);
return next();
};
}
export function validateParams<T extends z.ZodSchema>(
schema: T
): MiddlewareHandler {
return async (c, next) => {
const result = schema.safeParse(c.req.param());
if (!result.success) {
return c.json(
{
error: "invalid_params",
issues: result.error.issues.map((i) => ({
path: i.path.join("."),
message: i.message,
})),
},
400
);
}
c.set("params", result.data);
return next();
};
}
export function validateQuery<T extends z.ZodSchema>(
schema: T
): MiddlewareHandler {
return async (c, next) => {
const result = schema.safeParse(c.req.query());
if (!result.success) {
return c.json(
{
error: "invalid_query",
issues: result.error.issues.map((i) => ({
path: i.path.join("."),
message: i.message,
})),
},
400
);
}
c.set("query", result.data);
return next();
};
}
Usage across the codebase becomes consistent:
const BrandSlugParams = z.object({
slug: z.string().regex(/^[a-z0-9-]+$/).min(1).max(64),
});
const ListQuerySchema = z.object({
limit: z.coerce.number().int().min(1).max(100).default(20),
offset: z.coerce.number().int().min(0).default(0),
status: z.enum(["draft", "published", "archived"]).optional(),
});
app.get(
"/v1/brands/:slug/content",
validateParams(BrandSlugParams),
validateQuery(ListQuerySchema),
async (c) => {
const { slug } = c.get("params");
const { limit, offset, status } = c.get("query");
// All validated. All typed. All constrained.
return c.json(await listContent(slug, { limit, offset, status }, c.env));
}
);
The Problem
When multiple workers call the same LLM API, each worker defines its own response types:
// pages-plus/src/types/gemini.ts
interface GeminiApiResponse {
candidates: Array<{
content: { parts: Array<{ text: string }> };
finishReason: string;
}>;
usageMetadata: {
promptTokenCount: number;
candidatesTokenCount: number;
totalTokenCount: number;
};
}
// aso-mrr/src/types/gemini.ts -- COPY-PASTED, slightly different
interface GeminiResponse {
candidates: {
content: { parts: { text: string }[] };
finishReason: string;
}[];
usageMetadata: {
promptTokenCount: number;
candidatesTokenCount: number;
// Missing: thoughtsTokenCount -- this copy doesn't know about thinking tokens
totalTokenCount: number;
};
}
// scalable-media/src/lib/gemini.ts -- yet another copy
type GeminiResult = {
candidates: Array<{
content: { parts: Array<{ text: string }> };
}>;
usageMetadata?: {
promptTokenCount?: number;
candidatesTokenCount?: number;
};
};
// gatherfeed/src/services/ai.ts -- and another
// This one has thoughtsTokenCount but misspelled as thoughtTokenCount
Six files. Four slightly different versions. When Google adds a new field to the response, you update one copy and forget the others. When thinking tokens ship, three of the four copies miss the field. That is how you get a 9x cost undercount.
The Fix: Shared Type Modules
Option A: Monorepo shared package
// packages/shared/src/types/llm.ts
import { z } from "zod";
// Define once with Zod -- get runtime validation AND TypeScript type
export const GeminiUsageSchema = z.object({
promptTokenCount: z.number().default(0),
candidatesTokenCount: z.number().default(0),
thoughtsTokenCount: z.number().default(0),
cachedContentTokenCount: z.number().default(0),
totalTokenCount: z.number().default(0),
});
export const GeminiResponseSchema = z.object({
candidates: z.array(
z.object({
content: z.object({
parts: z.array(z.object({ text: z.string() })),
}),
finishReason: z.string(),
})
),
usageMetadata: GeminiUsageSchema.optional(),
});
export type GeminiUsage = z.infer<typeof GeminiUsageSchema>;
export type GeminiResponse = z.infer<typeof GeminiResponseSchema>;
// Helper to extract text from response
export function extractText(response: GeminiResponse): string | null {
return response.candidates?.[0]?.content?.parts?.[0]?.text ?? null;
}
// Helper to extract usage with defaults for all tiers
export function extractUsage(response: GeminiResponse): GeminiUsage {
return GeminiUsageSchema.parse(response.usageMetadata ?? {});
}
Every worker imports from the shared package:
// pages-plus/src/domain/content.ts
import { GeminiResponseSchema, extractText, extractUsage } from "@acme/shared/types/llm";
// aso-mrr/src/services/analysis.ts
import { GeminiResponseSchema, extractText, extractUsage } from "@acme/shared/types/llm";
// One definition. One import. One place to update.
Option B: Eliminate the type entirely with AI SDK
The better fix is to stop interacting with provider APIs directly. When you use the Vercel AI SDK, you never see a GeminiApiResponse. The SDK returns a normalized GenerateObjectResult or GenerateTextResult with consistent types across all providers.
import { generateObject } from "ai";
import { google } from "@ai-sdk/google";
const { object, usage } = await generateObject({
model: google("gemini-2.5-flash"),
schema: MySchema,
prompt: "...",
});
// usage is always: { promptTokens: number, completionTokens: number }
// No GeminiApiResponse. No provider-specific types. No copy-paste.
Key insight: The best way to eliminate duplicated types is to eliminate the need for the type entirely. If you are defining
GeminiApiResponsein your codebase, you are at the wrong abstraction level. Use an SDK that abstracts the provider.
The Problem
You are building an AI agent that performs multi-step tasks: research a topic, generate an outline, write content, review for quality, publish. Each step depends on the previous step. Some steps may fail and need retries. The whole sequence may take minutes. And you need to know where you are if the process crashes and restarts.
The manual approach:
// The hand-rolled state machine -- no persistence, no recovery, no observability
type AgentState = "idle" | "researching" | "outlining" | "writing" | "reviewing" | "publishing";
interface AgentContext {
state: AgentState;
keyword: string;
researchData?: ResearchResult;
outline?: ArticleOutline;
draft?: string;
reviewScore?: number;
error?: string;
retryCount: number;
}
async function runAgent(keyword: string, env: Env): Promise<void> {
const ctx: AgentContext = {
state: "idle",
keyword,
retryCount: 0,
};
try {
// Step 1: Research
ctx.state = "researching";
ctx.researchData = await doResearch(keyword, env);
// Step 2: Outline
ctx.state = "outlining";
ctx.outline = await generateOutline(ctx.researchData, env);
// Step 3: Write
ctx.state = "writing";
ctx.draft = await writeDraft(ctx.outline, env);
// Step 4: Review
ctx.state = "reviewing";
ctx.reviewScore = await reviewContent(ctx.draft, env);
if (ctx.reviewScore < 70) {
// Retry writing -- but what if it fails again?
// What if the process crashes here?
// What if we've already spent $0.50 on research and outlining?
ctx.state = "writing";
ctx.draft = await writeDraft(ctx.outline, env);
ctx.reviewScore = await reviewContent(ctx.draft, env);
}
// Step 5: Publish
ctx.state = "publishing";
await publishContent(ctx.draft, env);
} catch (err) {
ctx.error = err.message;
// The state is in memory. If the process crashes, it's gone.
// There's no way to resume from the last successful step.
// The entire pipeline must restart from scratch.
// All the LLM calls (and their costs) are wasted.
console.log("Agent failed at state:", ctx.state, err);
}
}
The problems:
- No persistence. If the Worker crashes after step 3 (writing), all progress is lost. The research and outline cost money. That money is wasted.
- No observability. You cannot query βhow many agents are in the reviewing state right now?β or βwhat was the last successful step for keyword X?β
- No recovery. When the process restarts, it starts from step 1. There is no way to resume from step 4.
- No concurrency control. Two cron ticks could start two agents for the same keyword. Both run to completion. You pay twice.
- No backpressure. If step 3 (writing) takes 30 seconds but step 1 (research) takes 2 seconds, you can saturate the LLM with writing requests while research builds up.
The Fix: Stateful Agent Runtimes
Cloudflare Agents SDK (Durable Objects with built-in state):
import { Agent } from "agents";
import { generateObject } from "ai";
import { google } from "@ai-sdk/google";
import { z } from "zod";
interface ContentAgentState {
status: "idle" | "researching" | "outlining" | "writing" | "reviewing" | "publishing" | "done" | "failed";
keyword: string;
researchData?: unknown;
outline?: unknown;
draft?: string;
reviewScore?: number;
completedSteps: string[];
totalCost: number;
error?: string;
}
const initialState: ContentAgentState = {
status: "idle",
keyword: "",
completedSteps: [],
totalCost: 0,
};
export class ContentAgent extends Agent<Env, ContentAgentState> {
initialState = initialState;
async startPipeline(keyword: string) {
this.setState({ ...this.state, keyword, status: "researching" });
try {
// Step 1: Research -- state persists automatically
if (!this.state.completedSteps.includes("research")) {
const research = await this.doResearch(keyword);
this.setState({
...this.state,
researchData: research,
completedSteps: [...this.state.completedSteps, "research"],
status: "outlining",
});
}
// Step 2: Outline
if (!this.state.completedSteps.includes("outline")) {
const outline = await this.generateOutline();
this.setState({
...this.state,
outline,
completedSteps: [...this.state.completedSteps, "outline"],
status: "writing",
});
}
// Step 3: Write
if (!this.state.completedSteps.includes("write")) {
const draft = await this.writeDraft();
this.setState({
...this.state,
draft,
completedSteps: [...this.state.completedSteps, "write"],
status: "reviewing",
});
}
// Step 4: Review
if (!this.state.completedSteps.includes("review")) {
const score = await this.reviewContent();
this.setState({
...this.state,
reviewScore: score,
completedSteps: [...this.state.completedSteps, "review"],
status: score >= 70 ? "publishing" : "writing",
});
if (score < 70) {
// Remove write step to retry
this.setState({
...this.state,
completedSteps: this.state.completedSteps.filter(
(s) => s !== "write" && s !== "review"
),
});
// Re-run from write step
return this.startPipeline(keyword);
}
}
// Step 5: Publish
if (!this.state.completedSteps.includes("publish")) {
await this.publishContent();
this.setState({
...this.state,
completedSteps: [...this.state.completedSteps, "publish"],
status: "done",
});
}
} catch (err) {
this.setState({
...this.state,
status: "failed",
error: err instanceof Error ? err.message : String(err),
});
// State is persisted even on failure.
// On retry, completedSteps tells us where to resume.
// Already-spent LLM costs are not wasted.
}
}
private async doResearch(keyword: string) {
const { object, usage } = await generateObject({
model: google("gemini-2.5-flash"),
schema: ResearchSchema,
prompt: `Research the topic: ${keyword}`,
});
this.setState({
...this.state,
totalCost:
this.state.totalCost +
(usage.promptTokens * 0.15 + usage.completionTokens * 0.6) / 1_000_000,
});
return object;
}
private async generateOutline() {
const { object } = await generateObject({
model: google("gemini-2.5-flash"),
schema: OutlineSchema,
prompt: `Create an outline based on this research: ${JSON.stringify(this.state.researchData)}`,
});
return object;
}
private async writeDraft() {
const { object } = await generateObject({
model: google("gemini-2.5-flash"),
schema: z.object({ content: z.string() }),
prompt: `Write the full article from this outline: ${JSON.stringify(this.state.outline)}`,
});
return object.content;
}
private async reviewContent(): Promise<number> {
const { object } = await generateObject({
model: google("gemini-2.5-flash"),
schema: z.object({ score: z.number().min(0).max(100), feedback: z.string() }),
prompt: `Score this content 0-100: ${this.state.draft?.substring(0, 5000)}`,
});
return object.score;
}
private async publishContent() {
await this.env.PUBLISH_QUEUE.send({
event_id: crypto.randomUUID(),
type: "content.publish",
source: "content-agent",
timestamp: new Date().toISOString(),
payload: {
keyword: this.state.keyword,
content: this.state.draft,
},
});
}
}
What the Agents SDK gives you:
| Concern | Manual State Machine | CF Agents SDK |
|---|---|---|
| State persistence | In-memory (lost on crash) | SQLite-backed (survives restarts) |
| Recovery | Start from scratch | Resume from last completed step |
| Concurrency | No dedup, double-runs | One DO instance per keyword |
| Observability | console.log | Query this.state via HTTP |
| Scheduling | Manual setTimeout | Built-in this.schedule() with alarms |
| Cost tracking | Manual counter | State tracks totalCost durably |
| Testing | Need to mock everything | Test each step independently |
When Agents SDK Is Not the Right Fit
The Cloudflare Agents SDK runs on Cloudflareβs edge network. If you are not on Cloudflare, or if your workflows span multiple cloud providers, consider:
- Temporal β Battle-tested durable execution. Workflows survive infrastructure failures. TypeScript SDK available. Best for complex, long-running workflows across services.
- Inngest β Event-driven, serverless-friendly. Steps are durable. Good TypeScript support. Best for event-triggered multi-step pipelines.
- LangGraph β Purpose-built for AI agent graphs. State management between nodes. Best for complex agent architectures with branching logic. But can be brittle in production β state management bugs and retry complexity are common complaints.
The Problem
Each Worker manages its own API keys for external providers:
// pages-plus/wrangler.jsonc
{
"vars": {
"GEMINI_API_KEY": "...",
"OPENAI_API_KEY": "...",
"BRAVE_API_KEY": "..."
}
}
// aso-mrr/wrangler.jsonc
{
"vars": {
"GEMINI_API_KEY": "...", // Same key or different? Who knows.
"DATAFORSEO_LOGIN": "...",
"DATAFORSEO_PASSWORD": "..."
}
}
// scalable-media/wrangler.jsonc
{
"vars": {
"GEMINI_API_KEY": "...", // Third copy of the key
"PERPLEXITY_API_KEY": "...",
"TWITTER_BEARER_TOKEN": "..."
}
}
// gatherfeed/wrangler.jsonc
{
"vars": {
"GEMINI_API_KEY": "...", // Fourth copy
"BRAVE_API_KEY": "...", // Second copy
"YOUTUBE_API_KEY": "..."
}
}
The problems compound:
-
Key rotation requires N deployments. When you rotate a Gemini key, you update 4 workers. If you miss one, it breaks in production.
-
No usage attribution. All 4 workers use the same Gemini key. The billing dashboard shows total spend but not which worker spent what.
-
No rate limiting. Each worker hits Gemini independently. Four workers each making requests at the providerβs rate limit means 4x the intended rate.
-
Security surface. Every Workerβs environment has every key it needs. If any Worker is compromised, all its keys are exposed. In our case, that was 14 keys across 4 workers.
-
No centralized kill switch. When you discover $47 in unexpected spending, you cannot flip one switch. You must update and deploy 4 workers to remove the keys.
The Fix: Centralized Key Management
One service holds all provider keys. Every other service authenticates to this proxy with a project-specific credential.
// api-proxy/src/providers/registry.ts
interface ProviderConfig {
name: string;
baseUrl: string;
authHeader: string;
keyEnvVar: string;
rateLimit: { requests: number; windowMs: number };
}
const PROVIDERS: Record<string, ProviderConfig> = {
gemini: {
name: "Google Gemini",
baseUrl: "https://generativelanguage.googleapis.com/v1beta",
authHeader: "x-goog-api-key",
keyEnvVar: "GEMINI_API_KEY",
rateLimit: { requests: 60, windowMs: 60_000 },
},
openai: {
name: "OpenAI",
baseUrl: "https://api.openai.com/v1",
authHeader: "Authorization",
keyEnvVar: "OPENAI_API_KEY",
rateLimit: { requests: 100, windowMs: 60_000 },
},
brave: {
name: "Brave Search",
baseUrl: "https://api.search.brave.com/res/v1",
authHeader: "X-Subscription-Token",
keyEnvVar: "BRAVE_API_KEY",
rateLimit: { requests: 15, windowMs: 1_000 },
},
perplexity: {
name: "Perplexity",
baseUrl: "https://api.perplexity.ai",
authHeader: "Authorization",
keyEnvVar: "PERPLEXITY_API_KEY",
rateLimit: { requests: 20, windowMs: 60_000 },
},
};
// Project auth -- each calling service gets ONE credential
interface ProjectAuth {
projectId: string;
apiKey: string;
dailyLimitUsd: number;
allowedProviders: string[];
costTier: 1 | 2 | 3;
}
Worker configurations become minimal:
// pages-plus/wrangler.jsonc -- AFTER consolidation
{
"vars": {
// NO provider API keys. Only the proxy credential.
"API_PROXY_URL": "https://api-proxy.your-domain.com",
"API_PROXY_KEY": "..."
}
}
Key rotation is now a single operation:
wrangler secret put GEMINI_API_KEY --name api-proxy
Kill switch is now a single operation:
// Disable a project's access to all providers
// One API call. Immediate effect. No deployments.
await db
.update(projects)
.set({ active: false })
.where(eq(projects.projectId, "pages-plus"));
The Problem
Your Workers use console.log for observability:
// What logging looks like in most Workers
app.post("/v1/content/generate", async (c) => {
console.log("Received generate request");
try {
const body = await c.req.json();
console.log("Generating content for:", body.keyword);
const result = await generateContent(body, c.env);
console.log("Content generated successfully");
return c.json(result);
} catch (err) {
console.log("Error generating content:", err.message);
// Which request? What were the inputs? What was the state?
// The log says "Error generating content: fetch failed"
// Good luck debugging that in production.
return c.json({ error: "internal_error" }, 500);
}
});
The problems:
-
No context propagation. Each
console.logis a standalone string. There is no request ID linking them together. When 50 requests happen in parallel, the logs are interleaved and useless. -
No structured data. Logs are strings, not objects. You cannot filter by
status:errororfunction:content-generateorcost>0.10because those fields do not exist. -
No level management. Everything is
console.log. You cannot turn off debug logs in production or escalate warnings to error. There is no severity. -
No performance data. You do not know how long each operation took, what it cost, or what the token counts were. Those values exist in your code but never reach the logs.
The Fix: Pino in Browser Mode on Cloudflare Workers
Pino is the fastest Node.js logger. Its browser mode works on Cloudflare Workers because it outputs via console.log internally (which Workers capture) but gives you structured JSON, child loggers, and log levels.
// src/lib/logger.ts
import pino from "pino";
export function createLogger(service: string) {
return pino({
level: "info",
browser: {
asObject: true,
write: {
info: (o: object) => console.log(JSON.stringify(o)),
warn: (o: object) => console.warn(JSON.stringify(o)),
error: (o: object) => console.error(JSON.stringify(o)),
debug: (o: object) => console.debug(JSON.stringify(o)),
},
},
base: { service },
timestamp: pino.stdTimeFunctions.isoTime,
});
}
Usage with Hono middleware:
// src/middleware/logging.ts
import { createLogger } from "../lib/logger";
import type { MiddlewareHandler } from "hono";
export function requestLogger(service: string): MiddlewareHandler {
const baseLogger = createLogger(service);
return async (c, next) => {
const requestId = c.req.header("x-request-id") ?? crypto.randomUUID();
const start = Date.now();
// Create a child logger with request context
const log = baseLogger.child({
requestId,
method: c.req.method,
path: c.req.path,
});
// Attach to the context so handlers can use it
c.set("log", log);
c.set("requestId", requestId);
try {
await next();
const duration = Date.now() - start;
log.info({
msg: "request completed",
status: c.res.status,
duration,
});
} catch (err) {
const duration = Date.now() - start;
log.error({
msg: "request failed",
status: 500,
duration,
error: err instanceof Error ? err.message : String(err),
stack: err instanceof Error ? err.stack : undefined,
});
throw err;
}
};
}
Application wiring:
// src/index.ts
import { Hono } from "hono";
import { requestLogger } from "./middleware/logging";
const app = new Hono<{ Bindings: Env }>();
app.use("*", requestLogger("pages-plus"));
app.post("/v1/content/generate", async (c) => {
const log = c.get("log");
const body = await c.req.json();
log.info({ msg: "generating content", keyword: body.keyword });
const start = Date.now();
const result = await generateContent(body, c.env);
const duration = Date.now() - start;
log.info({
msg: "content generated",
keyword: body.keyword,
wordCount: result.wordCount,
duration,
cost: result.cost,
model: result.model,
tokens: result.usage,
});
return c.json(result);
});
What the output looks like:
{
"level": 30,
"time": "2026-03-15T10:23:45.123Z",
"service": "pages-plus",
"requestId": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"method": "POST",
"path": "/v1/content/generate",
"msg": "content generated",
"keyword": "bank statement to excel",
"wordCount": 1847,
"duration": 4523,
"cost": 0.0034,
"model": "gemini-2.5-flash",
"tokens": {
"input": 2340,
"output": 1200,
"thinking": 3800
}
}
Now you can:
- Correlate: Find all logs for request
a1b2c3d4across middleware, handler, and domain functions - Filter: Show only requests where
cost > 0.10orduration > 5000 - Aggregate: Calculate average cost per keyword, p99 latency per endpoint
- Alert: Trigger when error rate exceeds threshold or daily cost exceeds budget
Child Loggers for Deep Context
The real power of Pino is child loggers. Each child inherits its parentβs context and adds its own:
// In the request handler
const log = c.get("log"); // Has requestId, method, path
// Pass to domain function
const contentLog = log.child({ function: "content-generate", keyword });
// Inside domain function, create deeper children
const llmLog = contentLog.child({ provider: "gemini", model: "gemini-2.5-flash" });
llmLog.info({ msg: "calling LLM", promptLength: prompt.length });
// Output has: requestId + method + path + function + keyword + provider + model + msg + promptLength
// ALL context propagated automatically. Zero extra work.
The Problem
Your wrangler.jsonc has 9 cron triggers:
{
"triggers": {
"crons": [
"*/15 * * * *",
"*/30 * * * *",
"0 */2 * * *",
"0 */4 * * *",
"0 6 * * *",
"0 12 * * *",
"0 18 * * *",
"30 8 * * *",
"0 0 * * *"
]
}
}
Each cron triggers LLM calls. Some generate article outlines. Some generate full drafts. Some generate meta descriptions. Some generate internal link suggestions. None of them check how much has been spent today. None of them have a kill switch.
War Story: The 193 Ghost Posts
Here is the sequence of events:
- Monday: Deploy content pipeline with 9 crons. Each cron handles one stage of content generation.
- Monday-Friday: Crons run 24/7. Each generates content using Gemini. No cost tracking. No quality gate. No human review.
- Friday: During a routine ops session, check the Gemini billing dashboard. See $37 from pages-plus.
- Investigation: The pipeline generated 193 blog posts in 5 days.
- No human ever reviewed the content quality
- No human knew the posts existed (they were in draft status in D1)
- No cost was attributed to any specific function
- No daily limit existed to stop the pipeline
- Emergency response:
- All 9 crons in pages-plus disabled immediately
- All 4 crons in aso-mrr disabled
- 14 API keys removed from worker configurations
- New P0 standard written: API Cost Metering Standard
- 12 issues created to implement metering before re-enabling any cron
The Fix: Budget-Aware Pipelines
Every cron that triggers LLM calls must check the budget before proceeding:
// src/crons/content-pipeline.ts
import type { ScheduledEvent } from "@cloudflare/workers-types";
interface BudgetCheck {
allowed: boolean;
spent: number;
limit: number;
remaining: number;
}
async function checkBudget(
projectId: string,
env: Env
): Promise<BudgetCheck> {
const response = await fetch(
`${env.API_PROXY_URL}/v1/costs?period=day&project=${projectId}`,
{
headers: {
"X-Project-Id": projectId,
"X-Api-Key": env.API_PROXY_KEY,
},
}
);
if (!response.ok) {
// If we cannot check the budget, do not proceed
return { allowed: false, spent: 0, limit: 0, remaining: 0 };
}
const data = await response.json();
const spent = data.total_cost_usd;
const limit = data.daily_limit_usd;
return {
allowed: spent < limit * 0.8, // Stop at 80% to leave headroom
spent,
limit,
remaining: limit - spent,
};
}
export async function handleScheduled(
event: ScheduledEvent,
env: Env
): Promise<void> {
const log = createLogger("pages-plus").child({ trigger: "cron", cron: event.cron });
// 1. Budget check BEFORE any LLM call
const budget = await checkBudget("pages-plus", env);
if (!budget.allowed) {
log.warn({
msg: "cron skipped: budget limit approaching",
spent: budget.spent,
limit: budget.limit,
remaining: budget.remaining,
});
return; // Exit gracefully. Do nothing. Cost: $0.
}
log.info({
msg: "cron executing",
budget: {
spent: budget.spent,
limit: budget.limit,
remaining: budget.remaining,
},
});
// 2. Process with cost awareness
const pendingItems = await getPendingContentItems(env);
// Estimate cost before processing
const estimatedCostPerItem = 0.03; // Based on historical average
const maxItems = Math.floor(budget.remaining / estimatedCostPerItem);
const itemsToProcess = pendingItems.slice(0, Math.min(maxItems, 10)); // Cap at 10 per run
log.info({
msg: "processing items",
pending: pendingItems.length,
processing: itemsToProcess.length,
maxByBudget: maxItems,
});
for (const item of itemsToProcess) {
try {
await processContentItem(item, env);
} catch (err) {
log.error({
msg: "item processing failed",
itemId: item.id,
error: err instanceof Error ? err.message : String(err),
});
// Continue with next item, don't fail the whole batch
}
}
log.info({
msg: "cron completed",
processed: itemsToProcess.length,
skipped: pendingItems.length - itemsToProcess.length,
});
}
Kill Switch Pattern
Every automated pipeline needs a manual kill switch that does not require a deployment:
// Check a KV flag before running ANY automated pipeline
async function isPipelineEnabled(
pipeline: string,
env: Env
): Promise<boolean> {
// KV read is fast, cheap, and can be updated without deployment
const flag = await env.KV.get(`pipeline:${pipeline}:enabled`);
// Default to DISABLED -- pipelines must be explicitly enabled
// This is the opposite of the usual default, and it's intentional.
// An unset flag means "we haven't verified this pipeline yet."
return flag === "true";
}
// In the cron handler
export async function handleScheduled(
event: ScheduledEvent,
env: Env
): Promise<void> {
// Kill switch check -- before budget check, before anything
if (!(await isPipelineEnabled("content-generation", env))) {
// Silent return. No log spam. The pipeline is off.
return;
}
// Budget check
const budget = await checkBudget("pages-plus", env);
if (!budget.allowed) return;
// ... actual work
}
// To disable a pipeline in emergency:
// wrangler kv key put "pipeline:content-generation:enabled" "false" --namespace-id <id>
// Takes effect immediately. No deployment. No code change.
Cron Governance Checklist
Before enabling any cron that triggers LLM calls:
// src/crons/governance.ts
interface CronGovernanceCheck {
pipeline: string;
checks: {
killSwitchExists: boolean; // Can you disable without deploying?
budgetCheckExists: boolean; // Does it check spend before calling LLMs?
dailyLimitSet: boolean; // Is there a dollar limit per day?
costAttribution: boolean; // Does each call tag its function + project?
maxItemsCapped: boolean; // Is there a per-run item limit?
humanReviewGate: boolean; // Do outputs get reviewed before publishing?
errorHandling: boolean; // Does failure in one item not kill the batch?
logging: boolean; // Is there structured logging for observability?
};
}
// Every check must be true before the cron is enabled.
// This is the standard that was written after the $47 incident.
// It exists because we did not have it before.
Example 1: Extracting Structured Data from Unstructured Text
import { generateObject } from "ai";
import { google } from "@ai-sdk/google";
import { z } from "zod";
const InvoiceSchema = z.object({
vendor: z.string(),
invoiceNumber: z.string(),
date: z.string().date(),
lineItems: z.array(
z.object({
description: z.string(),
quantity: z.number(),
unitPrice: z.number(),
total: z.number(),
})
),
subtotal: z.number(),
tax: z.number(),
total: z.number(),
currency: z.string().length(3),
});
async function extractInvoice(rawText: string) {
const { object } = await generateObject({
model: google("gemini-2.5-flash"),
schema: InvoiceSchema,
prompt: `Extract invoice data from this text:\n\n${rawText}`,
});
// object.lineItems[0].total is a number, guaranteed
// object.currency is a 3-char string, guaranteed
// No regex. No manual parsing. No try/catch.
return object;
}
Example 2: Cost-Aware Model Selection
function selectModel(task: {
complexity: "low" | "medium" | "high";
budgetRemaining: number;
requiresReasoning: boolean;
}): string {
// If budget is tight, always use the cheapest model
if (task.budgetRemaining < 0.50) {
return "gemini-2.5-flash"; // $0.15/M input, $0.60/M output
}
// High-complexity reasoning tasks justify thinking tokens
if (task.complexity === "high" && task.requiresReasoning) {
// But only if budget allows -- thinking tokens are 5.8x output
if (task.budgetRemaining > 2.0) {
return "gemini-2.5-pro"; // $1.25/M input, $10/M output
}
}
// Default: Flash handles 90% of tasks adequately
return "gemini-2.5-flash";
}
Example 3: Zod Schema for Queue Message Validation
import { z } from "zod";
const DomainMessageSchema = z.object({
event_id: z.string().uuid(),
type: z.string().regex(/^[a-z]+\.[a-z]+$/), // e.g., "content.published"
source: z.string().min(1),
timestamp: z.string().datetime(),
correlation_id: z.string().optional(),
payload: z.record(z.unknown()),
});
type DomainMessage = z.infer<typeof DomainMessageSchema>;
// In queue consumer
async function handleBatch(batch: MessageBatch<unknown>) {
for (const msg of batch.messages) {
const result = DomainMessageSchema.safeParse(msg.body);
if (!result.success) {
console.error("Invalid queue message:", result.error.issues);
msg.ack(); // Discard malformed messages, don't retry
continue;
}
const message = result.data;
await processMessage(message);
msg.ack();
}
}
Example 4: Shared Type Package Structure
// packages/shared/src/index.ts
// Explicit exports -- no barrel re-exports
export {
GeminiResponseSchema,
GeminiUsageSchema,
extractText,
extractUsage,
} from "./types/gemini";
export type { GeminiResponse, GeminiUsage } from "./types/gemini";
export {
DomainMessageSchema,
type DomainMessage,
} from "./types/messages";
export {
calculateCost,
MODEL_PRICING,
} from "./cost/calculator";
export type { TokenUsage, ModelPricing } from "./cost/calculator";
// packages/shared/package.json
{
"name": "@acme/shared",
"version": "1.0.0",
"exports": {
".": "./src/index.ts",
"./types/*": "./src/types/*.ts",
"./cost/*": "./src/cost/*.ts"
}
}
// pages-plus/package.json
{
"dependencies": {
"@acme/shared": "workspace:*"
}
}
Example 5: Retry with Exponential Backoff for LLM Calls
async function withRetry<T>(
fn: () => Promise<T>,
options: {
maxRetries?: number;
baseDelayMs?: number;
maxDelayMs?: number;
retryOn?: (error: unknown) => boolean;
} = {}
): Promise<T> {
const {
maxRetries = 3,
baseDelayMs = 1000,
maxDelayMs = 30000,
retryOn = () => true,
} = options;
let lastError: unknown;
for (let attempt = 0; attempt <= maxRetries; attempt++) {
try {
return await fn();
} catch (err) {
lastError = err;
if (attempt === maxRetries || !retryOn(err)) {
throw err;
}
const delay = Math.min(baseDelayMs * 2 ** attempt, maxDelayMs);
const jitter = delay * (0.5 + Math.random() * 0.5);
await new Promise((resolve) => setTimeout(resolve, jitter));
}
}
throw lastError;
}
// Usage with LLM calls
const result = await withRetry(
() =>
generateObject({
model: google("gemini-2.5-flash"),
schema: ArticleSchema,
prompt: "...",
}),
{
maxRetries: 2,
retryOn: (err) => {
// Retry on rate limits and server errors
// Do NOT retry on validation errors (schema mismatch)
if (err instanceof Error) {
return err.message.includes("429") || err.message.includes("500");
}
return false;
},
}
);
Example 6: Pino Child Logger Chain
import pino from "pino";
const root = pino({
browser: {
asObject: true,
write: {
info: (o: object) => console.log(JSON.stringify(o)),
error: (o: object) => console.error(JSON.stringify(o)),
},
},
base: { service: "scalable-media" },
});
// Request-level context
const requestLog = root.child({ requestId: "abc-123" });
// Function-level context
const contentLog = requestLog.child({ function: "content-generate" });
// LLM call-level context
const llmLog = contentLog.child({
provider: "gemini",
model: "gemini-2.5-flash",
});
llmLog.info({ msg: "calling model", promptTokens: 2340 });
// Output: { service, requestId, function, provider, model, msg, promptTokens }
// Every field from every ancestor is included automatically.
llmLog.info({
msg: "model responded",
outputTokens: 1200,
thinkingTokens: 3800,
cost: 0.0145,
latencyMs: 4523,
});
// Full trace from service to specific LLM call, all in one log line.
Example 7: Validation Error Formatting for API Consumers
import { z } from "zod";
function formatZodError(error: z.ZodError): {
error: string;
issues: Array<{ path: string; code: string; message: string; expected?: string; received?: string }>;
} {
return {
error: "validation_failed",
issues: error.issues.map((issue) => {
const base = {
path: issue.path.join("."),
code: issue.code,
message: issue.message,
};
if (issue.code === "invalid_type") {
return {
...base,
expected: issue.expected,
received: issue.received,
};
}
return base;
}),
};
}
// Usage
const result = CreateContentSchema.safeParse(body);
if (!result.success) {
return c.json(formatZodError(result.error), 400);
}
// Response to the client:
// {
// "error": "validation_failed",
// "issues": [
// { "path": "sections.0.heading", "code": "too_small", "message": "String must contain at least 1 character(s)" },
// { "path": "metadata.seo.keywords.11", "code": "too_big", "message": "Array must contain at most 10 element(s)" }
// ]
// }
Example 8: Daily Cost Dashboard Query
// GET /v1/costs/dashboard
app.get("/v1/costs/dashboard", async (c) => {
const db = c.env.DB;
const [byProject, byFunction, byModel, dailyTrend] = await Promise.all([
// Cost by project (today)
db
.prepare(
`SELECT project_id, ROUND(SUM(cost_usd), 2) as cost,
COUNT(*) as calls, SUM(thinking_tokens) as thinking
FROM api_calls_ledger
WHERE created_at >= date('now')
GROUP BY project_id ORDER BY cost DESC`
)
.all(),
// Cost by function (today)
db
.prepare(
`SELECT function, ROUND(SUM(cost_usd), 2) as cost,
COUNT(*) as calls
FROM api_calls_ledger
WHERE created_at >= date('now')
GROUP BY function ORDER BY cost DESC`
)
.all(),
// Cost by model (today)
db
.prepare(
`SELECT model, ROUND(SUM(cost_usd), 2) as cost,
COUNT(*) as calls,
SUM(input_tokens) as input_tokens,
SUM(output_tokens) as output_tokens,
SUM(thinking_tokens) as thinking_tokens
FROM api_calls_ledger
WHERE created_at >= date('now')
GROUP BY model ORDER BY cost DESC`
)
.all(),
// Daily trend (last 7 days)
db
.prepare(
`SELECT date(created_at) as day, ROUND(SUM(cost_usd), 2) as cost,
COUNT(*) as calls
FROM api_calls_ledger
WHERE created_at >= date('now', '-7 days')
GROUP BY day ORDER BY day`
)
.all(),
]);
return c.json({ byProject, byFunction, byModel, dailyTrend });
});
LLM Response Parsing
| Approach | How It Works | Pros | Cons |
|---|---|---|---|
| Manual regex + JSON.parse | Strip markdown fences, trim, parse, cast | No dependencies | Fragile, incomplete coverage, no retries, as type assertions lie |
| Provider function calling | Define functions, model returns structured call | Provider-native, no SDK dependency | Provider-specific API, manual validation still needed, inconsistent across providers |
Vercel AI SDK generateObject() | Zod schema sent as response format, SDK validates + retries | Type-safe, provider-agnostic, automatic retries, zero parsing code | Adds dependency (~50KB), requires Zod schema upfront, deeply nested schemas can cause issues |
| Instructor (Python) | Pydantic models as output schemas, retries on validation failure | Pythonic, good error messages, patches provider clients | Python only, monkey-patches clients, adds complexity |
| Outlines / Guidance | Constrained generation at token level | Guaranteed valid output, fastest | Requires model server access, not for API-based models |
Recommendation: Use
generateObject()for TypeScript projects. It handles the 95% case (structured output from API-based models) with zero boilerplate. If you are in Python, Instructor is the equivalent. If you run your own model server, Outlines gives you guaranteed-valid generation.
Cost Tracking
| Approach | Setup | Token Tier Support | Attribution | Daily Limits | Real-Time |
|---|---|---|---|---|---|
| No tracking | None | None | None | None | No |
| Provider dashboards | Already available | Full (provider knows) | Project-level (by API key) | Manual alerts only | Yes |
| Cloudflare AI Gateway | One URL change | Full | By gateway ID | Rate limiting available | Yes |
| LiteLLM Proxy | Self-hosted proxy | Full | By key/user/team/tag | Per-key budgets | Yes |
| Portkey | SDK + dashboard | Full | By virtual key/metadata | Budget alerts | Yes |
| Custom proxy | Build + deploy | You implement it | Fully customizable | You implement it | You implement it |
Recommendation: Start with Cloudflare AI Gateway if you are already on Cloudflare β it is free and requires one line of code. Move to LiteLLM or a custom proxy when you need function-level cost attribution with custom tags. Provider dashboards are not enough β they show total spend but not why each dollar was spent.
Agent State Management
| Runtime | Language | State Persistence | Recovery Model | Deployment | Best For |
|---|---|---|---|---|---|
| Manual state machine | Any | In-memory (lost on crash) | None β restart from scratch | Wherever your code runs | Prototypes, simple sequences |
| CF Agents SDK | TypeScript | SQLite per agent (durable) | Resume from last checkpoint | Cloudflare edge, global | Per-entity agents (one per user/brand), edge-native |
| Temporal | TS, Python, Go, Java | Event-sourced (durable) | Replay from event history | Self-hosted or cloud | Mission-critical workflows, multi-service orchestration |
| Inngest | TypeScript | Step-level (durable) | Resume from last step | Serverless, event-driven | Event-triggered pipelines, serverless-friendly |
| LangGraph | Python, TypeScript | Configurable (Redis, Postgres) | Checkpoint-based | Self-hosted | Complex agent graphs with branching, AI-specific abstractions |
Recommendation: If you are on Cloudflare, the Agents SDK is the obvious choice β it runs on Durable Objects with built-in SQLite state, hibernation, and global distribution. If you are not on Cloudflare, Temporal for complex multi-service workflows or Inngest for simpler event-driven pipelines. LangGraph if you need graph-based agent routing, but be prepared for production challenges with state management and debugging.
Logging on Edge/Serverless
| Library | Edge/Worker Support | Structured Output | Child Loggers | Size | Notes |
|---|---|---|---|---|---|
console.log | Native | No (strings only) | No | 0KB | Not logging, just printing |
| Pino (browser mode) | Yes (browser mode) | Yes (JSON) | Yes | ~15KB | Fastest structured logger, browser mode uses console internally |
| Winston | No (Node.js only) | Yes | No (but has metadata) | ~200KB | Too heavy, Node.js APIs, not edge-compatible |
| Custom JSON wrapper | Yes | Yes | If you build it | 1-5KB | Full control, maintenance burden |
| Workers Logs | Native (CF only) | Yes | No | 0KB | Cloudflare-specific, automatic, but limited control |
Recommendation: Pino in browser mode for Cloudflare Workers. It gives you structured JSON, child loggers with context propagation, log levels, and it works by writing to
console.loginternally β which Workers capture. Zero infrastructure to set up. If you want Cloudflare-native, Workers Logs is automatic but less flexible.
Input Validation
| Approach | Runtime Validation | Type Safety | Error Messages | Schema Reuse | Bundle Size |
|---|---|---|---|---|---|
None (request.json()) | No | No (any) | N/A (crashes) | N/A | 0KB |
Manual if checks | Yes (incomplete) | No (no narrowing) | Custom but inconsistent | Copy-paste | 0KB |
| Zod | Yes (complete) | Yes (z.infer) | Structured, detailed | Schema objects | ~14KB |
| tRPC | Yes (via Zod) | Yes (end-to-end) | Structured | Full stack | ~30KB |
| Valibot | Yes (complete) | Yes (similar to Zod) | Structured | Schema objects | ~5KB |
| ArkType | Yes (complete) | Yes (type-first) | Structured | Schema objects | ~40KB |
Recommendation: Zod is the standard choice. It has the largest ecosystem, works with AI SDK, tRPC, React Hook Form, and most TypeScript tools. If bundle size is critical (edge workers), Valibot offers similar functionality at 1/3 the size. tRPC if you control both client and server and want end-to-end type safety.
| # | Donβt | Do Instead | Why |
|---|---|---|---|
| 1 | Parse LLM responses with regex + JSON.parse + as | Use generateObject() with a Zod schema | Regex misses variants, as lies to TypeScript, no retries on malformed responses |
| 2 | Call provider APIs directly with raw fetch() | Route through a centralized proxy with per-call metering | No cost tracking = $47 in 5 days with zero attribution |
| 3 | Calculate cost as (input + output) * price | Track ALL token tiers: input, output, thinking, cache_read, cache_write | Thinking tokens cost 5.8x output tokens, missing them causes 9x undercounts |
| 4 | Let files grow past 300 lines with mixed concerns | Split by single responsibility, target <300 lines per module | God modules hide coupling, make testing expensive, slow prompt iteration |
| 5 | Use await c.req.json() without validation | Validate with Zod at every system boundary (HTTP, queue, function) | Unvalidated input propagates through queues and databases, crashes downstream |
| 6 | Copy-paste GeminiApiResponse interface across files | Shared type packages, or use AI SDK (no provider types needed) | Copies drift. One copy misses thoughtsTokenCount. 9x undercount follows. |
| 7 | Hand-roll state machines with in-memory state | Use a stateful agent runtime (CF Agents SDK, Temporal, Inngest) | No persistence = restart from scratch on crash. Wasted LLM spend. |
| 8 | Put provider API keys in every worker | Centralize keys in one proxy service | Rotation requires N deployments. No kill switch. No centralized rate limiting. |
| 9 | Use console.log("something happened") for observability | Pino browser mode with child loggers for context propagation | Strings are not searchable. No request correlation. No cost/latency data. |
| 10 | Run crons that call LLMs without budget checks | Budget-aware pipelines with daily limits, cost checks, and kill switches | 193 ghost posts in 5 days. $37 in untracked spend. No human in the loop. |
Here is the architecture that addresses all 10 anti-patterns simultaneously:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β APPLICATION LAYER β
β β
β βββββββββββββββββ βββββββββββββββββ βββββββββββββββββ β
β β Content β β Analysis β β Agent β β
β β Pipeline β β Service β β Service β β
β β β β β β β β
β β Zod schemas β β Zod schemas β β CF Agents SDK β β
β β generateObjectβ β generateObjectβ β Durable state β β
β β Pino logging β β Pino logging β β Pino logging β β
β β Budget checks β β Budget checks β β Budget checks β β
β βββββββββ¬ββββββββ βββββββββ¬ββββββββ βββββββββ¬ββββββββ β
β β β β β
β ββββββββββββββββββββΌβββββββββββββββββββ β
β β β
β X-Project-Id + X-Function + X-Tags β
β β β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β API PROXY β β
β β β β
β β Auth β Rate Limit β Budget Check β Forward β Meter β β
β β β β
β β - One service holds all provider API keys β β
β β - Every call logged to api_calls_ledger β β
β β - All token tiers tracked (input/output/thinking/cache) β β
β β - Daily spend limits enforced per project β β
β β - Kill switch via project.active flag β β
β ββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ β
β β β
βββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββ
β PROVIDER APIs β
β Gemini | OpenAI | Anthropic β
β Brave | Perplexity | etc. β
βββββββββββββββββββββββββββββββββ
The key properties of this stack:
-
Every LLM call is structured. Zod schemas define the output.
generateObject()validates it. No regex, noJSON.parse, noas. -
Every LLM call is metered. The proxy records project, function, tags, all token tiers, and calculated cost. Monthly reconciliation catches formula drift.
-
Every LLM call is budgeted. Daily limits stop runaway spending. Kill switches disable pipelines without deployment. Crons check budget before calling.
-
Every module is small. Under 300 lines. Single responsibility. Independently testable. Prompts are easy to find and iterate.
-
Every boundary is validated. HTTP endpoints, queue consumers, function arguments β Zod schemas at every transition point. Invalid data is rejected at the edge, not in the database.
-
Every type is defined once. Shared packages or SDK abstractions. No copy-paste drift. One update propagates everywhere.
-
Every agent has durable state. CF Agents SDK or equivalent runtime. Crash recovery resumes from the last checkpoint. No wasted LLM spend.
-
Every key is centralized. One service, one rotation point, one kill switch. No scattered secrets.
-
Every log is structured. Pino with child loggers. Request IDs propagate through the call chain. Cost, latency, and token counts are logged fields, not embedded strings.
-
Every cron is governed. Budget check before execution. Item caps per run. Kill switch via KV. Human review gate before publishing.
This is not theoretical. This is the stack that replaced the one that burned $47 in 5 days.
Official Documentation
- Vercel AI SDK - generateObject β API reference for structured output generation with schema validation
- Vercel AI SDK - Generating Structured Data β Guide to structured data extraction with Zod schemas
- Vercel AI SDK 6 β Unified generateObject/generateText with multi-step tool calling and structured output
- Zod Documentation β TypeScript-first schema validation with static type inference
- Cloudflare AI Gateway β Managed LLM proxy with analytics, caching, rate limiting
- Cloudflare AI Gateway Pricing β Core features free, Logpush extra
- Cloudflare Agents SDK β Stateful AI agents on Durable Objects with built-in SQL, scheduling, WebSocket
- Cloudflare Agents API Reference β Agent class methods, state management, lifecycle hooks
- Cloudflare Agent Class Internals β How Agents extend Durable Objects
- Cloudflare Durable Objects β Stateful micro-servers with SQLite, the foundation for Agents
- Cloudflare Workers Logs β Native structured logging for Workers
- Gemini API Pricing β Token tier pricing including thinking tokens for Flash/Pro models
- Pino Logger β Low overhead Node.js logger with browser mode for edge runtimes
- LiteLLM β Open-source LLM proxy supporting 100+ providers with cost tracking
- LiteLLM Spend Tracking β Per-key/user/team cost attribution and budget management
- LiteLLM Tag Budgets β Tag-based spend tracking for cost center attribution
- Temporal β Durable execution platform for mission-critical workflows
- Inngest β Event-driven serverless workflow orchestration
- LangGraph β Graph-based AI agent orchestration framework
Blog Posts and Guides
- The Complete Guide to LLM Observability (Portkey) β Traces, metrics, events for production LLM apps
- LLM Cost Tracking Solution (TrueFoundry) β Observability, governance, and cost optimization for LLM operations
- Cloudflare AI Gateway Pricing Explained β Detailed pricing breakdown and comparison
- Comparing Inngest and Temporal for State Management β Side-by-side comparison for distributed systems
- The Ultimate Guide to TypeScript Orchestration β Temporal vs Trigger.dev vs Inngest comparison
- Orchestrating Multi-Step Agents: Temporal/Dagster/LangGraph Patterns β Patterns for long-running agent work
- Prototype to Production-Ready Agentic AI (Temporal) β Moving LangGraph agents to Temporal for production
- We Tested 8 LangGraph Alternatives (ZenML) β Practical comparison of agent orchestration frameworks
- Building AI Agents with MCP, AuthN/AuthZ, and Durable Objects β Cloudflareβs vision for agent infrastructure
- Pino on Cloudflare Workers (GitHub Issue #2035) β Browser mode usage for Workers environments
- Structured Outputs with Vercel AI SDK β Practical guide to generateObject with Zod
- Professional Validation with Zod (CodeSignal) β Production-readiness patterns with Zod
- LiteLLM Cost Tracking: Multi-Model Expense Management (Statsig) β Real-world cost tracking with LiteLLM
- Monitor LiteLLM with Datadog β LLM observability integration
Companion Articles
- The Three-Layer AI Agent Architecture β Agent runtime + LLM interface + API proxy, the architectural pattern this articleβs fixes implement
- Event-Driven Architecture on Cloudflare Workers β Queues, fan-out, idempotency, consumer middleware
- Cost Observability for Cloudflare Workers β D1 row reads incident, Analytics Engine, budget governor patterns
- Building an Autonomous Data Pipeline on Cloudflare Workers β Workers + D1 + Queues + DO + R2, priority scheduler, cost story
Libraries and Tools
- Vercel AI SDK (npm:
ai) β TypeScript SDK for building AI applications, structured output, streaming - Zod (npm:
zod) β TypeScript-first schema validation, runtime + compile-time safety - Pino (npm:
pino) β Low-overhead structured logger for Node.js and browser environments - LiteLLM (GitHub) β Python SDK and proxy server for 100+ LLM APIs with cost tracking
- Cloudflare Agents (GitHub) β Build and deploy stateful AI agents on Cloudflareβs edge network
- Hono β Lightweight web framework for Cloudflare Workers, Deno, Bun, Node.js
- hono-pino (JSR) β Pino integration middleware for Hono
Every anti-pattern in this article was shipped to production, discovered through operational pain, and fixed with the techniques described. The $47 was real. The 193 ghost posts were real. The 9x undercount was real. The fixes are also real, and they are running in production today.