Cost-Aware Orchestration: Budget as a First-Class Constraint

Every pipeline step has a cost. Every capability provider has a price. The orchestrator must treat budget like it treats timeout — a hard constraint that shapes every routing decision before the invoice arrives.

Open Table of Contents

The Problem: Invisible Cost
Budget as Constraint, Not Afterthought
The Cost Gradient Within a Single Capability
Per-Step Cost Attribution
The Budget Governor Pattern
Cost-Aware Degradation
Caching Saves Money
Real Pricing Data (2026)
The Four-Tier Model
Monitoring and Alerting
References

The Problem: Invisible Cost

Lunar.dev’s 2024 State of API Consumption report found that 36% of engineers spend more time troubleshooting APIs than building new features. But that figure measures developer time, not money. The money problem is worse: no company can reliably attribute a $10,000 API bill to the specific product features or pipeline runs that generated it.

The cost of a pipeline is invisible until the invoice arrives. By then it is too late to ask which pipeline caused it, which provider charged the most, or whether a cheaper provider would have produced the same result.

This is an architectural failure, not a finance failure. Systems that cannot observe their own cost cannot control it. Engineers optimize what they can measure. If cost per invocation is not measurable at design time, it will not be optimized. It will accumulate silently — per render, per TTS call, per AI video generation — until a monthly statement forces a reckoning.

The conventional response is to add cost monitoring after the fact: export billing data to a dashboard, tag cloud resources, add labels to API calls. This is useful but insufficient. Retroactive monitoring explains what happened. It does not change what happens next.

The correct fix is to make cost a first-class constraint that the orchestrator enforces in real time — the same way it enforces timeouts, retry limits, and provider health checks. Budget must be visible to the routing layer before any capability is invoked. If the budget for a step is exhausted, the router must know this and act on it: downgrade to a free provider, queue the task, or reject it cleanly.

Budget as Constraint, Not Afterthought

A pipeline orchestrator already treats certain values as hard constraints. Timeout means: if this step has not completed in N seconds, abort it. Retry limit means: if this step has failed K times, stop trying. Budget must occupy the same conceptual position: if this step would exceed the allocated spend, do not invoke the expensive provider — route elsewhere.

This requires three things the typical orchestrator does not have:

Cost metadata per provider. The registry must know what each provider charges per invocation, not just its endpoint URL.
Budget declaration at the invoke site. Each pipeline step must declare how much it is willing to spend.
Budget enforcement in the router. Before invoking a provider, the router checks the declared budget and the remaining global/daily budget, then selects accordingly.

The YAML representation is straightforward:

pipeline:
  - invoke:
      capability: tts-generate
      budget_cents: 2
      prefer: cheapest

  - invoke:
      capability: ffmpeg-render
      budget_cents: 50
      prefer: cheapest

  - invoke:
      capability: ai-video-generate
      budget_cents: 25
      prefer: local-first
      fallback: reject

The budget_cents field is a hard cap per invocation. The prefer hint is a soft routing preference that applies within that budget. The router reads both and selects a provider that satisfies the cap — or raises a budget error if none exists.

This is the minimum surface area. A production system adds two more fields: fallback (what to do if no provider fits the budget — queue, degrade, reject) and priority (high-priority pipelines may be permitted to exceed caps, while batch pipelines are held to strict limits).

The Cost Gradient Within a Single Capability

A key insight that makes cost-aware routing tractable: for most capabilities, a free or near-free provider exists. The cost gradient within a single capability spans several orders of magnitude.

ffmpeg-render:
  Stargate (local GPU)         → $0.00/render
  Shotstack (commercial)       → $0.04/render
  RunPod (spot instance)       → $0.08/render
  RunPod (on-demand)           → $0.25/render
  Creatomate (commercial)      → ~$0.04/render

tts-generate:
  Edge TTS (local, offline)    → $0.00/call
  Fish Audio (paid)            → $0.80/hr of audio
  ElevenLabs (paid)            → $0.30/1M characters

ai-video-generate:
  Local Wan2 (own GPU)         → $0.00/clip
  fal.ai Wan2 (cloud)          → $0.20/clip
  Runway Gen-4 (commercial)    → $0.50/5 seconds

stock-footage:
  Pexels (free, API)           → $0.00/search
  Pixabay (free, API)          → $0.00/search
  Storyblocks (subscription)   → $24,000/year flat

The gradient means the routing decision is almost never binary. It is not “use the expensive provider or fail.” It is: start at the free tier, move up only when quality or availability requires it, and never exceed the declared budget.

API Mom’s CapabilityRegistryDO stores a cost_model per provider. Each provider record includes cost_cents_per_invocation (or equivalent unit), is_free, requires_local_gpu, and availability_zone. The router reads this at decision time to build a candidate list sorted by cost, filtered by availability and budget.

interface ProviderRecord {
  provider_id: string;
  capability: string;
  cost_cents_per_invocation: number;
  is_free: boolean;
  requires_local_gpu: boolean;
  availability: 'always' | 'spot' | 'local-only';
  health_status: 'healthy' | 'degraded' | 'down';
}

async function selectProvider(
  capability: string,
  budget_cents: number,
  prefer: 'cheapest' | 'fastest' | 'local-first',
  registry: CapabilityRegistryDO,
): Promise<ProviderRecord> {
  const candidates = await registry.getProviders(capability);

  // Filter: healthy, within budget
  const eligible = candidates.filter(
    (p) => p.health_status === 'healthy' && p.cost_cents_per_invocation <= budget_cents,
  );

  if (eligible.length === 0) {
    throw new BudgetExceededError(`No provider for ${capability} within ${budget_cents}¢`);
  }

  // Sort by preference
  if (prefer === 'cheapest') {
    eligible.sort((a, b) => a.cost_cents_per_invocation - b.cost_cents_per_invocation);
  } else if (prefer === 'local-first') {
    eligible.sort((a, b) => (a.requires_local_gpu ? -1 : 1));
  }

  return eligible[0];
}

The prefer: cheapest hint selects Stargate (local GPU, free) when the local runner is online, Shotstack as the first paid fallback. If the declared budget is less than Shotstack’s $0.04, only local/free providers remain in the candidate list.

Per-Step Cost Attribution

Invisible cost is the root problem. Making cost visible requires instrumentation at the invocation level, not the billing level. Every capability invoke must return cost metadata alongside its payload.

{
  "result": { "url": "https://r2.example.com/renders/abc123.mp4" },
  "meta": {
    "cost_cents": 4,
    "provider_id": "shotstack",
    "cache_hit": false,
    "latency_ms": 2340,
    "invocation_id": "inv_01j9k2m4n8p"
  }
}

The pipeline runner aggregates this metadata across steps:

interface PipelineRunRecord {
  run_id: string;
  pipeline_id: string;
  started_at: number;
  completed_at: number;
  status: 'success' | 'failed' | 'budget_rejected';

  // Cost breakdown
  total_cost_cents: number;
  cost_by_step: Record<string, number>;      // step_name → cents
  cost_by_provider: Record<string, number>;  // provider_id → cents
  cost_by_capability: Record<string, number>; // capability → cents

  // Cache savings
  cache_hits: number;
  cache_misses: number;
  saved_by_cache_cents: number;
}

With this record, you can answer questions that are otherwise unanswerable from billing data alone:

“This pipeline costs $0.43 on average, $0.12 when the cache is warm.”
“The ai-video-generate step accounts for 70% of pipeline cost.”
“fal.ai spent $42 this week; the local Wan2 runner saved an estimated $380.”
“Pipeline render-weekly-summary exceeded its $1.00 budget on 12 of 30 runs.”

This is cost observability at the right level of abstraction — pipeline runs, not cloud billing line items.

The Budget Governor Pattern

Per-invocation budget caps prevent any single step from overspending. But you also need aggregate enforcement: daily caps per capability, per-pipeline caps, and global caps. These require a stateful controller.

Pattern 3 from Cloudflare Durable Objects Patterns applies directly. A BudgetGovernorDO holds spend state for all capabilities and enforces limits across invocations. Because a Durable Object is single-threaded with transactional storage, budget accounting is race-free — two concurrent pipeline runs cannot both read “budget remaining: $0.50” and each spend $0.40.

// capability-budget-governor.ts

interface CapabilityBudget {
  capability: string;
  daily_limit_cents: number;
  spent_today_cents: number;
  reset_at: number; // Unix timestamp (midnight UTC)
}

interface GlobalBudget {
  daily_limit_cents: number;
  spent_today_cents: number;
  reset_at: number;
}

interface GovernorState {
  capability_budgets: Record<string, CapabilityBudget>;
  pipeline_budgets: Record<string, { limit_cents: number; spent_cents: number }>;
  global: GlobalBudget;
}

export class CapabilityBudgetGovernor extends DurableObject {
  private state: GovernorState;

  async checkAndCharge(params: {
    capability: string;
    pipeline_id: string;
    estimated_cents: number;
  }): Promise<{ approved: boolean; reason?: string }> {
    await this.maybeResetDaily();

    const { capability, pipeline_id, estimated_cents } = params;
    const cap = this.state.capability_budgets[capability];
    const pipeline = this.state.pipeline_budgets[pipeline_id];
    const global = this.state.global;

    // Check capability daily limit
    if (cap && cap.spent_today_cents + estimated_cents > cap.daily_limit_cents) {
      return {
        approved: false,
        reason: `Capability ${capability} daily budget exhausted (${cap.spent_today_cents}¢ of ${cap.daily_limit_cents}¢)`,
      };
    }

    // Check per-pipeline limit
    if (pipeline && pipeline.spent_cents + estimated_cents > pipeline.limit_cents) {
      return {
        approved: false,
        reason: `Pipeline ${pipeline_id} budget exceeded (${pipeline.spent_cents}¢ of ${pipeline.limit_cents}¢)`,
      };
    }

    // Check global daily limit
    if (global.spent_today_cents + estimated_cents > global.daily_limit_cents) {
      return {
        approved: false,
        reason: `Global daily budget exhausted (${global.spent_today_cents}¢ of ${global.daily_limit_cents}¢)`,
      };
    }

    // Charge (optimistic — actual cost reported after invocation)
    if (cap) cap.spent_today_cents += estimated_cents;
    if (pipeline) pipeline.spent_cents += estimated_cents;
    global.spent_today_cents += estimated_cents;

    await this.ctx.storage.put('state', this.state);
    return { approved: true };
  }

  async reconcile(params: {
    capability: string;
    pipeline_id: string;
    estimated_cents: number;
    actual_cents: number;
  }): Promise<void> {
    // Correct the optimistic charge with the actual cost
    const delta = params.actual_cents - params.estimated_cents;
    const cap = this.state.capability_budgets[params.capability];
    const pipeline = this.state.pipeline_budgets[params.pipeline_id];

    if (cap) cap.spent_today_cents += delta;
    if (pipeline) pipeline.spent_cents += delta;
    this.state.global.spent_today_cents += delta;

    await this.ctx.storage.put('state', this.state);
  }

  private async maybeResetDaily(): Promise<void> {
    const now = Date.now();
    if (now >= this.state.global.reset_at) {
      // New day — reset all daily counters
      for (const cap of Object.values(this.state.capability_budgets)) {
        cap.spent_today_cents = 0;
        cap.reset_at = nextMidnightUTC();
      }
      this.state.global.spent_today_cents = 0;
      this.state.global.reset_at = nextMidnightUTC();
      await this.ctx.storage.put('state', this.state);
    }
  }
}

Example budget configuration — maintained as code, applied at governor initialization:

const budgetConfig = {
  capabilities: {
    'ffmpeg-render': { daily_limit_cents: 500 },       // $5/day
    'ai-video-generate': { daily_limit_cents: 2000 },  // $20/day
    'tts-generate': { daily_limit_cents: 300 },        // $3/day
    'stock-search': { daily_limit_cents: 0 },          // free only
  },
  pipelines: {
    'render-weekly-summary': { limit_cents: 100 },     // $1/run
    'render-social-clip': { limit_cents: 50 },         // $0.50/run
    'batch-tts-podcast': { limit_cents: 200 },         // $2/run
  },
  global: {
    daily_limit_cents: 5000,                           // $50/day total
  },
};

The governor is the enforcement point. The routing layer is the decision point. Both are separate from the capability providers themselves. This separation means you can change budget policy without touching pipeline logic.

Cost-Aware Degradation

Budget exhaustion is not a binary state. It is a gradient. The routing layer should respond proportionally, degrading gracefully rather than failing hard.

Four thresholds define the degradation curve:

Global daily budget: $50.00

$50.00 → $10.00  (0–80% spent)
  Status: NORMAL
  Routing: Standard — prefer cheapest, allow paid providers

$10.00 → $5.00   (80–90% spent)
  Status: WARN
  Routing: Prefer cheapest providers across all capabilities
  Action: Log degradation events, alert on-call

$5.00  → $1.00   (90–98% spent)
  Status: CONSTRAINED
  Routing: Free and local providers only
  Action: Queue paid-only tasks for next day, warn pipeline callers

$1.00  → $0.00   (98–100% spent)
  Status: EXHAUSTED
  Routing: Reject new invocations that cannot be served free
  Action: Alert human, surface budget status in API responses

The router reads current budget status from the governor on each invocation. This is a single DO read — negligible latency. Based on status, it applies a routing override:

type BudgetStatus = 'normal' | 'warn' | 'constrained' | 'exhausted';

function applyBudgetOverride(
  candidates: ProviderRecord[],
  status: BudgetStatus,
): ProviderRecord[] {
  switch (status) {
    case 'normal':
      return candidates; // No override

    case 'warn':
      // Sort cheapest first, do not filter
      return candidates.sort((a, b) => a.cost_cents_per_invocation - b.cost_cents_per_invocation);

    case 'constrained':
      // Free and local providers only
      return candidates.filter((p) => p.is_free || p.requires_local_gpu);

    case 'exhausted':
      // Free only — if none, throw
      const free = candidates.filter((p) => p.is_free);
      if (free.length === 0) {
        throw new BudgetExhaustedError('Global daily budget exhausted, no free provider available');
      }
      return free;
  }
}

The degradation cascade maps directly to the four-tier model. constrained forces everything to Tier 1 (free local) and Tier 2 (free cloud). exhausted is a circuit breaker that protects the budget from further drain while keeping the system partially functional.

Caching Saves Money

Content-addressed caching is the most reliable cost optimization available. If the input to a capability invocation is identical to a previous invocation, the output will be identical. Pay once, serve indefinitely.

The caching layer operates on the capability inputs, not the provider responses:

async function invokeWithCache(
  capability: string,
  inputs: Record<string, unknown>,
  provider: ProviderRecord,
): Promise<InvokeResult> {
  // Deterministic cache key from capability + inputs
  const cacheKey = await sha256(`${capability}:${JSON.stringify(inputs)}`);

  // Check R2 or KV cache
  const cached = await env.CACHE.get(cacheKey, 'json');
  if (cached) {
    return {
      ...cached,
      meta: { ...cached.meta, cache_hit: true, cost_cents: 0 },
    };
  }

  // Cache miss — invoke provider
  const result = await invoke(capability, inputs, provider);

  // Store result (TTL varies by capability)
  const ttl = CACHE_TTL[capability] ?? 86400; // Default 24h
  await env.CACHE.put(cacheKey, JSON.stringify(result), { expirationTtl: ttl });

  return result;
}

const CACHE_TTL: Record<string, number> = {
  'tts-generate': 0,         // No TTL — TTS output is permanent
  'stock-search': 86400,     // 24h — stock libraries update daily
  'ai-video-generate': 0,    // No TTL — same prompt, same clip
  'ffmpeg-render': 3600,     // 1h — rendered files may be overwritten
};

The savings compound quickly in practice:

TTS: “Hello and welcome to this week’s podcast” costs $0.015 to generate with ElevenLabs. Generated once, cached forever, used in every episode introduction. Cost amortized to effectively zero.
Stock search: A Pexels search for “mountain sunrise” returns the same results within a 24-hour window. Caching eliminates 90% of repeat searches on recurring pipelines.
AI video: Given a deterministic prompt and seed, Wan2 generates the same clip. A promotional video that uses the same B-roll prompt across 50 renders pays for the generation once.
Structured data: Financial data fetched at pipeline start can be cached for the pipeline’s duration, preventing repeated API calls across steps.

The cache hit ratio is a leading indicator of cost efficiency. A ratio below 50% on a recurring pipeline means inputs are not stable enough for caching to help — which usually means prompts or search terms are being generated dynamically with unnecessary variation (timestamps, session IDs, random seeds).

Real Pricing Data (2026)

Cost-aware routing requires current pricing data. Pricing changes. The registry should store prices with a price_updated_at timestamp and alert when data is stale (>30 days) or when provider price changes exceed a threshold.

Capability	Free Provider	Cheapest Paid	Premium
TTS	Edge TTS ($0.00)	Fish Audio ($0.80/hr)	ElevenLabs ($0.30/1M chars)
AI Video	Local Wan2 ($0.00)	fal.ai ($0.20/clip)	Runway Gen-4 ($0.50/5s)
Stock Footage	Pexels ($0.00)	Pixabay ($0.00)	Storyblocks ($24K/yr)
Render	Local FFmpeg ($0.00)	Shotstack ($0.04/render)	Creatomate (~$0.04/render)
LLM (standard)	Workers AI ($0.00)	Gemini Flash ($0.15/1M)	Claude Sonnet ($3.00/1M)
LLM (reasoning)	OpenRouter free	GPT-4o-mini ($0.15/1M)	Claude Opus ($15.00/1M)
Image generation	—	fal.ai Flux (~$0.003/img)	DALL-E 3 ($0.04/img)

A few observations from this table:

Free providers are production-grade. Edge TTS produces acceptable voice output for most use cases. Pexels and Pixabay have extensive libraries. Local FFmpeg handles render workloads that commercial providers charge $0.04/render for. The free tier is not a fallback of last resort — it is the primary tier.

Stock footage is a category exception. Pexels and Pixabay are both free, both API-accessible, and cover most search queries. Storyblocks at $24K/year is only justifiable at enterprise scale for exclusive or premium footage. Cost-aware routing should prefer Pexels → Pixabay with no paid fallback unless a specific clip is unavailable.

AI video has a steep premium curve. fal.ai at $0.20/clip is a reasonable paid tier. Runway at $0.50/5 seconds implies roughly $0.50-2.00 per typical clip — a 5–10x premium over fal.ai for quality improvements that may not be visible at 1080p on social platforms.

Pricing data goes stale fast. These figures are from early 2026 and will drift. The registry’s cost_model must be updatable without a deployment. A Durable Object that serves as the authoritative pricing source, updated via admin API, is preferable to hardcoded constants in the router.

The Four-Tier Model

The four-tier model from API Mom as Intelligent Router applies directly to capability routing, not just LLM routing. Every capability has a free tier, a limited-free tier, a paid-per-call tier, and potentially a subscription tier.

Tier 1 — Free, always available
  Local GPU (Wan2, FFmpeg, Edge TTS)
  Pexels, Pixabay (free API)
  Cloudflare Workers AI (LLM)
  Cost: $0.00

Tier 2 — Free with limits
  OpenRouter free models (rate-limited)
  Cost: $0.00, but may queue or throttle

Tier 3 — Paid per-call
  fal.ai (AI video, image generation)
  ElevenLabs, Fish Audio (TTS)
  Shotstack, Creatomate (render)
  Claude/GPT/Gemini API (LLM)
  Cost: $0.003–$15.00/M tokens or equivalent unit

Tier 4 — Subscription
  Claude Max via local runner (LLM, zero API tokens)
  Storyblocks (stock footage, flat annual rate)
  Cost: Amortized from subscription

Routing priority: Tier 1 → Tier 4 → Tier 3 cheapest → Tier 3 expensive.

Tier 4 is counterintuitive. Subscription capacity is effectively free at the margin — the monthly payment is fixed regardless of usage. LLM work routed through a Claude Max subscription via the local runner costs zero API tokens. This makes Tier 4 preferable to Tier 3 for any workload the subscription can handle.

Budget exhaustion changes the priority: Tier 1 → Tier 2 → Tier 4 → reject. Tier 3 is cut off. Tier 4 remains available because it does not draw from the API budget.

Monitoring and Alerting

Cost-aware routing requires cost-aware observability. Four metrics define whether the system is healthy from a cost perspective.

Daily spend vs budget — alert at 80%

The primary safeguard. When global spend crosses 80% of the daily limit, the system should log a structured alert and notify on-call. This gives time to investigate before the constrained mode kicks in at 90%. Budget alerts should include a breakdown by capability and provider so the cause is immediately visible.

interface BudgetAlert {
  type: 'budget_warn' | 'budget_constrained' | 'budget_exhausted';
  global_spent_cents: number;
  global_limit_cents: number;
  pct_consumed: number;
  top_capabilities: Array<{ capability: string; spent_cents: number }>;
  timestamp: number;
}

Cost per pipeline run — alert if 2× average

Each pipeline has an expected cost range. A single run that costs 2× the rolling average indicates an anomaly: a cache miss on a normally-cached step, a routing failure that fell through to an expensive provider, or a new code path that invokes a capability not previously used. Alert on the run record, not the billing statement.

Cache hit ratio — alert if below 50%

A cache hit ratio below 50% on a recurring pipeline is a signal that inputs are varying unnecessarily. The most common causes: prompts include a timestamp or UUID, search queries include a session ID, random seeds are not pinned. A low cache ratio means you are paying for results you have already computed. Alert and investigate inputs.

Provider cost drift — alert if price changes

The registry stores cost_cents_per_invocation per provider. When the router fetches this value and it differs from the value recorded in the last invocation result, flag the delta. Provider price changes are common and rarely announced prominently. Detecting them automatically means you discover a 50% price increase the day it takes effect, not the day the invoice arrives.

async function detectPriceDrift(
  provider_id: string,
  expected_cents: number,
  actual_cents: number,
): Promise<void> {
  const drift = Math.abs(actual_cents - expected_cents) / expected_cents;
  if (drift > 0.1) {
    // More than 10% drift
    await alertChannel.send({
      type: 'provider_price_drift',
      provider_id,
      expected_cents,
      actual_cents,
      drift_pct: (drift * 100).toFixed(1),
    });
    // Update registry with actual price
    await registry.updateProviderCost(provider_id, actual_cents);
  }
}

These four metrics — daily spend, per-run cost, cache ratio, price drift — constitute a minimum viable cost observability system. Taken together, they eliminate the conditions that produce surprise invoices: invisible accumulation, cache failures, routing anomalies, and silent price changes.

References

API Mom as Intelligent Router — Four-tier routing model, cost-aware LLM dispatch, degradation algorithm
Cloudflare Durable Objects Patterns — Pattern 3: Budget Governor; single-writer budget accounting
Lunar.dev, State of API Consumption 2024 — 36% of engineers spend more time troubleshooting APIs than building; cost attribution gap in enterprise API usage
fal.ai pricing — AI video and image generation per-invocation costs
ElevenLabs pricing — TTS per-character costs
Shotstack pricing — Video render per-credit pricing
OpenRouter models — Free model catalog, updated continuously