Skip to content
Gary Wu
Go back

Why Every Capability Should Assume 80% Reliability

Edit page

No media processing service is reliable enough to call once and trust. The architecture that survives production is not the one that expects everything to work — it is the one that expects everything to fail and handles it gracefully. Build for 80% reliability, instrument everything, retry selectively, and keep a fallback chain for every capability that matters.


Table of Contents

Open Table of Contents

1. The Uncomfortable Truth

Here is the data, collected from real production pipelines across 2024 and early 2026:

CapabilityObserved failure modeEffective reliability
OpenAI TTS (tts-1)Truncates output to 50–60% of intended content on long inputs~60%
ElevenLabsNon-deterministic output; unauthorized billing charges exceeding $2K; account lockouts without warning~70%
Edge TTSLong text truncation; audio degradation after extended continuous use~80%
FFmpeg8 GB memory consumed in under 60 seconds on corrupted input streamsDependent on input validation
fal.ai (Wan2)Internal server errors; generation timeouts with no partial output~75%
WhisperTiming drift begins at 20+ minute audio; transcripts desync from media~85% for short audio
Pexels APIHard 200 req/hr limit; burst requests trigger 429 with no header warningRate-dependent
Cloudflare R2March 2025 incident: 100% write failure for over one hour99.9%+ (but incidents happen)
SoraEntire service shut down March 24, 20260% (service terminated)

These are not edge cases. They are the normal operating conditions of production media pipelines. Every capability in that table was relied upon by pipelines that expected it to work. Every one of them failed in ways that were not recoverable without deliberate architectural preparation.

The OpenAI TTS truncation issue is particularly instructive. The API returns HTTP 200. The response contains audio. A naive validator sees success. A duration-aware validator catches that a 500-word narration is 23 seconds when it should be 45 seconds — and the silent truncation is exposed. Without that check, the pipeline ships a video where the narration stops halfway through and the remaining footage plays in silence.

No one filed a support ticket. The model just stopped generating audio after hitting an internal length boundary, returned what it had, and called it done.

This is the shape of production failure in media pipelines: not crashes, not 500 errors, not connection timeouts — silent partial success. The call completes. The output is wrong. Only validation catches it.


2. The Design Principle

There is a simple asymmetry in how architectural assumptions fail:

If you build for 100% reliability and get 80%, your system crashes. Pipelines stall. Steps that depend on the failed capability time out. Without retry logic, a single transient error kills the entire job. Without fallbacks, a provider outage kills every job until it recovers.

If you build for 80% reliability and get 95%, your system is graceful. Retries fire and succeed on the first or second attempt. Fallbacks are available but unused. Monitoring shows a healthy retry rate. The pipeline operator sees nothing alarming.

The asymmetry is not symmetrical in cost. Building for 80% reliability costs engineering time upfront — validation logic, retry policies, fallback chains, monitoring dashboards. Building for 100% reliability costs nothing upfront and costs catastrophically later, at a moment you cannot control.

The principle: always design for the worse case you can plausibly expect, not the average case you would prefer.

For a capability like TTS, that means assuming any given call may produce truncated, silent, or distorted output. The architecture that survives this assumption is not complex — it is just validation, retry, and fallback applied consistently. The architecture that ignores this assumption looks simpler right up until the moment it fails in a way that cannot be patched without a redesign.

The tight loop article makes a related point about systems: “Don’t fix the bug. Fix the system that let the bug live.” In capability reliability terms, this translates directly: don’t fix the truncated audio file. Fix the system that allowed a truncated audio file to make it past the TTS step without detection.


3. Validation at Every Boundary

Every capability output must be validated before the pipeline advances to the next step. The validation is not optional and is not delegated to downstream steps — it happens at the boundary, immediately after the capability completes, before anything else runs.

The pattern:

interface ValidationResult {
  valid: boolean;
  reason?: string;
  measured?: Record<string, number | string>;
}

async function validateTTSOutput(
  outputPath: string,
  inputText: string
): Promise<ValidationResult> {
  const exists = await fs.stat(outputPath).catch(() => null);
  if (!exists) return { valid: false, reason: "output file missing" };
  if (exists.size === 0) return { valid: false, reason: "output file is empty" };

  const duration = await getAudioDuration(outputPath); // ffprobe
  const expectedDuration = estimateTTSDuration(inputText); // ~150 words/min
  const ratio = duration / expectedDuration;

  if (ratio < 0.75) {
    return {
      valid: false,
      reason: "audio duration too short — likely truncated",
      measured: { duration, expectedDuration, ratio }
    };
  }

  return { valid: true, measured: { duration, expectedDuration, ratio } };
}

Validation requirements by capability type:

TTS (text-to-speech)

FFmpeg render

async function validateFFmpegOutput(
  outputPath: string,
  expectedDurationSec: number,
  expectedResolution: { width: number; height: number }
): Promise<ValidationResult> {
  const probe = await ffprobe(outputPath);
  if (!probe) return { valid: false, reason: "ffprobe failed — file unreadable" };

  const video = probe.streams.find(s => s.codec_type === "video");
  if (!video) return { valid: false, reason: "no video stream in output" };

  const duration = parseFloat(probe.format.duration ?? "0");
  if (duration < 0.5) return { valid: false, reason: "duration near zero" };

  const durationRatio = duration / expectedDurationSec;
  if (durationRatio < 0.9 || durationRatio > 1.1) {
    return { valid: false, reason: "duration out of expected range", measured: { duration, expectedDurationSec } };
  }

  if (video.width !== expectedResolution.width || video.height !== expectedResolution.height) {
    return { valid: false, reason: "resolution mismatch", measured: { width: video.width, height: video.height } };
  }

  return { valid: true };
}

AI video generation

Stock search

The validation layer is cheap. A full validation pass on a rendered video via ffprobe takes under 100 milliseconds. The cost of skipping it is a downstream step that silently processes corrupted input and produces corrupted output — compounding the damage across every step that follows.


4. Retry Strategies Per Capability Type

Not all capabilities are safe to retry in the same way. Treating them uniformly leads to either missed recovery opportunities or unintended side effects.

Idempotent capabilities — TTS, render, thumbnail generation

Same input always produces equivalent output (or close enough). Safe to retry without concern. The risk is cost and latency, not correctness.

async function retryIdempotent<T>(
  fn: () => Promise<T>,
  validate: (result: T) => ValidationResult,
  maxAttempts = 3
): Promise<T> {
  for (let attempt = 1; attempt <= maxAttempts; attempt++) {
    const result = await fn();
    const validation = validate(result);
    if (validation.valid) return result;

    if (attempt < maxAttempts) {
      const backoffMs = Math.pow(2, attempt) * 1000; // 2s, 4s, 8s
      await sleep(backoffMs);
    } else {
      throw new Error(`Failed after ${maxAttempts} attempts: ${validation.reason}`);
    }
  }
  throw new Error("unreachable");
}

Non-idempotent capabilities — AI video generation, image generation

Same input does not produce the same output. Each retry generates different content. This is sometimes acceptable (any valid video is fine) and sometimes not (the pipeline is expecting a specific shot that matched a script segment). The pipeline author must specify the policy explicitly — do not inherit a default.

type NonIdempotentRetryPolicy =
  | { mode: "any-valid"; maxAttempts: number }      // retry freely, accept any valid output
  | { mode: "fail-fast" }                            // one attempt only
  | { mode: "budget-gated"; budgetUsd: number };     // retry only if budget allows

Rate-limited capabilities — Pexels, fal.ai, stock APIs

Retrying immediately after a 429 makes things worse. The correct behavior: read the Retry-After header or X-RateLimit-Reset header, wait until the rate limit window resets, then retry.

async function retryRateLimited<T>(
  fn: () => Promise<{ result: T; headers: Headers }>,
  maxAttempts = 3
): Promise<T> {
  for (let attempt = 1; attempt <= maxAttempts; attempt++) {
    const { result, headers } = await fn().catch(e => {
      if (e.status === 429) {
        const retryAfter = parseInt(headers.get("Retry-After") ?? "60", 10);
        return { result: null, headers, retryAfterMs: retryAfter * 1000 };
      }
      throw e;
    });

    if (result !== null) return result;

    const resetMs = getRateLimitResetMs(headers); // parse X-RateLimit-Reset
    await sleep(resetMs);
  }
  throw new Error("Rate limit retry exhausted");
}

Expensive capabilities — Runway, Kling, high-cost video generation

These cost real money per attempt. Auto-retry is not the right default. The pipeline must explicitly authorize a retry — either via a budget parameter or via a human-in-the-loop escalation.

async function runExpensiveCapability<T>(
  fn: () => Promise<T>,
  validate: (result: T) => ValidationResult,
  options: { maxCostUsd: number; currentCostUsd: number }
): Promise<T> {
  const result = await fn();
  const validation = validate(result);
  if (validation.valid) return result;

  const retryWouldExceedBudget = options.currentCostUsd + COST_PER_ATTEMPT > options.maxCostUsd;
  if (retryWouldExceedBudget) {
    throw new Error(`Validation failed and retry would exceed budget. Failing step. Reason: ${validation.reason}`);
  }

  // budget allows — escalate for human decision before retrying
  await escalateToHuman({ reason: validation.reason, costIfRetried: COST_PER_ATTEMPT });
  throw new Error("Escalated to human — pipeline paused");
}

5. Fallback Chains

When a primary capability fails validation after all retries are exhausted, the next option is not failure — it is the next provider in the fallback chain. A fallback chain is a prioritized list of providers that can satisfy the same capability, ordered from preferred (highest quality, potentially highest cost) to most available (lowest quality, always available).

TTS fallback chain:

text-to-speech:
  Try:        ElevenLabs (best quality, highest cost, non-idempotent)
  Fallback 1: OpenAI TTS (good quality, reliable idempotent)
  Fallback 2: Edge TTS (acceptable quality, free, always available)
  Terminal:   fail the step with structured error

AI video fallback chain:

ai-video-generation:
  Try:        fal.ai Wan2 (fast, affordable)
  Fallback 1: Kling (slower, higher quality)
  Fallback 2: Runway (premium, expensive)
  Terminal:   fail with escalation to human

Stock footage fallback chain:

stock-search:
  Try:        Pexels (free tier, 200/hr)
  Fallback 1: Pixabay (free, different catalog)
  Fallback 2: Storyblocks (subscription, large catalog)
  Terminal:   return empty results, pipeline handles gracefully

API Mom’s CapabilityRegistryDO manages these chains automatically. The pipeline YAML declares text-to-speech as a capability dependency — not elevenlabs. API Mom resolves the capability to the highest-priority available provider, attempts the call, validates the result using the declared validation schema, and cascades to the next provider on failure. The pipeline code never sees provider names.

// Pipeline code — capability-aware, not provider-aware
const audio = await capabilityRegistry.invoke("text-to-speech", {
  text: scriptSegment,
  voice: "narrator",
  validate: {
    minDurationRatio: 0.75,
    maxDurationRatio: 1.30
  }
});

// API Mom internals — provider resolution, retry, fallback
// The pipeline author does not write this

The separation matters. When ElevenLabs locks out your account or charges unauthorized fees, the pipeline does not need to change. API Mom marks ElevenLabs as unhealthy, and all future text-to-speech requests route to OpenAI TTS until ElevenLabs is restored or removed from the chain.


6. The Sora Lesson

On March 24, 2026, OpenAI shut down the Sora service.

Not a degradation. Not a rate limit. Not a temporary outage. The entire service ceased to exist.

Any pipeline that had hardcoded sora as its video generation provider stopped working permanently and silently. The job would submit. The provider call would fail. Without a fallback chain, the pipeline would fail. With a fallback chain configured for a provider that no longer exists, the fallback would never trigger because the top-level error was not a transient failure — it was a permanent one.

The lesson is not specific to Sora. It applies to any third-party capability:

Platforms can disappear overnight. A startup goes bankrupt. An API is deprecated in favor of a newer model. A service pivots away from the market you depend on. The gap between “working yesterday” and “gone today” can be a single business decision by a company whose priorities are not yours.

The capability architecture is the protection. The pipeline YAML references ai-video-generation. API Mom routes to whatever provider is alive. When Sora disappears from the registry, all pipelines that referenced ai-video-generation continue working — routed automatically to the next available provider. The only change is a single line in the provider registry: marking Sora as permanently unavailable.

# pipeline.yaml — what you write
steps:
  - capability: ai-video-generation
    params:
      duration: 5
      style: cinematic

# What you do NOT write
steps:
  - provider: sora  # <-- this breaks permanently on March 24, 2026
    endpoint: https://sora.openai.com/...

The capability contract also protects against the inverse: a new, better provider becomes available. The pipeline automatically benefits without any changes. The abstraction layer that feels like overhead during early development becomes a compounding structural advantage over the life of the pipeline.


7. Memory as a Weapon

FFmpeg processing a corrupted input stream can consume 8 GB of memory in under 60 seconds. This is not a theoretical case — it is a documented behavior when certain container formats have corrupted or missing duration metadata, causing FFmpeg to buffer indefinitely while trying to determine stream length before beginning processing.

On a machine running multiple concurrent pipeline jobs, this behavior can exhaust all available memory, trigger the OOM killer, and terminate unrelated processes. The affected pipeline fails. Every other pipeline on the machine may also fail or be corrupted.

Memory, in this context, is a weapon held by the input. Any input from an external source — a user upload, a downloaded file from a third-party API, a generated file from an AI video service — is potentially corrupted. The capability script must treat it as such.

The required defensive protocol for any FFmpeg capability:

#!/bin/bash
set -euo pipefail

INPUT="$1"
OUTPUT="$2"

# 1. Validate input before any processing
if ! ffprobe -v error -select_streams v:0 \
    -show_entries stream=codec_type,duration \
    -of json "$INPUT" > /tmp/probe.json 2>/dev/null; then
  echo '{"error": "input_unreadable"}' >&2
  exit 1
fi

DURATION=$(jq -r '.streams[0].duration // "unknown"' /tmp/probe.json)
if [ "$DURATION" = "unknown" ] || [ "$DURATION" = "N/A" ]; then
  echo '{"error": "missing_duration_metadata"}' >&2
  exit 1
fi

# 2. Set memory limit (500 MB for this capability)
ulimit -v $((500 * 1024))  # 500 MB virtual memory limit

# 3. Set hard timeout — kill after 5 minutes regardless
timeout 300 ffmpeg \
  -i "$INPUT" \
  -c:v libx264 -preset fast \
  -c:a aac \
  -movflags +faststart \
  "$OUTPUT" \
  || { echo '{"error": "ffmpeg_failed_or_timeout"}' >&2; exit 1; }

# 4. Validate output
if [ ! -s "$OUTPUT" ]; then
  echo '{"error": "output_empty"}' >&2
  exit 1
fi

The SIGTERM handler for temp file cleanup:

TEMP_DIR=$(mktemp -d)
cleanup() {
  rm -rf "$TEMP_DIR"
}
trap cleanup EXIT SIGTERM SIGINT

This is not defensive programming. It is correct programming for an environment where inputs are not trusted. The validation before processing is the most important step — it avoids the corrupted-stream memory consumption entirely by refusing to process inputs that lack valid metadata.

The timeout is the backstop. If validation passes but processing encounters a pathological case, the hard 5-minute kill prevents runaway resource consumption. Five minutes is generous for any media processing job that should complete in under 90 seconds. A job that takes longer is either processing input too large for this capability, or it is stuck.


8. ElevenLabs Horror Stories

ElevenLabs is a capable TTS provider with demonstrably superior voice quality compared to alternatives. It is also the provider that has produced the most severe production incidents in real pipelines.

Incident 1: Unauthorized billing A pipeline running overnight generated significantly more API calls than planned due to a retry loop bug. ElevenLabs has no rate limit on the free tier for the operations that were called. The billing exceeded $2,000 in a single night. The account was charged before any alert could fire.

Incident 2: Account lockout Following unusual billing patterns, ElevenLabs locked the account without prior notice. All pipelines using ElevenLabs as their primary TTS provider stopped working. No programmatic notification was sent — the failure was discovered when pipeline operators noticed audio missing from generated content.

Incident 3: Non-deterministic output The same script, sent twice with identical parameters including voice ID and stability settings, produced measurably different output files. Duration differed by 8%. Pronunciation of specific words differed. This is expected behavior for neural TTS — but pipelines that relied on consistent output for downstream processing (audio-video synchronization, transcription verification) failed in ways that were hard to trace.

The mitigations, applied now:

Never depend on a single TTS provider. ElevenLabs sits at position 1 in the fallback chain. OpenAI TTS is position 2. Edge TTS is position 3. If ElevenLabs fails validation after retries, the chain advances automatically.

Monitor billing programmatically. ElevenLabs provides a usage API. Poll it on every pipeline run, or on a schedule. Alert if the per-hour spend rate exceeds a threshold.

async function checkElevenLabsBilling(): Promise<void> {
  const usage = await elevenlabs.getUsageStats({ period: "1h" });
  if (usage.charactersUsed > CHAR_LIMIT_PER_HOUR) {
    await alert({
      severity: "critical",
      message: `ElevenLabs usage exceeding threshold: ${usage.charactersUsed} chars/hr`,
      action: "pause_pipelines_using_elevenlabs"
    });
  }
}

Validate output consistency. For idempotent use cases where the same voice and text are expected to produce comparable output, validate that successive calls produce audio within an acceptable duration range of each other. A ratio outside 0.85–1.15 between two calls for the same input indicates unexpected non-determinism.

Keep Edge TTS as a free, always-available fallback. Edge TTS (Microsoft Azure Cognitive Services, accessed via the edge-tts CLI) has no cost per character, no account lockout risk, and no billing surprises. Its quality is acceptable for most production use cases. It is not the first choice — but the first choice does not matter if the fallback is not there when needed.


9. The Compound Effect of Unreliability

Consider a pipeline with five sequential steps. Each step depends on the previous. No retry logic, no fallback chains.

If each step is 90% reliable, the pipeline success rate is:

0.9 × 0.9 × 0.9 × 0.9 × 0.9 = 0.59

A pipeline where each step works nine times out of ten will succeed less than six times out of ten. Forty-one percent failure rate, with no intervention, on a system where every individual component looks healthy.

The math is unforgiving at scale. A pipeline that runs 100 times per day generates 41 failures per day — each requiring investigation, resubmission, or manual remediation.

With retry (2 attempts per step, same provider):

Each step’s effective success rate rises from 90% to approximately:

P(step succeeds) = 1 - P(both attempts fail)
                 = 1 - (0.1 × 0.1)
                 = 1 - 0.01
                 = 0.99

Pipeline success rate:

0.99^5 = 0.95

From 59% to 95% — a 36-point improvement from a single retry per step.

With retry plus fallback chain:

A fallback chain catches failures that retries cannot — provider outages, rate limits, validation failures that indicate a systematic issue with the primary provider rather than a transient error.

If the fallback provider has 90% reliability for cases that the primary failed, the effective per-step success rate rises further:

P(step succeeds) = P(primary succeeds) + P(primary fails) × P(fallback succeeds)
                 = 0.99 + 0.01 × 0.90
                 = 0.999

Pipeline success rate:

0.999^5 ≈ 0.995

From 59% to 99.5%. The same five-step pipeline, no changes to the underlying capability providers, no improvements to their individual reliability — just retry logic and fallback chains applied consistently.

ConfigurationPer-step reliability5-step pipeline success
No retry, no fallback90%59%
2 retries, no fallback99%95%
2 retries + fallback chain99.9%99.5%

This is why retry plus fallback at every step is non-negotiable — not as a best practice, but as a mathematical requirement for operating a multi-step pipeline with acceptable success rates.

The compound effect works in reverse too. A pipeline where each step has an 80% individual success rate (realistic for AI video generation + TTS + stock search combined) has a baseline success rate of 0.8^5 = 33%. Without retry and fallback, one in three pipeline runs completes. With retry and fallback, the same pipeline achieves 99%+ completion.

The engineering cost of retry and fallback chains is fixed. The operational cost of not having them scales linearly with pipeline volume.


10. Monitoring What Matters

The monitoring layer must be specific enough to trigger on real problems and quiet enough to avoid alert fatigue. Four metrics cover the essential signal for capability reliability:

Per-capability success rate

Track success and failure for every capability independently. Do not roll up into a single pipeline success metric — that obscures which capability is degrading.

Alert threshold: success rate drops below 80% over a 15-minute window.

interface CapabilityMetric {
  capabilityName: string;
  provider: string;
  timestamp: number;
  success: boolean;
  latencyMs: number;
  validationResult?: ValidationResult;
  retryCount: number;
  usedFallback: boolean;
}

// Emit after every capability invocation
await metrics.emit(capabilityMetric);

Per-capability latency P99

Mean latency is misleading. P99 is the latency that nearly every call experiences or better — it captures the long tail that burns pipeline throughput. A TTS call that averages 800ms but has a P99 of 45 seconds indicates a systematic problem (probably timeouts hitting, being retried, succeeding on retry) that the mean hides.

Alert threshold: P99 doubles compared to 24-hour baseline.

Retry rate

A healthy system has a low retry rate. An elevated retry rate means the primary provider is degrading and retries are masking it. This is exactly the scenario where monitoring needs to surface the problem before it gets worse.

Alert threshold: retry rate exceeds 20% of total calls for a given capability over any 30-minute window.

Fallback rate

A fallback being triggered means the primary provider has failed validation after all retries. This is a serious signal — not a transient error but a provider that is systematically failing.

Alert threshold: primary provider fails and fallback is triggered in more than 10% of calls for a given capability over any 60-minute window.

// Dashboard query example
SELECT
  capability_name,
  provider,
  COUNT(*) AS total_calls,
  SUM(CASE WHEN success THEN 1 ELSE 0 END) * 100.0 / COUNT(*) AS success_rate,
  PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY latency_ms) AS p99_latency_ms,
  SUM(retry_count) * 100.0 / COUNT(*) AS retry_rate,
  SUM(CASE WHEN used_fallback THEN 1 ELSE 0 END) * 100.0 / COUNT(*) AS fallback_rate
FROM capability_metrics
WHERE timestamp > NOW() - INTERVAL '1 hour'
GROUP BY capability_name, provider
ORDER BY success_rate ASC;

The alert system is the tight loop operating at the capability level. The autonomous entity pattern describes escalation up through human levels when automated recovery fails. Monitoring these four metrics provides the signal that triggers escalation before a degraded capability causes widespread pipeline failure.

What not to monitor:

Do not alert on individual retries. Do not alert on individual fallback uses. Do not alert on single-call failures. These are noise at the call level and signal only at the rate level. The difference between a system that is healthy and a system that is degrading is not any single event — it is the pattern over time.


The Architecture That Survives

The pattern across all ten sections is consistent:

Assume failure. Validate at every boundary. Retry selectively. Fall back gracefully. Monitor rates, not events.

A pipeline built on these principles does not become more complex as it encounters more providers, more failure modes, more edge cases. It becomes more capable — because each new failure mode produces a validation rule, a retry policy, or a fallback entry that prevents the same failure from affecting any future run.

The autonomous entity pattern describes how accumulated fixes become skills. The tight loop principle describes how the system that catches bugs is more valuable than the fix for any individual bug. The compound math of unreliability shows why these properties are not optional at scale.

The 80% assumption is not pessimism. It is calibration. A system designed for the real world, not the ideal one.



Reference Implementation


Edit page
Share this post on:

Previous Post
The Capability Primitive
Next Post
Composable Pipelines: When a Pipeline IS a Capability