The Tight Loop: Observability and Action

“Don’t fix the bug. Fix the system that let the bug live.” — the tight loop principle

Open Table of Contents

1. The Core Insight
2. The Observe-Orient-Decide-Act Loop (OODA)
3. Control Theory and PID Controllers
4. Erlang/OTP: Industrial-Strength Self-Healing
5. Kubernetes Reconciliation Loops
6. Fast CI/CD Feedback and the Andon Cord
7. Chaos Engineering
8. Progressive Delivery and Canary Deployments
9. Observability: The Instrument Panel
10. DORA Metrics: Measuring the Loop
11. Circuit Breakers and Bulkheads
12. Minimal Viable Test Subsets
13. AI Agent Observability
14. AI Evaluation: The Eval Loop
15. The Strangler Fig Pattern
16. Applying the Tight Loop to Autonomous Agent Systems
17. Anti-Patterns
18. The Meta-Loop: Continuously Enhancing This Article
19. Tools and Libraries Reference

1. The Core Insight

The tight loop is a discipline, not a technique.

Most engineers, when they encounter a failure in a running system, reach for a fix to the immediate problem. The tight loop practitioner does something different: they ask why the system couldn’t find and fix this itself, then fix the system’s ability to do that.

The tight loop has three properties:

Latency is the enemy. The longer the gap between cause and observation — between action and feedback — the more work is wasted, the more damage accumulates, and the harder root cause becomes to identify. Compress this gap to seconds or minutes, not hours or days.
The system must observe itself. External monitoring is table stakes. Real tight-loop systems emit structured telemetry that lets the system — not a human — detect drift and trigger correction.
Scope before you run. Don’t iterate over the full problem space when a representative subset reveals the same failure modes in a fraction of the time. Use the smallest input that exercises the broken path.

The loop itself:

  commit ──→ observe ──→ classify ──→ act ──→ commit
     ↑                                          │
     └──────────────────────────────────────────┘

Each turn of the loop should take seconds to minutes, not hours. If a turn takes longer, the first fix is to shorten the loop, not to fix whatever error the long loop produced.

2. The Observe-Orient-Decide-Act Loop (OODA)

John Boyd designed the OODA loop for aerial combat. A pilot who can complete an OODA cycle faster than their opponent gets inside the opponent’s decision cycle and wins before the opponent can respond.

The same dynamic applies to software systems.

  ┌─────────────────────────────────────────────────────────┐
  │                        OODA LOOP                         │
  │                                                          │
  │  OBSERVE ──→ ORIENT ──→ DECIDE ──→ ACT                  │
  │     ↑                              │                     │
  │     └──────────────────────────────┘                     │
  │                                                          │
  │  Observe:  Collect signals (metrics, logs, traces)       │
  │  Orient:   Make sense of signals in context              │
  │  Decide:   Choose an action from the action space        │
  │  Act:      Execute and inject the result back into Obs.  │
  └─────────────────────────────────────────────────────────┘

Applying OODA to systems:

OODA Phase	Tight Loop Equivalent
Observe	Structured logs, metrics, distributed traces, health endpoints
Orient	Alerting rules, anomaly detection, policy engines
Decide	Runbooks, auto-remediation scripts, LLM-based diagnosis
Act	Helm rollback, feature flag toggle, job queue push, PR creation

The key insight from Boyd: speed through the loop matters more than quality of any individual decision. A pilot who makes good decisions slowly loses to a pilot who makes OK decisions fast. For systems, this means: prefer reversible actions that close the loop quickly over perfect actions that take hours to compute.

References:

Boyd, John. “Patterns of Conflict” (1986 briefing slides, declassified)
Chet Richards, Certain to Win (2004) — best accessible treatment of OODA
Mark Osborne, OODA in Software Engineering (2019)

3. Control Theory and PID Controllers

Before software, engineers built tight loops in analog circuits and mechanical systems. Control theory gives us a rigorous vocabulary.

The classic feedback control loop:

  setpoint ──→ [comparator] ──→ [controller] ──→ [plant]
                   ↑                                │
                   └──────────── [sensor] ──────────┘

Setpoint: the desired state (e.g., p99 latency < 100ms, error rate < 0.1%)
Plant: the system being controlled (the service, the fleet, the pipeline)
Sensor: observability infrastructure (metrics, logs, traces)
Controller: the correction logic (alerting rules, auto-scaler, AI agent)
Comparator: the gap between current state and setpoint

PID Controllers

The most common controller is the PID (Proportional-Integral-Derivative):

P (Proportional): correction proportional to current error. Fast response, can overshoot.
I (Integral): correction proportional to accumulated error over time. Eliminates steady-state drift, can cause oscillation.
D (Derivative): correction proportional to rate of change of error. Damps oscillation, amplifies noise.

correction = Kp * error + Ki * ∫error dt + Kd * d(error)/dt

Software analogues:

Control Concept	Software Equivalent
Setpoint	SLO / error budget target
Proportional	Scale replicas proportionally to request queue depth
Integral	Carry over unresolved incidents to next sprint budget
Derivative	Alert when error rate is increasing, not just when it’s high
Deadband	Don’t act unless deviation > threshold (avoid flapping)
Anti-windup	Cap backlog so old accumulated issues don’t distort current action

The Integral term and tech debt: Technical debt is the integral of error. Systems that only react to current problems (pure P control) accumulate drift until it’s catastrophic. Tight-loop teams explicitly budget for integral correction — this is what “paying down tech debt” means in control theory terms.

References:

Åström & Wittenmark, Computer-Controlled Systems (classic textbook)
Netflix Tech Blog: Auto-Tuning PID Controllers for Microservices (2018)
Google SRE Book, Chapter 6: “Monitoring Distributed Systems”

4. Erlang/OTP: Industrial-Strength Self-Healing

Erlang was designed in the 1980s for telephone switches that needed nine nines of uptime (31ms downtime per year). The solution was radical: let it crash, then supervise the crash.

Supervisor Trees

Every Erlang/OTP system is a tree of processes. Each internal node is a supervisor whose only job is to monitor its children and restart them on failure. Leaves are workers.

                    [Application Supervisor]
                           /        \
             [Worker Sup]            [Scanner Sup]
             /     |     \               /     \
        [W1]    [W2]    [W3]         [S1]    [S2]

Restart strategies:

one_for_one: restart only the failed child
one_for_all: if any child fails, restart all siblings (used when children share state)
rest_for_one: restart the failed child and all children started after it (used for dependency chains)

“Let it crash” philosophy

Rather than defensive programming (check every error, handle every nil, catch every exception), Erlang programs assume the happy path and crash on anything unexpected. The supervisor tree catches the crash, restarts the process from a known-good state, and logs the crash for later analysis.

This is not abandoning reliability — it’s a different theory of reliability:

Crash fast → minimal damage to shared state
Restart clean → known-good starting point
Log everything → root cause analysis post-hoc
Supervisor knows the context → escalate up the tree if restart loops

The circuit breaker built-in: OTP supervisors have max_restarts and max_seconds parameters. If a child crashes more than max_restarts times in max_seconds, the supervisor itself crashes, propagating the failure up the tree. This is automatic circuit breaking.

Applying to modern systems:

Erlang/OTP Concept	Modern Equivalent
Supervisor tree	Kubernetes Pod + Deployment + HPA
`one_for_one`	Pod restart policy
Max restart threshold	CrashLoopBackOff + alerting
”Let it crash”	Structured exception logging + dead letter queues
Process isolation	Container isolation / actor model (Akka, Orleans)
Hot code reload	Rolling deployments, feature flags

Libraries and tools:

Elixir: brings OTP to a modern Ruby-like syntax; same supervisor model
Akka (Scala/Java): actor model with supervisor hierarchies
Microsoft Orleans (.NET): virtual actor model with grain persistence
Temporal (Go/Java/TypeScript): workflow orchestration with automatic retry trees

5. Kubernetes Reconciliation Loops

Kubernetes is the most widely deployed implementation of the tight loop pattern at infrastructure scale.

The core principle: desired state vs. actual state

Every Kubernetes controller follows the same pattern:

while (true) {
  const desired = getDesiredState()  // read from etcd (spec)
  const actual  = getActualState()   // query the real world (status)
  const diff    = reconcile(desired, actual)
  if (diff) apply(diff)
  await sleep(resyncPeriod)
}

This is the reconciliation loop. Controllers don’t store “what I last did” — they compute “what I need to do now” from first principles on every iteration. This makes them idempotent and convergent: run them 100 times on a healthy cluster and nothing changes; run them once after a failure and they fix it.

Key properties:

Level-triggered, not edge-triggered: controllers react to current state, not events. This means missed events (network partition, API server restart) don’t leave the system in a broken state indefinitely.
Optimistic concurrency: controllers use resourceVersion to detect conflicts. If the resource changed since they last read it, they start over. No distributed locks required.
Eventual consistency: Kubernetes makes no guarantees about when the actual state matches desired, only that it will converge. This is the right contract for distributed systems.

Writing your own controller

The Kubernetes controller-runtime library (Go) implements the scaffolding:

// controller-runtime example
func (r *MyReconciler) Reconcile(ctx context.Context, req reconcile.Request) (reconcile.Result, error) {
    // 1. Fetch the object
    obj := &MyResource{}
    if err := r.Get(ctx, req.NamespacedName, obj); err != nil {
        return reconcile.Result{}, client.IgnoreNotFound(err)
    }

    // 2. Compute desired state
    desired := r.computeDesired(obj)

    // 3. Diff and apply
    if !reflect.DeepEqual(obj.Status, desired) {
        obj.Status = desired
        return reconcile.Result{}, r.Status().Update(ctx, obj)
    }

    // 4. Re-queue after resync period
    return reconcile.Result{RequeueAfter: 5 * time.Minute}, nil
}

Applying this pattern outside Kubernetes:

Any system with a “desired config” and a “running reality” can use reconciliation loops:

Infrastructure-as-Code: Terraform plan/apply is one reconciliation pass
Database schema management: Flyway/Liquibase migrates toward desired schema
Agent dispatchers: scan repos for violations (actual), compare to policy (desired), create jobs (apply)
Feature flag systems: compare enabled users (desired) to current rollout (actual), adjust weights

References:

Kubebuilder Book — best tutorial on writing controllers
Programming Kubernetes (Hausenblas & Schimanski, 2019)
Bilgin Ibryam, Kubernetes Patterns — especially Chapter 27: Controller

6. Fast CI/CD Feedback and the Andon Cord

Toyota Production System and the Andon Cord

Toyota’s Andon cord (now a button) allows any worker on the assembly line to stop the entire line when they detect a defect. Counterintuitively, this increases throughput: problems are fixed immediately rather than propagating through 500 more cars before detection.

Software equivalent: when a test fails in CI, block all merges to that branch until it’s fixed. This is not a slowdown — it’s the mechanism that prevents defect accumulation.

The key insight: stopping to fix is faster than continuing with a broken foundation.

The Fast Feedback Gradient

Not all tests are equal. Organize your test suite into layers with dramatically different execution times:

Layer 0: Static analysis         → < 5s    (linting, type checking, formatting)
Layer 1: Unit tests              → < 30s   (pure functions, no I/O)
Layer 2: Integration tests       → < 3min  (real DB, fake external services)
Layer 3: Contract tests          → < 5min  (consumer-driven contracts)
Layer 4: End-to-end tests        → < 15min (real infrastructure, smoke paths)
Layer 5: Load/chaos tests        → hours   (run nightly or pre-release only)

Run layers sequentially and fail fast. Don’t run a 10-minute E2E suite on a commit that fails the 5-second linter.

Tools that implement fast feedback:

Nx / Turborepo: monorepo-aware build caching. Only rebuild/retest packages that changed.
Bazel: hermetic builds with per-target caching. Google runs 100,000+ tests per day.
Jest —watchMode: instant re-run of changed test files during development.
vitest: same philosophy, faster startup, native ESM.
cargo-nextest: Rust test runner, 3× faster than cargo test.
pytest-xdist: parallel test execution for Python.
GitHub Actions path filters: only trigger jobs when relevant files change.

Test Impact Analysis

Rather than running all tests on every commit, track which tests exercise which code paths and run only affected tests.

Microsoft TAOS: used internally, reduces test runs by 80%+
Launchable: SaaS layer over any test runner, ML-based test selection
pytest-testmon: Python test impact analysis using coverage data
Bazel query: bazel query 'rdeps(//..., //path/to/changed:target)'

7. Chaos Engineering

Chaos engineering is the practice of deliberately injecting failures into systems to discover weaknesses before they manifest under real load.

The GameDay model (Netflix)

Netflix’s Simian Army (now evolved) runs experiments during “GameDays”:

Define steady state (the system’s normal behavior — a measurable hypothesis)
Inject a failure (kill a node, increase latency, exhaust a resource)
Observe whether steady state is maintained
Fix anything that broke

Key tools:

Tool	Scope	Notes
Chaos Monkey	EC2/AWS instance termination	Original Netflix tool
Chaos Gorilla	Kill entire availability zone
Gremlin	SaaS chaos platform	CPU, memory, network, disk attacks
LitmusChaos	Kubernetes-native	CRD-based experiments
Chaos Mesh	Kubernetes-native	CNCF project
toxiproxy	Network proxy with failure injection	Great for local dev
Pumba	Docker container chaos

Principles (from the Chaos Engineering handbook):

Build a hypothesis around steady-state behavior
Vary real-world events (not just crashes — latency, data corruption, dependency unavailability)
Run experiments in production (chaos in staging doesn’t find production failure modes)
Automate experiments to run continuously
Minimize blast radius (use canaries, feature flags, scope injection narrowly)

The tight loop application: chaos engineering is how you verify your tight loop closes correctly. The question isn’t “does the system crash?” but “how long does it take the loop to detect and recover?“

8. Progressive Delivery and Canary Deployments

Progressive delivery separates deployment (code is running in production) from release (users are getting the new code). This lets you verify changes on real traffic with a small blast radius before full rollout.

Delivery strategies:

Blue/Green:     100% old → switch → 100% new (all-or-nothing)
Canary:         1% → 10% → 25% → 50% → 100% (gradual)
A/B:            Split by user segment (for feature testing)
Shadow:         New system processes requests but doesn't return results
Ring:           Internal → dogfood → early adopters → general availability

Automated canary analysis:

The key is automating the decision to advance or rollback:

Deploy to 1% of traffic
Compare error rate, latency, business metrics between canary and baseline
If canary is statistically worse → automatic rollback
If canary matches or improves → advance to next percentage

Tools:

Argo Rollouts: Kubernetes progressive delivery with analysis templates
Flagger: same concept, works with Istio/Linkerd/NGINX
LaunchDarkly / Unleash: feature flags for progressive release
Harness: CD platform with built-in canary analysis
Spinnaker: Netflix’s CD platform, canary analysis built in

The metric that matters: don’t just watch error rate. Define a single north star metric per canary that captures business value (conversion, revenue, engagement) alongside technical health.

9. Observability: The Instrument Panel

You cannot run a tight loop without seeing the system’s state. Observability is the prerequisite.

The Three Pillars

Pillar	What it captures	Best for
Metrics	Numerical aggregates over time	Alerting, dashboards, trending
Logs	Discrete events with context	Debugging specific incidents
Traces	Request flow across services	Latency diagnosis, dependency mapping

These three are complementary, not alternatives.

The USE Method (Brendan Gregg)

For every resource in your system:

Utilization: what fraction of time the resource is busy (CPU 70%, disk queue depth)
Saturation: how much work is queued waiting for the resource
Errors: error count for that resource

USE gives you a systematic checklist. Run it against CPU, memory, disk I/O, network I/O, and any application-specific resources (thread pools, connection pools, queue depth).

The RED Method (Tom Wilkie)

For every service endpoint:

Rate: requests per second
Errors: error rate (%)
Duration: latency distribution (p50, p95, p99)

RED is the service-facing view; USE is the infrastructure-facing view. Use both.

Google’s Four Golden Signals:

Latency (duration of successful requests)
Traffic (demand on the system)
Errors (rate of failing requests)
Saturation (how “full” the service is)

OpenTelemetry (OTel)

OTel is the emerging standard for instrumentation. Vendor-agnostic SDK that exports metrics, logs, and traces to any backend (Prometheus, Jaeger, Tempo, Honeycomb, Datadog, etc.).

// OTel instrumentation example
import { trace, metrics } from '@opentelemetry/api'

const tracer = trace.getTracer('my-service')
const meter = metrics.getMeter('my-service')
const jobsCreated = meter.createCounter('jobs_created_total')

async function reconcile(repos: string[]) {
  const span = tracer.startSpan('reconcile')
  try {
    for (const repo of repos) {
      const jobSpan = tracer.startSpan('process_repo', { parent: span })
      const jobs = await processRepo(repo)
      jobsCreated.add(jobs.length, { repo })
      jobSpan.end()
    }
  } catch (err) {
    span.recordException(err)
    span.setStatus({ code: SpanStatusCode.ERROR })
    throw err
  } finally {
    span.end()
  }
}

Key observability backends:

Tool	Pillar	Notes
Prometheus	Metrics	Pull-based, PromQL, ecosystem standard
Grafana	Dashboards	Connects to any datasource
Loki	Logs	Label-based, LogQL, Grafana native
Tempo	Traces	Integrates with Loki/Prometheus
Honeycomb	All three	Exceptional for high-cardinality queries
Datadog	All three	SaaS, expensive but complete
Jaeger	Traces	OSS, CNCF project

10. DORA Metrics: Measuring the Loop

DORA (DevOps Research and Assessment) identified four metrics that predict both organizational performance and software delivery performance:

Metric	Elite	High	Medium	Low
Deployment frequency	Multiple/day	Weekly	Monthly	<Monthly
Lead time for changes	< 1 hour	1 day	1 week	1 month
Change failure rate	< 5%	5-15%	10-15%	>15%
Time to restore service	< 1 hour	< 1 day	1 week	>1 month

Reading these metrics through the tight loop lens:

Deployment frequency measures loop cadence. More deployments = smaller changes = less risk per change.
Lead time for changes measures loop latency. Commit to production in minutes, not weeks.
Change failure rate measures loop accuracy. Good tests + canaries = fewer breaks.
Time to restore service (MTTR) measures loop recovery speed. Observability + runbooks + auto-remediation = fast recovery.

DORA’s key finding: these four metrics cluster together. Elite performers are elite on all four simultaneously. You cannot optimize one in isolation.

Additional metric: SPACE framework (GitHub, 2021)

DORA measures flow. SPACE measures developer productivity more holistically:

Satisfaction and well-being
Performance (outcomes, not output)
Activity (frequency of actions)
Communication and collaboration
Efficiency and flow

11. Circuit Breakers and Bulkheads

These patterns from Michael Nygard’s Release It! are essential tight-loop primitives.

Circuit Breaker

A circuit breaker monitors calls to an external dependency and “trips” (opens the circuit) when failures exceed a threshold. While open, calls fail fast without attempting the dependency. After a timeout, it half-opens and lets a test call through.

States: CLOSED → (threshold exceeded) → OPEN → (timeout) → HALF-OPEN → (success) → CLOSED
                                           ↑                              │
                                           └──────── (failure) ───────────┘

This is the software equivalent of a household circuit breaker: prevents one failing component from cascading into system-wide failure.

Implementation:

resilience4j (Java/Kotlin): circuit breaker, retry, rate limiter, bulkhead
Polly (.NET): retry, circuit breaker, timeout, bulkhead
ts-circuit-breaker / opossum (Node.js)
pybreaker (Python)
Envoy proxy: circuit breaking at the network level, no application code changes

Bulkheads

Named after ship bulkheads that isolate compartments from flooding. In software: isolate thread pools, connection pools, or process queues by caller or resource. If one tenant hammers the DB, other tenants’ connections aren’t starved.

Without bulkheads:              With bulkheads:
[All requests] → [Pool 10]      [API requests]    → [Pool 5]
                                [Background jobs] → [Pool 3]
                                [Admin requests]  → [Pool 2]

Timeout pyramid:

Every external call needs a timeout. Set timeouts at every layer:

User request timeout:       500ms
Service call timeout:       200ms
Database query timeout:     100ms
External API timeout:        50ms

If inner timeouts aren’t set, a slow database query holds the service call slot, which holds the user request slot, until the entire connection pool is exhausted.

12. Minimal Viable Test Subsets

The tight loop lives or dies by the time to get feedback. One of the most powerful optimizations is not “run the tests faster” but “run fewer tests that still tell you what you need to know.”

The philosophy:

Run the smallest set of tests that gives you confidence the specific thing you changed is correct.

Principles:

Scope to the change. A change to the JSON serializer doesn’t need to run the authentication tests. Map change → affected tests using coverage or static analysis.
Use fixtures, not production data. A 3-repo subset reveals the same parser bugs as a 41-repo scan but runs in 2 seconds instead of 8 minutes.
Smoke tests for fast sanity. A handful of critical-path tests that run in under 30 seconds. If these fail, don’t run anything else.
Regression pack for known failures. Every bug you’ve ever fixed should have a regression test. These tests encode your system’s error history.
Statistical sampling. For large datasets, random sampling of N items often reveals the same failure modes as exhaustive scanning. Start with N=10, scale up only if needed.

Test pyramid vs. test diamond:

The classic “test pyramid” (many unit tests, fewer integration, minimal E2E) is right for most systems. But for systems with significant integration complexity (API-heavy systems, cloud workers), a “diamond” shape often makes more sense:

Pyramid:                    Diamond:
    [E2E]                       [E2E]
   [integ ]                  [integ integ]
  [unit unit unit]           [  unit unit  ]
                              [smoke smoke]

The diamond adds a thick integration band and a smoke band at the bottom because unit tests of an API client test mocks, not the API.

Fast iteration workflow:

# Step 1: Smoke test — 3 repos, 5 seconds
curl -X POST .../debug/reconcile -d '{"repos": ["a", "b", "c"]}'

# Step 2: If smoke passes, run against 10 representative repos
# Step 3: If that passes, merge and let the full cron verify at scale

13. AI Agent Observability

Standard APM tools weren’t designed for LLM-based systems. AI agents introduce new failure modes:

Hallucinated tool calls: the agent calls a tool that doesn’t exist or with wrong parameters
Context window exhaustion: long conversations lose early context silently
Prompt injection: user input containing instructions that hijack agent behavior
Reasoning loops: agent loops on the same decision without progress
Cost runaway: unbounded agent loops burning tokens exponentially
Latency variance: p99 latency for LLM calls can be 10× p50

What to instrument:

// Minimum viable AI agent telemetry
interface AgentSpan {
  session_id: string
  turn_number: number
  model: string
  input_tokens: number
  output_tokens: number
  cached_tokens: number
  cost_usd: number
  tool_calls: ToolCallRecord[]
  latency_ms: number
  finish_reason: 'end_turn' | 'max_tokens' | 'tool_use' | 'error'
  error?: string
}

interface ToolCallRecord {
  tool_name: string
  duration_ms: number
  success: boolean
  error?: string
}

AI Observability Platforms:

Platform	Strengths	Notes
LangSmith	Deep LangChain integration, evaluation suite, prompt hub	Best-in-class if using LangChain
Langfuse	Open source, self-hostable, framework-agnostic	Best choice for custom agents
Helicone	Proxy-based (no SDK changes), cost analytics	Drop-in for any OpenAI-compatible API
Arize Phoenix	OSS, ML observability background, embeddings viz	Strong for RAG/embedding workflows
AgentOps	Agent-specific metrics (session replay, loop detection)	Designed for multi-step agents
Weights & Biases Weave	Versioning + tracing + evaluation in one	Strong if already using W&B
Braintrust	Eval-first, dataset management, score tracking	Best for systematic LLM eval workflows
PromptLayer	Prompt versioning + A/B testing	Simpler, less overhead

Cost control as a tight loop:

Budget → [token counter] → [budget guard] → [agent]
                ↑                               │
                └──────── [usage report] ────────┘

If accumulated_cost > daily_budget × 0.8:
  → switch to cheaper model (Haiku instead of Sonnet)
  → reduce context window
  → alert human
  → pause non-critical jobs

Agent-specific alert thresholds:

Turn count > 20 without terminal state → likely stuck
Same tool call 3× in a row → likely looping
Token ratio (output/input) > 0.8 → suspiciously verbose
finish_reason: max_tokens → context window exhausted, result likely truncated

14. AI Evaluation: The Eval Loop

Evaluating LLM-based systems is itself a tight loop discipline. The eval loop closes the gap between “model behavior” and “desired behavior.”

The Eval Hierarchy

Level 0: Assertion-based evals      — deterministic, binary pass/fail
Level 1: LLM-as-judge evals         — use a model to grade model output
Level 2: Human preference evals     — human labels ground truth
Level 3: Downstream metric evals    — measure business/task outcomes

Use all four levels, but optimize for Level 0 first. Deterministic evals run in milliseconds and never disagree with themselves.

Eval-as-CI

Treat evals like tests: run them on every commit, gate merges on them. The key is separating two concerns:

Regression pack: a frozen dataset of inputs → expected outputs. If a model change breaks regression, block the deploy.
Exploration set: new examples from production that stress-test the current model. This expands the regression pack over time.

# .github/workflows/eval.yml
- name: Run eval suite
  run: |
    braintrust eval src/evals/main.eval.ts
  env:
    BRAINTRUST_API_KEY: ${{ secrets.BRAINTRUST_API_KEY }}

LLM-as-Judge Pattern

When the expected output is complex (prose, code, structured reasoning), use a stronger model to grade a weaker model’s output:

async function gradeResponse(
  question: string,
  expected: string,
  actual: string,
): Promise<Score> {
  const prompt = `
    Question: ${question}
    Expected: ${expected}
    Actual: ${actual}

    Rate the actual response on:
    1. Correctness (0-1): Does it answer the question correctly?
    2. Completeness (0-1): Does it cover all aspects of expected?
    3. Conciseness (0-1): Is it appropriately concise?

    Return JSON: {"correctness": N, "completeness": N, "conciseness": N, "reasoning": "..."}
  `
  const response = await claude.complete(prompt)
  return JSON.parse(response)
}

Caveats for LLM-as-judge:

Same model grading itself introduces bias (models prefer their own style)
Judges are sensitive to prompt wording — test the judge too
Use GPT-4 to judge Claude, Claude to judge GPT-4 — reduces same-family bias
Always include a “reasoning” field to make judgments auditable

Evaluation Frameworks:

Framework	Language	Notes
Braintrust	TypeScript/Python	Best-in-class eval dataset management, CI integration
LangSmith Evals	Python	Tight LangChain integration, shareable datasets
Promptfoo	TypeScript	YAML-configured evals, provider-agnostic
Evals (OpenAI)	Python	OpenAI’s own framework, used for model training
RAGAS	Python	Specialized for RAG evaluation
TruLens	Python	LLM observability + feedback functions
DeepEval	Python	Comprehensive metric library (G-Eval, RAGAS, toxicity)

Benchmark Suites for Agents:

Benchmark	What it tests
SWE-bench	Solving real GitHub issues (software engineering)
SWE-bench Verified	Human-verified subset of SWE-bench
AgentBench	Multi-domain agent tasks (OS, DB, web, code)
GAIA	Real-world assistant tasks with tool use
WebArena	Web browsing and navigation
OSWorld	Desktop computer use
TAU-bench	Tool agent benchmark with real APIs

Golden Dataset Curation

The most important eval practice is building and maintaining a golden dataset:

Collect from production: log all agent inputs and outputs in production
Label failures: when a user corrects the agent or gives negative feedback, tag that input
Label successes: randomly sample good outputs and verify them
Balance the set: ensure coverage of edge cases, not just the happy path
Version the dataset: evals are only meaningful relative to a specific dataset version
Rotate in new examples: a frozen dataset decays as the system changes

Minimum Viable Eval

Don’t wait for a comprehensive eval suite. Ship a minimum viable eval that runs in < 2 minutes:

// minimum-viable-eval.ts
const TEST_CASES = [
  {
    input: "repo has no CI workflow",
    expectedJobType: "add-workflow",
    expectedPriority: 2,
  },
  {
    input: "CI is failing on main",
    expectedJobType: "fix-ci",
    expectedPriority: 1,
  },
  // 5-10 cases covering critical paths
]

for (const tc of TEST_CASES) {
  const result = await evaluatePolicy(tc.input)
  assert(result.type === tc.expectedJobType, `Wrong job type for: ${tc.input}`)
  assert(result.priority === tc.expectedPriority, `Wrong priority for: ${tc.input}`)
}

Constitutional AI and RLHF in the eval loop

Constitutional AI (Anthropic, 2022) trains models using a set of principles (“the constitution”) evaluated by the model itself. RLHF (Reinforcement Learning from Human Feedback) fine-tunes models using human preference signals.

For application-level systems, these techniques manifest as:

Self-critique prompts: ask the model to evaluate its own output before returning it
Preference datasets: collect human ratings to few-shot prompt the judge
Red-teaming: adversarial evaluation specifically seeking failure modes

Red-Teaming for AI Agents

Red-teaming agents requires finding inputs that cause:

Prompt injection (user tricks agent into executing unintended actions)
Tool misuse (agent calls destructive tools inappropriately)
Hallucinated authority (agent claims permissions it doesn’t have)
Context manipulation (long conversation history causes drift from initial instructions)

Automate red-teaming by generating adversarial inputs with a separate model specifically tasked to find failures.

15. The Strangler Fig Pattern

When you need to replace an existing system while keeping it running, the Strangler Fig pattern (Martin Fowler, 2004) applies the tight loop approach to migration.

The pattern:

Phase 1: New system runs alongside old, handles nothing
         [all traffic] → [old system]
         [new system]   (idle)

Phase 2: Route specific, low-risk paths to new system
         [path A] → [new system]
         [path B] → [old system]

Phase 3: Gradually move paths
         [path A, C, D] → [new system]
         [path B]       → [old system]

Phase 4: Old system dies naturally (strangled by the fig)
         [all traffic] → [new system]

The tight loop application: don’t big-bang migrate. Each “move a path to the new system” step is one loop iteration. Observe the new system under real load, fix issues, expand scope.

Where it applies:

Migrating from monolith to microservices
Replacing a legacy database with a new schema
Upgrading an agent framework without service interruption
Moving from one model provider to another

16. Applying the Tight Loop to Autonomous Agent Systems

This section synthesizes the above into a concrete architecture for self-healing agent systems — the specific context that motivated this article.

The Two-Tier Agent Architecture

┌─────────────────────────────────────────────────────────────┐
│                         BRAIN (Policy Plane)                 │
│                                                              │
│  [Scanner] → [Policy Engine] → [Job Queue] → [Dispatcher]   │
│      ↑                                            │          │
│      └────────────── [Result Ingest] ─────────────┘          │
└─────────────────────────────────────────────────────────────┘
                              │
                         (job dispatch)
                              │
┌─────────────────────────────────────────────────────────────┐
│                        WORKERS (Execution Plane)             │
│                                                              │
│  [RepoPrime] [CI-Fixer] [Compliance-Bot] [PR-Merger]         │
│      │           │            │               │              │
│      └───────────┴────────────┴───────────────┘              │
│                       (result reports)                        │
└─────────────────────────────────────────────────────────────┘

The tight loop for each tier:

Brain loop (minutes):

Scan repos for signals (CI status, file presence, branch protection)
Evaluate policy (desired state = signals all green)
For each violation, create a job with priority and type
Dispatch to eligible runners

Worker loop (seconds to minutes per job):

Claim a job
Execute the fix (run Claude, push a PR, trigger a workflow)
Report result back to Brain
Brain re-scans to verify the fix closed the gap

Observability the loop needs:

// Every reconciliation run should emit:
{
  timestamp: number,
  trigger: 'cron' | 'webhook' | 'debug',
  repos_scanned: number,
  repos_with_violations: number,
  jobs_created: { type: string, priority: number, repo: string }[],
  jobs_already_pending: number,  // deduplicated
  scan_duration_ms: number,
  errors: { repo: string, error: string }[],
}

// Every job completion should emit:
{
  job_id: string,
  job_type: string,
  repo: string,
  runner_id: string,
  duration_ms: number,
  success: boolean,
  pr_url?: string,
  error?: string,
  retry_count: number,
}

The fast-iteration subset

The single most impactful tight-loop technique for agent systems:

# Instead of waiting for the full cron (30 repos, 8 minutes):
curl -X POST .../debug/reconcile -d '{"repos": ["owner/repo-a", "owner/repo-b", "owner/repo-c"]}'

# Choose repos that cover:
# - A repo with CI failing (tests fix-ci path)
# - A repo missing a file (tests add-workflow path)
# - A repo that's clean (tests no-duplicate-job path)
# - A repo with known edge cases (tests error handling)

Self-healing failure modes:

Failure	Detection	Remediation
Job stuck in `assigned` > 10min	Cron checks claimed_at	Re-queue to `pending`
Runner offline	heartbeat cutoff (90s)	Remove from eligible runners
Scan error for a repo	error count in reconcile result	Alert + skip repo
Job loop (same job created repeatedly)	Check `jobs WHERE repo=? AND type=? AND status IN (pending, running)`	Deduplicate before insert
Worker crash mid-job	Job stays `running`, runner goes offline	Timeout → re-queue
GitHub API rate limit	429 response	Exponential backoff + pause scan

17. Anti-Patterns

Anti-pattern 1: The Manual Fix

Symptom: A bug occurs. You fix the specific instance of the bug, not the class of bug.

Example: The scanner crashes on repo X because it has no description. You add ?? '' for repo X. Next week, repo Y has no homepage and crashes the same way.

Fix: Identify the class of failure (missing optional fields), add defensive defaults for all of them, add a test for the failure class.

Anti-pattern 2: Waiting for the Full Run

Symptom: You make a change, then wait 30 minutes for the cron to tell you if it worked.

Fix: Build a debug/test endpoint that runs the loop on a configurable subset. The first thing to build when setting up a new loop is the fast iteration path.

Anti-pattern 3: Alert Without Action

Symptom: The system detects a problem and sends a notification to a human, who then manually decides what to do.

Fix: Every alert should have a corresponding runbook. Every runbook should have an automated version. Humans in the loop should be exception handlers, not the rule.

Anti-pattern 4: The Infinite Retry

Symptom: The system retries failed jobs indefinitely with no backoff and no limit. One broken API endpoint saturates the job queue.

Fix: Exponential backoff + max retry count + dead letter queue. After N failures, move the job to a dead letter queue for human inspection.

Anti-pattern 5: Observing at the Wrong Granularity

Symptom: You measure “reconciliation succeeded” but not “how many repos were scanned, how many had violations, how many jobs were created.”

Fix: Emit structured events at every step of the loop. Aggregate metrics tell you the system is healthy; structured events tell you why it’s healthy or not.

Anti-pattern 6: Changing Multiple Things at Once

Symptom: You deploy three changes simultaneously. Something breaks. You don’t know which change caused it.

Fix: One change per loop iteration. If you must batch changes, use feature flags to control which changes are active in production and roll them out one at a time.

Anti-pattern 7: Testing Only the Happy Path

Symptom: Your test subset only covers repos that work correctly. The broken repo is the first one discovered by the cron.

Fix: Your test subset must include at least one example of each known failure mode. The subset is a regression pack, not a demo.

18. The Meta-Loop: Continuously Enhancing This Article

This article is itself a tight loop artifact. It should be enhanced each time:

A new tight loop pattern is discovered in practice
A new tool is adopted that changes how the loop is implemented
An anti-pattern is identified in actual work
A section becomes outdated as tooling evolves

Planned additions:

Case study: the Mulan dispatcher tight loop (concrete example of S1-S4)
Temporal workflows as a managed tight loop runtime
The HITL (Human In The Loop) pattern — when and how to put humans back in
Cost-aware loops — token budgets, model tiering, eval gating on cost
Multi-agent loop coordination — when multiple loops interact (handoff protocols)
Eval-driven prompt engineering — using eval results to refine prompts systematically
The “shadow mode” pattern for agent systems — run new agent on real traffic, compare to old agent’s output, measure diff before switching
Canary releases for prompt changes — deploy new prompt to 5% of traffic, measure eval scores, roll forward or back

19. Tools and Libraries Reference

Self-Healing Infrastructure

Tool	Category	Link
Kubernetes controller-runtime	Reconciliation loops	k8s.io/controller-runtime
Argo Rollouts	Progressive delivery	argoproj.github.io/argo-rollouts
Flagger	Canary automation	flagger.app
Temporal	Workflow orchestration	temporal.io
Resilience4j	Circuit breaker (JVM)	resilience4j.readme.io
Polly	Circuit breaker (.NET)	pollydocs.org
opossum	Circuit breaker (Node)	nodeshift/opossum
Gremlin	Chaos engineering	gremlin.com
LitmusChaos	Kubernetes chaos	litmuschaos.io
toxiproxy	Network failure injection	shopify/toxiproxy

Observability

Tool	Category	Link
OpenTelemetry	Instrumentation standard	opentelemetry.io
Prometheus	Metrics	prometheus.io
Grafana	Dashboards	grafana.com
Jaeger	Distributed tracing	jaegertracing.io
Honeycomb	High-cardinality observability	honeycomb.io
Loki	Log aggregation	grafana.com/oss/loki

AI/LLM Observability

Tool	Category	Link
Langfuse	LLM tracing + eval	langfuse.com
LangSmith	LangChain observability	smith.langchain.com
Helicone	Proxy-based LLM observability	helicone.ai
Arize Phoenix	ML + LLM observability	arize.com/phoenix
AgentOps	Agent session replay	agentops.ai
Weights & Biases Weave	LLM tracing + eval	wandb.ai/site/weave
Braintrust	LLM evaluation	braintrust.dev

Evaluation Frameworks

Tool	Language	Link
Braintrust	TypeScript/Python	braintrust.dev
Promptfoo	TypeScript	promptfoo.dev
DeepEval	Python	confident-ai.com
RAGAS	Python	ragas.io
TruLens	Python	trulens.org
LangSmith Evals	Python	smith.langchain.com

Fast CI/CD

Tool	Category	Notes
Nx	Monorepo build cache	Supports affected-only builds
Turborepo	Monorepo task runner	Vercel-backed
Bazel	Hermetic build system	Google-scale
Launchable	Test impact analysis	ML-based selection
GitHub Actions path filters	CI scoping	`on.push.paths`

Last updated: 2026-03-20 — S3 of garywu/mulan epic #123

Next update triggers: S4 completion (Brain DO), first successful self-healing loop, new eval tool adoption