“Don’t fix the bug. Fix the system that let the bug live.” — the tight loop principle
Table of Contents
Open Table of Contents
- 1. The Core Insight
- 2. The Observe-Orient-Decide-Act Loop (OODA)
- 3. Control Theory and PID Controllers
- 4. Erlang/OTP: Industrial-Strength Self-Healing
- 5. Kubernetes Reconciliation Loops
- 6. Fast CI/CD Feedback and the Andon Cord
- 7. Chaos Engineering
- 8. Progressive Delivery and Canary Deployments
- 9. Observability: The Instrument Panel
- 10. DORA Metrics: Measuring the Loop
- 11. Circuit Breakers and Bulkheads
- 12. Minimal Viable Test Subsets
- 13. AI Agent Observability
- 14. AI Evaluation: The Eval Loop
- 15. The Strangler Fig Pattern
- 16. Applying the Tight Loop to Autonomous Agent Systems
- 17. Anti-Patterns
- 18. The Meta-Loop: Continuously Enhancing This Article
- 19. Tools and Libraries Reference
1. The Core Insight
The tight loop is a discipline, not a technique.
Most engineers, when they encounter a failure in a running system, reach for a fix to the immediate problem. The tight loop practitioner does something different: they ask why the system couldn’t find and fix this itself, then fix the system’s ability to do that.
The tight loop has three properties:
-
Latency is the enemy. The longer the gap between cause and observation — between action and feedback — the more work is wasted, the more damage accumulates, and the harder root cause becomes to identify. Compress this gap to seconds or minutes, not hours or days.
-
The system must observe itself. External monitoring is table stakes. Real tight-loop systems emit structured telemetry that lets the system — not a human — detect drift and trigger correction.
-
Scope before you run. Don’t iterate over the full problem space when a representative subset reveals the same failure modes in a fraction of the time. Use the smallest input that exercises the broken path.
The loop itself:
commit ──→ observe ──→ classify ──→ act ──→ commit
↑ │
└──────────────────────────────────────────┘
Each turn of the loop should take seconds to minutes, not hours. If a turn takes longer, the first fix is to shorten the loop, not to fix whatever error the long loop produced.
2. The Observe-Orient-Decide-Act Loop (OODA)
John Boyd designed the OODA loop for aerial combat. A pilot who can complete an OODA cycle faster than their opponent gets inside the opponent’s decision cycle and wins before the opponent can respond.
The same dynamic applies to software systems.
┌─────────────────────────────────────────────────────────┐
│ OODA LOOP │
│ │
│ OBSERVE ──→ ORIENT ──→ DECIDE ──→ ACT │
│ ↑ │ │
│ └──────────────────────────────┘ │
│ │
│ Observe: Collect signals (metrics, logs, traces) │
│ Orient: Make sense of signals in context │
│ Decide: Choose an action from the action space │
│ Act: Execute and inject the result back into Obs. │
└─────────────────────────────────────────────────────────┘
Applying OODA to systems:
| OODA Phase | Tight Loop Equivalent |
|---|---|
| Observe | Structured logs, metrics, distributed traces, health endpoints |
| Orient | Alerting rules, anomaly detection, policy engines |
| Decide | Runbooks, auto-remediation scripts, LLM-based diagnosis |
| Act | Helm rollback, feature flag toggle, job queue push, PR creation |
The key insight from Boyd: speed through the loop matters more than quality of any individual decision. A pilot who makes good decisions slowly loses to a pilot who makes OK decisions fast. For systems, this means: prefer reversible actions that close the loop quickly over perfect actions that take hours to compute.
References:
- Boyd, John. “Patterns of Conflict” (1986 briefing slides, declassified)
- Chet Richards, Certain to Win (2004) — best accessible treatment of OODA
- Mark Osborne, OODA in Software Engineering (2019)
3. Control Theory and PID Controllers
Before software, engineers built tight loops in analog circuits and mechanical systems. Control theory gives us a rigorous vocabulary.
The classic feedback control loop:
setpoint ──→ [comparator] ──→ [controller] ──→ [plant]
↑ │
└──────────── [sensor] ──────────┘
- Setpoint: the desired state (e.g., p99 latency < 100ms, error rate < 0.1%)
- Plant: the system being controlled (the service, the fleet, the pipeline)
- Sensor: observability infrastructure (metrics, logs, traces)
- Controller: the correction logic (alerting rules, auto-scaler, AI agent)
- Comparator: the gap between current state and setpoint
PID Controllers
The most common controller is the PID (Proportional-Integral-Derivative):
- P (Proportional): correction proportional to current error. Fast response, can overshoot.
- I (Integral): correction proportional to accumulated error over time. Eliminates steady-state drift, can cause oscillation.
- D (Derivative): correction proportional to rate of change of error. Damps oscillation, amplifies noise.
correction = Kp * error + Ki * ∫error dt + Kd * d(error)/dt
Software analogues:
| Control Concept | Software Equivalent |
|---|---|
| Setpoint | SLO / error budget target |
| Proportional | Scale replicas proportionally to request queue depth |
| Integral | Carry over unresolved incidents to next sprint budget |
| Derivative | Alert when error rate is increasing, not just when it’s high |
| Deadband | Don’t act unless deviation > threshold (avoid flapping) |
| Anti-windup | Cap backlog so old accumulated issues don’t distort current action |
The Integral term and tech debt: Technical debt is the integral of error. Systems that only react to current problems (pure P control) accumulate drift until it’s catastrophic. Tight-loop teams explicitly budget for integral correction — this is what “paying down tech debt” means in control theory terms.
References:
- Åström & Wittenmark, Computer-Controlled Systems (classic textbook)
- Netflix Tech Blog: Auto-Tuning PID Controllers for Microservices (2018)
- Google SRE Book, Chapter 6: “Monitoring Distributed Systems”
4. Erlang/OTP: Industrial-Strength Self-Healing
Erlang was designed in the 1980s for telephone switches that needed nine nines of uptime (31ms downtime per year). The solution was radical: let it crash, then supervise the crash.
Supervisor Trees
Every Erlang/OTP system is a tree of processes. Each internal node is a supervisor whose only job is to monitor its children and restart them on failure. Leaves are workers.
[Application Supervisor]
/ \
[Worker Sup] [Scanner Sup]
/ | \ / \
[W1] [W2] [W3] [S1] [S2]
Restart strategies:
one_for_one: restart only the failed childone_for_all: if any child fails, restart all siblings (used when children share state)rest_for_one: restart the failed child and all children started after it (used for dependency chains)
“Let it crash” philosophy
Rather than defensive programming (check every error, handle every nil, catch every exception), Erlang programs assume the happy path and crash on anything unexpected. The supervisor tree catches the crash, restarts the process from a known-good state, and logs the crash for later analysis.
This is not abandoning reliability — it’s a different theory of reliability:
- Crash fast → minimal damage to shared state
- Restart clean → known-good starting point
- Log everything → root cause analysis post-hoc
- Supervisor knows the context → escalate up the tree if restart loops
The circuit breaker built-in: OTP supervisors have max_restarts and max_seconds parameters. If a child crashes more than max_restarts times in max_seconds, the supervisor itself crashes, propagating the failure up the tree. This is automatic circuit breaking.
Applying to modern systems:
| Erlang/OTP Concept | Modern Equivalent |
|---|---|
| Supervisor tree | Kubernetes Pod + Deployment + HPA |
one_for_one | Pod restart policy |
| Max restart threshold | CrashLoopBackOff + alerting |
| ”Let it crash” | Structured exception logging + dead letter queues |
| Process isolation | Container isolation / actor model (Akka, Orleans) |
| Hot code reload | Rolling deployments, feature flags |
Libraries and tools:
- Elixir: brings OTP to a modern Ruby-like syntax; same supervisor model
- Akka (Scala/Java): actor model with supervisor hierarchies
- Microsoft Orleans (.NET): virtual actor model with grain persistence
- Temporal (Go/Java/TypeScript): workflow orchestration with automatic retry trees
5. Kubernetes Reconciliation Loops
Kubernetes is the most widely deployed implementation of the tight loop pattern at infrastructure scale.
The core principle: desired state vs. actual state
Every Kubernetes controller follows the same pattern:
while (true) {
const desired = getDesiredState() // read from etcd (spec)
const actual = getActualState() // query the real world (status)
const diff = reconcile(desired, actual)
if (diff) apply(diff)
await sleep(resyncPeriod)
}
This is the reconciliation loop. Controllers don’t store “what I last did” — they compute “what I need to do now” from first principles on every iteration. This makes them idempotent and convergent: run them 100 times on a healthy cluster and nothing changes; run them once after a failure and they fix it.
Key properties:
-
Level-triggered, not edge-triggered: controllers react to current state, not events. This means missed events (network partition, API server restart) don’t leave the system in a broken state indefinitely.
-
Optimistic concurrency: controllers use
resourceVersionto detect conflicts. If the resource changed since they last read it, they start over. No distributed locks required. -
Eventual consistency: Kubernetes makes no guarantees about when the actual state matches desired, only that it will converge. This is the right contract for distributed systems.
Writing your own controller
The Kubernetes controller-runtime library (Go) implements the scaffolding:
// controller-runtime example
func (r *MyReconciler) Reconcile(ctx context.Context, req reconcile.Request) (reconcile.Result, error) {
// 1. Fetch the object
obj := &MyResource{}
if err := r.Get(ctx, req.NamespacedName, obj); err != nil {
return reconcile.Result{}, client.IgnoreNotFound(err)
}
// 2. Compute desired state
desired := r.computeDesired(obj)
// 3. Diff and apply
if !reflect.DeepEqual(obj.Status, desired) {
obj.Status = desired
return reconcile.Result{}, r.Status().Update(ctx, obj)
}
// 4. Re-queue after resync period
return reconcile.Result{RequeueAfter: 5 * time.Minute}, nil
}
Applying this pattern outside Kubernetes:
Any system with a “desired config” and a “running reality” can use reconciliation loops:
- Infrastructure-as-Code: Terraform plan/apply is one reconciliation pass
- Database schema management: Flyway/Liquibase migrates toward desired schema
- Agent dispatchers: scan repos for violations (actual), compare to policy (desired), create jobs (apply)
- Feature flag systems: compare enabled users (desired) to current rollout (actual), adjust weights
References:
- Kubebuilder Book — best tutorial on writing controllers
- Programming Kubernetes (Hausenblas & Schimanski, 2019)
- Bilgin Ibryam, Kubernetes Patterns — especially Chapter 27: Controller
6. Fast CI/CD Feedback and the Andon Cord
Toyota Production System and the Andon Cord
Toyota’s Andon cord (now a button) allows any worker on the assembly line to stop the entire line when they detect a defect. Counterintuitively, this increases throughput: problems are fixed immediately rather than propagating through 500 more cars before detection.
Software equivalent: when a test fails in CI, block all merges to that branch until it’s fixed. This is not a slowdown — it’s the mechanism that prevents defect accumulation.
The key insight: stopping to fix is faster than continuing with a broken foundation.
The Fast Feedback Gradient
Not all tests are equal. Organize your test suite into layers with dramatically different execution times:
Layer 0: Static analysis → < 5s (linting, type checking, formatting)
Layer 1: Unit tests → < 30s (pure functions, no I/O)
Layer 2: Integration tests → < 3min (real DB, fake external services)
Layer 3: Contract tests → < 5min (consumer-driven contracts)
Layer 4: End-to-end tests → < 15min (real infrastructure, smoke paths)
Layer 5: Load/chaos tests → hours (run nightly or pre-release only)
Run layers sequentially and fail fast. Don’t run a 10-minute E2E suite on a commit that fails the 5-second linter.
Tools that implement fast feedback:
- Nx / Turborepo: monorepo-aware build caching. Only rebuild/retest packages that changed.
- Bazel: hermetic builds with per-target caching. Google runs 100,000+ tests per day.
- Jest —watchMode: instant re-run of changed test files during development.
- vitest: same philosophy, faster startup, native ESM.
- cargo-nextest: Rust test runner, 3× faster than
cargo test. - pytest-xdist: parallel test execution for Python.
- GitHub Actions path filters: only trigger jobs when relevant files change.
Test Impact Analysis
Rather than running all tests on every commit, track which tests exercise which code paths and run only affected tests.
- Microsoft TAOS: used internally, reduces test runs by 80%+
- Launchable: SaaS layer over any test runner, ML-based test selection
- pytest-testmon: Python test impact analysis using coverage data
- Bazel query:
bazel query 'rdeps(//..., //path/to/changed:target)'
7. Chaos Engineering
Chaos engineering is the practice of deliberately injecting failures into systems to discover weaknesses before they manifest under real load.
The GameDay model (Netflix)
Netflix’s Simian Army (now evolved) runs experiments during “GameDays”:
- Define steady state (the system’s normal behavior — a measurable hypothesis)
- Inject a failure (kill a node, increase latency, exhaust a resource)
- Observe whether steady state is maintained
- Fix anything that broke
Key tools:
| Tool | Scope | Notes |
|---|---|---|
| Chaos Monkey | EC2/AWS instance termination | Original Netflix tool |
| Chaos Gorilla | Kill entire availability zone | |
| Gremlin | SaaS chaos platform | CPU, memory, network, disk attacks |
| LitmusChaos | Kubernetes-native | CRD-based experiments |
| Chaos Mesh | Kubernetes-native | CNCF project |
| toxiproxy | Network proxy with failure injection | Great for local dev |
| Pumba | Docker container chaos |
Principles (from the Chaos Engineering handbook):
- Build a hypothesis around steady-state behavior
- Vary real-world events (not just crashes — latency, data corruption, dependency unavailability)
- Run experiments in production (chaos in staging doesn’t find production failure modes)
- Automate experiments to run continuously
- Minimize blast radius (use canaries, feature flags, scope injection narrowly)
The tight loop application: chaos engineering is how you verify your tight loop closes correctly. The question isn’t “does the system crash?” but “how long does it take the loop to detect and recover?“
8. Progressive Delivery and Canary Deployments
Progressive delivery separates deployment (code is running in production) from release (users are getting the new code). This lets you verify changes on real traffic with a small blast radius before full rollout.
Delivery strategies:
Blue/Green: 100% old → switch → 100% new (all-or-nothing)
Canary: 1% → 10% → 25% → 50% → 100% (gradual)
A/B: Split by user segment (for feature testing)
Shadow: New system processes requests but doesn't return results
Ring: Internal → dogfood → early adopters → general availability
Automated canary analysis:
The key is automating the decision to advance or rollback:
- Deploy to 1% of traffic
- Compare error rate, latency, business metrics between canary and baseline
- If canary is statistically worse → automatic rollback
- If canary matches or improves → advance to next percentage
Tools:
- Argo Rollouts: Kubernetes progressive delivery with analysis templates
- Flagger: same concept, works with Istio/Linkerd/NGINX
- LaunchDarkly / Unleash: feature flags for progressive release
- Harness: CD platform with built-in canary analysis
- Spinnaker: Netflix’s CD platform, canary analysis built in
The metric that matters: don’t just watch error rate. Define a single north star metric per canary that captures business value (conversion, revenue, engagement) alongside technical health.
9. Observability: The Instrument Panel
You cannot run a tight loop without seeing the system’s state. Observability is the prerequisite.
The Three Pillars
| Pillar | What it captures | Best for |
|---|---|---|
| Metrics | Numerical aggregates over time | Alerting, dashboards, trending |
| Logs | Discrete events with context | Debugging specific incidents |
| Traces | Request flow across services | Latency diagnosis, dependency mapping |
These three are complementary, not alternatives.
The USE Method (Brendan Gregg)
For every resource in your system:
- Utilization: what fraction of time the resource is busy (CPU 70%, disk queue depth)
- Saturation: how much work is queued waiting for the resource
- Errors: error count for that resource
USE gives you a systematic checklist. Run it against CPU, memory, disk I/O, network I/O, and any application-specific resources (thread pools, connection pools, queue depth).
The RED Method (Tom Wilkie)
For every service endpoint:
- Rate: requests per second
- Errors: error rate (%)
- Duration: latency distribution (p50, p95, p99)
RED is the service-facing view; USE is the infrastructure-facing view. Use both.
Google’s Four Golden Signals:
- Latency (duration of successful requests)
- Traffic (demand on the system)
- Errors (rate of failing requests)
- Saturation (how “full” the service is)
OpenTelemetry (OTel)
OTel is the emerging standard for instrumentation. Vendor-agnostic SDK that exports metrics, logs, and traces to any backend (Prometheus, Jaeger, Tempo, Honeycomb, Datadog, etc.).
// OTel instrumentation example
import { trace, metrics } from '@opentelemetry/api'
const tracer = trace.getTracer('my-service')
const meter = metrics.getMeter('my-service')
const jobsCreated = meter.createCounter('jobs_created_total')
async function reconcile(repos: string[]) {
const span = tracer.startSpan('reconcile')
try {
for (const repo of repos) {
const jobSpan = tracer.startSpan('process_repo', { parent: span })
const jobs = await processRepo(repo)
jobsCreated.add(jobs.length, { repo })
jobSpan.end()
}
} catch (err) {
span.recordException(err)
span.setStatus({ code: SpanStatusCode.ERROR })
throw err
} finally {
span.end()
}
}
Key observability backends:
| Tool | Pillar | Notes |
|---|---|---|
| Prometheus | Metrics | Pull-based, PromQL, ecosystem standard |
| Grafana | Dashboards | Connects to any datasource |
| Loki | Logs | Label-based, LogQL, Grafana native |
| Tempo | Traces | Integrates with Loki/Prometheus |
| Honeycomb | All three | Exceptional for high-cardinality queries |
| Datadog | All three | SaaS, expensive but complete |
| Jaeger | Traces | OSS, CNCF project |
10. DORA Metrics: Measuring the Loop
DORA (DevOps Research and Assessment) identified four metrics that predict both organizational performance and software delivery performance:
| Metric | Elite | High | Medium | Low |
|---|---|---|---|---|
| Deployment frequency | Multiple/day | Weekly | Monthly | <Monthly |
| Lead time for changes | < 1 hour | 1 day | 1 week | 1 month |
| Change failure rate | < 5% | 5-15% | 10-15% | >15% |
| Time to restore service | < 1 hour | < 1 day | 1 week | >1 month |
Reading these metrics through the tight loop lens:
- Deployment frequency measures loop cadence. More deployments = smaller changes = less risk per change.
- Lead time for changes measures loop latency. Commit to production in minutes, not weeks.
- Change failure rate measures loop accuracy. Good tests + canaries = fewer breaks.
- Time to restore service (MTTR) measures loop recovery speed. Observability + runbooks + auto-remediation = fast recovery.
DORA’s key finding: these four metrics cluster together. Elite performers are elite on all four simultaneously. You cannot optimize one in isolation.
Additional metric: SPACE framework (GitHub, 2021)
DORA measures flow. SPACE measures developer productivity more holistically:
- Satisfaction and well-being
- Performance (outcomes, not output)
- Activity (frequency of actions)
- Communication and collaboration
- Efficiency and flow
11. Circuit Breakers and Bulkheads
These patterns from Michael Nygard’s Release It! are essential tight-loop primitives.
Circuit Breaker
A circuit breaker monitors calls to an external dependency and “trips” (opens the circuit) when failures exceed a threshold. While open, calls fail fast without attempting the dependency. After a timeout, it half-opens and lets a test call through.
States: CLOSED → (threshold exceeded) → OPEN → (timeout) → HALF-OPEN → (success) → CLOSED
↑ │
└──────── (failure) ───────────┘
This is the software equivalent of a household circuit breaker: prevents one failing component from cascading into system-wide failure.
Implementation:
- resilience4j (Java/Kotlin): circuit breaker, retry, rate limiter, bulkhead
- Polly (.NET): retry, circuit breaker, timeout, bulkhead
- ts-circuit-breaker / opossum (Node.js)
- pybreaker (Python)
- Envoy proxy: circuit breaking at the network level, no application code changes
Bulkheads
Named after ship bulkheads that isolate compartments from flooding. In software: isolate thread pools, connection pools, or process queues by caller or resource. If one tenant hammers the DB, other tenants’ connections aren’t starved.
Without bulkheads: With bulkheads:
[All requests] → [Pool 10] [API requests] → [Pool 5]
[Background jobs] → [Pool 3]
[Admin requests] → [Pool 2]
Timeout pyramid:
Every external call needs a timeout. Set timeouts at every layer:
User request timeout: 500ms
Service call timeout: 200ms
Database query timeout: 100ms
External API timeout: 50ms
If inner timeouts aren’t set, a slow database query holds the service call slot, which holds the user request slot, until the entire connection pool is exhausted.
12. Minimal Viable Test Subsets
The tight loop lives or dies by the time to get feedback. One of the most powerful optimizations is not “run the tests faster” but “run fewer tests that still tell you what you need to know.”
The philosophy:
Run the smallest set of tests that gives you confidence the specific thing you changed is correct.
Principles:
-
Scope to the change. A change to the JSON serializer doesn’t need to run the authentication tests. Map change → affected tests using coverage or static analysis.
-
Use fixtures, not production data. A 3-repo subset reveals the same parser bugs as a 41-repo scan but runs in 2 seconds instead of 8 minutes.
-
Smoke tests for fast sanity. A handful of critical-path tests that run in under 30 seconds. If these fail, don’t run anything else.
-
Regression pack for known failures. Every bug you’ve ever fixed should have a regression test. These tests encode your system’s error history.
-
Statistical sampling. For large datasets, random sampling of N items often reveals the same failure modes as exhaustive scanning. Start with N=10, scale up only if needed.
Test pyramid vs. test diamond:
The classic “test pyramid” (many unit tests, fewer integration, minimal E2E) is right for most systems. But for systems with significant integration complexity (API-heavy systems, cloud workers), a “diamond” shape often makes more sense:
Pyramid: Diamond:
[E2E] [E2E]
[integ ] [integ integ]
[unit unit unit] [ unit unit ]
[smoke smoke]
The diamond adds a thick integration band and a smoke band at the bottom because unit tests of an API client test mocks, not the API.
Fast iteration workflow:
# Step 1: Smoke test — 3 repos, 5 seconds
curl -X POST .../debug/reconcile -d '{"repos": ["a", "b", "c"]}'
# Step 2: If smoke passes, run against 10 representative repos
# Step 3: If that passes, merge and let the full cron verify at scale
13. AI Agent Observability
Standard APM tools weren’t designed for LLM-based systems. AI agents introduce new failure modes:
- Hallucinated tool calls: the agent calls a tool that doesn’t exist or with wrong parameters
- Context window exhaustion: long conversations lose early context silently
- Prompt injection: user input containing instructions that hijack agent behavior
- Reasoning loops: agent loops on the same decision without progress
- Cost runaway: unbounded agent loops burning tokens exponentially
- Latency variance: p99 latency for LLM calls can be 10× p50
What to instrument:
// Minimum viable AI agent telemetry
interface AgentSpan {
session_id: string
turn_number: number
model: string
input_tokens: number
output_tokens: number
cached_tokens: number
cost_usd: number
tool_calls: ToolCallRecord[]
latency_ms: number
finish_reason: 'end_turn' | 'max_tokens' | 'tool_use' | 'error'
error?: string
}
interface ToolCallRecord {
tool_name: string
duration_ms: number
success: boolean
error?: string
}
AI Observability Platforms:
| Platform | Strengths | Notes |
|---|---|---|
| LangSmith | Deep LangChain integration, evaluation suite, prompt hub | Best-in-class if using LangChain |
| Langfuse | Open source, self-hostable, framework-agnostic | Best choice for custom agents |
| Helicone | Proxy-based (no SDK changes), cost analytics | Drop-in for any OpenAI-compatible API |
| Arize Phoenix | OSS, ML observability background, embeddings viz | Strong for RAG/embedding workflows |
| AgentOps | Agent-specific metrics (session replay, loop detection) | Designed for multi-step agents |
| Weights & Biases Weave | Versioning + tracing + evaluation in one | Strong if already using W&B |
| Braintrust | Eval-first, dataset management, score tracking | Best for systematic LLM eval workflows |
| PromptLayer | Prompt versioning + A/B testing | Simpler, less overhead |
Cost control as a tight loop:
Budget → [token counter] → [budget guard] → [agent]
↑ │
└──────── [usage report] ────────┘
If accumulated_cost > daily_budget × 0.8:
→ switch to cheaper model (Haiku instead of Sonnet)
→ reduce context window
→ alert human
→ pause non-critical jobs
Agent-specific alert thresholds:
- Turn count > 20 without terminal state → likely stuck
- Same tool call 3× in a row → likely looping
- Token ratio (output/input) > 0.8 → suspiciously verbose
finish_reason: max_tokens→ context window exhausted, result likely truncated
14. AI Evaluation: The Eval Loop
Evaluating LLM-based systems is itself a tight loop discipline. The eval loop closes the gap between “model behavior” and “desired behavior.”
The Eval Hierarchy
Level 0: Assertion-based evals — deterministic, binary pass/fail
Level 1: LLM-as-judge evals — use a model to grade model output
Level 2: Human preference evals — human labels ground truth
Level 3: Downstream metric evals — measure business/task outcomes
Use all four levels, but optimize for Level 0 first. Deterministic evals run in milliseconds and never disagree with themselves.
Eval-as-CI
Treat evals like tests: run them on every commit, gate merges on them. The key is separating two concerns:
- Regression pack: a frozen dataset of inputs → expected outputs. If a model change breaks regression, block the deploy.
- Exploration set: new examples from production that stress-test the current model. This expands the regression pack over time.
# .github/workflows/eval.yml
- name: Run eval suite
run: |
braintrust eval src/evals/main.eval.ts
env:
BRAINTRUST_API_KEY: ${{ secrets.BRAINTRUST_API_KEY }}
LLM-as-Judge Pattern
When the expected output is complex (prose, code, structured reasoning), use a stronger model to grade a weaker model’s output:
async function gradeResponse(
question: string,
expected: string,
actual: string,
): Promise<Score> {
const prompt = `
Question: ${question}
Expected: ${expected}
Actual: ${actual}
Rate the actual response on:
1. Correctness (0-1): Does it answer the question correctly?
2. Completeness (0-1): Does it cover all aspects of expected?
3. Conciseness (0-1): Is it appropriately concise?
Return JSON: {"correctness": N, "completeness": N, "conciseness": N, "reasoning": "..."}
`
const response = await claude.complete(prompt)
return JSON.parse(response)
}
Caveats for LLM-as-judge:
- Same model grading itself introduces bias (models prefer their own style)
- Judges are sensitive to prompt wording — test the judge too
- Use GPT-4 to judge Claude, Claude to judge GPT-4 — reduces same-family bias
- Always include a “reasoning” field to make judgments auditable
Evaluation Frameworks:
| Framework | Language | Notes |
|---|---|---|
| Braintrust | TypeScript/Python | Best-in-class eval dataset management, CI integration |
| LangSmith Evals | Python | Tight LangChain integration, shareable datasets |
| Promptfoo | TypeScript | YAML-configured evals, provider-agnostic |
| Evals (OpenAI) | Python | OpenAI’s own framework, used for model training |
| RAGAS | Python | Specialized for RAG evaluation |
| TruLens | Python | LLM observability + feedback functions |
| DeepEval | Python | Comprehensive metric library (G-Eval, RAGAS, toxicity) |
Benchmark Suites for Agents:
| Benchmark | What it tests |
|---|---|
| SWE-bench | Solving real GitHub issues (software engineering) |
| SWE-bench Verified | Human-verified subset of SWE-bench |
| AgentBench | Multi-domain agent tasks (OS, DB, web, code) |
| GAIA | Real-world assistant tasks with tool use |
| WebArena | Web browsing and navigation |
| OSWorld | Desktop computer use |
| TAU-bench | Tool agent benchmark with real APIs |
Golden Dataset Curation
The most important eval practice is building and maintaining a golden dataset:
- Collect from production: log all agent inputs and outputs in production
- Label failures: when a user corrects the agent or gives negative feedback, tag that input
- Label successes: randomly sample good outputs and verify them
- Balance the set: ensure coverage of edge cases, not just the happy path
- Version the dataset: evals are only meaningful relative to a specific dataset version
- Rotate in new examples: a frozen dataset decays as the system changes
Minimum Viable Eval
Don’t wait for a comprehensive eval suite. Ship a minimum viable eval that runs in < 2 minutes:
// minimum-viable-eval.ts
const TEST_CASES = [
{
input: "repo has no CI workflow",
expectedJobType: "add-workflow",
expectedPriority: 2,
},
{
input: "CI is failing on main",
expectedJobType: "fix-ci",
expectedPriority: 1,
},
// 5-10 cases covering critical paths
]
for (const tc of TEST_CASES) {
const result = await evaluatePolicy(tc.input)
assert(result.type === tc.expectedJobType, `Wrong job type for: ${tc.input}`)
assert(result.priority === tc.expectedPriority, `Wrong priority for: ${tc.input}`)
}
Constitutional AI and RLHF in the eval loop
Constitutional AI (Anthropic, 2022) trains models using a set of principles (“the constitution”) evaluated by the model itself. RLHF (Reinforcement Learning from Human Feedback) fine-tunes models using human preference signals.
For application-level systems, these techniques manifest as:
- Self-critique prompts: ask the model to evaluate its own output before returning it
- Preference datasets: collect human ratings to few-shot prompt the judge
- Red-teaming: adversarial evaluation specifically seeking failure modes
Red-Teaming for AI Agents
Red-teaming agents requires finding inputs that cause:
- Prompt injection (user tricks agent into executing unintended actions)
- Tool misuse (agent calls destructive tools inappropriately)
- Hallucinated authority (agent claims permissions it doesn’t have)
- Context manipulation (long conversation history causes drift from initial instructions)
Automate red-teaming by generating adversarial inputs with a separate model specifically tasked to find failures.
15. The Strangler Fig Pattern
When you need to replace an existing system while keeping it running, the Strangler Fig pattern (Martin Fowler, 2004) applies the tight loop approach to migration.
The pattern:
Phase 1: New system runs alongside old, handles nothing
[all traffic] → [old system]
[new system] (idle)
Phase 2: Route specific, low-risk paths to new system
[path A] → [new system]
[path B] → [old system]
Phase 3: Gradually move paths
[path A, C, D] → [new system]
[path B] → [old system]
Phase 4: Old system dies naturally (strangled by the fig)
[all traffic] → [new system]
The tight loop application: don’t big-bang migrate. Each “move a path to the new system” step is one loop iteration. Observe the new system under real load, fix issues, expand scope.
Where it applies:
- Migrating from monolith to microservices
- Replacing a legacy database with a new schema
- Upgrading an agent framework without service interruption
- Moving from one model provider to another
16. Applying the Tight Loop to Autonomous Agent Systems
This section synthesizes the above into a concrete architecture for self-healing agent systems — the specific context that motivated this article.
The Two-Tier Agent Architecture
┌─────────────────────────────────────────────────────────────┐
│ BRAIN (Policy Plane) │
│ │
│ [Scanner] → [Policy Engine] → [Job Queue] → [Dispatcher] │
│ ↑ │ │
│ └────────────── [Result Ingest] ─────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
(job dispatch)
│
┌─────────────────────────────────────────────────────────────┐
│ WORKERS (Execution Plane) │
│ │
│ [RepoPrime] [CI-Fixer] [Compliance-Bot] [PR-Merger] │
│ │ │ │ │ │
│ └───────────┴────────────┴───────────────┘ │
│ (result reports) │
└─────────────────────────────────────────────────────────────┘
The tight loop for each tier:
Brain loop (minutes):
- Scan repos for signals (CI status, file presence, branch protection)
- Evaluate policy (desired state = signals all green)
- For each violation, create a job with priority and type
- Dispatch to eligible runners
Worker loop (seconds to minutes per job):
- Claim a job
- Execute the fix (run Claude, push a PR, trigger a workflow)
- Report result back to Brain
- Brain re-scans to verify the fix closed the gap
Observability the loop needs:
// Every reconciliation run should emit:
{
timestamp: number,
trigger: 'cron' | 'webhook' | 'debug',
repos_scanned: number,
repos_with_violations: number,
jobs_created: { type: string, priority: number, repo: string }[],
jobs_already_pending: number, // deduplicated
scan_duration_ms: number,
errors: { repo: string, error: string }[],
}
// Every job completion should emit:
{
job_id: string,
job_type: string,
repo: string,
runner_id: string,
duration_ms: number,
success: boolean,
pr_url?: string,
error?: string,
retry_count: number,
}
The fast-iteration subset
The single most impactful tight-loop technique for agent systems:
# Instead of waiting for the full cron (30 repos, 8 minutes):
curl -X POST .../debug/reconcile -d '{"repos": ["owner/repo-a", "owner/repo-b", "owner/repo-c"]}'
# Choose repos that cover:
# - A repo with CI failing (tests fix-ci path)
# - A repo missing a file (tests add-workflow path)
# - A repo that's clean (tests no-duplicate-job path)
# - A repo with known edge cases (tests error handling)
Self-healing failure modes:
| Failure | Detection | Remediation |
|---|---|---|
Job stuck in assigned > 10min | Cron checks claimed_at | Re-queue to pending |
| Runner offline | heartbeat cutoff (90s) | Remove from eligible runners |
| Scan error for a repo | error count in reconcile result | Alert + skip repo |
| Job loop (same job created repeatedly) | Check jobs WHERE repo=? AND type=? AND status IN (pending, running) | Deduplicate before insert |
| Worker crash mid-job | Job stays running, runner goes offline | Timeout → re-queue |
| GitHub API rate limit | 429 response | Exponential backoff + pause scan |
17. Anti-Patterns
Anti-pattern 1: The Manual Fix
Symptom: A bug occurs. You fix the specific instance of the bug, not the class of bug.
Example: The scanner crashes on repo X because it has no description. You add ?? '' for repo X. Next week, repo Y has no homepage and crashes the same way.
Fix: Identify the class of failure (missing optional fields), add defensive defaults for all of them, add a test for the failure class.
Anti-pattern 2: Waiting for the Full Run
Symptom: You make a change, then wait 30 minutes for the cron to tell you if it worked.
Fix: Build a debug/test endpoint that runs the loop on a configurable subset. The first thing to build when setting up a new loop is the fast iteration path.
Anti-pattern 3: Alert Without Action
Symptom: The system detects a problem and sends a notification to a human, who then manually decides what to do.
Fix: Every alert should have a corresponding runbook. Every runbook should have an automated version. Humans in the loop should be exception handlers, not the rule.
Anti-pattern 4: The Infinite Retry
Symptom: The system retries failed jobs indefinitely with no backoff and no limit. One broken API endpoint saturates the job queue.
Fix: Exponential backoff + max retry count + dead letter queue. After N failures, move the job to a dead letter queue for human inspection.
Anti-pattern 5: Observing at the Wrong Granularity
Symptom: You measure “reconciliation succeeded” but not “how many repos were scanned, how many had violations, how many jobs were created.”
Fix: Emit structured events at every step of the loop. Aggregate metrics tell you the system is healthy; structured events tell you why it’s healthy or not.
Anti-pattern 6: Changing Multiple Things at Once
Symptom: You deploy three changes simultaneously. Something breaks. You don’t know which change caused it.
Fix: One change per loop iteration. If you must batch changes, use feature flags to control which changes are active in production and roll them out one at a time.
Anti-pattern 7: Testing Only the Happy Path
Symptom: Your test subset only covers repos that work correctly. The broken repo is the first one discovered by the cron.
Fix: Your test subset must include at least one example of each known failure mode. The subset is a regression pack, not a demo.
18. The Meta-Loop: Continuously Enhancing This Article
This article is itself a tight loop artifact. It should be enhanced each time:
- A new tight loop pattern is discovered in practice
- A new tool is adopted that changes how the loop is implemented
- An anti-pattern is identified in actual work
- A section becomes outdated as tooling evolves
Planned additions:
- Case study: the Mulan dispatcher tight loop (concrete example of S1-S4)
- Temporal workflows as a managed tight loop runtime
- The
HITL(Human In The Loop) pattern — when and how to put humans back in - Cost-aware loops — token budgets, model tiering, eval gating on cost
- Multi-agent loop coordination — when multiple loops interact (handoff protocols)
- Eval-driven prompt engineering — using eval results to refine prompts systematically
- The “shadow mode” pattern for agent systems — run new agent on real traffic, compare to old agent’s output, measure diff before switching
- Canary releases for prompt changes — deploy new prompt to 5% of traffic, measure eval scores, roll forward or back
19. Tools and Libraries Reference
Self-Healing Infrastructure
| Tool | Category | Link |
|---|---|---|
| Kubernetes controller-runtime | Reconciliation loops | k8s.io/controller-runtime |
| Argo Rollouts | Progressive delivery | argoproj.github.io/argo-rollouts |
| Flagger | Canary automation | flagger.app |
| Temporal | Workflow orchestration | temporal.io |
| Resilience4j | Circuit breaker (JVM) | resilience4j.readme.io |
| Polly | Circuit breaker (.NET) | pollydocs.org |
| opossum | Circuit breaker (Node) | nodeshift/opossum |
| Gremlin | Chaos engineering | gremlin.com |
| LitmusChaos | Kubernetes chaos | litmuschaos.io |
| toxiproxy | Network failure injection | shopify/toxiproxy |
Observability
| Tool | Category | Link |
|---|---|---|
| OpenTelemetry | Instrumentation standard | opentelemetry.io |
| Prometheus | Metrics | prometheus.io |
| Grafana | Dashboards | grafana.com |
| Jaeger | Distributed tracing | jaegertracing.io |
| Honeycomb | High-cardinality observability | honeycomb.io |
| Loki | Log aggregation | grafana.com/oss/loki |
AI/LLM Observability
| Tool | Category | Link |
|---|---|---|
| Langfuse | LLM tracing + eval | langfuse.com |
| LangSmith | LangChain observability | smith.langchain.com |
| Helicone | Proxy-based LLM observability | helicone.ai |
| Arize Phoenix | ML + LLM observability | arize.com/phoenix |
| AgentOps | Agent session replay | agentops.ai |
| Weights & Biases Weave | LLM tracing + eval | wandb.ai/site/weave |
| Braintrust | LLM evaluation | braintrust.dev |
Evaluation Frameworks
| Tool | Language | Link |
|---|---|---|
| Braintrust | TypeScript/Python | braintrust.dev |
| Promptfoo | TypeScript | promptfoo.dev |
| DeepEval | Python | confident-ai.com |
| RAGAS | Python | ragas.io |
| TruLens | Python | trulens.org |
| LangSmith Evals | Python | smith.langchain.com |
Fast CI/CD
| Tool | Category | Notes |
|---|---|---|
| Nx | Monorepo build cache | Supports affected-only builds |
| Turborepo | Monorepo task runner | Vercel-backed |
| Bazel | Hermetic build system | Google-scale |
| Launchable | Test impact analysis | ML-based selection |
| GitHub Actions path filters | CI scoping | on.push.paths |
Last updated: 2026-03-20 — S3 of garywu/mulan epic #123
Next update triggers: S4 completion (Brain DO), first successful self-healing loop, new eval tool adoption