Skip to content
Gary Wu
Go back

The Tight Loop: Observability and Action

Edit page

“Don’t fix the bug. Fix the system that let the bug live.” — the tight loop principle


Table of Contents

Open Table of Contents

1. The Core Insight

The tight loop is a discipline, not a technique.

Most engineers, when they encounter a failure in a running system, reach for a fix to the immediate problem. The tight loop practitioner does something different: they ask why the system couldn’t find and fix this itself, then fix the system’s ability to do that.

The tight loop has three properties:

  1. Latency is the enemy. The longer the gap between cause and observation — between action and feedback — the more work is wasted, the more damage accumulates, and the harder root cause becomes to identify. Compress this gap to seconds or minutes, not hours or days.

  2. The system must observe itself. External monitoring is table stakes. Real tight-loop systems emit structured telemetry that lets the system — not a human — detect drift and trigger correction.

  3. Scope before you run. Don’t iterate over the full problem space when a representative subset reveals the same failure modes in a fraction of the time. Use the smallest input that exercises the broken path.

The loop itself:

  commit ──→ observe ──→ classify ──→ act ──→ commit
     ↑                                          │
     └──────────────────────────────────────────┘

Each turn of the loop should take seconds to minutes, not hours. If a turn takes longer, the first fix is to shorten the loop, not to fix whatever error the long loop produced.


2. The Observe-Orient-Decide-Act Loop (OODA)

John Boyd designed the OODA loop for aerial combat. A pilot who can complete an OODA cycle faster than their opponent gets inside the opponent’s decision cycle and wins before the opponent can respond.

The same dynamic applies to software systems.

  ┌─────────────────────────────────────────────────────────┐
  │                        OODA LOOP                         │
  │                                                          │
  │  OBSERVE ──→ ORIENT ──→ DECIDE ──→ ACT                  │
  │     ↑                              │                     │
  │     └──────────────────────────────┘                     │
  │                                                          │
  │  Observe:  Collect signals (metrics, logs, traces)       │
  │  Orient:   Make sense of signals in context              │
  │  Decide:   Choose an action from the action space        │
  │  Act:      Execute and inject the result back into Obs.  │
  └─────────────────────────────────────────────────────────┘

Applying OODA to systems:

OODA PhaseTight Loop Equivalent
ObserveStructured logs, metrics, distributed traces, health endpoints
OrientAlerting rules, anomaly detection, policy engines
DecideRunbooks, auto-remediation scripts, LLM-based diagnosis
ActHelm rollback, feature flag toggle, job queue push, PR creation

The key insight from Boyd: speed through the loop matters more than quality of any individual decision. A pilot who makes good decisions slowly loses to a pilot who makes OK decisions fast. For systems, this means: prefer reversible actions that close the loop quickly over perfect actions that take hours to compute.

References:


3. Control Theory and PID Controllers

Before software, engineers built tight loops in analog circuits and mechanical systems. Control theory gives us a rigorous vocabulary.

The classic feedback control loop:

  setpoint ──→ [comparator] ──→ [controller] ──→ [plant]
                   ↑                                │
                   └──────────── [sensor] ──────────┘

PID Controllers

The most common controller is the PID (Proportional-Integral-Derivative):

correction = Kp * error + Ki * ∫error dt + Kd * d(error)/dt

Software analogues:

Control ConceptSoftware Equivalent
SetpointSLO / error budget target
ProportionalScale replicas proportionally to request queue depth
IntegralCarry over unresolved incidents to next sprint budget
DerivativeAlert when error rate is increasing, not just when it’s high
DeadbandDon’t act unless deviation > threshold (avoid flapping)
Anti-windupCap backlog so old accumulated issues don’t distort current action

The Integral term and tech debt: Technical debt is the integral of error. Systems that only react to current problems (pure P control) accumulate drift until it’s catastrophic. Tight-loop teams explicitly budget for integral correction — this is what “paying down tech debt” means in control theory terms.

References:


4. Erlang/OTP: Industrial-Strength Self-Healing

Erlang was designed in the 1980s for telephone switches that needed nine nines of uptime (31ms downtime per year). The solution was radical: let it crash, then supervise the crash.

Supervisor Trees

Every Erlang/OTP system is a tree of processes. Each internal node is a supervisor whose only job is to monitor its children and restart them on failure. Leaves are workers.

                    [Application Supervisor]
                           /        \
             [Worker Sup]            [Scanner Sup]
             /     |     \               /     \
        [W1]    [W2]    [W3]         [S1]    [S2]

Restart strategies:

“Let it crash” philosophy

Rather than defensive programming (check every error, handle every nil, catch every exception), Erlang programs assume the happy path and crash on anything unexpected. The supervisor tree catches the crash, restarts the process from a known-good state, and logs the crash for later analysis.

This is not abandoning reliability — it’s a different theory of reliability:

The circuit breaker built-in: OTP supervisors have max_restarts and max_seconds parameters. If a child crashes more than max_restarts times in max_seconds, the supervisor itself crashes, propagating the failure up the tree. This is automatic circuit breaking.

Applying to modern systems:

Erlang/OTP ConceptModern Equivalent
Supervisor treeKubernetes Pod + Deployment + HPA
one_for_onePod restart policy
Max restart thresholdCrashLoopBackOff + alerting
”Let it crash”Structured exception logging + dead letter queues
Process isolationContainer isolation / actor model (Akka, Orleans)
Hot code reloadRolling deployments, feature flags

Libraries and tools:


5. Kubernetes Reconciliation Loops

Kubernetes is the most widely deployed implementation of the tight loop pattern at infrastructure scale.

The core principle: desired state vs. actual state

Every Kubernetes controller follows the same pattern:

while (true) {
  const desired = getDesiredState()  // read from etcd (spec)
  const actual  = getActualState()   // query the real world (status)
  const diff    = reconcile(desired, actual)
  if (diff) apply(diff)
  await sleep(resyncPeriod)
}

This is the reconciliation loop. Controllers don’t store “what I last did” — they compute “what I need to do now” from first principles on every iteration. This makes them idempotent and convergent: run them 100 times on a healthy cluster and nothing changes; run them once after a failure and they fix it.

Key properties:

  1. Level-triggered, not edge-triggered: controllers react to current state, not events. This means missed events (network partition, API server restart) don’t leave the system in a broken state indefinitely.

  2. Optimistic concurrency: controllers use resourceVersion to detect conflicts. If the resource changed since they last read it, they start over. No distributed locks required.

  3. Eventual consistency: Kubernetes makes no guarantees about when the actual state matches desired, only that it will converge. This is the right contract for distributed systems.

Writing your own controller

The Kubernetes controller-runtime library (Go) implements the scaffolding:

// controller-runtime example
func (r *MyReconciler) Reconcile(ctx context.Context, req reconcile.Request) (reconcile.Result, error) {
    // 1. Fetch the object
    obj := &MyResource{}
    if err := r.Get(ctx, req.NamespacedName, obj); err != nil {
        return reconcile.Result{}, client.IgnoreNotFound(err)
    }

    // 2. Compute desired state
    desired := r.computeDesired(obj)

    // 3. Diff and apply
    if !reflect.DeepEqual(obj.Status, desired) {
        obj.Status = desired
        return reconcile.Result{}, r.Status().Update(ctx, obj)
    }

    // 4. Re-queue after resync period
    return reconcile.Result{RequeueAfter: 5 * time.Minute}, nil
}

Applying this pattern outside Kubernetes:

Any system with a “desired config” and a “running reality” can use reconciliation loops:

References:


6. Fast CI/CD Feedback and the Andon Cord

Toyota Production System and the Andon Cord

Toyota’s Andon cord (now a button) allows any worker on the assembly line to stop the entire line when they detect a defect. Counterintuitively, this increases throughput: problems are fixed immediately rather than propagating through 500 more cars before detection.

Software equivalent: when a test fails in CI, block all merges to that branch until it’s fixed. This is not a slowdown — it’s the mechanism that prevents defect accumulation.

The key insight: stopping to fix is faster than continuing with a broken foundation.

The Fast Feedback Gradient

Not all tests are equal. Organize your test suite into layers with dramatically different execution times:

Layer 0: Static analysis         → < 5s    (linting, type checking, formatting)
Layer 1: Unit tests              → < 30s   (pure functions, no I/O)
Layer 2: Integration tests       → < 3min  (real DB, fake external services)
Layer 3: Contract tests          → < 5min  (consumer-driven contracts)
Layer 4: End-to-end tests        → < 15min (real infrastructure, smoke paths)
Layer 5: Load/chaos tests        → hours   (run nightly or pre-release only)

Run layers sequentially and fail fast. Don’t run a 10-minute E2E suite on a commit that fails the 5-second linter.

Tools that implement fast feedback:

Test Impact Analysis

Rather than running all tests on every commit, track which tests exercise which code paths and run only affected tests.


7. Chaos Engineering

Chaos engineering is the practice of deliberately injecting failures into systems to discover weaknesses before they manifest under real load.

The GameDay model (Netflix)

Netflix’s Simian Army (now evolved) runs experiments during “GameDays”:

  1. Define steady state (the system’s normal behavior — a measurable hypothesis)
  2. Inject a failure (kill a node, increase latency, exhaust a resource)
  3. Observe whether steady state is maintained
  4. Fix anything that broke

Key tools:

ToolScopeNotes
Chaos MonkeyEC2/AWS instance terminationOriginal Netflix tool
Chaos GorillaKill entire availability zone
GremlinSaaS chaos platformCPU, memory, network, disk attacks
LitmusChaosKubernetes-nativeCRD-based experiments
Chaos MeshKubernetes-nativeCNCF project
toxiproxyNetwork proxy with failure injectionGreat for local dev
PumbaDocker container chaos

Principles (from the Chaos Engineering handbook):

  1. Build a hypothesis around steady-state behavior
  2. Vary real-world events (not just crashes — latency, data corruption, dependency unavailability)
  3. Run experiments in production (chaos in staging doesn’t find production failure modes)
  4. Automate experiments to run continuously
  5. Minimize blast radius (use canaries, feature flags, scope injection narrowly)

The tight loop application: chaos engineering is how you verify your tight loop closes correctly. The question isn’t “does the system crash?” but “how long does it take the loop to detect and recover?“


8. Progressive Delivery and Canary Deployments

Progressive delivery separates deployment (code is running in production) from release (users are getting the new code). This lets you verify changes on real traffic with a small blast radius before full rollout.

Delivery strategies:

Blue/Green:     100% old → switch → 100% new (all-or-nothing)
Canary:         1% → 10% → 25% → 50% → 100% (gradual)
A/B:            Split by user segment (for feature testing)
Shadow:         New system processes requests but doesn't return results
Ring:           Internal → dogfood → early adopters → general availability

Automated canary analysis:

The key is automating the decision to advance or rollback:

  1. Deploy to 1% of traffic
  2. Compare error rate, latency, business metrics between canary and baseline
  3. If canary is statistically worse → automatic rollback
  4. If canary matches or improves → advance to next percentage

Tools:

The metric that matters: don’t just watch error rate. Define a single north star metric per canary that captures business value (conversion, revenue, engagement) alongside technical health.


9. Observability: The Instrument Panel

You cannot run a tight loop without seeing the system’s state. Observability is the prerequisite.

The Three Pillars

PillarWhat it capturesBest for
MetricsNumerical aggregates over timeAlerting, dashboards, trending
LogsDiscrete events with contextDebugging specific incidents
TracesRequest flow across servicesLatency diagnosis, dependency mapping

These three are complementary, not alternatives.

The USE Method (Brendan Gregg)

For every resource in your system:

USE gives you a systematic checklist. Run it against CPU, memory, disk I/O, network I/O, and any application-specific resources (thread pools, connection pools, queue depth).

The RED Method (Tom Wilkie)

For every service endpoint:

RED is the service-facing view; USE is the infrastructure-facing view. Use both.

Google’s Four Golden Signals:

  1. Latency (duration of successful requests)
  2. Traffic (demand on the system)
  3. Errors (rate of failing requests)
  4. Saturation (how “full” the service is)

OpenTelemetry (OTel)

OTel is the emerging standard for instrumentation. Vendor-agnostic SDK that exports metrics, logs, and traces to any backend (Prometheus, Jaeger, Tempo, Honeycomb, Datadog, etc.).

// OTel instrumentation example
import { trace, metrics } from '@opentelemetry/api'

const tracer = trace.getTracer('my-service')
const meter = metrics.getMeter('my-service')
const jobsCreated = meter.createCounter('jobs_created_total')

async function reconcile(repos: string[]) {
  const span = tracer.startSpan('reconcile')
  try {
    for (const repo of repos) {
      const jobSpan = tracer.startSpan('process_repo', { parent: span })
      const jobs = await processRepo(repo)
      jobsCreated.add(jobs.length, { repo })
      jobSpan.end()
    }
  } catch (err) {
    span.recordException(err)
    span.setStatus({ code: SpanStatusCode.ERROR })
    throw err
  } finally {
    span.end()
  }
}

Key observability backends:

ToolPillarNotes
PrometheusMetricsPull-based, PromQL, ecosystem standard
GrafanaDashboardsConnects to any datasource
LokiLogsLabel-based, LogQL, Grafana native
TempoTracesIntegrates with Loki/Prometheus
HoneycombAll threeExceptional for high-cardinality queries
DatadogAll threeSaaS, expensive but complete
JaegerTracesOSS, CNCF project

10. DORA Metrics: Measuring the Loop

DORA (DevOps Research and Assessment) identified four metrics that predict both organizational performance and software delivery performance:

MetricEliteHighMediumLow
Deployment frequencyMultiple/dayWeeklyMonthly<Monthly
Lead time for changes< 1 hour1 day1 week1 month
Change failure rate< 5%5-15%10-15%>15%
Time to restore service< 1 hour< 1 day1 week>1 month

Reading these metrics through the tight loop lens:

DORA’s key finding: these four metrics cluster together. Elite performers are elite on all four simultaneously. You cannot optimize one in isolation.

Additional metric: SPACE framework (GitHub, 2021)

DORA measures flow. SPACE measures developer productivity more holistically:


11. Circuit Breakers and Bulkheads

These patterns from Michael Nygard’s Release It! are essential tight-loop primitives.

Circuit Breaker

A circuit breaker monitors calls to an external dependency and “trips” (opens the circuit) when failures exceed a threshold. While open, calls fail fast without attempting the dependency. After a timeout, it half-opens and lets a test call through.

States: CLOSED → (threshold exceeded) → OPEN → (timeout) → HALF-OPEN → (success) → CLOSED
                                           ↑                              │
                                           └──────── (failure) ───────────┘

This is the software equivalent of a household circuit breaker: prevents one failing component from cascading into system-wide failure.

Implementation:

Bulkheads

Named after ship bulkheads that isolate compartments from flooding. In software: isolate thread pools, connection pools, or process queues by caller or resource. If one tenant hammers the DB, other tenants’ connections aren’t starved.

Without bulkheads:              With bulkheads:
[All requests] → [Pool 10]      [API requests]    → [Pool 5]
                                [Background jobs] → [Pool 3]
                                [Admin requests]  → [Pool 2]

Timeout pyramid:

Every external call needs a timeout. Set timeouts at every layer:

User request timeout:       500ms
Service call timeout:       200ms
Database query timeout:     100ms
External API timeout:        50ms

If inner timeouts aren’t set, a slow database query holds the service call slot, which holds the user request slot, until the entire connection pool is exhausted.


12. Minimal Viable Test Subsets

The tight loop lives or dies by the time to get feedback. One of the most powerful optimizations is not “run the tests faster” but “run fewer tests that still tell you what you need to know.”

The philosophy:

Run the smallest set of tests that gives you confidence the specific thing you changed is correct.

Principles:

  1. Scope to the change. A change to the JSON serializer doesn’t need to run the authentication tests. Map change → affected tests using coverage or static analysis.

  2. Use fixtures, not production data. A 3-repo subset reveals the same parser bugs as a 41-repo scan but runs in 2 seconds instead of 8 minutes.

  3. Smoke tests for fast sanity. A handful of critical-path tests that run in under 30 seconds. If these fail, don’t run anything else.

  4. Regression pack for known failures. Every bug you’ve ever fixed should have a regression test. These tests encode your system’s error history.

  5. Statistical sampling. For large datasets, random sampling of N items often reveals the same failure modes as exhaustive scanning. Start with N=10, scale up only if needed.

Test pyramid vs. test diamond:

The classic “test pyramid” (many unit tests, fewer integration, minimal E2E) is right for most systems. But for systems with significant integration complexity (API-heavy systems, cloud workers), a “diamond” shape often makes more sense:

Pyramid:                    Diamond:
    [E2E]                       [E2E]
   [integ ]                  [integ integ]
  [unit unit unit]           [  unit unit  ]
                              [smoke smoke]

The diamond adds a thick integration band and a smoke band at the bottom because unit tests of an API client test mocks, not the API.

Fast iteration workflow:

# Step 1: Smoke test — 3 repos, 5 seconds
curl -X POST .../debug/reconcile -d '{"repos": ["a", "b", "c"]}'

# Step 2: If smoke passes, run against 10 representative repos
# Step 3: If that passes, merge and let the full cron verify at scale

13. AI Agent Observability

Standard APM tools weren’t designed for LLM-based systems. AI agents introduce new failure modes:

What to instrument:

// Minimum viable AI agent telemetry
interface AgentSpan {
  session_id: string
  turn_number: number
  model: string
  input_tokens: number
  output_tokens: number
  cached_tokens: number
  cost_usd: number
  tool_calls: ToolCallRecord[]
  latency_ms: number
  finish_reason: 'end_turn' | 'max_tokens' | 'tool_use' | 'error'
  error?: string
}

interface ToolCallRecord {
  tool_name: string
  duration_ms: number
  success: boolean
  error?: string
}

AI Observability Platforms:

PlatformStrengthsNotes
LangSmithDeep LangChain integration, evaluation suite, prompt hubBest-in-class if using LangChain
LangfuseOpen source, self-hostable, framework-agnosticBest choice for custom agents
HeliconeProxy-based (no SDK changes), cost analyticsDrop-in for any OpenAI-compatible API
Arize PhoenixOSS, ML observability background, embeddings vizStrong for RAG/embedding workflows
AgentOpsAgent-specific metrics (session replay, loop detection)Designed for multi-step agents
Weights & Biases WeaveVersioning + tracing + evaluation in oneStrong if already using W&B
BraintrustEval-first, dataset management, score trackingBest for systematic LLM eval workflows
PromptLayerPrompt versioning + A/B testingSimpler, less overhead

Cost control as a tight loop:

Budget → [token counter] → [budget guard] → [agent]
                ↑                               │
                └──────── [usage report] ────────┘

If accumulated_cost > daily_budget × 0.8:
  → switch to cheaper model (Haiku instead of Sonnet)
  → reduce context window
  → alert human
  → pause non-critical jobs

Agent-specific alert thresholds:


14. AI Evaluation: The Eval Loop

Evaluating LLM-based systems is itself a tight loop discipline. The eval loop closes the gap between “model behavior” and “desired behavior.”

The Eval Hierarchy

Level 0: Assertion-based evals      — deterministic, binary pass/fail
Level 1: LLM-as-judge evals         — use a model to grade model output
Level 2: Human preference evals     — human labels ground truth
Level 3: Downstream metric evals    — measure business/task outcomes

Use all four levels, but optimize for Level 0 first. Deterministic evals run in milliseconds and never disagree with themselves.

Eval-as-CI

Treat evals like tests: run them on every commit, gate merges on them. The key is separating two concerns:

  1. Regression pack: a frozen dataset of inputs → expected outputs. If a model change breaks regression, block the deploy.
  2. Exploration set: new examples from production that stress-test the current model. This expands the regression pack over time.
# .github/workflows/eval.yml
- name: Run eval suite
  run: |
    braintrust eval src/evals/main.eval.ts
  env:
    BRAINTRUST_API_KEY: ${{ secrets.BRAINTRUST_API_KEY }}

LLM-as-Judge Pattern

When the expected output is complex (prose, code, structured reasoning), use a stronger model to grade a weaker model’s output:

async function gradeResponse(
  question: string,
  expected: string,
  actual: string,
): Promise<Score> {
  const prompt = `
    Question: ${question}
    Expected: ${expected}
    Actual: ${actual}

    Rate the actual response on:
    1. Correctness (0-1): Does it answer the question correctly?
    2. Completeness (0-1): Does it cover all aspects of expected?
    3. Conciseness (0-1): Is it appropriately concise?

    Return JSON: {"correctness": N, "completeness": N, "conciseness": N, "reasoning": "..."}
  `
  const response = await claude.complete(prompt)
  return JSON.parse(response)
}

Caveats for LLM-as-judge:

Evaluation Frameworks:

FrameworkLanguageNotes
BraintrustTypeScript/PythonBest-in-class eval dataset management, CI integration
LangSmith EvalsPythonTight LangChain integration, shareable datasets
PromptfooTypeScriptYAML-configured evals, provider-agnostic
Evals (OpenAI)PythonOpenAI’s own framework, used for model training
RAGASPythonSpecialized for RAG evaluation
TruLensPythonLLM observability + feedback functions
DeepEvalPythonComprehensive metric library (G-Eval, RAGAS, toxicity)

Benchmark Suites for Agents:

BenchmarkWhat it tests
SWE-benchSolving real GitHub issues (software engineering)
SWE-bench VerifiedHuman-verified subset of SWE-bench
AgentBenchMulti-domain agent tasks (OS, DB, web, code)
GAIAReal-world assistant tasks with tool use
WebArenaWeb browsing and navigation
OSWorldDesktop computer use
TAU-benchTool agent benchmark with real APIs

Golden Dataset Curation

The most important eval practice is building and maintaining a golden dataset:

  1. Collect from production: log all agent inputs and outputs in production
  2. Label failures: when a user corrects the agent or gives negative feedback, tag that input
  3. Label successes: randomly sample good outputs and verify them
  4. Balance the set: ensure coverage of edge cases, not just the happy path
  5. Version the dataset: evals are only meaningful relative to a specific dataset version
  6. Rotate in new examples: a frozen dataset decays as the system changes

Minimum Viable Eval

Don’t wait for a comprehensive eval suite. Ship a minimum viable eval that runs in < 2 minutes:

// minimum-viable-eval.ts
const TEST_CASES = [
  {
    input: "repo has no CI workflow",
    expectedJobType: "add-workflow",
    expectedPriority: 2,
  },
  {
    input: "CI is failing on main",
    expectedJobType: "fix-ci",
    expectedPriority: 1,
  },
  // 5-10 cases covering critical paths
]

for (const tc of TEST_CASES) {
  const result = await evaluatePolicy(tc.input)
  assert(result.type === tc.expectedJobType, `Wrong job type for: ${tc.input}`)
  assert(result.priority === tc.expectedPriority, `Wrong priority for: ${tc.input}`)
}

Constitutional AI and RLHF in the eval loop

Constitutional AI (Anthropic, 2022) trains models using a set of principles (“the constitution”) evaluated by the model itself. RLHF (Reinforcement Learning from Human Feedback) fine-tunes models using human preference signals.

For application-level systems, these techniques manifest as:

Red-Teaming for AI Agents

Red-teaming agents requires finding inputs that cause:

Automate red-teaming by generating adversarial inputs with a separate model specifically tasked to find failures.


15. The Strangler Fig Pattern

When you need to replace an existing system while keeping it running, the Strangler Fig pattern (Martin Fowler, 2004) applies the tight loop approach to migration.

The pattern:

Phase 1: New system runs alongside old, handles nothing
         [all traffic] → [old system]
         [new system]   (idle)

Phase 2: Route specific, low-risk paths to new system
         [path A] → [new system]
         [path B] → [old system]

Phase 3: Gradually move paths
         [path A, C, D] → [new system]
         [path B]       → [old system]

Phase 4: Old system dies naturally (strangled by the fig)
         [all traffic] → [new system]

The tight loop application: don’t big-bang migrate. Each “move a path to the new system” step is one loop iteration. Observe the new system under real load, fix issues, expand scope.

Where it applies:


16. Applying the Tight Loop to Autonomous Agent Systems

This section synthesizes the above into a concrete architecture for self-healing agent systems — the specific context that motivated this article.

The Two-Tier Agent Architecture

┌─────────────────────────────────────────────────────────────┐
│                         BRAIN (Policy Plane)                 │
│                                                              │
│  [Scanner] → [Policy Engine] → [Job Queue] → [Dispatcher]   │
│      ↑                                            │          │
│      └────────────── [Result Ingest] ─────────────┘          │
└─────────────────────────────────────────────────────────────┘

                         (job dispatch)

┌─────────────────────────────────────────────────────────────┐
│                        WORKERS (Execution Plane)             │
│                                                              │
│  [RepoPrime] [CI-Fixer] [Compliance-Bot] [PR-Merger]         │
│      │           │            │               │              │
│      └───────────┴────────────┴───────────────┘              │
│                       (result reports)                        │
└─────────────────────────────────────────────────────────────┘

The tight loop for each tier:

Brain loop (minutes):

  1. Scan repos for signals (CI status, file presence, branch protection)
  2. Evaluate policy (desired state = signals all green)
  3. For each violation, create a job with priority and type
  4. Dispatch to eligible runners

Worker loop (seconds to minutes per job):

  1. Claim a job
  2. Execute the fix (run Claude, push a PR, trigger a workflow)
  3. Report result back to Brain
  4. Brain re-scans to verify the fix closed the gap

Observability the loop needs:

// Every reconciliation run should emit:
{
  timestamp: number,
  trigger: 'cron' | 'webhook' | 'debug',
  repos_scanned: number,
  repos_with_violations: number,
  jobs_created: { type: string, priority: number, repo: string }[],
  jobs_already_pending: number,  // deduplicated
  scan_duration_ms: number,
  errors: { repo: string, error: string }[],
}

// Every job completion should emit:
{
  job_id: string,
  job_type: string,
  repo: string,
  runner_id: string,
  duration_ms: number,
  success: boolean,
  pr_url?: string,
  error?: string,
  retry_count: number,
}

The fast-iteration subset

The single most impactful tight-loop technique for agent systems:

# Instead of waiting for the full cron (30 repos, 8 minutes):
curl -X POST .../debug/reconcile -d '{"repos": ["owner/repo-a", "owner/repo-b", "owner/repo-c"]}'

# Choose repos that cover:
# - A repo with CI failing (tests fix-ci path)
# - A repo missing a file (tests add-workflow path)
# - A repo that's clean (tests no-duplicate-job path)
# - A repo with known edge cases (tests error handling)

Self-healing failure modes:

FailureDetectionRemediation
Job stuck in assigned > 10minCron checks claimed_atRe-queue to pending
Runner offlineheartbeat cutoff (90s)Remove from eligible runners
Scan error for a repoerror count in reconcile resultAlert + skip repo
Job loop (same job created repeatedly)Check jobs WHERE repo=? AND type=? AND status IN (pending, running)Deduplicate before insert
Worker crash mid-jobJob stays running, runner goes offlineTimeout → re-queue
GitHub API rate limit429 responseExponential backoff + pause scan

17. Anti-Patterns

Anti-pattern 1: The Manual Fix

Symptom: A bug occurs. You fix the specific instance of the bug, not the class of bug.

Example: The scanner crashes on repo X because it has no description. You add ?? '' for repo X. Next week, repo Y has no homepage and crashes the same way.

Fix: Identify the class of failure (missing optional fields), add defensive defaults for all of them, add a test for the failure class.


Anti-pattern 2: Waiting for the Full Run

Symptom: You make a change, then wait 30 minutes for the cron to tell you if it worked.

Fix: Build a debug/test endpoint that runs the loop on a configurable subset. The first thing to build when setting up a new loop is the fast iteration path.


Anti-pattern 3: Alert Without Action

Symptom: The system detects a problem and sends a notification to a human, who then manually decides what to do.

Fix: Every alert should have a corresponding runbook. Every runbook should have an automated version. Humans in the loop should be exception handlers, not the rule.


Anti-pattern 4: The Infinite Retry

Symptom: The system retries failed jobs indefinitely with no backoff and no limit. One broken API endpoint saturates the job queue.

Fix: Exponential backoff + max retry count + dead letter queue. After N failures, move the job to a dead letter queue for human inspection.


Anti-pattern 5: Observing at the Wrong Granularity

Symptom: You measure “reconciliation succeeded” but not “how many repos were scanned, how many had violations, how many jobs were created.”

Fix: Emit structured events at every step of the loop. Aggregate metrics tell you the system is healthy; structured events tell you why it’s healthy or not.


Anti-pattern 6: Changing Multiple Things at Once

Symptom: You deploy three changes simultaneously. Something breaks. You don’t know which change caused it.

Fix: One change per loop iteration. If you must batch changes, use feature flags to control which changes are active in production and roll them out one at a time.


Anti-pattern 7: Testing Only the Happy Path

Symptom: Your test subset only covers repos that work correctly. The broken repo is the first one discovered by the cron.

Fix: Your test subset must include at least one example of each known failure mode. The subset is a regression pack, not a demo.


18. The Meta-Loop: Continuously Enhancing This Article

This article is itself a tight loop artifact. It should be enhanced each time:

Planned additions:


19. Tools and Libraries Reference

Self-Healing Infrastructure

ToolCategoryLink
Kubernetes controller-runtimeReconciliation loopsk8s.io/controller-runtime
Argo RolloutsProgressive deliveryargoproj.github.io/argo-rollouts
FlaggerCanary automationflagger.app
TemporalWorkflow orchestrationtemporal.io
Resilience4jCircuit breaker (JVM)resilience4j.readme.io
PollyCircuit breaker (.NET)pollydocs.org
opossumCircuit breaker (Node)nodeshift/opossum
GremlinChaos engineeringgremlin.com
LitmusChaosKubernetes chaoslitmuschaos.io
toxiproxyNetwork failure injectionshopify/toxiproxy

Observability

ToolCategoryLink
OpenTelemetryInstrumentation standardopentelemetry.io
PrometheusMetricsprometheus.io
GrafanaDashboardsgrafana.com
JaegerDistributed tracingjaegertracing.io
HoneycombHigh-cardinality observabilityhoneycomb.io
LokiLog aggregationgrafana.com/oss/loki

AI/LLM Observability

ToolCategoryLink
LangfuseLLM tracing + evallangfuse.com
LangSmithLangChain observabilitysmith.langchain.com
HeliconeProxy-based LLM observabilityhelicone.ai
Arize PhoenixML + LLM observabilityarize.com/phoenix
AgentOpsAgent session replayagentops.ai
Weights & Biases WeaveLLM tracing + evalwandb.ai/site/weave
BraintrustLLM evaluationbraintrust.dev

Evaluation Frameworks

ToolLanguageLink
BraintrustTypeScript/Pythonbraintrust.dev
PromptfooTypeScriptpromptfoo.dev
DeepEvalPythonconfident-ai.com
RAGASPythonragas.io
TruLensPythontrulens.org
LangSmith EvalsPythonsmith.langchain.com

Fast CI/CD

ToolCategoryNotes
NxMonorepo build cacheSupports affected-only builds
TurborepoMonorepo task runnerVercel-backed
BazelHermetic build systemGoogle-scale
LaunchableTest impact analysisML-based selection
GitHub Actions path filtersCI scopingon.push.paths

Last updated: 2026-03-20 — S3 of garywu/mulan epic #123

Next update triggers: S4 completion (Brain DO), first successful self-healing loop, new eval tool adoption


Edit page
Share this post on:

Previous Post
Never Fail Twice: The Escalation Ladder That Learns
Next Post
The Autonomous Entity Pattern