The Recurring Run Review

“One API call is an experiment. One thousand API calls per day is an operational commitment.”

Open Table of Contents

1. The Phase Transition Problem
2. Why Recurring Is Categorically Different
3. The Cost Explosion Model
- The Formula
- Cost Dimensions to Model
4. The AI Call Multiplier
- The $1-per-call Scenario
- AI Call Rules for Recurring Jobs
5. Known Limits: The Constraint Database
6. The Review Gate: Formal Checklist
7. The Automation Registry
- Registry Entry Schema
- Registry Location
8. Required Infrastructure: Circuit Breakers and Kill Switches
9. Observability Is Mandatory, Not Optional
- Minimum: The Structured Log Line
- Dashboard Requirement
10. The Smart Scan Principle
11. Governance in Practice: A Worked Example
12. The Automation Lifecycle
- The Quarterly Review
13. Related Articles
Summary

1. The Phase Transition Problem

There is a hard boundary between something that runs once and something that runs repeatedly. Most engineers treat this as a smooth continuum. It is not.

When you run a script manually, you are present. You observe the output. You see the error. You stop it. The blast radius is bounded by your attention span.

When you schedule that same script to run every five minutes, you have made a fundamentally different commitment. You are no longer present for each execution. The script runs while you sleep, while you are in meetings, while you are on vacation. Every assumption embedded in that script — about data size, API availability, cost per call, downstream effects — now compounds silently.

This is the phase transition: from experiment to operational system. The mistake is treating it as a deployment decision. It is not. It is an architectural decision that requires formal review.

The rule is simple: anything that runs more than once enters a review process before it runs the second time.

2. Why Recurring Is Categorically Different

Compounding Effects

A single bad API call costs one error. A bad API call on a five-minute cron costs 288 errors per day. At monthly scale that is 8,640 errors, each potentially triggering downstream state changes, retries, secondary calls, and notifications.

Silent Failure Accumulates

One-off scripts fail loudly — you are watching. Recurring jobs fail quietly — nobody is watching. By the time anyone notices, the damage is done: rate limits exhausted, data corrupted, costs inflated, downstream systems confused.

State Drift

Recurring jobs interact with state. Every run reads something, writes something, or signals something. Over time, these interactions accumulate into a state machine that no individual designed. The job does not just run — it shapes the world it runs in.

The Discovery Problem

Once a recurring job exists, it is difficult to find. It does not appear in your code editor. It does not show up in a pull request. It runs invisibly on a schedule that nobody remembers setting. The Cloudflare autonomous pipeline article documents this directly: an unmonitored * * * * * cron burned $19.63/month in D1 row reads before anyone noticed.

3. The Cost Explosion Model

Every recurring job has a cost function. Before scheduling anything, you must write it down.

The Formula

total_cost_per_month =
  (calls_per_run × cost_per_call) × (runs_per_hour × 730)

A job that makes 14 API calls per repo, scans 41 repos, and runs every 5 minutes:

14 × 41 = 574 API calls per run
574 × 12 = 6,888 calls per hour
6,888 × 730 = 5,028,240 calls per month

GitHub’s limit is 5,000 calls per hour. This single job, run naively, consumes 138% of the hourly limit and triggers rate limiting on every single cycle.

This is not a hypothetical. This is exactly what happened when the mulan org scanner was configured to run every 5 minutes without a smart scan optimization. The rate limit hit was not a surprise — it was predictable from the cost function. The surprise was that nobody computed the function before deploying.

Cost Dimensions to Model

Dimension	Unit	Common Limits
GitHub REST API	requests/hour	5,000/hr (authenticated)
GitHub GraphQL API	points/hour	5,000/hr
GitHub secondary rate limit	content creations/min	~30/min (soft)
Cloudflare Worker CPU	ms/invocation	50ms (free), 30s (paid)
Cloudflare D1 reads	rows/day	5M (free), unlimited (paid)
Cloudflare Worker requests	req/day	100K (free), unlimited (paid)
External LLM API	tokens or requests	varies by provider + tier
Claude via Agent SDK	requests	shared with Pro/Max quota

Write the cost function. Compute the monthly projection. If any dimension exceeds 50% of its limit at steady state, the job needs redesign before it runs.

4. The AI Call Multiplier

The cost model above applies to all recurring jobs. But AI calls introduce a qualitatively different problem: the cost per call is orders of magnitude higher than standard API calls, and it varies unpredictably by input size.

The $1-per-call Scenario

Consider a job that calls Claude Sonnet to analyze a repository. At current API pricing:

Input: ~10,000 tokens (repo context, prompt)
Output: ~2,000 tokens (analysis, recommendations)
Cost: approximately $0.03–$0.15 per call at standard API rates

That sounds cheap. Now schedule it naively across a portfolio:

$0.10/call × 41 repos × 12 runs/hour = $49.20/hour = $35,916/month

With a $1/call scenario (extended thinking, long context, high-stakes analysis):

$1.00/call × 41 repos × 12 runs/hour = $492/hour = $359,160/month

This is not an edge case. This is the direct consequence of treating AI calls like ordinary API calls — cheap per invocation, catastrophic at scale.

AI Call Rules for Recurring Jobs

Rule 1: AI calls must never be in the hot path of a cron loop over many items. The hot path (the inner loop over items per cycle) must be free of AI calls. AI calls belong in asynchronous jobs that are dispatched one at a time, rate-limited, and queued. The org-prime-agent-architecture article describes this separation explicitly: the Brain DO produces a run sheet, the Dispatcher queues individual jobs, and the local runner processes one job at a time with a 20-minute hard timeout per AI call.

Rule 2: Every AI call in a recurring context needs an explicit budget.

ai_calls:
  model: claude-sonnet
  max_calls_per_day: 50
  max_tokens_per_call: 8000
  estimated_daily_cost: $2.50
  kill_switch: true   # halt job if daily budget exceeded

Rule 3: AI calls need output caching. If the input (repo state, document, prompt) has not changed since the last call, return the cached output. Never call an AI twice with the same input. Cache key = hash of the input content.

Rule 4: Use the cheapest model that produces acceptable output. The org-prime-agent-architecture article defines a model routing policy: Haiku for structured extraction, Sonnet for planning and code changes, Opus for complex multi-step reasoning. Never use Opus where Haiku produces acceptable results.

Rule 5: Thinking mode is a cost multiplier, not a free upgrade. Extended thinking at budget_tokens: 10000 can 10× the cost of a call. It belongs in low-frequency, high-stakes decisions — not in any loop.

Rule 6: Prefer local models for hot-path operations. When a job runs frequently over many items, consider whether a local model (running on available GPU, zero marginal cost per call) can replace the cloud API call for that specific task. Reserve API calls for tasks that genuinely require frontier model capability.

5. Known Limits: The Constraint Database

Every external system has documented limits. Before any recurring job goes live, the relevant constraints must be recorded in a shared reference. This is the constraint database — a maintained document that lives in the brain repo alongside the automation registry.

GitHub API Constraints

Limit	Value	Notes
REST API rate limit	5,000 req/hr	Per authenticated user, not per app
GraphQL rate limit	5,000 points/hr	Each field has a cost; connections cost more
Secondary rate limit	~30 content creations/min	Soft; triggers 403 with retry-after header
Search API	30 req/min	Separate quota from REST
Actions API	1,000 req/hr	For workflow dispatch and run queries
Repository contents API	counted in REST	Each file fetch = 1 REST call

Cloudflare Platform Constraints

Limit	Free Tier	Paid Tier
Worker requests	100K/day	Unlimited
Worker CPU per request	10ms	30s
Subrequests per Worker invocation	50	1,000
D1 row reads	5M/day	Unlimited
D1 row writes	100K/day	Unlimited
Durable Object requests	1M/month	$0.15/M after
Cron triggers	5/account	Unlimited
DO alarm precision	1 minute	1 minute

AI Provider Constraints

Provider	Limit Type	Value
Anthropic API (Tier 1)	Tokens/min	50,000 TPM
Anthropic API (Tier 4)	Tokens/min	400,000 TPM
Anthropic API (Tier 1)	Requests/min	50 RPM
Claude Code (Pro/Max)	Requests	Shared quota, no published hard number
Claude Agent SDK	Concurrency	One session per binary instance

The constraint database rule: Every recurring job must reference the specific limits it operates within. “I didn’t know there was a limit” is not an acceptable post-mortem. Add any limit you discover to the database before you hit it again.

6. The Review Gate: Formal Checklist

Before any job, script, workflow, or agent is scheduled to run more than once, it must pass this checklist. The completed checklist becomes part of the job’s automation registry entry. A job that cannot answer all questions does not get scheduled.

Section A: Classification

Job ID: unique name across all services (e.g. mulan-org-scanner)
Run frequency: what is the schedule? (cron expression, event trigger, alarm interval)
Trigger type: Cron / event-driven / DO alarm / manual-with-retry / queue consumer
Item count: how many items does it process per run? (repos, records, files…)
Termination condition: under what conditions does this job permanently stop running?

Section B: External API Calls

GitHub REST: calls per run ___ · calls per hour ___ · % of 5,000/hr limit ___
GitHub GraphQL: points per run ___ · % of 5,000/hr limit ___
GitHub secondary: content creations per run ___ · within 30/min limit? ___
Other APIs: (list each with calls/run and limit)
Smart fetch: does the job skip API calls when data has not changed? (required if >20% of any limit)
Conditional requests: does the job use ETags or If-Modified-Since headers?
Scaling question: how does call count scale as item count grows? At 10× items, is the cost still within limits?
Event-driven alternative: could resources report in when they change, replacing polling entirely? (required if item count may exceed ~100)
Backoff policy: what happens on a 429 or 403 rate limit response?

Section C: AI Calls

Section D: State and Side Effects

Database reads: which tables · rows per run ___
Database writes: which tables · rows per run ___
File writes: which repos · which files · how many per run ___
GitHub content creation: issues / PRs / comments / commits — how many per run ___
Notifications: Telegram / Slack / email — per run ___
Idempotent: if the job runs twice with identical input, does it produce the same result?
Deduplication: what prevents creating duplicate issues, PRs, or comments?

Section E: Observability

Structured log line: every run emits one parseable summary line (format: ___)
Success metric: there is a measurable output confirming the run succeeded
Dashboard entry: the job appears in the automation registry dashboard
Alert on failure: notification fires if the job fails or produces no output for N cycles

Section F: Kill Switches (all required)

Immediate kill switch exists: env var / feature flag / D1 config (name: ___)
Kill switch tested: confirmed that setting it stops the job within one cycle
Budget kill switch: job halts automatically if daily AI cost exceeds budget
Rate limit kill switch: job backs off automatically on 429/403 responses
Graceful degradation: job serves cached data when external dependencies are unavailable

Section G: Sign-Off

Cost function computed and within 50% of all relevant limits at steady state
All constraints from §5 checked against this job’s usage
AI calls budgeted (if present) with daily cap and cache policy
Kill switch tested in staging or manually triggered
One full run observed end-to-end before scheduling
Registry entry created in .automation/registry.yml

A job that fails any item in sections A–F does not run on a schedule until the item is addressed.

7. The Automation Registry

Every recurring job in the org must have an entry in the automation registry. This is a YAML file maintained in the brain or ops repo. It is the authoritative source of what is running, why, and at what cost.

Registry Entry Schema

- id: mulan-org-scanner
  type: cloudflare-worker-cron
  repo: garywu/mulan
  deployed_at: 2026-03-20
  schedule: "*/5 * * * *"
  status: active   # active | deprecated | suspended

  purpose: >
    Scans all registered repos for org health signals (CI, README, tooling standards).
    Generates scored dashboard in garywu/_readme. Dispatches fix jobs to the dispatcher queue.

  cost_model:
    github_rest_calls_per_run: 82        # smart scan: 2/repo unchanged, 14/repo changed
    github_rest_calls_per_hour: 984      # 82 × 12 runs/hour
    github_rest_pct_of_limit: "20%"
    ai_calls_per_run: 0
    ai_cost_per_month: "$0.00"
    d1_reads_per_run: 41
    d1_writes_per_run: 41

  limits_checked:
    - github_rest: "5,000/hr"
    - github_graphql: "5,000/hr"
    - cloudflare_subrequests: "1,000/invocation"

  kill_switches:
    - type: env_var
      name: SCANNER_DISABLED
      effect: skip scanning, serve cached data only
    - type: rate_limit_response
      effect: exponential backoff up to 5 min, alert via Telegram

  observability:
    log_format: "[reconciler] scanned N repos: X full, Y fast | api_calls=N | status=ok"
    dashboard: "garywu/_readme (auto-generated)"
    alert_on_failure: telegram

  review:
    reviewed_at: 2026-03-20
    checklist_version: v1
    status: approved

Registry Location

garywu/brain/
  .automation/
    registry.yml    # all recurring jobs across the org
    limits.yml      # known platform limits, reviewed quarterly
    budget.yml      # monthly AI spend targets per service

The registry is consumed by:

Git Sure — shows “what’s monitoring this repo” in per-repo pages
The org dashboard — displays all active services and their API usage
The review process — the registry entry is the artifact that proves review happened

8. Required Infrastructure: Circuit Breakers and Kill Switches

Every recurring job must implement, at minimum:

1. Rate Limit Awareness

async function ghFetchWithBackoff(url: string, token: string): Promise<Response> {
  const res = await fetch(url, { headers: { Authorization: `Bearer ${token}` } })

  if (res.status === 429 || res.status === 403) {
    const retryAfter = res.headers.get('retry-after')
    const resetAt = res.headers.get('x-ratelimit-reset')
    const waitMs = retryAfter
      ? parseInt(retryAfter) * 1000
      : resetAt
        ? parseInt(resetAt) * 1000 - Date.now()
        : 60_000

    console.error(`[rate-limit] ${url.slice(0, 60)} — backing off ${Math.ceil(waitMs / 1000)}s`)
    await notify(`⚠️ Rate limit hit — waiting ${Math.ceil(waitMs / 1000)}s`)
    await sleep(Math.min(waitMs, 300_000)) // cap at 5 min
    return ghFetchWithBackoff(url, token)  // single retry
  }

  return res
}

2. AI Budget Circuit Breaker

async function callAIWithBudget(
  prompt: string,
  db: D1Database,
  dailyBudgetUsd: number,
): Promise<string | null> {
  const today = new Date().toISOString().split('T')[0]
  const spent = await db
    .prepare(`SELECT COALESCE(SUM(cost_usd), 0) AS total FROM ai_call_log WHERE date = ?`)
    .bind(today)
    .first<{ total: number }>()

  if ((spent?.total ?? 0) >= dailyBudgetUsd) {
    console.warn(`[ai-budget] $${dailyBudgetUsd} daily cap reached — skipping AI call`)
    return null  // caller must handle null gracefully
  }

  const result = await callAI(prompt)
  const costUsd = result.usage.total_tokens * 0.000003 // estimate; calibrate per model

  await db
    .prepare(`INSERT INTO ai_call_log (date, model, tokens, cost_usd) VALUES (?, ?, ?, ?)`)
    .bind(today, result.model, result.usage.total_tokens, costUsd)
    .run()

  return result.content
}

3. Input Change Detection (Cache-Before-Call)

async function processIfChanged(
  cacheKey: string,
  input: string,
  db: D1Database,
  processor: () => Promise<string>,
): Promise<string | null> {
  const inputHash = await sha256(input)

  const cached = await db
    .prepare(`SELECT output, input_hash FROM process_cache WHERE key = ?`)
    .bind(cacheKey)
    .first<{ output: string; input_hash: string }>()

  if (cached?.input_hash === inputHash) {
    return cached.output  // cache hit — no expensive call
  }

  const output = await processor()  // only runs when input changed

  await db
    .prepare(`INSERT OR REPLACE INTO process_cache (key, input_hash, output, updated_at)
              VALUES (?, ?, ?, unixepoch())`)
    .bind(cacheKey, inputHash, output)
    .run()

  return output
}

4. Graceful Degradation

When a recurring job cannot complete its full function (rate limited, AI budget exceeded, external API unavailable), it must:

Complete what it can using local or cached data
Log the degraded state with clear reason
Notify the operator (Telegram or equivalent)
Continue to the next cycle at the normal interval — do not spin
Never write partial state as complete state

9. Observability Is Mandatory, Not Optional

The tight loop article defines the core principle: a system must observe itself. For recurring jobs, this means every single run emits a structured summary before it exits.

Minimum: The Structured Log Line

[mulan-org-scanner] cycle=42 repos=41 full=3 fast=38 jobs_created=5
  api_calls=89 api_pct=1.8% duration_ms=12400 status=ok

This single line answers every operational question about the run:

What ran and how many items were processed
How much API budget was consumed
How long it took
Whether it succeeded

Without this line, you cannot answer “is this job healthy?” without reading full execution logs. That is too slow for incident response.

Dashboard Requirement

Any recurring job that runs for more than one week must appear in a monitoring dashboard. At minimum, the dashboard shows:

Last run time and status
API call count trend (7-day rolling)
AI cost trend (7-day, if applicable)
Success rate (last 100 runs)
Item count trend (growing? shrinking? stable?)

The activity section of the garywu/_readme dashboard (generated by the mulan dispatcher) is an example of this applied to the job system itself: it shows what’s running now, how many fixes happened in the last hour, and the hourly histogram.

10. The Smart Scan Principle

The single most effective cost reduction for polling-based recurring jobs is skipping work when nothing has changed. This applies at every layer.

Layer 1: Push Timestamp Gating

For jobs that check the state of external resources (files, repos, APIs):

if resource.last_modified == cached.last_modified:
    skip expensive checks
    optionally refresh the single most volatile signal (e.g. CI status)
    cost: 1-2 API calls
else:
    full scan
    update cache
    cost: N API calls (full scan)

Applied to the mulan org scanner with 41 repos:

Scenario	API calls/run	Calls/hour	% of limit
Naive (always full scan)	574	6,888	138%
Smart scan (3 repos changed)	118	1,416	28%
Smart scan (0 repos changed)	82	984	20%

Layer 2: Content Deduplication

For jobs that write output (commits, issues, notifications):

if hash(new_content) == hash(existing_content):
    skip write entirely

A dashboard that regenerates every 5 minutes but only commits when data changed consumes zero GitHub write API calls on quiet cycles — the cycles that comprise the majority of all cycles.

Layer 3: Tiered Polling Frequency

Not all signals need the same polling frequency:

Signal	Natural change rate	Appropriate poll interval
CI status	Changes within minutes of a push	Every run
File-based signals (README, config)	Changes only with a push	Only when `pushed_at` changes
Repo metadata (description, topics)	Changes rarely	Weekly
Registry (repo list)	Changes on explicit edit	On file change only

Structure your jobs to reflect these natural rates. Polling file signals at CI frequency wastes 95% of your API budget.

Layer 4: Event-Driven Flip (the final form)

Push-timestamp gating reduces cost from O(repos) to O(repos × unchanged_ratio). But there is a deeper problem: polling does not scale to hundreds of repos at any frequency.

At 200 repos with */5 * * * * and smart scan:

Quiet cycle: 2 calls × 200 repos = 400 calls/run × 12 = 4,800/hr (96% of limit)
At 500 repos: impossible, regardless of how smart the scan is

The polling model is fundamentally O(repos). The correct architecture is O(changes).

The event-driven flip:

Instead of the scanner reaching out to repos, repos report in when they change:

Repo push → CI completes → notify-readme.yml fires
                                    ↓
                     POST /notify {repo: "garywu/atlas"}
                                    ↓
                     Accumulator (DO storage dirty set)
                     adds repo, arms 5-minute alarm
                     if alarm already set: just accumulate
                                    ↓
                     5-minute alarm fires (batch window closed)
                                    ↓
              scan only the N repos that reported changes
                                    ↓
                       one README commit (or skip if no diff)
                       dirty set cleared

The cron becomes a low-frequency backup sweep (once per day at 3am) to catch things that change without pushes: CI runs completing asynchronously, Dependabot PRs, branch protection changes, new repos added to the registry.

Cost comparison at scale:

Repos	Pushes/day	Polling `/5 ` smart	Event-driven
41	10	~984/hr	10 × 14 = 140/day
200	20	~4,800/hr (96%)	20 × 14 = 280/day
1,000	50	impossible	50 × 14 = 700/day

Event-driven scanning at 1,000 repos costs the same as smart polling at 50 repos.

Implementation with Durable Object storage:

// BrainDO: accumulate dirty repos, batch scan fires at next cron tick
async notify(repo: string): Promise<void> {
  const dirty: string[] = await this.state.storage.get('dirty_repos') ?? []
  if (!dirty.includes(repo)) {
    dirty.push(repo)
    await this.state.storage.put('dirty_repos', dirty)
  }
}

async consumeDirty(): Promise<string[]> {
  const dirty: string[] = await this.state.storage.get('dirty_repos') ?? []
  if (dirty.length > 0) await this.state.storage.delete('dirty_repos')
  return dirty
}

// Cron handler: pop dirty set, scan only changed repos
async function runScheduled(env: Env): Promise<void> {
  const dirtyRepos = await brainDO.consumeDirty()

  if (dirtyRepos.length > 0) {
    // O(changes) — only scan repos that reported changes
    await runReconciler(env, dirtyRepos)
    return
  }

  // Daily full sweep — catches things that change without pushes
  if (new Date().getUTCHours() === 3) {
    await runReconciler(env)
  }
}

The cron still fires every 5 minutes, but when the dirty set is empty it exits in milliseconds with zero GitHub API calls. The 5-minute poll becomes the batch window for the event-driven notifications — repos that push multiple times in a 5-minute window are deduplicated into one scan.

Prerequisites for the flip:

Each repo needs a push notification hook (e.g. notify-readme.yml GitHub Actions workflow)
A receiver that accumulates notifications (DO storage, a D1 table, or a queue)
The cron or alarm reads from the accumulator, not from the registry

The notification hook should be deployed to repos as part of initial setup — not as a one-time manual step. This is a standard compliance item, not an enhancement.

11. Governance in Practice: A Worked Example

The mulan org scanner followed a trajectory that illustrates exactly what the review gate prevents.

Phase 1 — Manual experiment: Ran once to generate a dashboard. 574 API calls. Worked fine.

Phase 2 — First schedule (30-min cron): No review. No cost function computed.

574 × 2 runs/hour = 1,148 calls/hour = 23% of limit
Acceptable, but the margin was not known or documented

Phase 3 — Interval tightened (5-min cron): No review. Rate limit math not redone.

574 × 12 runs/hour = 6,888 calls/hour = 138% of limit
Every cycle hit the rate limit and blocked fix-ci jobs downstream

Phase 4 — Debugging: 20 minutes identifying root cause. fix-ci jobs were failing with “API rate limit already exceeded” — a symptom two layers removed from the cause.

Phase 5 — Corrective action:

Smart scan: 574 → 82 calls per run on quiet cycles (86% reduction)
Content deduplication: zero dashboard commits on unchanged data
Constraint database entry created for GitHub REST and GraphQL limits
Automation registry entry created with the verified cost model

Phase 6 — Event-driven flip:

repos report in via notify-readme.yml → POST /notify → BrainDO dirty set
cron pops dirty set: if empty, exits in milliseconds with 0 GitHub API calls
daily 3am sweep catches things that change without pushes
cost: ~140 API calls/day from scanning (was ~984/hour)
scales to 1,000 repos — O(changes), not O(repos)

What a review gate would have done at Phase 3: The cost function at 5-minute intervals returns 6,888 calls/hour against a 5,000/hour limit. This is immediately visible. The gate blocks the deployment until smart scan is implemented, which drops the number to ~984 calls/hour (20% of limit). The rate limiting incident never happens.

The review gate costs 30 minutes to complete. The incident cost more than an hour of debugging and produced silent job failures during the debugging window. The gate has positive expected value on the very first job it catches.

The scaling question would have been caught at Phase 2: Section B of the review checklist asks: “How does cost scale as item count grows?” At 41 repos, polling is manageable. At 200 repos with the same architecture, it is not. The event-driven flip (§10, Layer 4) is the answer to this question, and the checklist forces it to be asked before deployment, not after hitting the wall at scale.

12. The Automation Lifecycle

Every recurring job moves through defined phases with explicit transitions:

EXPERIMENT → REVIEW → APPROVED → MONITORED → DEPRECATED → REMOVED

EXPERIMENT: Runs manually, on demand. No schedule. No registry entry required. This phase has no governance burden — move fast, break things, iterate.

REVIEW: Owner intends to schedule the job. Checklist (§6) is in progress. Job is not yet running on a schedule.

APPROVED: Checklist complete. Registry entry created. Kill switch tested. Job is now running on schedule.

MONITORED: Running in production for more than one week. Dashboard entry active. Quarterly cost review scheduled.

DEPRECATED: Marked for removal in the registry. Still running, but with a removal date set. Downstream consumers have been notified.

REMOVED: Job deleted. Schedule removed. Registry entry archived with reason and date. Dependent systems updated.

The Quarterly Review

All APPROVED and MONITORED jobs are reviewed every three months:

Is the job still needed? (What would break if it stopped?)
Has the cost model drifted? (Items processed may have grown)
Are the platform limits still accurate? (Providers change limits)
Can the job be made cheaper? (Better caching, model routing, batching)
Are there newer primitives that change the architecture? (e.g. webhooks replacing polling)

The Brain DO in the org-prime-agent-architecture performs an equivalent review for runner capacity — adjusting allocations weekly based on observed success rates. The quarterly automation review is the same pattern applied at the governance level.

The Tight Loop — Systems must observe and correct themselves. The recurring run review is the tight loop applied to automation governance: treat each scheduled job as a system with its own feedback loop, not a fire-and-forget script.
Org Prime Agent Architecture — How persistent AI agents structure their own recurring operations within defined cost and rate limits. The Brain DO’s model routing policy (Haiku/Sonnet/Opus based on task complexity) is the AI cost governance pattern applied at the agent level. The adaptive capacity system (adjusting runner slots based on 7-day success rates) is the self-correcting circuit breaker applied to agent throughput.
Building an Autonomous Data Pipeline on Cloudflare Workers — Production case study. The * * * * * cron that burned $19.63/month in D1 row reads is the canonical example of Phase Transition Problem (§1) in production. The priority-based work scheduler and cost ceiling patterns in that article are direct implementations of the principles in §8.

Summary

The phase transition from “runs once” to “runs repeatedly” is the most consequential decision in automation engineering. It requires formal process because the consequences are compounding, silent, and sometimes irreversible.

The non-negotiable requirements:

Requirement	Why
Cost function computed before scheduling	Rate limit disasters are always predictable in advance
AI calls budgeted separately	They cost 100–10,000× more per call than REST API calls
Smart scan / input change detection	80%+ API reduction on typical cycles
Kill switch tested before first scheduled run	If you can’t stop it, you don’t control it
Structured log line on every run	Without it, “is this healthy?” takes too long to answer
Registry entry created	Invisible automation is unmanaged automation
Constraint database referenced	”I didn’t know there was a limit” is preventable

The cost of running this review for a new job: approximately 30 minutes.

The cost of not running it: rate limits exhausted, surprise bills, silent job failures, downstream systems confused, and a debugging session measured in hours rather than seconds.

This article is part of the garywu engineering knowledge base. See also: git-sure#29 (global automation registry implementation).

Last updated: 2026-03-20.