“One API call is an experiment. One thousand API calls per day is an operational commitment.”
Table of Contents
Open Table of Contents
- 1. The Phase Transition Problem
- 2. Why Recurring Is Categorically Different
- 3. The Cost Explosion Model
- 4. The AI Call Multiplier
- 5. Known Limits: The Constraint Database
- 6. The Review Gate: Formal Checklist
- 7. The Automation Registry
- 8. Required Infrastructure: Circuit Breakers and Kill Switches
- 9. Observability Is Mandatory, Not Optional
- 10. The Smart Scan Principle
- 11. Governance in Practice: A Worked Example
- 12. The Automation Lifecycle
- 13. Related Articles
- Summary
1. The Phase Transition Problem
There is a hard boundary between something that runs once and something that runs repeatedly. Most engineers treat this as a smooth continuum. It is not.
When you run a script manually, you are present. You observe the output. You see the error. You stop it. The blast radius is bounded by your attention span.
When you schedule that same script to run every five minutes, you have made a fundamentally different commitment. You are no longer present for each execution. The script runs while you sleep, while you are in meetings, while you are on vacation. Every assumption embedded in that script — about data size, API availability, cost per call, downstream effects — now compounds silently.
This is the phase transition: from experiment to operational system. The mistake is treating it as a deployment decision. It is not. It is an architectural decision that requires formal review.
The rule is simple: anything that runs more than once enters a review process before it runs the second time.
2. Why Recurring Is Categorically Different
Compounding Effects
A single bad API call costs one error. A bad API call on a five-minute cron costs 288 errors per day. At monthly scale that is 8,640 errors, each potentially triggering downstream state changes, retries, secondary calls, and notifications.
Silent Failure Accumulates
One-off scripts fail loudly — you are watching. Recurring jobs fail quietly — nobody is watching. By the time anyone notices, the damage is done: rate limits exhausted, data corrupted, costs inflated, downstream systems confused.
State Drift
Recurring jobs interact with state. Every run reads something, writes something, or signals something. Over time, these interactions accumulate into a state machine that no individual designed. The job does not just run — it shapes the world it runs in.
The Discovery Problem
Once a recurring job exists, it is difficult to find. It does not appear in your code editor. It does not show up in a pull request. It runs invisibly on a schedule that nobody remembers setting. The Cloudflare autonomous pipeline article documents this directly: an unmonitored * * * * * cron burned $19.63/month in D1 row reads before anyone noticed.
3. The Cost Explosion Model
Every recurring job has a cost function. Before scheduling anything, you must write it down.
The Formula
total_cost_per_month =
(calls_per_run × cost_per_call) × (runs_per_hour × 730)
A job that makes 14 API calls per repo, scans 41 repos, and runs every 5 minutes:
14 × 41 = 574 API calls per run
574 × 12 = 6,888 calls per hour
6,888 × 730 = 5,028,240 calls per month
GitHub’s limit is 5,000 calls per hour. This single job, run naively, consumes 138% of the hourly limit and triggers rate limiting on every single cycle.
This is not a hypothetical. This is exactly what happened when the mulan org scanner was configured to run every 5 minutes without a smart scan optimization. The rate limit hit was not a surprise — it was predictable from the cost function. The surprise was that nobody computed the function before deploying.
Cost Dimensions to Model
| Dimension | Unit | Common Limits |
|---|---|---|
| GitHub REST API | requests/hour | 5,000/hr (authenticated) |
| GitHub GraphQL API | points/hour | 5,000/hr |
| GitHub secondary rate limit | content creations/min | ~30/min (soft) |
| Cloudflare Worker CPU | ms/invocation | 50ms (free), 30s (paid) |
| Cloudflare D1 reads | rows/day | 5M (free), unlimited (paid) |
| Cloudflare Worker requests | req/day | 100K (free), unlimited (paid) |
| External LLM API | tokens or requests | varies by provider + tier |
| Claude via Agent SDK | requests | shared with Pro/Max quota |
Write the cost function. Compute the monthly projection. If any dimension exceeds 50% of its limit at steady state, the job needs redesign before it runs.
4. The AI Call Multiplier
The cost model above applies to all recurring jobs. But AI calls introduce a qualitatively different problem: the cost per call is orders of magnitude higher than standard API calls, and it varies unpredictably by input size.
The $1-per-call Scenario
Consider a job that calls Claude Sonnet to analyze a repository. At current API pricing:
- Input: ~10,000 tokens (repo context, prompt)
- Output: ~2,000 tokens (analysis, recommendations)
- Cost: approximately $0.03–$0.15 per call at standard API rates
That sounds cheap. Now schedule it naively across a portfolio:
$0.10/call × 41 repos × 12 runs/hour = $49.20/hour = $35,916/month
With a $1/call scenario (extended thinking, long context, high-stakes analysis):
$1.00/call × 41 repos × 12 runs/hour = $492/hour = $359,160/month
This is not an edge case. This is the direct consequence of treating AI calls like ordinary API calls — cheap per invocation, catastrophic at scale.
AI Call Rules for Recurring Jobs
Rule 1: AI calls must never be in the hot path of a cron loop over many items. The hot path (the inner loop over items per cycle) must be free of AI calls. AI calls belong in asynchronous jobs that are dispatched one at a time, rate-limited, and queued. The org-prime-agent-architecture article describes this separation explicitly: the Brain DO produces a run sheet, the Dispatcher queues individual jobs, and the local runner processes one job at a time with a 20-minute hard timeout per AI call.
Rule 2: Every AI call in a recurring context needs an explicit budget.
ai_calls:
model: claude-sonnet
max_calls_per_day: 50
max_tokens_per_call: 8000
estimated_daily_cost: $2.50
kill_switch: true # halt job if daily budget exceeded
Rule 3: AI calls need output caching. If the input (repo state, document, prompt) has not changed since the last call, return the cached output. Never call an AI twice with the same input. Cache key = hash of the input content.
Rule 4: Use the cheapest model that produces acceptable output. The org-prime-agent-architecture article defines a model routing policy: Haiku for structured extraction, Sonnet for planning and code changes, Opus for complex multi-step reasoning. Never use Opus where Haiku produces acceptable results.
Rule 5: Thinking mode is a cost multiplier, not a free upgrade.
Extended thinking at budget_tokens: 10000 can 10× the cost of a call. It belongs in low-frequency, high-stakes decisions — not in any loop.
Rule 6: Prefer local models for hot-path operations. When a job runs frequently over many items, consider whether a local model (running on available GPU, zero marginal cost per call) can replace the cloud API call for that specific task. Reserve API calls for tasks that genuinely require frontier model capability.
5. Known Limits: The Constraint Database
Every external system has documented limits. Before any recurring job goes live, the relevant constraints must be recorded in a shared reference. This is the constraint database — a maintained document that lives in the brain repo alongside the automation registry.
GitHub API Constraints
| Limit | Value | Notes |
|---|---|---|
| REST API rate limit | 5,000 req/hr | Per authenticated user, not per app |
| GraphQL rate limit | 5,000 points/hr | Each field has a cost; connections cost more |
| Secondary rate limit | ~30 content creations/min | Soft; triggers 403 with retry-after header |
| Search API | 30 req/min | Separate quota from REST |
| Actions API | 1,000 req/hr | For workflow dispatch and run queries |
| Repository contents API | counted in REST | Each file fetch = 1 REST call |
Cloudflare Platform Constraints
| Limit | Free Tier | Paid Tier |
|---|---|---|
| Worker requests | 100K/day | Unlimited |
| Worker CPU per request | 10ms | 30s |
| Subrequests per Worker invocation | 50 | 1,000 |
| D1 row reads | 5M/day | Unlimited |
| D1 row writes | 100K/day | Unlimited |
| Durable Object requests | 1M/month | $0.15/M after |
| Cron triggers | 5/account | Unlimited |
| DO alarm precision | 1 minute | 1 minute |
AI Provider Constraints
| Provider | Limit Type | Value |
|---|---|---|
| Anthropic API (Tier 1) | Tokens/min | 50,000 TPM |
| Anthropic API (Tier 4) | Tokens/min | 400,000 TPM |
| Anthropic API (Tier 1) | Requests/min | 50 RPM |
| Claude Code (Pro/Max) | Requests | Shared quota, no published hard number |
| Claude Agent SDK | Concurrency | One session per binary instance |
The constraint database rule: Every recurring job must reference the specific limits it operates within. “I didn’t know there was a limit” is not an acceptable post-mortem. Add any limit you discover to the database before you hit it again.
6. The Review Gate: Formal Checklist
Before any job, script, workflow, or agent is scheduled to run more than once, it must pass this checklist. The completed checklist becomes part of the job’s automation registry entry. A job that cannot answer all questions does not get scheduled.
Section A: Classification
- Job ID: unique name across all services (e.g.
mulan-org-scanner) - Run frequency: what is the schedule? (cron expression, event trigger, alarm interval)
- Trigger type: Cron / event-driven / DO alarm / manual-with-retry / queue consumer
- Item count: how many items does it process per run? (repos, records, files…)
- Termination condition: under what conditions does this job permanently stop running?
Section B: External API Calls
- GitHub REST: calls per run ___ · calls per hour ___ · % of 5,000/hr limit ___
- GitHub GraphQL: points per run ___ · % of 5,000/hr limit ___
- GitHub secondary: content creations per run ___ · within 30/min limit? ___
- Other APIs: (list each with calls/run and limit)
- Smart fetch: does the job skip API calls when data has not changed? (required if >20% of any limit)
- Conditional requests: does the job use ETags or
If-Modified-Sinceheaders? - Scaling question: how does call count scale as item count grows? At 10× items, is the cost still within limits?
- Event-driven alternative: could resources report in when they change, replacing polling entirely? (required if item count may exceed ~100)
- Backoff policy: what happens on a 429 or 403 rate limit response?
Section C: AI Calls
- Makes AI calls: Yes / No (if No, skip this section)
- Model: haiku / sonnet / opus / local / other: ___
- Thinking mode: disabled / adaptive / extended (budget tokens: ___)
- Calls per run: ___
- Estimated input tokens per call: ___
- Estimated output tokens per call: ___
- Estimated cost per call: $___
- Estimated cost per run: $___
- Estimated cost per month: $___
- Input change detection: does it skip the AI call if input is unchanged since last call?
- Output cache: is the result cached against an input hash?
- Daily budget cap: $___ (job halts if exceeded)
- Model routing justification: why this model and not a cheaper one?
- Could a local model replace this call? Yes / No / Partially
Section D: State and Side Effects
- Database reads: which tables · rows per run ___
- Database writes: which tables · rows per run ___
- File writes: which repos · which files · how many per run ___
- GitHub content creation: issues / PRs / comments / commits — how many per run ___
- Notifications: Telegram / Slack / email — per run ___
- Idempotent: if the job runs twice with identical input, does it produce the same result?
- Deduplication: what prevents creating duplicate issues, PRs, or comments?
Section E: Observability
- Structured log line: every run emits one parseable summary line (format: ___)
- Success metric: there is a measurable output confirming the run succeeded
- Dashboard entry: the job appears in the automation registry dashboard
- Alert on failure: notification fires if the job fails or produces no output for N cycles
Section F: Kill Switches (all required)
- Immediate kill switch exists: env var / feature flag / D1 config (name: ___)
- Kill switch tested: confirmed that setting it stops the job within one cycle
- Budget kill switch: job halts automatically if daily AI cost exceeds budget
- Rate limit kill switch: job backs off automatically on 429/403 responses
- Graceful degradation: job serves cached data when external dependencies are unavailable
Section G: Sign-Off
- Cost function computed and within 50% of all relevant limits at steady state
- All constraints from §5 checked against this job’s usage
- AI calls budgeted (if present) with daily cap and cache policy
- Kill switch tested in staging or manually triggered
- One full run observed end-to-end before scheduling
- Registry entry created in
.automation/registry.yml
A job that fails any item in sections A–F does not run on a schedule until the item is addressed.
7. The Automation Registry
Every recurring job in the org must have an entry in the automation registry. This is a YAML file maintained in the brain or ops repo. It is the authoritative source of what is running, why, and at what cost.
Registry Entry Schema
- id: mulan-org-scanner
type: cloudflare-worker-cron
repo: garywu/mulan
deployed_at: 2026-03-20
schedule: "*/5 * * * *"
status: active # active | deprecated | suspended
purpose: >
Scans all registered repos for org health signals (CI, README, tooling standards).
Generates scored dashboard in garywu/_readme. Dispatches fix jobs to the dispatcher queue.
cost_model:
github_rest_calls_per_run: 82 # smart scan: 2/repo unchanged, 14/repo changed
github_rest_calls_per_hour: 984 # 82 × 12 runs/hour
github_rest_pct_of_limit: "20%"
ai_calls_per_run: 0
ai_cost_per_month: "$0.00"
d1_reads_per_run: 41
d1_writes_per_run: 41
limits_checked:
- github_rest: "5,000/hr"
- github_graphql: "5,000/hr"
- cloudflare_subrequests: "1,000/invocation"
kill_switches:
- type: env_var
name: SCANNER_DISABLED
effect: skip scanning, serve cached data only
- type: rate_limit_response
effect: exponential backoff up to 5 min, alert via Telegram
observability:
log_format: "[reconciler] scanned N repos: X full, Y fast | api_calls=N | status=ok"
dashboard: "garywu/_readme (auto-generated)"
alert_on_failure: telegram
review:
reviewed_at: 2026-03-20
checklist_version: v1
status: approved
Registry Location
garywu/brain/
.automation/
registry.yml # all recurring jobs across the org
limits.yml # known platform limits, reviewed quarterly
budget.yml # monthly AI spend targets per service
The registry is consumed by:
- Git Sure — shows “what’s monitoring this repo” in per-repo pages
- The org dashboard — displays all active services and their API usage
- The review process — the registry entry is the artifact that proves review happened
8. Required Infrastructure: Circuit Breakers and Kill Switches
Every recurring job must implement, at minimum:
1. Rate Limit Awareness
async function ghFetchWithBackoff(url: string, token: string): Promise<Response> {
const res = await fetch(url, { headers: { Authorization: `Bearer ${token}` } })
if (res.status === 429 || res.status === 403) {
const retryAfter = res.headers.get('retry-after')
const resetAt = res.headers.get('x-ratelimit-reset')
const waitMs = retryAfter
? parseInt(retryAfter) * 1000
: resetAt
? parseInt(resetAt) * 1000 - Date.now()
: 60_000
console.error(`[rate-limit] ${url.slice(0, 60)} — backing off ${Math.ceil(waitMs / 1000)}s`)
await notify(`⚠️ Rate limit hit — waiting ${Math.ceil(waitMs / 1000)}s`)
await sleep(Math.min(waitMs, 300_000)) // cap at 5 min
return ghFetchWithBackoff(url, token) // single retry
}
return res
}
2. AI Budget Circuit Breaker
async function callAIWithBudget(
prompt: string,
db: D1Database,
dailyBudgetUsd: number,
): Promise<string | null> {
const today = new Date().toISOString().split('T')[0]
const spent = await db
.prepare(`SELECT COALESCE(SUM(cost_usd), 0) AS total FROM ai_call_log WHERE date = ?`)
.bind(today)
.first<{ total: number }>()
if ((spent?.total ?? 0) >= dailyBudgetUsd) {
console.warn(`[ai-budget] $${dailyBudgetUsd} daily cap reached — skipping AI call`)
return null // caller must handle null gracefully
}
const result = await callAI(prompt)
const costUsd = result.usage.total_tokens * 0.000003 // estimate; calibrate per model
await db
.prepare(`INSERT INTO ai_call_log (date, model, tokens, cost_usd) VALUES (?, ?, ?, ?)`)
.bind(today, result.model, result.usage.total_tokens, costUsd)
.run()
return result.content
}
3. Input Change Detection (Cache-Before-Call)
async function processIfChanged(
cacheKey: string,
input: string,
db: D1Database,
processor: () => Promise<string>,
): Promise<string | null> {
const inputHash = await sha256(input)
const cached = await db
.prepare(`SELECT output, input_hash FROM process_cache WHERE key = ?`)
.bind(cacheKey)
.first<{ output: string; input_hash: string }>()
if (cached?.input_hash === inputHash) {
return cached.output // cache hit — no expensive call
}
const output = await processor() // only runs when input changed
await db
.prepare(`INSERT OR REPLACE INTO process_cache (key, input_hash, output, updated_at)
VALUES (?, ?, ?, unixepoch())`)
.bind(cacheKey, inputHash, output)
.run()
return output
}
4. Graceful Degradation
When a recurring job cannot complete its full function (rate limited, AI budget exceeded, external API unavailable), it must:
- Complete what it can using local or cached data
- Log the degraded state with clear reason
- Notify the operator (Telegram or equivalent)
- Continue to the next cycle at the normal interval — do not spin
- Never write partial state as complete state
9. Observability Is Mandatory, Not Optional
The tight loop article defines the core principle: a system must observe itself. For recurring jobs, this means every single run emits a structured summary before it exits.
Minimum: The Structured Log Line
[mulan-org-scanner] cycle=42 repos=41 full=3 fast=38 jobs_created=5
api_calls=89 api_pct=1.8% duration_ms=12400 status=ok
This single line answers every operational question about the run:
- What ran and how many items were processed
- How much API budget was consumed
- How long it took
- Whether it succeeded
Without this line, you cannot answer “is this job healthy?” without reading full execution logs. That is too slow for incident response.
Dashboard Requirement
Any recurring job that runs for more than one week must appear in a monitoring dashboard. At minimum, the dashboard shows:
- Last run time and status
- API call count trend (7-day rolling)
- AI cost trend (7-day, if applicable)
- Success rate (last 100 runs)
- Item count trend (growing? shrinking? stable?)
The activity section of the garywu/_readme dashboard (generated by the mulan dispatcher) is an example of this applied to the job system itself: it shows what’s running now, how many fixes happened in the last hour, and the hourly histogram.
10. The Smart Scan Principle
The single most effective cost reduction for polling-based recurring jobs is skipping work when nothing has changed. This applies at every layer.
Layer 1: Push Timestamp Gating
For jobs that check the state of external resources (files, repos, APIs):
if resource.last_modified == cached.last_modified:
skip expensive checks
optionally refresh the single most volatile signal (e.g. CI status)
cost: 1-2 API calls
else:
full scan
update cache
cost: N API calls (full scan)
Applied to the mulan org scanner with 41 repos:
| Scenario | API calls/run | Calls/hour | % of limit |
|---|---|---|---|
| Naive (always full scan) | 574 | 6,888 | 138% |
| Smart scan (3 repos changed) | 118 | 1,416 | 28% |
| Smart scan (0 repos changed) | 82 | 984 | 20% |
Layer 2: Content Deduplication
For jobs that write output (commits, issues, notifications):
if hash(new_content) == hash(existing_content):
skip write entirely
A dashboard that regenerates every 5 minutes but only commits when data changed consumes zero GitHub write API calls on quiet cycles — the cycles that comprise the majority of all cycles.
Layer 3: Tiered Polling Frequency
Not all signals need the same polling frequency:
| Signal | Natural change rate | Appropriate poll interval |
|---|---|---|
| CI status | Changes within minutes of a push | Every run |
| File-based signals (README, config) | Changes only with a push | Only when pushed_at changes |
| Repo metadata (description, topics) | Changes rarely | Weekly |
| Registry (repo list) | Changes on explicit edit | On file change only |
Structure your jobs to reflect these natural rates. Polling file signals at CI frequency wastes 95% of your API budget.
Layer 4: Event-Driven Flip (the final form)
Push-timestamp gating reduces cost from O(repos) to O(repos × unchanged_ratio). But there is a deeper problem: polling does not scale to hundreds of repos at any frequency.
At 200 repos with */5 * * * * and smart scan:
- Quiet cycle: 2 calls × 200 repos = 400 calls/run × 12 = 4,800/hr (96% of limit)
- At 500 repos: impossible, regardless of how smart the scan is
The polling model is fundamentally O(repos). The correct architecture is O(changes).
The event-driven flip:
Instead of the scanner reaching out to repos, repos report in when they change:
Repo push → CI completes → notify-readme.yml fires
↓
POST /notify {repo: "garywu/atlas"}
↓
Accumulator (DO storage dirty set)
adds repo, arms 5-minute alarm
if alarm already set: just accumulate
↓
5-minute alarm fires (batch window closed)
↓
scan only the N repos that reported changes
↓
one README commit (or skip if no diff)
dirty set cleared
The cron becomes a low-frequency backup sweep (once per day at 3am) to catch things that change without pushes: CI runs completing asynchronously, Dependabot PRs, branch protection changes, new repos added to the registry.
Cost comparison at scale:
| Repos | Pushes/day | Polling */5 * smart | Event-driven |
|---|---|---|---|
| 41 | 10 | ~984/hr | 10 × 14 = 140/day |
| 200 | 20 | ~4,800/hr (96%) | 20 × 14 = 280/day |
| 1,000 | 50 | impossible | 50 × 14 = 700/day |
Event-driven scanning at 1,000 repos costs the same as smart polling at 50 repos.
Implementation with Durable Object storage:
// BrainDO: accumulate dirty repos, batch scan fires at next cron tick
async notify(repo: string): Promise<void> {
const dirty: string[] = await this.state.storage.get('dirty_repos') ?? []
if (!dirty.includes(repo)) {
dirty.push(repo)
await this.state.storage.put('dirty_repos', dirty)
}
}
async consumeDirty(): Promise<string[]> {
const dirty: string[] = await this.state.storage.get('dirty_repos') ?? []
if (dirty.length > 0) await this.state.storage.delete('dirty_repos')
return dirty
}
// Cron handler: pop dirty set, scan only changed repos
async function runScheduled(env: Env): Promise<void> {
const dirtyRepos = await brainDO.consumeDirty()
if (dirtyRepos.length > 0) {
// O(changes) — only scan repos that reported changes
await runReconciler(env, dirtyRepos)
return
}
// Daily full sweep — catches things that change without pushes
if (new Date().getUTCHours() === 3) {
await runReconciler(env)
}
}
The cron still fires every 5 minutes, but when the dirty set is empty it exits in milliseconds with zero GitHub API calls. The 5-minute poll becomes the batch window for the event-driven notifications — repos that push multiple times in a 5-minute window are deduplicated into one scan.
Prerequisites for the flip:
- Each repo needs a push notification hook (e.g.
notify-readme.ymlGitHub Actions workflow) - A receiver that accumulates notifications (DO storage, a D1 table, or a queue)
- The cron or alarm reads from the accumulator, not from the registry
The notification hook should be deployed to repos as part of initial setup — not as a one-time manual step. This is a standard compliance item, not an enhancement.
11. Governance in Practice: A Worked Example
The mulan org scanner followed a trajectory that illustrates exactly what the review gate prevents.
Phase 1 — Manual experiment: Ran once to generate a dashboard. 574 API calls. Worked fine.
Phase 2 — First schedule (30-min cron): No review. No cost function computed.
- 574 × 2 runs/hour = 1,148 calls/hour = 23% of limit
- Acceptable, but the margin was not known or documented
Phase 3 — Interval tightened (5-min cron): No review. Rate limit math not redone.
- 574 × 12 runs/hour = 6,888 calls/hour = 138% of limit
- Every cycle hit the rate limit and blocked fix-ci jobs downstream
Phase 4 — Debugging: 20 minutes identifying root cause. fix-ci jobs were failing with “API rate limit already exceeded” — a symptom two layers removed from the cause.
Phase 5 — Corrective action:
- Smart scan: 574 → 82 calls per run on quiet cycles (86% reduction)
- Content deduplication: zero dashboard commits on unchanged data
- Constraint database entry created for GitHub REST and GraphQL limits
- Automation registry entry created with the verified cost model
Phase 6 — Event-driven flip:
- repos report in via
notify-readme.yml→POST /notify→ BrainDO dirty set - cron pops dirty set: if empty, exits in milliseconds with 0 GitHub API calls
- daily 3am sweep catches things that change without pushes
- cost: ~140 API calls/day from scanning (was ~984/hour)
- scales to 1,000 repos — O(changes), not O(repos)
What a review gate would have done at Phase 3: The cost function at 5-minute intervals returns 6,888 calls/hour against a 5,000/hour limit. This is immediately visible. The gate blocks the deployment until smart scan is implemented, which drops the number to ~984 calls/hour (20% of limit). The rate limiting incident never happens.
The review gate costs 30 minutes to complete. The incident cost more than an hour of debugging and produced silent job failures during the debugging window. The gate has positive expected value on the very first job it catches.
The scaling question would have been caught at Phase 2: Section B of the review checklist asks: “How does cost scale as item count grows?” At 41 repos, polling is manageable. At 200 repos with the same architecture, it is not. The event-driven flip (§10, Layer 4) is the answer to this question, and the checklist forces it to be asked before deployment, not after hitting the wall at scale.
12. The Automation Lifecycle
Every recurring job moves through defined phases with explicit transitions:
EXPERIMENT → REVIEW → APPROVED → MONITORED → DEPRECATED → REMOVED
EXPERIMENT: Runs manually, on demand. No schedule. No registry entry required. This phase has no governance burden — move fast, break things, iterate.
REVIEW: Owner intends to schedule the job. Checklist (§6) is in progress. Job is not yet running on a schedule.
APPROVED: Checklist complete. Registry entry created. Kill switch tested. Job is now running on schedule.
MONITORED: Running in production for more than one week. Dashboard entry active. Quarterly cost review scheduled.
DEPRECATED: Marked for removal in the registry. Still running, but with a removal date set. Downstream consumers have been notified.
REMOVED: Job deleted. Schedule removed. Registry entry archived with reason and date. Dependent systems updated.
The Quarterly Review
All APPROVED and MONITORED jobs are reviewed every three months:
- Is the job still needed? (What would break if it stopped?)
- Has the cost model drifted? (Items processed may have grown)
- Are the platform limits still accurate? (Providers change limits)
- Can the job be made cheaper? (Better caching, model routing, batching)
- Are there newer primitives that change the architecture? (e.g. webhooks replacing polling)
The Brain DO in the org-prime-agent-architecture performs an equivalent review for runner capacity — adjusting allocations weekly based on observed success rates. The quarterly automation review is the same pattern applied at the governance level.
13. Related Articles
-
The Tight Loop — Systems must observe and correct themselves. The recurring run review is the tight loop applied to automation governance: treat each scheduled job as a system with its own feedback loop, not a fire-and-forget script.
-
Org Prime Agent Architecture — How persistent AI agents structure their own recurring operations within defined cost and rate limits. The Brain DO’s model routing policy (Haiku/Sonnet/Opus based on task complexity) is the AI cost governance pattern applied at the agent level. The adaptive capacity system (adjusting runner slots based on 7-day success rates) is the self-correcting circuit breaker applied to agent throughput.
-
Building an Autonomous Data Pipeline on Cloudflare Workers — Production case study. The
* * * * *cron that burned $19.63/month in D1 row reads is the canonical example of Phase Transition Problem (§1) in production. The priority-based work scheduler and cost ceiling patterns in that article are direct implementations of the principles in §8.
Summary
The phase transition from “runs once” to “runs repeatedly” is the most consequential decision in automation engineering. It requires formal process because the consequences are compounding, silent, and sometimes irreversible.
The non-negotiable requirements:
| Requirement | Why |
|---|---|
| Cost function computed before scheduling | Rate limit disasters are always predictable in advance |
| AI calls budgeted separately | They cost 100–10,000× more per call than REST API calls |
| Smart scan / input change detection | 80%+ API reduction on typical cycles |
| Kill switch tested before first scheduled run | If you can’t stop it, you don’t control it |
| Structured log line on every run | Without it, “is this healthy?” takes too long to answer |
| Registry entry created | Invisible automation is unmanaged automation |
| Constraint database referenced | ”I didn’t know there was a limit” is preventable |
The cost of running this review for a new job: approximately 30 minutes.
The cost of not running it: rate limits exhausted, surprise bills, silent job failures, downstream systems confused, and a debugging session measured in hours rather than seconds.
This article is part of the garywu engineering knowledge base. See also: git-sure#29 (global automation registry implementation).
Last updated: 2026-03-20.