Skip to content
Gary Wu
Go back

The Recurring Run Review

Edit page

“One API call is an experiment. One thousand API calls per day is an operational commitment.”


Table of Contents

Open Table of Contents

1. The Phase Transition Problem

There is a hard boundary between something that runs once and something that runs repeatedly. Most engineers treat this as a smooth continuum. It is not.

When you run a script manually, you are present. You observe the output. You see the error. You stop it. The blast radius is bounded by your attention span.

When you schedule that same script to run every five minutes, you have made a fundamentally different commitment. You are no longer present for each execution. The script runs while you sleep, while you are in meetings, while you are on vacation. Every assumption embedded in that script — about data size, API availability, cost per call, downstream effects — now compounds silently.

This is the phase transition: from experiment to operational system. The mistake is treating it as a deployment decision. It is not. It is an architectural decision that requires formal review.

The rule is simple: anything that runs more than once enters a review process before it runs the second time.


2. Why Recurring Is Categorically Different

Compounding Effects

A single bad API call costs one error. A bad API call on a five-minute cron costs 288 errors per day. At monthly scale that is 8,640 errors, each potentially triggering downstream state changes, retries, secondary calls, and notifications.

Silent Failure Accumulates

One-off scripts fail loudly — you are watching. Recurring jobs fail quietly — nobody is watching. By the time anyone notices, the damage is done: rate limits exhausted, data corrupted, costs inflated, downstream systems confused.

State Drift

Recurring jobs interact with state. Every run reads something, writes something, or signals something. Over time, these interactions accumulate into a state machine that no individual designed. The job does not just run — it shapes the world it runs in.

The Discovery Problem

Once a recurring job exists, it is difficult to find. It does not appear in your code editor. It does not show up in a pull request. It runs invisibly on a schedule that nobody remembers setting. The Cloudflare autonomous pipeline article documents this directly: an unmonitored * * * * * cron burned $19.63/month in D1 row reads before anyone noticed.


3. The Cost Explosion Model

Every recurring job has a cost function. Before scheduling anything, you must write it down.

The Formula

total_cost_per_month =
  (calls_per_run × cost_per_call) × (runs_per_hour × 730)

A job that makes 14 API calls per repo, scans 41 repos, and runs every 5 minutes:

14 × 41 = 574 API calls per run
574 × 12 = 6,888 calls per hour
6,888 × 730 = 5,028,240 calls per month

GitHub’s limit is 5,000 calls per hour. This single job, run naively, consumes 138% of the hourly limit and triggers rate limiting on every single cycle.

This is not a hypothetical. This is exactly what happened when the mulan org scanner was configured to run every 5 minutes without a smart scan optimization. The rate limit hit was not a surprise — it was predictable from the cost function. The surprise was that nobody computed the function before deploying.

Cost Dimensions to Model

DimensionUnitCommon Limits
GitHub REST APIrequests/hour5,000/hr (authenticated)
GitHub GraphQL APIpoints/hour5,000/hr
GitHub secondary rate limitcontent creations/min~30/min (soft)
Cloudflare Worker CPUms/invocation50ms (free), 30s (paid)
Cloudflare D1 readsrows/day5M (free), unlimited (paid)
Cloudflare Worker requestsreq/day100K (free), unlimited (paid)
External LLM APItokens or requestsvaries by provider + tier
Claude via Agent SDKrequestsshared with Pro/Max quota

Write the cost function. Compute the monthly projection. If any dimension exceeds 50% of its limit at steady state, the job needs redesign before it runs.


4. The AI Call Multiplier

The cost model above applies to all recurring jobs. But AI calls introduce a qualitatively different problem: the cost per call is orders of magnitude higher than standard API calls, and it varies unpredictably by input size.

The $1-per-call Scenario

Consider a job that calls Claude Sonnet to analyze a repository. At current API pricing:

That sounds cheap. Now schedule it naively across a portfolio:

$0.10/call × 41 repos × 12 runs/hour = $49.20/hour = $35,916/month

With a $1/call scenario (extended thinking, long context, high-stakes analysis):

$1.00/call × 41 repos × 12 runs/hour = $492/hour = $359,160/month

This is not an edge case. This is the direct consequence of treating AI calls like ordinary API calls — cheap per invocation, catastrophic at scale.

AI Call Rules for Recurring Jobs

Rule 1: AI calls must never be in the hot path of a cron loop over many items. The hot path (the inner loop over items per cycle) must be free of AI calls. AI calls belong in asynchronous jobs that are dispatched one at a time, rate-limited, and queued. The org-prime-agent-architecture article describes this separation explicitly: the Brain DO produces a run sheet, the Dispatcher queues individual jobs, and the local runner processes one job at a time with a 20-minute hard timeout per AI call.

Rule 2: Every AI call in a recurring context needs an explicit budget.

ai_calls:
  model: claude-sonnet
  max_calls_per_day: 50
  max_tokens_per_call: 8000
  estimated_daily_cost: $2.50
  kill_switch: true   # halt job if daily budget exceeded

Rule 3: AI calls need output caching. If the input (repo state, document, prompt) has not changed since the last call, return the cached output. Never call an AI twice with the same input. Cache key = hash of the input content.

Rule 4: Use the cheapest model that produces acceptable output. The org-prime-agent-architecture article defines a model routing policy: Haiku for structured extraction, Sonnet for planning and code changes, Opus for complex multi-step reasoning. Never use Opus where Haiku produces acceptable results.

Rule 5: Thinking mode is a cost multiplier, not a free upgrade. Extended thinking at budget_tokens: 10000 can 10× the cost of a call. It belongs in low-frequency, high-stakes decisions — not in any loop.

Rule 6: Prefer local models for hot-path operations. When a job runs frequently over many items, consider whether a local model (running on available GPU, zero marginal cost per call) can replace the cloud API call for that specific task. Reserve API calls for tasks that genuinely require frontier model capability.


5. Known Limits: The Constraint Database

Every external system has documented limits. Before any recurring job goes live, the relevant constraints must be recorded in a shared reference. This is the constraint database — a maintained document that lives in the brain repo alongside the automation registry.

GitHub API Constraints

LimitValueNotes
REST API rate limit5,000 req/hrPer authenticated user, not per app
GraphQL rate limit5,000 points/hrEach field has a cost; connections cost more
Secondary rate limit~30 content creations/minSoft; triggers 403 with retry-after header
Search API30 req/minSeparate quota from REST
Actions API1,000 req/hrFor workflow dispatch and run queries
Repository contents APIcounted in RESTEach file fetch = 1 REST call

Cloudflare Platform Constraints

LimitFree TierPaid Tier
Worker requests100K/dayUnlimited
Worker CPU per request10ms30s
Subrequests per Worker invocation501,000
D1 row reads5M/dayUnlimited
D1 row writes100K/dayUnlimited
Durable Object requests1M/month$0.15/M after
Cron triggers5/accountUnlimited
DO alarm precision1 minute1 minute

AI Provider Constraints

ProviderLimit TypeValue
Anthropic API (Tier 1)Tokens/min50,000 TPM
Anthropic API (Tier 4)Tokens/min400,000 TPM
Anthropic API (Tier 1)Requests/min50 RPM
Claude Code (Pro/Max)RequestsShared quota, no published hard number
Claude Agent SDKConcurrencyOne session per binary instance

The constraint database rule: Every recurring job must reference the specific limits it operates within. “I didn’t know there was a limit” is not an acceptable post-mortem. Add any limit you discover to the database before you hit it again.


6. The Review Gate: Formal Checklist

Before any job, script, workflow, or agent is scheduled to run more than once, it must pass this checklist. The completed checklist becomes part of the job’s automation registry entry. A job that cannot answer all questions does not get scheduled.


Section A: Classification


Section B: External API Calls


Section C: AI Calls


Section D: State and Side Effects


Section E: Observability


Section F: Kill Switches (all required)


Section G: Sign-Off

A job that fails any item in sections A–F does not run on a schedule until the item is addressed.


7. The Automation Registry

Every recurring job in the org must have an entry in the automation registry. This is a YAML file maintained in the brain or ops repo. It is the authoritative source of what is running, why, and at what cost.

Registry Entry Schema

- id: mulan-org-scanner
  type: cloudflare-worker-cron
  repo: garywu/mulan
  deployed_at: 2026-03-20
  schedule: "*/5 * * * *"
  status: active   # active | deprecated | suspended

  purpose: >
    Scans all registered repos for org health signals (CI, README, tooling standards).
    Generates scored dashboard in garywu/_readme. Dispatches fix jobs to the dispatcher queue.

  cost_model:
    github_rest_calls_per_run: 82        # smart scan: 2/repo unchanged, 14/repo changed
    github_rest_calls_per_hour: 984      # 82 × 12 runs/hour
    github_rest_pct_of_limit: "20%"
    ai_calls_per_run: 0
    ai_cost_per_month: "$0.00"
    d1_reads_per_run: 41
    d1_writes_per_run: 41

  limits_checked:
    - github_rest: "5,000/hr"
    - github_graphql: "5,000/hr"
    - cloudflare_subrequests: "1,000/invocation"

  kill_switches:
    - type: env_var
      name: SCANNER_DISABLED
      effect: skip scanning, serve cached data only
    - type: rate_limit_response
      effect: exponential backoff up to 5 min, alert via Telegram

  observability:
    log_format: "[reconciler] scanned N repos: X full, Y fast | api_calls=N | status=ok"
    dashboard: "garywu/_readme (auto-generated)"
    alert_on_failure: telegram

  review:
    reviewed_at: 2026-03-20
    checklist_version: v1
    status: approved

Registry Location

garywu/brain/
  .automation/
    registry.yml    # all recurring jobs across the org
    limits.yml      # known platform limits, reviewed quarterly
    budget.yml      # monthly AI spend targets per service

The registry is consumed by:


8. Required Infrastructure: Circuit Breakers and Kill Switches

Every recurring job must implement, at minimum:

1. Rate Limit Awareness

async function ghFetchWithBackoff(url: string, token: string): Promise<Response> {
  const res = await fetch(url, { headers: { Authorization: `Bearer ${token}` } })

  if (res.status === 429 || res.status === 403) {
    const retryAfter = res.headers.get('retry-after')
    const resetAt = res.headers.get('x-ratelimit-reset')
    const waitMs = retryAfter
      ? parseInt(retryAfter) * 1000
      : resetAt
        ? parseInt(resetAt) * 1000 - Date.now()
        : 60_000

    console.error(`[rate-limit] ${url.slice(0, 60)} — backing off ${Math.ceil(waitMs / 1000)}s`)
    await notify(`⚠️ Rate limit hit — waiting ${Math.ceil(waitMs / 1000)}s`)
    await sleep(Math.min(waitMs, 300_000)) // cap at 5 min
    return ghFetchWithBackoff(url, token)  // single retry
  }

  return res
}

2. AI Budget Circuit Breaker

async function callAIWithBudget(
  prompt: string,
  db: D1Database,
  dailyBudgetUsd: number,
): Promise<string | null> {
  const today = new Date().toISOString().split('T')[0]
  const spent = await db
    .prepare(`SELECT COALESCE(SUM(cost_usd), 0) AS total FROM ai_call_log WHERE date = ?`)
    .bind(today)
    .first<{ total: number }>()

  if ((spent?.total ?? 0) >= dailyBudgetUsd) {
    console.warn(`[ai-budget] $${dailyBudgetUsd} daily cap reached — skipping AI call`)
    return null  // caller must handle null gracefully
  }

  const result = await callAI(prompt)
  const costUsd = result.usage.total_tokens * 0.000003 // estimate; calibrate per model

  await db
    .prepare(`INSERT INTO ai_call_log (date, model, tokens, cost_usd) VALUES (?, ?, ?, ?)`)
    .bind(today, result.model, result.usage.total_tokens, costUsd)
    .run()

  return result.content
}

3. Input Change Detection (Cache-Before-Call)

async function processIfChanged(
  cacheKey: string,
  input: string,
  db: D1Database,
  processor: () => Promise<string>,
): Promise<string | null> {
  const inputHash = await sha256(input)

  const cached = await db
    .prepare(`SELECT output, input_hash FROM process_cache WHERE key = ?`)
    .bind(cacheKey)
    .first<{ output: string; input_hash: string }>()

  if (cached?.input_hash === inputHash) {
    return cached.output  // cache hit — no expensive call
  }

  const output = await processor()  // only runs when input changed

  await db
    .prepare(`INSERT OR REPLACE INTO process_cache (key, input_hash, output, updated_at)
              VALUES (?, ?, ?, unixepoch())`)
    .bind(cacheKey, inputHash, output)
    .run()

  return output
}

4. Graceful Degradation

When a recurring job cannot complete its full function (rate limited, AI budget exceeded, external API unavailable), it must:

  1. Complete what it can using local or cached data
  2. Log the degraded state with clear reason
  3. Notify the operator (Telegram or equivalent)
  4. Continue to the next cycle at the normal interval — do not spin
  5. Never write partial state as complete state

9. Observability Is Mandatory, Not Optional

The tight loop article defines the core principle: a system must observe itself. For recurring jobs, this means every single run emits a structured summary before it exits.

Minimum: The Structured Log Line

[mulan-org-scanner] cycle=42 repos=41 full=3 fast=38 jobs_created=5
  api_calls=89 api_pct=1.8% duration_ms=12400 status=ok

This single line answers every operational question about the run:

Without this line, you cannot answer “is this job healthy?” without reading full execution logs. That is too slow for incident response.

Dashboard Requirement

Any recurring job that runs for more than one week must appear in a monitoring dashboard. At minimum, the dashboard shows:

The activity section of the garywu/_readme dashboard (generated by the mulan dispatcher) is an example of this applied to the job system itself: it shows what’s running now, how many fixes happened in the last hour, and the hourly histogram.


10. The Smart Scan Principle

The single most effective cost reduction for polling-based recurring jobs is skipping work when nothing has changed. This applies at every layer.

Layer 1: Push Timestamp Gating

For jobs that check the state of external resources (files, repos, APIs):

if resource.last_modified == cached.last_modified:
    skip expensive checks
    optionally refresh the single most volatile signal (e.g. CI status)
    cost: 1-2 API calls
else:
    full scan
    update cache
    cost: N API calls (full scan)

Applied to the mulan org scanner with 41 repos:

ScenarioAPI calls/runCalls/hour% of limit
Naive (always full scan)5746,888138%
Smart scan (3 repos changed)1181,41628%
Smart scan (0 repos changed)8298420%

Layer 2: Content Deduplication

For jobs that write output (commits, issues, notifications):

if hash(new_content) == hash(existing_content):
    skip write entirely

A dashboard that regenerates every 5 minutes but only commits when data changed consumes zero GitHub write API calls on quiet cycles — the cycles that comprise the majority of all cycles.

Layer 3: Tiered Polling Frequency

Not all signals need the same polling frequency:

SignalNatural change rateAppropriate poll interval
CI statusChanges within minutes of a pushEvery run
File-based signals (README, config)Changes only with a pushOnly when pushed_at changes
Repo metadata (description, topics)Changes rarelyWeekly
Registry (repo list)Changes on explicit editOn file change only

Structure your jobs to reflect these natural rates. Polling file signals at CI frequency wastes 95% of your API budget.

Layer 4: Event-Driven Flip (the final form)

Push-timestamp gating reduces cost from O(repos) to O(repos × unchanged_ratio). But there is a deeper problem: polling does not scale to hundreds of repos at any frequency.

At 200 repos with */5 * * * * and smart scan:

The polling model is fundamentally O(repos). The correct architecture is O(changes).

The event-driven flip:

Instead of the scanner reaching out to repos, repos report in when they change:

Repo push → CI completes → notify-readme.yml fires

                     POST /notify {repo: "garywu/atlas"}

                     Accumulator (DO storage dirty set)
                     adds repo, arms 5-minute alarm
                     if alarm already set: just accumulate

                     5-minute alarm fires (batch window closed)

              scan only the N repos that reported changes

                       one README commit (or skip if no diff)
                       dirty set cleared

The cron becomes a low-frequency backup sweep (once per day at 3am) to catch things that change without pushes: CI runs completing asynchronously, Dependabot PRs, branch protection changes, new repos added to the registry.

Cost comparison at scale:

ReposPushes/dayPolling */5 * smartEvent-driven
4110~984/hr10 × 14 = 140/day
20020~4,800/hr (96%)20 × 14 = 280/day
1,00050impossible50 × 14 = 700/day

Event-driven scanning at 1,000 repos costs the same as smart polling at 50 repos.

Implementation with Durable Object storage:

// BrainDO: accumulate dirty repos, batch scan fires at next cron tick
async notify(repo: string): Promise<void> {
  const dirty: string[] = await this.state.storage.get('dirty_repos') ?? []
  if (!dirty.includes(repo)) {
    dirty.push(repo)
    await this.state.storage.put('dirty_repos', dirty)
  }
}

async consumeDirty(): Promise<string[]> {
  const dirty: string[] = await this.state.storage.get('dirty_repos') ?? []
  if (dirty.length > 0) await this.state.storage.delete('dirty_repos')
  return dirty
}

// Cron handler: pop dirty set, scan only changed repos
async function runScheduled(env: Env): Promise<void> {
  const dirtyRepos = await brainDO.consumeDirty()

  if (dirtyRepos.length > 0) {
    // O(changes) — only scan repos that reported changes
    await runReconciler(env, dirtyRepos)
    return
  }

  // Daily full sweep — catches things that change without pushes
  if (new Date().getUTCHours() === 3) {
    await runReconciler(env)
  }
}

The cron still fires every 5 minutes, but when the dirty set is empty it exits in milliseconds with zero GitHub API calls. The 5-minute poll becomes the batch window for the event-driven notifications — repos that push multiple times in a 5-minute window are deduplicated into one scan.

Prerequisites for the flip:

  1. Each repo needs a push notification hook (e.g. notify-readme.yml GitHub Actions workflow)
  2. A receiver that accumulates notifications (DO storage, a D1 table, or a queue)
  3. The cron or alarm reads from the accumulator, not from the registry

The notification hook should be deployed to repos as part of initial setup — not as a one-time manual step. This is a standard compliance item, not an enhancement.


11. Governance in Practice: A Worked Example

The mulan org scanner followed a trajectory that illustrates exactly what the review gate prevents.

Phase 1 — Manual experiment: Ran once to generate a dashboard. 574 API calls. Worked fine.

Phase 2 — First schedule (30-min cron): No review. No cost function computed.

Phase 3 — Interval tightened (5-min cron): No review. Rate limit math not redone.

Phase 4 — Debugging: 20 minutes identifying root cause. fix-ci jobs were failing with “API rate limit already exceeded” — a symptom two layers removed from the cause.

Phase 5 — Corrective action:

Phase 6 — Event-driven flip:

What a review gate would have done at Phase 3: The cost function at 5-minute intervals returns 6,888 calls/hour against a 5,000/hour limit. This is immediately visible. The gate blocks the deployment until smart scan is implemented, which drops the number to ~984 calls/hour (20% of limit). The rate limiting incident never happens.

The review gate costs 30 minutes to complete. The incident cost more than an hour of debugging and produced silent job failures during the debugging window. The gate has positive expected value on the very first job it catches.

The scaling question would have been caught at Phase 2: Section B of the review checklist asks: “How does cost scale as item count grows?” At 41 repos, polling is manageable. At 200 repos with the same architecture, it is not. The event-driven flip (§10, Layer 4) is the answer to this question, and the checklist forces it to be asked before deployment, not after hitting the wall at scale.


12. The Automation Lifecycle

Every recurring job moves through defined phases with explicit transitions:

EXPERIMENT → REVIEW → APPROVED → MONITORED → DEPRECATED → REMOVED

EXPERIMENT: Runs manually, on demand. No schedule. No registry entry required. This phase has no governance burden — move fast, break things, iterate.

REVIEW: Owner intends to schedule the job. Checklist (§6) is in progress. Job is not yet running on a schedule.

APPROVED: Checklist complete. Registry entry created. Kill switch tested. Job is now running on schedule.

MONITORED: Running in production for more than one week. Dashboard entry active. Quarterly cost review scheduled.

DEPRECATED: Marked for removal in the registry. Still running, but with a removal date set. Downstream consumers have been notified.

REMOVED: Job deleted. Schedule removed. Registry entry archived with reason and date. Dependent systems updated.

The Quarterly Review

All APPROVED and MONITORED jobs are reviewed every three months:

  1. Is the job still needed? (What would break if it stopped?)
  2. Has the cost model drifted? (Items processed may have grown)
  3. Are the platform limits still accurate? (Providers change limits)
  4. Can the job be made cheaper? (Better caching, model routing, batching)
  5. Are there newer primitives that change the architecture? (e.g. webhooks replacing polling)

The Brain DO in the org-prime-agent-architecture performs an equivalent review for runner capacity — adjusting allocations weekly based on observed success rates. The quarterly automation review is the same pattern applied at the governance level.



Summary

The phase transition from “runs once” to “runs repeatedly” is the most consequential decision in automation engineering. It requires formal process because the consequences are compounding, silent, and sometimes irreversible.

The non-negotiable requirements:

RequirementWhy
Cost function computed before schedulingRate limit disasters are always predictable in advance
AI calls budgeted separatelyThey cost 100–10,000× more per call than REST API calls
Smart scan / input change detection80%+ API reduction on typical cycles
Kill switch tested before first scheduled runIf you can’t stop it, you don’t control it
Structured log line on every runWithout it, “is this healthy?” takes too long to answer
Registry entry createdInvisible automation is unmanaged automation
Constraint database referenced”I didn’t know there was a limit” is preventable

The cost of running this review for a new job: approximately 30 minutes.

The cost of not running it: rate limits exhausted, surprise bills, silent job failures, downstream systems confused, and a debugging session measured in hours rather than seconds.


This article is part of the garywu engineering knowledge base. See also: git-sure#29 (global automation registry implementation).

Last updated: 2026-03-20.


Edit page
Share this post on:

Previous Post
Distributed Block Storage for Home Drives
Next Post
Building an Autonomous Data Pipeline on Cloudflare Workers