Skip to content
Gary Wu
Go back

Progress Visibility

Edit page

Org Status: 🟡 Dormant Cloudflare: N/A Last Audited: 2026-04-28



Never launch a process that takes more than ~10 seconds without continuous progress output. If you can’t tell whether it’s alive in 10 seconds, don’t launch it.

That’s it. Everything below helps you enforce it.


Before you run any operation expected to take longer than 10 seconds, answer these questions. If any answer is “no,” fix it before launching.

If you’re writing the long-running code yourself: implement the reporter first, before the logic. The reporter is not optional polish — it is how you know your code works.


This is the simplest implementation that satisfies the checklist. Copy it, adapt it, ship it.

The Interface

interface ProgressReporter {
  tick(done: number, total: number, label?: string): void;
  rate(): number;       // items per second
  eta(): number | null; // seconds remaining, or null if unknown
  finish(): void;
}

The Ticker Pattern (TypeScript/Node.js)

import { performance } from 'perf_hooks';

class ProgressReporter {
  private startMs: number;
  private lastTickMs: number;
  private lastDone: number;
  private intervalMs: number;
  private timer: NodeJS.Timeout | null = null;

  constructor(private tickIntervalMs = 5000) {
    this.startMs = performance.now();
    this.lastTickMs = this.startMs;
    this.lastDone = 0;
  }

  // Call this after every meaningful unit of work.
  // For bulk operations, call every N items rather than every item.
  tick(done: number, total: number, label = '') {
    const now = performance.now();
    const elapsed = (now - this.startMs) / 1000;
    const intervalElapsed = (now - this.lastTickMs) / 1000;

    if (intervalElapsed < this.tickIntervalMs / 1000) return; // throttle

    const rate = (done - this.lastDone) / intervalElapsed;
    const remaining = total > 0 ? (total - done) / (rate || 1) : null;
    const pct = total > 0 ? ((done / total) * 100).toFixed(1) : '?';
    const etaStr = remaining != null ? `ETA ${fmtSecs(remaining)}` : 'ETA unknown';

    process.stderr.write(
      `[${new Date().toISOString()}] ${pct}% | ${done}/${total} | ` +
      `${rate.toFixed(1)}/s | ${etaStr}${label ? ' | ' + label : ''}\n`
    );

    this.lastTickMs = now;
    this.lastDone = done;
  }

  // Use this for operations where you don't know total or can't call tick() reliably.
  // Emits a heartbeat so you can tell the process is alive.
  startHeartbeat(intervalMs = 5000, getMessage: () => string) {
    this.timer = setInterval(() => {
      process.stderr.write(`[${new Date().toISOString()}] alive: ${getMessage()}\n`);
    }, intervalMs);
  }

  stopHeartbeat() {
    if (this.timer) {
      clearInterval(this.timer);
      this.timer = null;
    }
  }

  finish() {
    this.stopHeartbeat();
    const elapsed = (performance.now() - this.startMs) / 1000;
    process.stderr.write(`[${new Date().toISOString()}] done in ${fmtSecs(elapsed)}\n`);
  }
}

function fmtSecs(s: number): string {
  if (s < 60) return `${s.toFixed(0)}s`;
  if (s < 3600) return `${Math.floor(s / 60)}m${Math.floor(s % 60)}s`;
  return `${Math.floor(s / 3600)}h${Math.floor((s % 3600) / 60)}m`;
}

Usage

const reporter = new ProgressReporter(5000); // report every 5 seconds at most

for (let i = 0; i < files.length; i++) {
  await processFile(files[i]);
  reporter.tick(i + 1, files.length, files[i]);
}

reporter.finish();

For operations where you truly cannot call tick() in a loop — e.g., waiting on a single long database query — use the heartbeat:

reporter.startHeartbeat(5000, () => `query running, ${db.stats()}`);
const result = await db.query(heavySQL);
reporter.stopHeartbeat();

stderr vs stdout

Always write progress to stderr, not stdout. Stdout is data. Stderr is status. This matters when:

The one exception: interactive TUI applications that own the terminal and render progress in-place using ANSI escape codes. That’s a deliberate choice, not a default.


Every serious long-running tool implements progress visibility. Study these before building your own.

rsync —progress

rsync -av --progress source/ dest/

Key elements: bytes transferred, percentage, rate in MB/s, time, transfer count, files remaining. Everything you need to know about the operation at a glance.

rsync -av --info=progress2 source/ dest/

--info=progress2 gives whole-operation progress instead of per-file. Use it when you have many small files and don’t need per-file detail.

git clone

git clone https://github.com/large/repo

Git reports phase transitions (enumerate → count → compress → receive → resolve), percentage within each phase, bytes transferred, and rate. It also writes progress to stderr, leaving stdout clean for scripting.

docker build

docker build -t myimage .

Docker numbers each step, reports cache hits, and shows transfer rates for layer pulls. The step numbers alone are enough to know where you are. This is the “phase reporting” pattern — useful when your operation has distinct phases rather than a uniform loop.

pv (Pipe Viewer)

pv is a Unix tool that fits into any pipeline and adds progress visibility to any byte stream:

tar czf - /data | pv -s $(du -sb /data | cut -f1) | aws s3 cp - s3://bucket/backup.tar.gz

pv reports bytes passed, elapsed time, rate, percentage (if you give it -s total_bytes), and ETA. It is a complete progress reporter for any stream-based operation. Reach for it before writing your own byte-counting code.

tqdm (Python)

from tqdm import tqdm

for item in tqdm(items, desc="processing", unit="file"):
    process(item)

tqdm wraps any iterable and adds a progress bar to stderr with elapsed time, rate, and ETA. It handles all the throttling and terminal width calculation. In Python, there is almost never a reason to write a manual progress reporter — just wrap your iterator.

The pattern these share

All of these tools follow the same structure:

  1. Track start time.
  2. Track items done and items total.
  3. Compute rate = (items done in last interval) / (interval seconds).
  4. Emit to stderr at a fixed interval, not on every item.
  5. Show: done, total or fraction, rate, ETA.

Copy this pattern. Do not invent a new one.


A process that stopped emitting output is either done or dead. You need to tell these apart.

The 30-second rule

If a process has not emitted a progress line in 30 seconds, assume it is stuck. Kill it. Investigate. Do not wait longer “just in case.” A well-behaved process emits at least every 10 seconds; 30 seconds is three missed heartbeats.

Implementing a watchdog

class Watchdog {
  private lastSeen: number = Date.now();
  private timer: NodeJS.Timeout;

  constructor(timeoutMs: number, onTimeout: () => void) {
    this.timer = setInterval(() => {
      if (Date.now() - this.lastSeen > timeoutMs) {
        onTimeout();
        this.stop();
      }
    }, 1000);
  }

  // Call this whenever the process emits output or makes progress.
  ping() {
    this.lastSeen = Date.now();
  }

  stop() {
    clearInterval(this.timer);
  }
}

// Usage:
const watchdog = new Watchdog(30_000, () => {
  console.error('Process appears stuck — killing.');
  child.kill('SIGTERM');
});

child.stderr.on('data', () => watchdog.ping());
child.on('close', () => watchdog.stop());

SQLite lock contention

SQLite is a common source of silent hangs. A write transaction holds an exclusive lock; any concurrent reader blocks indefinitely by default. Symptoms: the process emits progress for a while, then goes silent exactly when another writer acquires the database.

Fix:

// Set a busy timeout before any query. 5 seconds is reasonable for most cases.
db.pragma('busy_timeout = 5000');
// WAL mode allows readers and one writer to coexist:
db.pragma('journal_mode = WAL');

With busy_timeout set, SQLite will retry the lock for up to N milliseconds before throwing SQLITE_BUSY. Without it, it throws immediately — or on some drivers, waits forever. Always set it.

Timeouts on raw queries

Never issue a long-running query without a timeout:

sqlite3 -cmd ".timeout 10000" database.db "SELECT ..."

timeout 30s sqlite3 database.db "SELECT ..."

For Node.js with better-sqlite3, there is no per-query timeout, but you can run the query in a Worker thread and terminate the thread if it exceeds your budget.


Progress visibility is not enough if killing the process at 90% means starting over. The complement to progress reporting is checkpointing: saving enough state that you can resume from where you left off.

Cursor-based pagination

For any operation that processes rows from a database or records from a file, use a cursor stored in a durable location:

// Read the last checkpoint.
const checkpoint = db.prepare(
  'SELECT last_id FROM scan_checkpoints WHERE job = ?'
).get(jobName) ?? { last_id: 0 };

let cursor = checkpoint.last_id;

while (true) {
  const batch = db.prepare(
    'SELECT id, path FROM files WHERE id > ? ORDER BY id LIMIT 1000'
  ).all(cursor);

  if (batch.length === 0) break;

  for (const row of batch) {
    await processFile(row.path);
  }

  cursor = batch[batch.length - 1].id;

  // Save checkpoint after each batch.
  db.prepare(
    'INSERT OR REPLACE INTO scan_checkpoints (job, last_id) VALUES (?, ?)'
  ).run(jobName, cursor);

  reporter.tick(cursor, totalFiles);
}

Key properties:

What to checkpoint

For file scans: the last file path processed, or the last inode/ID seen, sorted by a stable order.

For network operations: the last page token, sequence number, or timestamp from the API.

For hashing jobs: the list of files already hashed, stored in the results table itself (if the file appears in results, skip it).

The checkpoint table is simple:

CREATE TABLE IF NOT EXISTS scan_checkpoints (
  job TEXT PRIMARY KEY,
  last_id INTEGER NOT NULL,
  updated_at TEXT NOT NULL DEFAULT (datetime('now'))
);


The rule above was not handed down from first principles. It was hard-won. This section traces where the problem came from and why it took so long to become standard practice.


The first computers had no interactive interfaces at all. You submitted a job — a deck of punched cards — to an operator. The operator loaded it into the machine. The machine ran. You came back later and picked up your printout.

IBM’s mainframe systems of the 1950s and 1960s were built entirely around this model. The IBM 360 series (1964) used Job Control Language (JCL), a formal notation for describing what resources a job needed, what to do with input and output, and what to do on error. JCL is still running today — it powers the core batch processing infrastructure of most large banks, insurance companies, and government agencies.

In this model, progress visibility was not a concept. You didn’t watch the job run. You found out it succeeded or failed when you read the output deck. The operating system might print a job completion accounting report: CPU seconds consumed, I/O operations, elapsed time. That was as close to “progress” as you got.

The implicit assumption: the operator knows roughly how long things take, and if a job runs significantly longer than expected, you investigate. In practice this meant the operations staff would check a job that had been running for twice its normal duration. The knowledge was in people’s heads, not in the software.


The shift from batch to interactive computing changed the calculus. In the mid-1960s, projects like MIT’s Compatible Time-Sharing System (CTSS) and later Multics allowed multiple users to interact with the computer simultaneously through teletypes — electromechanical terminals that printed output on paper rolls.

Now the user was sitting at the machine. Now silence was uncomfortable. If you typed a command and the teletype didn’t respond, you didn’t know if the machine was thinking, stuck, or had crashed.

This was not a solvable problem with the tools of the era. Teletypes transmitted at 110 baud — roughly 10 characters per second. Even if a program wanted to report progress, each line of output took a second or more to print. Animation was physically impossible. The answer was cultural: programs were expected to print something when they were done and nothing while running. If nothing printed, the operator waited.

The VT100 terminal (1978) introduced ANSI escape codes and cursor positioning. For the first time, programs could update the same line of output repeatedly — the foundation of all modern progress bars. But adoption was slow. Most programs continued the teletype convention of print-when-done.


Unix (1969–1971) cemented a design philosophy that would dominate for the next fifty years: composable programs that communicate through pipes. A program should read from stdin, write data to stdout, write errors to stderr. The streams are separate. The program is silent unless it has something to say.

This was the right design for composition. cat file | sort | uniq | wc -l works precisely because each program in the pipeline is silent except for its output. If sort printed progress to stdout, it would corrupt the data flowing to uniq.

But the convention became a trap when applied to long-running operations. cp, mv, dd — all of them run silently. You launch dd to copy a 500GB disk image, and it prints nothing until it finishes. If it takes four hours and you need to restart the machine, you have no idea whether it was at 1% or 99%.

dd was written in 1974. Its silence was deliberate: it was a building block, not a user-facing tool. The progress flag — status=progress — was not added until 2013, forty years later. The relevant commit was by Pádraig Brady, who maintained the GNU coreutils for years. The patch was small. The wait was long.


The absence of progress in dd is instructive because it was noticed immediately and complained about for decades before it was fixed.

The original workaround was a Unix signal trick: send SIGUSR1 (or SIGINFO on BSD systems) to a running dd process, and it prints a one-line status report to stderr. This works, but it requires the user to know the PID, type a kill command in another terminal, and repeat as needed. It is not continuous progress; it is manual polling.

dd if=/dev/sda of=/dev/sdb bs=4M &
DD_PID=$!
while kill -0 $DD_PID 2>/dev/null; do
  kill -USR1 $DD_PID
  sleep 5
done

The status=progress flag, when it finally arrived in GNU coreutils 8.24 (2014), made this automatic:

dd if=/dev/sda of=/dev/sdb bs=4M status=progress

The line updates in place every second. Rate is computed over a sliding window. This is the minimum viable progress reporter, distilled to a single command-line flag.


The Python library tqdm (from the Arabic word for “progress”, تقدّم, taqaddum) was first released in 2015. It became one of the most widely used Python libraries in the world — not because it solved a hard algorithmic problem, but because it solved a pervasive social problem: nobody wanted to write the same progress bar boilerplate every time.

tqdm’s design insight was to wrap the iterator, not the code inside the loop. You don’t change your processing code; you change how you iterate:

for item in items:
    process(item)

for item in tqdm(items):
    process(item)

This was a genuinely good API decision. The progress reporter is not entangled with the business logic. It can be added as an afterthought. It can be removed without touching the core loop. It composes with nested loops. It redirects to a log file gracefully.

By 2020, tqdm had over 30 million downloads per month. Its success demonstrates that the demand for progress visibility was there all along — developers just needed a tool that made it easier to add than to skip.


Before tqdm, before status=progress, before even the VT100 terminal, IBM mainframe shops solved the progress visibility problem through operations management rather than software.

A production IBM shop maintained a job run book: a ledger (and later a database) of every scheduled batch job, its expected start time, expected duration, resource consumption, and acceptable variance. If payroll processing was supposed to finish in four hours and it hit five, the operations desk got paged.

The tooling for this evolved into formal job scheduling systems: IBM’s own JES2/JES3, then commercial products like CA-7, JOBTRAC, and TWS (Tivoli Workload Scheduler). These systems tracked job duration historically, built statistical models of expected runtime, and alerted when a job exceeded two or three standard deviations.

This is, in effect, an external watchdog — the job itself is still silent, but infrastructure around it monitors elapsed time and raises an alarm. The pattern survives in modern form as job monitoring in Kubernetes (deadline seconds), AWS Batch (job timeouts), and Airflow (SLA misses).

The lesson: if you can’t change the program to emit progress, build the watchdog externally. Both approaches are valid. The program-level reporter is better because it can tell you where the process is, not just whether it’s still running.


The shift to streaming data systems in the 2010s — Kafka, Flink, Spark Streaming — created a new class of long-running processes: ones that run forever. These processes don’t finish; they accumulate lag.

Kafka introduced the concept of consumer lag: the difference between the latest offset on a partition and the offset the consumer has processed. A consumer with zero lag is caught up; a consumer with growing lag is falling behind. This is progress visibility for infinite processes.

The Kafka consumer metrics exposed via JMX became a standard part of operations dashboards. kafka-consumer-groups.sh --describe prints lag per partition:

GROUP           TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG
my-consumer     events          0          1234567890      1234600000      32110

This is exactly the progress reporter pattern applied to streaming: rate (implied by lag growth or shrinkage), ETA (if lag is decreasing, time to zero is lag / catch-up rate), and current position.

The same pattern appears in PostgreSQL replication lag, Redis replication offset, and Elasticsearch indexing lag. Wherever a long-running process operates on a sequence of records, the gap between “where we are” and “where we need to be” is the natural progress metric.


The evolution toward AI agent systems — programs that autonomously execute long sequences of steps — makes progress visibility more important, not less. An agent running a multi-hour file scan with no output is indistinguishable from a crashed agent. A human reviewing the system has no way to intervene intelligently without knowing where the process is.

The problem compounds when agents spawn sub-agents: a top-level agent waiting on a child operation has no idea if the child is making progress. The child’s silence looks identical to the child’s death.

The principle that started with IBM batch jobs, survived the transition to interactive computing, and was codified in pv and tqdm is now foundational infrastructure for any autonomous system. The implementation details change — ANSI terminals become structured log events, which become metrics pipelines, which become dashboards — but the requirement is unchanged: continuous output from every long-running operation, at a rate that lets any observer determine within 10 seconds whether the process is alive and making progress.

The rule at the top of this article is the end of a seventy-year line.


Edit page
Share this post on:

Previous Post
Production AI Anti-Patterns
Next Post
Securing Servers with Tailscale and Cloudflare