Skip to content
Gary Wu
Go back

The Capability Primitive

Edit page

A capability is a contract — defined inputs, defined outputs, a single responsibility, and the ability to register itself with any orchestrator. This is how you turn a 770-line server into seven composable pieces that can run on any machine, anywhere, for any cost.


Table of Contents

Open Table of Contents

The Problem

Video Factory is a Node.js service that runs on a single machine. It does everything:

The render-server/server.ts is 770 lines. It knows about ffmpeg binary paths. It knows about S3 credentials. It knows about the TTS endpoint. It knows about the stock footage API key. It is a monolith pretending to be a service.

The failure mode is predictable: the machine that runs it gets reassigned, rebooted, or simply dies. The service is gone. There is no fallback. Every pipeline that depended on it stops.

The deeper problem is coupling. Video Factory is not just coupled to one machine — it is coupled to one identity. The pipeline author hardcodes http://studio-machine:3000 and moves on. That URL is now a single point of failure for every workflow that touches it.

The solution is not redundancy. Adding a second Video Factory instance doubles the operational cost while keeping all the coupling. The solution is decomposition — breaking the monolith into capabilities: small, single-purpose HTTP services that know nothing about each other and register themselves with a central router when they start.


The Capability as Primitive

A capability is the compute equivalent of a Unix command. Like grep or ffmpeg, it:

The Unix philosophy is “Do one thing and do it well”. Docker’s container model is “one process per container”. Capabilities apply the same principle to HTTP services.

The critical property that distinguishes a capability from a microservice is ignorance. A capability does not know:

This ignorance is not a limitation. It is what makes composition possible. A capability that knows nothing about Video Factory can be used by Video Factory, by Shorts Factory, by Podcast Factory, and by any future pipeline that needs the same operation.

+------------------------------------------+
|           The Capability Model            |
|                                          |
|   Input (JSON) ──► [capability] ──► Output (JSON)
|                                          |
|   The capability knows nothing about     |
|   what surrounds it. It processes.       |
|   That is all.                           |
+------------------------------------------+

This is directly analogous to the Autonomous Entity pattern (see garywu/_readme/articles/autonomous-entity-pattern): each entity has clear boundaries, a defined interface, and delegates work downward rather than absorbing it upward.


The Self-Registering Script

The deployment model for a capability is the bootstrap script. You download it, run it, and it takes care of everything else:

curl -fsSL https://cdn.example.com/caps/ffmpeg-render/install.sh | sh

The script follows a fixed lifecycle:

1. Download        — fetch the capability binary or script
2. Prerequisites   — check for ffmpeg, node, python, GPU drivers
3. Install         — install missing prerequisites
4. Start           — bind to an available port, start HTTP server
5. Register        — POST to API Mom: name, endpoint, spec, cost_model
6. Heartbeat       — POST /heartbeat to API Mom every 60s
7. Deregister      — on SIGTERM, DELETE registration from API Mom

Here is a complete example for an ffmpeg-render capability:

#!/usr/bin/env bash
# ffmpeg-render capability bootstrap
set -euo pipefail

CAPABILITY_NAME="ffmpeg-render"
API_MOM_URL="${API_MOM_URL:-https://api-mom.example.com}"
PORT="${PORT:-$(shuf -i 8100-8200 -n 1)}"
ENDPOINT="http://$(hostname -I | awk '{print $1}'):${PORT}"

# --- 1. Prerequisites ---
check_prereqs() {
  if ! command -v ffmpeg &>/dev/null; then
    echo "Installing ffmpeg..."
    apt-get install -y ffmpeg 2>/dev/null || brew install ffmpeg 2>/dev/null
  fi
  if ! command -v node &>/dev/null; then
    echo "node required — install from https://nodejs.org" && exit 1
  fi
}

# --- 2. Start HTTP server ---
start_server() {
  node "$(dirname "$0")/server.js" --port "$PORT" &
  SERVER_PID=$!
  # Wait for readiness
  for i in $(seq 1 10); do
    curl -sf "http://localhost:${PORT}/health" &>/dev/null && break
    sleep 1
  done
}

# --- 3. Register with API Mom ---
register() {
  curl -sf -X POST "${API_MOM_URL}/v1/capabilities" \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer ${API_MOM_TOKEN}" \
    -d "$(cat <<JSON
{
  "name": "${CAPABILITY_NAME}",
  "version": "1.2.0",
  "endpoint": "${ENDPOINT}",
  "spec_url": "${ENDPOINT}/spec",
  "cost_model": {
    "type": "per_second",
    "rate_usd": 0.0,
    "notes": "local GPU, zero marginal cost"
  },
  "tags": ["video", "ffmpeg", "gpu"],
  "health_url": "${ENDPOINT}/health"
}
JSON
)"
  echo "Registered ${CAPABILITY_NAME} at ${ENDPOINT}"
}

# --- 4. Heartbeat loop ---
heartbeat_loop() {
  while true; do
    sleep 60
    curl -sf -X POST "${API_MOM_URL}/v1/capabilities/${CAPABILITY_NAME}/heartbeat" \
      -H "Authorization: Bearer ${API_MOM_TOKEN}" \
      -d "{\"endpoint\": \"${ENDPOINT}\"}" || true
  done
}

# --- 5. Deregister on exit ---
deregister() {
  echo "Deregistering ${CAPABILITY_NAME}..."
  curl -sf -X DELETE "${API_MOM_URL}/v1/capabilities/${CAPABILITY_NAME}" \
    -H "Authorization: Bearer ${API_MOM_TOKEN}" \
    -d "{\"endpoint\": \"${ENDPOINT}\"}" || true
  kill "$SERVER_PID" 2>/dev/null || true
}

trap deregister SIGTERM SIGINT

check_prereqs
start_server
register
heartbeat_loop &
wait "$SERVER_PID"

The registration payload tells API Mom everything it needs to route requests. The cost_model is the key field — it tells the router whether to prefer this instance over a cloud alternative.


The Decomposition Process

Video Factory’s render-server/server.ts is a useful case study because its boundaries are already visible — they just haven’t been enforced. Every exec('ffmpeg ...') call is a natural capability boundary.

Step 1: Identify the natural boundaries

Read the monolith and mark each logical operation:

render-server/server.ts (770 lines)
  ├── /search-stock       → stock-search capability
  ├── /render-sequence    → ffmpeg-concat capability
  ├── /text-to-speech     → tts-elevenlabs capability
  ├── /mix-audio          → audio-mix capability  (includes loudnorm)
  ├── /upload-asset       → media-store-upload capability
  ├── /thumbnail          → ffmpeg-thumbnail capability
  └── /transcode-720p     → ffmpeg-transcode capability

Each route becomes a capability. The orchestration code — the logic that calls routes in sequence — becomes a Scram-Jet pipeline definition, not a capability.

Step 2: Extract stateless functions

A capability’s handler should be a pure function. All state — input files, output files, intermediate buffers — lives in Media Store (R2 or NFS), not in the capability process.

Before (monolith, stateful):

// render-server/server.ts — maintains in-memory job map
const jobs = new Map<string, JobState>()

app.post('/render-sequence', async (req, res) => {
  const jobId = uuid()
  jobs.set(jobId, { status: 'running', inputPath: req.body.inputPath })

  const result = await runFfmpeg(req.body)
  jobs.get(jobId)!.status = 'done'
  jobs.get(jobId)!.outputPath = result.outputPath

  res.json({ jobId, outputPath: result.outputPath })
})

After (capability, stateless):

// capabilities/ffmpeg-concat/server.ts — ~80 lines total
app.post('/exec', async (req, res) => {
  const { input_clips, output_path, options } = ExecSchema.parse(req.body)

  // Read inputs from Media Store — capability doesn't own these files
  const localClips = await Promise.all(
    input_clips.map(clip => mediaStore.download(clip.media_store_path))
  )

  const outputLocalPath = await ffmpegConcat(localClips, options)

  // Write output to Media Store — caller owns the destination path
  await mediaStore.upload(outputLocalPath, output_path)

  res.json({ output_path, duration_ms: Date.now() - start })
})

Step 3: Add Media Store I/O

Every capability reads from and writes to Media Store. It never touches local disk permanently. This is what makes a capability machine-portable — there is no implicit state on the filesystem that would break if the process moved.

Step 4: Add self-registration

Wire in the bootstrap pattern above. The capability should register on startup and deregister cleanly. That is the entire deployment contract.

The result: a capability that has never heard of Video Factory and never will. It processes ffmpeg concat operations. That is all it knows.


The Capability Contract

Every capability must implement exactly three endpoints:

POST /exec     — accept input JSON, return output JSON or error
GET  /health   — return current status and basic capabilities info
GET  /spec     — return full input/output schema (JSON Schema)
// Minimal TypeScript implementation of the contract
import { z } from 'zod'
import express from 'express'

const SpecSchema = z.object({
  name: z.string(),
  version: z.string(),
  description: z.string(),
  input: z.record(z.unknown()),   // JSON Schema object
  output: z.record(z.unknown()),  // JSON Schema object
  cost_model: z.object({
    type: z.enum(['per_second', 'per_call', 'free']),
    rate_usd: z.number(),
  }),
})

const app = express()

// Every capability: POST /exec
app.post('/exec', async (req, res) => {
  try {
    const input = InputSchema.parse(req.body)
    const output = await process(input)
    res.json({ ok: true, output })
  } catch (err) {
    res.status(422).json({ ok: false, error: String(err) })
  }
})

// Every capability: GET /health
app.get('/health', (_req, res) => {
  res.json({
    status: 'ok',
    name: CAPABILITY_NAME,
    version: CAPABILITY_VERSION,
    uptime_ms: Date.now() - START_TIME,
  })
})

// Every capability: GET /spec
app.get('/spec', (_req, res) => {
  res.json(CAPABILITY_SPEC)
})

This contract is intentionally minimal. Capabilities do not implement authentication (API Mom handles that at the routing layer). They do not implement retry logic (Scram-Jet handles that at the pipeline layer). They do not implement logging aggregation (the infrastructure layer handles that). Each capability is responsible for exactly one thing: reliable execution of its defined operation.


Composition: Small Capabilities, Large Results

Individual capabilities compose into pipelines. The pipeline author describes the desired result; the system expands it into a tree of capability calls.

A render-video pipeline in Scram-Jet:

pipeline: render-video
version: "1.0"
steps:
  - id: search
    capability: stock-search
    input:
      query: "{{ job.topic }}"
      count: 10
      resolution: "4k"

  - id: voiceover
    capability: tts-elevenlabs
    input:
      text: "{{ job.script }}"
      voice_id: "{{ job.voice_id }}"
      output_path: "media://{{ job.id }}/voiceover.mp3"

  - id: render
    capability: ffmpeg-concat
    depends_on: [search, voiceover]
    input:
      input_clips: "{{ steps.search.output.clips }}"
      audio_track: "{{ steps.voiceover.output.output_path }}"
      output_path: "media://{{ job.id }}/raw.mp4"

  - id: mix
    capability: audio-mix
    depends_on: [render]
    input:
      video_path: "{{ steps.render.output.output_path }}"
      music_path: "{{ job.background_music_path }}"
      output_path: "media://{{ job.id }}/mixed.mp4"
      loudnorm: true

  - id: upload
    capability: media-store-upload
    depends_on: [mix]
    input:
      source_path: "{{ steps.mix.output.output_path }}"
      destination: "{{ job.upload_destination }}"

The user submits one job. Scram-Jet resolves the dependency graph, dispatches each step to the appropriate capability instance (chosen by API Mom based on cost and availability), and assembles the result. No individual capability knows it is part of this pipeline. Each one just processes its input and returns its output.

              render-video (pipeline)

       ┌─────────────┼─────────────────┐
       ▼             ▼                 ▼
  stock-search  tts-elevenlabs    (wait for both)


                                 ffmpeg-concat


                                   audio-mix
                                (includes loudnorm)


                               media-store-upload

This is the Adaptive Controller pattern from garywu/_readme/articles/cloudflare-durable-objects-patterns applied to compute: the orchestrator adapts to what is available; the workers just execute.


The Economics

The same capability can run in three places with three cost profiles:

LocationProviderCostLatencyAvailability
Your GPU workstationLocal$0.00/hr~2sWhen powered on
Rented GPU (Lambda Labs, Vast.ai)Cloud VM~$0.50/hr~3sOn-demand
Shotstack / CreatomateManaged API~$0.04/render~15sAlways

API Mom maintains the capability registry. When a render-video pipeline needs ffmpeg-concat, API Mom looks at:

  1. What instances are registered and healthy? (checked via heartbeat)
  2. What is each instance’s declared cost model?
  3. What is the current queue depth on each instance?
  4. What is the pipeline’s declared priority and budget?

A batch job that can wait 10 minutes routes to the $0.04 Shotstack API. A rush job that needs a result in 30 seconds routes to the rented GPU. A high-throughput internal job routes to the local workstation at zero cost.

The pipeline author wrote none of this logic. They declared a capability name. API Mom handled the economics.

Pipeline declares:   capability: ffmpeg-concat
                     priority: standard
                     budget_ceiling_usd: 0.10

API Mom resolves:    local GPU registered? → yes, healthy
                     queue depth: 0
                     cost: $0.00
                     → route to local GPU

The pipeline is portable. If the local GPU goes offline, the same pipeline routes to the cloud alternative without any configuration change. The pipeline author’s code does not change. The pipeline YAML does not change. Only the routing table changes — and that is managed centrally, not distributed across every pipeline definition.


Practical Example: Video Factory Before and After

Before: One 770-line monolith

render-server/
  server.ts          770 lines
  Dockerfile         12 lines
  package.json       24 dependencies

Everything couples together. The ffmpeg binary path is hardcoded. The S3 bucket name is hardcoded. The TTS API key is loaded once at startup. Every endpoint shares the same Express instance, the same process, the same failure domain.

After: Seven leaf capabilities + one pipeline definition

capabilities/
  stock-search/
    server.ts        85 lines    (HTTP + search API integration)
    install.sh       60 lines    (bootstrap + registration)
    spec.json        40 lines    (input/output schema)

  ffmpeg-concat/
    server.ts        90 lines    (HTTP + ffmpeg execution)
    install.sh       65 lines
    spec.json        45 lines

  tts-elevenlabs/
    server.ts        70 lines    (HTTP + ElevenLabs API)
    install.sh       55 lines
    spec.json        35 lines

  audio-mix/
    server.ts        95 lines    (HTTP + ffmpeg audio + loudnorm)
    install.sh       60 lines
    spec.json        40 lines

  ffmpeg-thumbnail/
    server.ts        60 lines
    install.sh       55 lines
    spec.json        30 lines

  ffmpeg-transcode/
    server.ts        75 lines
    install.sh       60 lines
    spec.json        40 lines

  media-store-upload/
    server.ts        55 lines    (HTTP + S3 upload)
    install.sh       50 lines
    spec.json        25 lines

pipelines/
  render-video.yaml  45 lines    (Scram-Jet pipeline definition)

Total: ~600 lines across 7 capability servers, replacing 770 lines in one. The line count is similar. The coupling is not. Each capability:

The pipeline definition is 45 lines of YAML. It replaces the implicit orchestration that was previously woven through the monolith’s route handlers and shared state.


Anti-Patterns

1. Capabilities too granular

Loudnorm is a single ffmpeg filter pass. Making it a standalone capability creates a network round-trip — with Media Store upload and download — for a 2-second CPU operation. Bundle loudnorm with audio-mix. The rule: if a step always precedes or always follows another step with no fan-out between them, they belong in the same capability.

# Wrong — unnecessary network hop
audio-mix → upload → loudnorm-download → loudnorm → upload

# Right — single capability, single Media Store write
audio-mix (includes loudnorm) → upload

2. Orchestration logic inside capabilities

A capability that calls another capability is no longer a capability — it is an orchestrator wearing a capability costume. If ffmpeg-concat internally calls tts-elevenlabs to get its audio, you have recreated the monolith with extra HTTP steps. Orchestration belongs in Scram-Jet pipelines. Capabilities are leaves in the call tree, not intermediate nodes.

3. State inside capabilities

A capability that maintains a job map in memory, writes to a local database, or stores intermediate files on its own disk has introduced hidden state. When the capability process restarts, that state is gone. When the capability moves to a different machine, that state does not follow it. All state — including intermediate files — belongs in Media Store. The capability should be indistinguishable from stateless from the caller’s perspective.

4. Capability-specific authentication

Each capability should not implement its own auth scheme. API Mom is the authentication boundary. When API Mom routes a request to a capability, it has already authenticated the caller. The capability trusts requests that arrive on its local network segment. Adding JWT validation to each capability adds complexity without security — the capability’s port should not be exposed to the public internet in the first place.

5. Hardcoding API Mom’s URL

Capabilities should read API_MOM_URL from the environment. This is not a minor point — it is what lets you run a local API Mom instance for testing and a production instance for deployment, without changing the capability code.


References


Edit page
Share this post on:

Previous Post
Self-Hosted to Serverless: Migrating a WebSocket Relay
Next Post
Why Every Capability Should Assume 80% Reliability