A capability is a contract — defined inputs, defined outputs, a single responsibility, and the ability to register itself with any orchestrator. This is how you turn a 770-line server into seven composable pieces that can run on any machine, anywhere, for any cost.
Table of Contents
Open Table of Contents
The Problem
Video Factory is a Node.js service that runs on a single machine. It does everything:
- Searches stock footage libraries via API
- Renders video sequences with ffmpeg
- Synthesizes speech with a TTS provider
- Mixes audio tracks
- Normalizes loudness to broadcast spec (EBU R128)
- Uploads the finished file to S3
The render-server/server.ts is 770 lines. It knows about ffmpeg binary paths. It knows about S3 credentials. It knows about the TTS endpoint. It knows about the stock footage API key. It is a monolith pretending to be a service.
The failure mode is predictable: the machine that runs it gets reassigned, rebooted, or simply dies. The service is gone. There is no fallback. Every pipeline that depended on it stops.
The deeper problem is coupling. Video Factory is not just coupled to one machine — it is coupled to one identity. The pipeline author hardcodes http://studio-machine:3000 and moves on. That URL is now a single point of failure for every workflow that touches it.
The solution is not redundancy. Adding a second Video Factory instance doubles the operational cost while keeping all the coupling. The solution is decomposition — breaking the monolith into capabilities: small, single-purpose HTTP services that know nothing about each other and register themselves with a central router when they start.
The Capability as Primitive
A capability is the compute equivalent of a Unix command. Like grep or ffmpeg, it:
- Accepts defined inputs — a JSON payload or file paths, specified in a schema
- Produces defined outputs — a JSON result or file paths, specified in a schema
- Does one thing — it does not orchestrate, does not store state, does not make decisions about what to do next
- Is addressable — any caller with the right URL can invoke it
The Unix philosophy is “Do one thing and do it well”. Docker’s container model is “one process per container”. Capabilities apply the same principle to HTTP services.
The critical property that distinguishes a capability from a microservice is ignorance. A capability does not know:
- Who is calling it
- What called it before
- What will be called after
- What “project” or “pipeline” it is part of
- Where its inputs came from (other than the Media Store path)
This ignorance is not a limitation. It is what makes composition possible. A capability that knows nothing about Video Factory can be used by Video Factory, by Shorts Factory, by Podcast Factory, and by any future pipeline that needs the same operation.
+------------------------------------------+
| The Capability Model |
| |
| Input (JSON) ──► [capability] ──► Output (JSON)
| |
| The capability knows nothing about |
| what surrounds it. It processes. |
| That is all. |
+------------------------------------------+
This is directly analogous to the Autonomous Entity pattern (see garywu/_readme/articles/autonomous-entity-pattern): each entity has clear boundaries, a defined interface, and delegates work downward rather than absorbing it upward.
The Self-Registering Script
The deployment model for a capability is the bootstrap script. You download it, run it, and it takes care of everything else:
curl -fsSL https://cdn.example.com/caps/ffmpeg-render/install.sh | sh
The script follows a fixed lifecycle:
1. Download — fetch the capability binary or script
2. Prerequisites — check for ffmpeg, node, python, GPU drivers
3. Install — install missing prerequisites
4. Start — bind to an available port, start HTTP server
5. Register — POST to API Mom: name, endpoint, spec, cost_model
6. Heartbeat — POST /heartbeat to API Mom every 60s
7. Deregister — on SIGTERM, DELETE registration from API Mom
Here is a complete example for an ffmpeg-render capability:
#!/usr/bin/env bash
# ffmpeg-render capability bootstrap
set -euo pipefail
CAPABILITY_NAME="ffmpeg-render"
API_MOM_URL="${API_MOM_URL:-https://api-mom.example.com}"
PORT="${PORT:-$(shuf -i 8100-8200 -n 1)}"
ENDPOINT="http://$(hostname -I | awk '{print $1}'):${PORT}"
# --- 1. Prerequisites ---
check_prereqs() {
if ! command -v ffmpeg &>/dev/null; then
echo "Installing ffmpeg..."
apt-get install -y ffmpeg 2>/dev/null || brew install ffmpeg 2>/dev/null
fi
if ! command -v node &>/dev/null; then
echo "node required — install from https://nodejs.org" && exit 1
fi
}
# --- 2. Start HTTP server ---
start_server() {
node "$(dirname "$0")/server.js" --port "$PORT" &
SERVER_PID=$!
# Wait for readiness
for i in $(seq 1 10); do
curl -sf "http://localhost:${PORT}/health" &>/dev/null && break
sleep 1
done
}
# --- 3. Register with API Mom ---
register() {
curl -sf -X POST "${API_MOM_URL}/v1/capabilities" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer ${API_MOM_TOKEN}" \
-d "$(cat <<JSON
{
"name": "${CAPABILITY_NAME}",
"version": "1.2.0",
"endpoint": "${ENDPOINT}",
"spec_url": "${ENDPOINT}/spec",
"cost_model": {
"type": "per_second",
"rate_usd": 0.0,
"notes": "local GPU, zero marginal cost"
},
"tags": ["video", "ffmpeg", "gpu"],
"health_url": "${ENDPOINT}/health"
}
JSON
)"
echo "Registered ${CAPABILITY_NAME} at ${ENDPOINT}"
}
# --- 4. Heartbeat loop ---
heartbeat_loop() {
while true; do
sleep 60
curl -sf -X POST "${API_MOM_URL}/v1/capabilities/${CAPABILITY_NAME}/heartbeat" \
-H "Authorization: Bearer ${API_MOM_TOKEN}" \
-d "{\"endpoint\": \"${ENDPOINT}\"}" || true
done
}
# --- 5. Deregister on exit ---
deregister() {
echo "Deregistering ${CAPABILITY_NAME}..."
curl -sf -X DELETE "${API_MOM_URL}/v1/capabilities/${CAPABILITY_NAME}" \
-H "Authorization: Bearer ${API_MOM_TOKEN}" \
-d "{\"endpoint\": \"${ENDPOINT}\"}" || true
kill "$SERVER_PID" 2>/dev/null || true
}
trap deregister SIGTERM SIGINT
check_prereqs
start_server
register
heartbeat_loop &
wait "$SERVER_PID"
The registration payload tells API Mom everything it needs to route requests. The cost_model is the key field — it tells the router whether to prefer this instance over a cloud alternative.
The Decomposition Process
Video Factory’s render-server/server.ts is a useful case study because its boundaries are already visible — they just haven’t been enforced. Every exec('ffmpeg ...') call is a natural capability boundary.
Step 1: Identify the natural boundaries
Read the monolith and mark each logical operation:
render-server/server.ts (770 lines)
├── /search-stock → stock-search capability
├── /render-sequence → ffmpeg-concat capability
├── /text-to-speech → tts-elevenlabs capability
├── /mix-audio → audio-mix capability (includes loudnorm)
├── /upload-asset → media-store-upload capability
├── /thumbnail → ffmpeg-thumbnail capability
└── /transcode-720p → ffmpeg-transcode capability
Each route becomes a capability. The orchestration code — the logic that calls routes in sequence — becomes a Scram-Jet pipeline definition, not a capability.
Step 2: Extract stateless functions
A capability’s handler should be a pure function. All state — input files, output files, intermediate buffers — lives in Media Store (R2 or NFS), not in the capability process.
Before (monolith, stateful):
// render-server/server.ts — maintains in-memory job map
const jobs = new Map<string, JobState>()
app.post('/render-sequence', async (req, res) => {
const jobId = uuid()
jobs.set(jobId, { status: 'running', inputPath: req.body.inputPath })
const result = await runFfmpeg(req.body)
jobs.get(jobId)!.status = 'done'
jobs.get(jobId)!.outputPath = result.outputPath
res.json({ jobId, outputPath: result.outputPath })
})
After (capability, stateless):
// capabilities/ffmpeg-concat/server.ts — ~80 lines total
app.post('/exec', async (req, res) => {
const { input_clips, output_path, options } = ExecSchema.parse(req.body)
// Read inputs from Media Store — capability doesn't own these files
const localClips = await Promise.all(
input_clips.map(clip => mediaStore.download(clip.media_store_path))
)
const outputLocalPath = await ffmpegConcat(localClips, options)
// Write output to Media Store — caller owns the destination path
await mediaStore.upload(outputLocalPath, output_path)
res.json({ output_path, duration_ms: Date.now() - start })
})
Step 3: Add Media Store I/O
Every capability reads from and writes to Media Store. It never touches local disk permanently. This is what makes a capability machine-portable — there is no implicit state on the filesystem that would break if the process moved.
Step 4: Add self-registration
Wire in the bootstrap pattern above. The capability should register on startup and deregister cleanly. That is the entire deployment contract.
The result: a capability that has never heard of Video Factory and never will. It processes ffmpeg concat operations. That is all it knows.
The Capability Contract
Every capability must implement exactly three endpoints:
POST /exec — accept input JSON, return output JSON or error
GET /health — return current status and basic capabilities info
GET /spec — return full input/output schema (JSON Schema)
// Minimal TypeScript implementation of the contract
import { z } from 'zod'
import express from 'express'
const SpecSchema = z.object({
name: z.string(),
version: z.string(),
description: z.string(),
input: z.record(z.unknown()), // JSON Schema object
output: z.record(z.unknown()), // JSON Schema object
cost_model: z.object({
type: z.enum(['per_second', 'per_call', 'free']),
rate_usd: z.number(),
}),
})
const app = express()
// Every capability: POST /exec
app.post('/exec', async (req, res) => {
try {
const input = InputSchema.parse(req.body)
const output = await process(input)
res.json({ ok: true, output })
} catch (err) {
res.status(422).json({ ok: false, error: String(err) })
}
})
// Every capability: GET /health
app.get('/health', (_req, res) => {
res.json({
status: 'ok',
name: CAPABILITY_NAME,
version: CAPABILITY_VERSION,
uptime_ms: Date.now() - START_TIME,
})
})
// Every capability: GET /spec
app.get('/spec', (_req, res) => {
res.json(CAPABILITY_SPEC)
})
This contract is intentionally minimal. Capabilities do not implement authentication (API Mom handles that at the routing layer). They do not implement retry logic (Scram-Jet handles that at the pipeline layer). They do not implement logging aggregation (the infrastructure layer handles that). Each capability is responsible for exactly one thing: reliable execution of its defined operation.
Composition: Small Capabilities, Large Results
Individual capabilities compose into pipelines. The pipeline author describes the desired result; the system expands it into a tree of capability calls.
A render-video pipeline in Scram-Jet:
pipeline: render-video
version: "1.0"
steps:
- id: search
capability: stock-search
input:
query: "{{ job.topic }}"
count: 10
resolution: "4k"
- id: voiceover
capability: tts-elevenlabs
input:
text: "{{ job.script }}"
voice_id: "{{ job.voice_id }}"
output_path: "media://{{ job.id }}/voiceover.mp3"
- id: render
capability: ffmpeg-concat
depends_on: [search, voiceover]
input:
input_clips: "{{ steps.search.output.clips }}"
audio_track: "{{ steps.voiceover.output.output_path }}"
output_path: "media://{{ job.id }}/raw.mp4"
- id: mix
capability: audio-mix
depends_on: [render]
input:
video_path: "{{ steps.render.output.output_path }}"
music_path: "{{ job.background_music_path }}"
output_path: "media://{{ job.id }}/mixed.mp4"
loudnorm: true
- id: upload
capability: media-store-upload
depends_on: [mix]
input:
source_path: "{{ steps.mix.output.output_path }}"
destination: "{{ job.upload_destination }}"
The user submits one job. Scram-Jet resolves the dependency graph, dispatches each step to the appropriate capability instance (chosen by API Mom based on cost and availability), and assembles the result. No individual capability knows it is part of this pipeline. Each one just processes its input and returns its output.
render-video (pipeline)
│
┌─────────────┼─────────────────┐
▼ ▼ ▼
stock-search tts-elevenlabs (wait for both)
│
▼
ffmpeg-concat
│
▼
audio-mix
(includes loudnorm)
│
▼
media-store-upload
This is the Adaptive Controller pattern from garywu/_readme/articles/cloudflare-durable-objects-patterns applied to compute: the orchestrator adapts to what is available; the workers just execute.
The Economics
The same capability can run in three places with three cost profiles:
| Location | Provider | Cost | Latency | Availability |
|---|---|---|---|---|
| Your GPU workstation | Local | $0.00/hr | ~2s | When powered on |
| Rented GPU (Lambda Labs, Vast.ai) | Cloud VM | ~$0.50/hr | ~3s | On-demand |
| Shotstack / Creatomate | Managed API | ~$0.04/render | ~15s | Always |
API Mom maintains the capability registry. When a render-video pipeline needs ffmpeg-concat, API Mom looks at:
- What instances are registered and healthy? (checked via heartbeat)
- What is each instance’s declared cost model?
- What is the current queue depth on each instance?
- What is the pipeline’s declared priority and budget?
A batch job that can wait 10 minutes routes to the $0.04 Shotstack API. A rush job that needs a result in 30 seconds routes to the rented GPU. A high-throughput internal job routes to the local workstation at zero cost.
The pipeline author wrote none of this logic. They declared a capability name. API Mom handled the economics.
Pipeline declares: capability: ffmpeg-concat
priority: standard
budget_ceiling_usd: 0.10
API Mom resolves: local GPU registered? → yes, healthy
queue depth: 0
cost: $0.00
→ route to local GPU
The pipeline is portable. If the local GPU goes offline, the same pipeline routes to the cloud alternative without any configuration change. The pipeline author’s code does not change. The pipeline YAML does not change. Only the routing table changes — and that is managed centrally, not distributed across every pipeline definition.
Practical Example: Video Factory Before and After
Before: One 770-line monolith
render-server/
server.ts 770 lines
Dockerfile 12 lines
package.json 24 dependencies
Everything couples together. The ffmpeg binary path is hardcoded. The S3 bucket name is hardcoded. The TTS API key is loaded once at startup. Every endpoint shares the same Express instance, the same process, the same failure domain.
After: Seven leaf capabilities + one pipeline definition
capabilities/
stock-search/
server.ts 85 lines (HTTP + search API integration)
install.sh 60 lines (bootstrap + registration)
spec.json 40 lines (input/output schema)
ffmpeg-concat/
server.ts 90 lines (HTTP + ffmpeg execution)
install.sh 65 lines
spec.json 45 lines
tts-elevenlabs/
server.ts 70 lines (HTTP + ElevenLabs API)
install.sh 55 lines
spec.json 35 lines
audio-mix/
server.ts 95 lines (HTTP + ffmpeg audio + loudnorm)
install.sh 60 lines
spec.json 40 lines
ffmpeg-thumbnail/
server.ts 60 lines
install.sh 55 lines
spec.json 30 lines
ffmpeg-transcode/
server.ts 75 lines
install.sh 60 lines
spec.json 40 lines
media-store-upload/
server.ts 55 lines (HTTP + S3 upload)
install.sh 50 lines
spec.json 25 lines
pipelines/
render-video.yaml 45 lines (Scram-Jet pipeline definition)
Total: ~600 lines across 7 capability servers, replacing 770 lines in one. The line count is similar. The coupling is not. Each capability:
- Can be deployed and upgraded independently
- Can run on any machine with the right prerequisites
- Fails in isolation — a TTS outage does not take down the ffmpeg renderer
- Can be replaced by a cloud API by updating the registry, not the code
- Can be tested in isolation with a simple
curl
The pipeline definition is 45 lines of YAML. It replaces the implicit orchestration that was previously woven through the monolith’s route handlers and shared state.
Anti-Patterns
1. Capabilities too granular
Loudnorm is a single ffmpeg filter pass. Making it a standalone capability creates a network round-trip — with Media Store upload and download — for a 2-second CPU operation. Bundle loudnorm with audio-mix. The rule: if a step always precedes or always follows another step with no fan-out between them, they belong in the same capability.
# Wrong — unnecessary network hop
audio-mix → upload → loudnorm-download → loudnorm → upload
# Right — single capability, single Media Store write
audio-mix (includes loudnorm) → upload
2. Orchestration logic inside capabilities
A capability that calls another capability is no longer a capability — it is an orchestrator wearing a capability costume. If ffmpeg-concat internally calls tts-elevenlabs to get its audio, you have recreated the monolith with extra HTTP steps. Orchestration belongs in Scram-Jet pipelines. Capabilities are leaves in the call tree, not intermediate nodes.
3. State inside capabilities
A capability that maintains a job map in memory, writes to a local database, or stores intermediate files on its own disk has introduced hidden state. When the capability process restarts, that state is gone. When the capability moves to a different machine, that state does not follow it. All state — including intermediate files — belongs in Media Store. The capability should be indistinguishable from stateless from the caller’s perspective.
4. Capability-specific authentication
Each capability should not implement its own auth scheme. API Mom is the authentication boundary. When API Mom routes a request to a capability, it has already authenticated the caller. The capability trusts requests that arrive on its local network segment. Adding JWT validation to each capability adds complexity without security — the capability’s port should not be exposed to the public internet in the first place.
5. Hardcoding API Mom’s URL
Capabilities should read API_MOM_URL from the environment. This is not a minor point — it is what lets you run a local API Mom instance for testing and a production instance for deployment, without changing the capability code.
References
- garywu/_readme/articles/autonomous-entity-pattern — Capabilities as autonomous entities: clear boundaries, defined interfaces, downward delegation
- garywu/_readme/articles/cloudflare-durable-objects-patterns — The Adaptive Controller pattern: orchestrators adapt to available workers; workers just execute
- garywu/_readme/articles/api-mom-intelligent-router — API Mom as the routing and cost management layer between callers and capabilities
- Unix Philosophy — “Do one thing and do it well” — the original formulation of the capability primitive
- Docker Best Practices: One Process Per Container — Docker’s application of the same principle to containerized services
- EBU R128 Loudness Standard — the broadcast loudness spec that
loudnormtargets, and why it belongs inaudio-mixrather than a standalone capability