A capability can be a single script or a full pipeline. The caller cannot tell the difference. This recursive property — pipelines calling pipelines calling pipelines — is the most underexplored idea in workflow orchestration, and it changes everything about how you design systems.
Most workflow systems draw a hard line between “a task” and “a workflow.” A task is a leaf node — a function, a script, a subprocess. A workflow is a graph of tasks. You call tasks from workflows, but you do not call workflows from workflows. That asymmetry forces you to flatten everything beyond two levels of abstraction, and flattening kills reuse.
Scram-Jet eliminates that asymmetry. A capability is either a leaf (one script, one process) or a composite (a pipeline of other capabilities). The interface is identical in both cases. This means capabilities compose like functions, and composition depth is unlimited.
This article explains why that matters, what it costs, and how to do it right.
Table of Contents
Open Table of Contents
The Insight
Here is the simplest version of the idea:
capability("ffmpeg-concat", inputs) → outputs
The caller does not know and does not care whether ffmpeg-concat is:
- A single process that concatenates video files
- A pipeline that pre-processes each clip, normalizes audio, then concatenates
- A pipeline that calls another pipeline that does all of the above
The contract is identical: pass inputs, receive outputs. The implementation is opaque. This is the same principle that makes function composition work in programming — f(g(x)) does not require the caller to know what is inside g.
The consequence is that any composite capability can be treated as a leaf by its callers, which means any caller can itself become a composite, which means composition depth is unbounded.
Most orchestration systems do not have this property. Airflow DAGs call tasks. Temporal workflows call activities. GitHub Actions workflows call steps. In all three, the higher-level construct is a fundamentally different type from the lower-level construct. You cannot call a DAG from a DAG as if it were a task. You cannot call a Temporal workflow from an activity. The type asymmetry is load-bearing — and it is the reason those systems force you to flatten.
Scram-Jet removes the asymmetry. The pipeline scheduler does not need to know whether the thing it is about to invoke is a leaf or a composite. It calls the capability registry, gets an endpoint, posts the input, and waits for the output. The complexity inside that capability is hidden.
Leaf vs Composite vs Meta-Composite
There are three levels of composition. All three look identical to their callers.
Level 1: Leaf Capability
A leaf is a single script or subprocess. It takes input, does work, produces output.
# capabilities/ffmpeg-concat.yaml
name: ffmpeg-concat
version: 1.2.0
type: leaf-capability
runtime:
command: python3 scripts/ffmpeg_concat.py
timeout_seconds: 300
inputs:
clip_keys:
type: array
items: { type: string, description: "Media Store keys for input clips" }
output_format:
type: string
enum: [mp4, webm]
default: mp4
outputs:
output_key:
type: string
description: "Media Store key for the concatenated result"
cache:
enabled: true
ttl_seconds: 86400
One file. One process. No orchestration. The runtime.command is what Scram-Jet executes.
Level 2: Composite Capability
A composite is a pipeline of leaf capabilities (or other composites). It defines steps, data flow, and platform features like batching and parallelism. From the outside, it looks exactly like a leaf.
# capabilities/render-video.yaml
name: render-video
version: 2.1.0
type: composite-capability
inputs:
script_text: { type: string }
visual_beats:
type: array
items:
type: object
properties:
timestamp_ms: { type: integer }
query: { type: string }
outputs:
video_key: { type: string }
duration_ms: { type: integer }
steps:
- id: find-visuals
batch:
over: "$.inputs.visual_beats"
invoke:
capability: stock-search
version: "^1.0"
input:
query: "$.item.query"
aspect_ratio: "16:9"
output_key: "$.item.clip_key"
cache: true
- id: generate-voiceover
invoke:
capability: text-to-speech
version: "^3.0"
input:
text: "$.inputs.script_text"
voice: "nova"
output_as: voiceover
- id: concat-clips
invoke:
capability: ffmpeg-concat
version: "^1.2"
input:
clip_keys: "$.steps.find-visuals[*].clip_key"
output_format: mp4
cache: true
output_as: raw_video
- id: mix-audio
invoke:
capability: audio-mix
version: "^1.0"
input:
video_key: "$.steps.raw_video.output_key"
audio_key: "$.steps.voiceover.audio_key"
ducking_db: -18
output_as: mixed_video
- id: normalize-loudness
invoke:
capability: loudnorm
version: "^1.0"
input:
media_key: "$.steps.mixed_video.output_key"
target_lufs: -14
output_as: final_video
outputs_map:
video_key: "$.steps.final_video.output_key"
duration_ms: "$.steps.final_video.duration_ms"
The caller sees render-video as a single capability. They post a script and a list of visual beats. They receive a finished video key. The five internal steps are invisible to them.
Level 3: Meta-Composite
A meta-composite is a pipeline that includes composites as steps. To Scram-Jet’s scheduler, render-video looks identical to ffmpeg-concat when invoked from a higher level.
# capabilities/produce-short-form.yaml
name: produce-short-form
version: 1.0.0
type: composite-capability
inputs:
topic: { type: string }
duration_seconds: { type: integer, default: 60 }
steps:
- id: write-script
invoke:
capability: script-writer
version: "^2.0"
input:
topic: "$.inputs.topic"
duration_seconds: "$.inputs.duration_seconds"
format: short-form
output_as: script
- id: extract-beats
invoke:
capability: beat-extractor
version: "^1.0"
input:
script: "$.steps.script.text"
output_as: beats
- id: render
invoke:
capability: render-video # <-- this is a composite, called like a leaf
version: "^2.1"
input:
script_text: "$.steps.script.text"
visual_beats: "$.steps.beats.visual_beats"
budget_cents: 150
output_as: video
- id: upload
invoke:
capability: cdn-upload
version: "^1.0"
input:
media_key: "$.steps.video.video_key"
destination: "shorts/{topic}"
outputs_map:
cdn_url: "$.steps.upload.url"
duration_ms: "$.steps.video.duration_ms"
produce-short-form calls render-video exactly as it calls script-writer or cdn-upload. The meta-composite does not know or care that render-video is itself a pipeline of five steps.
Hierarchy Collapse: Flat API, Deep Execution
When a caller invokes produce-short-form, the interaction is one request and one response:
POST /capabilities/produce-short-form/invoke
{ "topic": "sourdough fermentation", "duration_seconds": 60 }
→ 200 OK
{ "cdn_url": "https://cdn.example.com/shorts/sourdough-fermentation.mp4", "duration_ms": 58400 }
Internally, Scram-Jet expands the full capability tree before execution begins:
produce-short-form (meta-composite)
├── script-writer (leaf)
├── beat-extractor (leaf)
├── render-video (composite)
│ ├── stock-search × N (leaf, batched)
│ ├── text-to-speech (leaf)
│ ├── ffmpeg-concat (leaf)
│ ├── audio-mix (leaf)
│ └── loudnorm (leaf)
└── cdn-upload (leaf)
The scheduler resolves the full tree, determines data dependencies, identifies which steps can run in parallel, and builds an execution plan. Then it runs. When it finishes, it collapses the tree and returns the final output to the original caller.
The API surface is flat. The execution graph is deep. The caller does not experience the depth — they experience one request and one response.
Observability is the exception. The trace for a produce-short-form invocation shows the full tree, with timing, input/output sizes, cache hits, and cost attribution at every node. This is how you debug a slow pipeline — you look at the trace and find the step that took 12 seconds instead of 2. But the trace is a diagnostic tool, not a user-facing API.
Trace: produce-short-form [req-a7f3]
├── script-writer 340ms $0.0031 cache: miss
├── beat-extractor 18ms $0.0000 cache: miss
└── render-video [composite, expanded]
├── stock-search × 12 4,100ms $0.0240 cache: 9 hits, 3 misses
├── text-to-speech 820ms $0.0018 cache: miss
├── ffmpeg-concat 6,200ms $0.0000 cache: miss
├── audio-mix 1,100ms $0.0000 cache: miss
└── loudnorm 430ms $0.0000 cache: miss
└── cdn-upload 290ms $0.0000 cache: miss
Total: 13,298ms $0.0289
Content-Addressed Caching: Bazel’s Insight
Bazel reduced Google’s CI times from 45 minutes to 7 minutes with one idea: cache key = SHA256(inputs). If you have already computed the output for a given set of inputs, return the cached result. Do not re-execute.
The key insight is that the cache key does not include time, machine identity, or any ambient state. It includes only the capability name, the version, and the canonical representation of the inputs. Two builds on different machines, on different days, with identical inputs will both hit the same cache entry.
Applied to Scram-Jet:
cache_key = SHA256(
capability_name +
capability_version +
canonical_json(inputs) # sorted keys, normalized whitespace
)
This means:
- TTS of the same text with the same voice: one API call, ever. Every subsequent invocation hits the cache.
- FFmpeg concat of the same set of clips: one transcode, ever. The output MP4 is in Media Store. Every subsequent invocation returns the stored key.
- Stock search for the same query: one external API call, ever. The result set is cached.
Cache invalidation happens at the version boundary. Bumping text-to-speech from 3.0 to 3.1 produces a different cache key. Old cache entries are not used.
The practical impact in a video pipeline: a short-form video that renders from 12 stock clips and a voiceover costs full compute the first time. If the same video is re-rendered (because of a downstream bug, or because a reviewer requested a re-cut that ultimately reverts to the original), every step that received identical inputs returns instantly from cache. A 13-second pipeline becomes a 0.4-second pipeline.
# Cache is opt-in per step, but the mechanism is uniform
- id: generate-voiceover
invoke:
capability: text-to-speech
input:
text: "$.inputs.script_text"
voice: nova
cache: true # SHA256(text-to-speech + version + {text, voice})
budget_cents: 5
Orthogonal Concerns
The five platform features — batching, parallelism, buffering, caching, and budget control — are orthogonal to the logic inside each capability. They are declared in the pipeline YAML, not implemented in capability code. A capability author does not need to know whether their capability will be batched, cached, or budget-limited. They write a function. The platform handles the rest.
Batching
Fan-out over a list of items, invoking a capability once per item.
- id: find-visuals
batch:
over: "$.inputs.visual_beats" # the list to iterate
concurrency: 8 # max parallel invocations
invoke:
capability: stock-search
input:
query: "$.item.query" # $.item refers to the current element
aspect_ratio: "16:9"
The scheduler fans out 12 items to 12 parallel stock-search invocations (or 8 at a time if concurrency: 8). The results are collected back into an array before the next step runs. The capability code sees one item at a time. The batching logic lives entirely in the scheduler.
Parallelism
Run independent steps concurrently.
- id: parallel-generation
parallel:
- invoke:
capability: stock-search
input: { query: "$.inputs.scene_query" }
output_as: visuals
- invoke:
capability: text-to-speech
input: { text: "$.inputs.narration" }
output_as: audio
# Both steps start simultaneously. Next step waits for both.
stock-search and text-to-speech have no data dependency on each other. Running them in parallel shaves their combined latency to the slower of the two, rather than the sum of both.
Buffering
Steps pass data through Media Store rather than inline in the pipeline payload. Large binary outputs (video clips, audio files, rendered images) are stored as content-addressed keys. Steps receive keys, not bytes.
This is automatic. When a capability output is a binary blob, Scram-Jet stores it in Media Store and passes the key to the next step. Pipeline payloads stay small. Steps that receive the same media key do not re-download the blob — they access it directly in the local Media Store cache.
Caching
- id: concat-clips
invoke:
capability: ffmpeg-concat
input:
clip_keys: "$.steps.find-visuals[*].clip_key"
cache: true # adds cache_key check before invocation
cache: true on any step tells the scheduler to compute the cache key before invoking the capability. If a cache hit exists, the step result is returned immediately and the capability is not invoked. The capability code is unchanged.
Budget Control
- id: render
invoke:
capability: render-video
input:
script_text: "$.steps.script.text"
visual_beats: "$.steps.beats.visual_beats"
budget_cents: 150 # hard limit for this step and all descendants
budget_cents: 150 on a composite step propagates a budget envelope through the entire sub-tree. If cumulative cost across all descendant capabilities exceeds 150 cents, the scheduler aborts with a BUDGET_EXCEEDED error. The capability code does not implement budget tracking — it is a platform concern.
Deterministic Step IDs
This is the lesson from Temporal and Cloudflare Workflows. It is also the most commonly skipped lesson.
When a workflow fails partway through and is retried, the scheduler needs to know which steps have already completed. It does this by looking up the step ID in the execution log. If the step ID was a random UUID generated at runtime, each retry produces different IDs. The scheduler cannot match the retry’s steps to the original run’s completed steps. Everything re-executes from the start.
Wrong:
// UUID generated at invocation time — different every run
const stepId = crypto.randomUUID()
await scheduler.invoke(stepId, 'text-to-speech', inputs)
Right:
// SHA256 of the step's identity — identical for identical work
const stepId = await sha256(
`${pipelineId}:${stepName}:${capabilityName}:${JSON.stringify(canonicalInputs)}`
)
await scheduler.invoke(stepId, 'text-to-speech', inputs)
The deterministic step ID is also the cache key. A step that was completed in a previous run (same pipeline, same inputs) will match on lookup and return the cached result, even if the run is a full retry. This is idempotent execution: retry the whole pipeline, pay only for the steps that genuinely need re-running.
The trace for a retried pipeline should look like this:
Trace: render-video [req-a7f3, retry 2]
├── stock-search × 12 cache: 12 hits (previous run completed)
├── text-to-speech cache: 1 hit (previous run completed)
├── ffmpeg-concat EXECUTED (previous run failed here)
├── audio-mix EXECUTED (new)
└── loudnorm EXECUTED (new)
Without deterministic IDs, all five steps re-execute. With deterministic IDs, only the three steps after the failure point re-execute. The savings compound with pipeline depth.
The Full Example: render-video
Putting it together: a complete composite capability, followed by its invocation from a meta-composite.
# capabilities/render-video.yaml
name: render-video
version: 2.1.0
type: composite-capability
description: >
Takes a script and visual beats, searches stock footage, generates voiceover,
concatenates clips, mixes audio, and loudness-normalizes the result.
inputs:
script_text:
type: string
description: "Full narration text for the video"
visual_beats:
type: array
description: "Timestamps and search queries for visual cut points"
items:
type: object
required: [timestamp_ms, query]
properties:
timestamp_ms: { type: integer }
query: { type: string }
outputs:
video_key:
type: string
description: "Media Store key for the rendered video"
duration_ms:
type: integer
steps:
- id: find-visuals
batch:
over: "$.inputs.visual_beats"
concurrency: 6
invoke:
capability: stock-search
version: "^1.0"
input:
query: "$.item.query"
aspect_ratio: "16:9"
min_duration_ms: 3000
collect_as: clips
cache: true
- id: generate-voiceover
invoke:
capability: text-to-speech
version: "^3.0"
input:
text: "$.inputs.script_text"
voice: nova
output_format: mp3
cache: true
output_as: voiceover
- id: validate-inputs
invoke:
capability: media-validator
version: "^1.0"
input:
clip_keys: "$.steps.clips[*].clip_key"
audio_key: "$.steps.voiceover.audio_key"
require_min_clips: 3
# Fails fast if any clip is missing or duration is 0.
# Next steps do not run if this step fails.
- id: concat-clips
invoke:
capability: ffmpeg-concat
version: "^1.2"
input:
clip_keys: "$.steps.clips[*].clip_key"
beat_timestamps_ms: "$.inputs.visual_beats[*].timestamp_ms"
output_format: mp4
cache: true
output_as: raw_video
- id: mix-audio
invoke:
capability: audio-mix
version: "^1.0"
input:
video_key: "$.steps.raw_video.output_key"
audio_key: "$.steps.voiceover.audio_key"
ducking_db: -18
fade_in_ms: 200
fade_out_ms: 500
output_as: mixed_video
- id: normalize-loudness
invoke:
capability: loudnorm
version: "^1.0"
input:
media_key: "$.steps.mixed_video.output_key"
target_lufs: -14
true_peak_dbfs: -1.0
output_as: final
outputs_map:
video_key: "$.steps.final.output_key"
duration_ms: "$.steps.final.duration_ms"
Now the meta-composite that calls it. From produce-short-form’s perspective, render-video is a single step with one input schema and one output schema:
# capabilities/produce-short-form.yaml
name: produce-short-form
version: 1.0.0
type: composite-capability
steps:
- id: write-script
invoke:
capability: script-writer
version: "^2.0"
input:
topic: "$.inputs.topic"
duration_seconds: "$.inputs.duration_seconds"
style: conversational
hook: first-5-seconds
output_as: script
- id: extract-beats
invoke:
capability: beat-extractor
version: "^1.0"
input:
script: "$.steps.script.text"
target_beat_count: 12
output_as: beats
- id: render
invoke:
capability: render-video # composite, called identically to any leaf
version: "^2.1"
input:
script_text: "$.steps.script.text"
visual_beats: "$.steps.beats.visual_beats"
budget_cents: 150
output_as: video
- id: upload
invoke:
capability: cdn-upload
version: "^1.0"
input:
media_key: "$.steps.video.video_key"
path_template: "shorts/{{ inputs.topic | slugify }}"
public: true
output_as: upload_result
outputs_map:
cdn_url: "$.steps.upload_result.url"
duration_ms: "$.steps.video.duration_ms"
cost_cents: "$.meta.cost_cents"
The call from the outside:
const result = await capabilities.invoke('produce-short-form', {
topic: 'sourdough fermentation',
duration_seconds: 60,
})
// result.cdn_url → "https://cdn.example.com/shorts/sourdough-fermentation.mp4"
// result.duration_ms → 58400
// result.cost_cents → 2.89
One function call. Three levels of composition inside. The caller never sees the depth.
Depth Limits
Unlimited composition depth sounds good in theory. In practice, each level adds real latency from two sources:
-
API Mom routing overhead: each
invokeis an HTTP call to the capability registry, which validates the request, resolves the capability version, checks the cache, and routes to an executor. At low latency this is 5-15ms per step. At high nesting depth with many steps per level, it accumulates. -
Media Store I/O: each step that produces binary output writes to Media Store. Each step that consumes binary output reads from Media Store. Deep pipelines with many binary-producing steps pay this cost at every level.
A practical measurement on a three-level pipeline:
| Depth | Compute time | Platform overhead | Overhead fraction |
|---|---|---|---|
| 1 (leaf) | 6,200ms | 12ms | 0.2% |
| 2 (composite) | 13,300ms | 85ms | 0.6% |
| 3 (meta-composite) | 13,300ms | 210ms | 1.6% |
| 5 (deep) | 13,300ms | 600ms | 4.3% |
At 5 levels, overhead is still under 5%. Beyond that, it grows faster because step counts multiply. The recommended hard limit is 5 levels. If you find yourself needing level 6, the right fix is to flatten the innermost composite into its parent rather than adding another nesting level.
Level 6 smell: produce-series → produce-episode → produce-segment → render-video → mix-media → transcode
Fix: flatten mix-media + transcode into render-video
Anti-Patterns
Compose for network calls, not for lines of code
A composite capability is a network call. It crosses API Mom’s routing layer, hits the registry, serializes inputs and outputs through Media Store. For three lines of Python that transform a string, this cost is indefensible.
# Wrong — wrapping trivial logic in a capability
- id: slugify-title
invoke:
capability: string-slugify # 4ms of actual work, 15ms of overhead
input: { text: "$.inputs.topic" }
# Right — do trivial transforms inline in the script
# The text-to-speech capability's script handles slug generation itself
A capability should represent work that takes longer than the platform overhead to invoke it (roughly 15ms for a warm path). If the work takes 2ms, inline it.
Do not hide side effects in composites
Each step in a composite should be deterministic: same inputs produce same outputs, with no observable effects beyond the output. Side effects — writing to a database, posting to an API, sending a notification — belong in explicitly named leaf capabilities at the edges of the pipeline, not buried inside intermediate composites.
When a pipeline is retried, Scram-Jet re-executes non-cached steps. A step that posts to a social media API on every execution will post twice. A step that inserts a database row will insert twice. Make side effects explicit, name them clearly, and put idempotency guards in their capability scripts.
# Wrong — side effect inside a step with cache: true
- id: mix-and-post
invoke:
capability: audio-mix-and-notify-slack # hides a POST to Slack inside
cache: true # if retried with cache miss, Slack gets two messages
# Right — side effects at the pipeline edge, never cached
- id: mix-audio
invoke: { capability: audio-mix }
cache: true
- id: notify-complete
invoke:
capability: slack-notify # explicit, obviously a side effect
input: { channel: "#renders", message: "Render complete" }
cache: false # explicit, never cached
Validate between steps, not at the end
Validation failures discovered at the final step waste all the compute that preceded them. Validate early and fail fast. The media-validator step in the render-video example runs before ffmpeg-concat. If any clip is missing or zero-duration, the pipeline fails before spending 6 seconds on FFmpeg.
# Wrong — FFmpeg runs for 6 seconds then fails on bad input
- id: concat-clips
invoke:
capability: ffmpeg-concat
input:
clip_keys: "$.steps.clips[*].clip_key"
# Right — validate before expensive compute
- id: validate-clips
invoke:
capability: media-validator
input:
clip_keys: "$.steps.clips[*].clip_key"
require_min_clips: 3
require_min_duration_ms: 2000
- id: concat-clips
invoke:
capability: ffmpeg-concat
input:
clip_keys: "$.steps.clips[*].clip_key"
Comparison: Airflow, Temporal, GitHub Actions
These are the three most commonly proposed alternatives when discussing pipeline orchestration. All three are excellent systems. None of them have the recursive property described in this article.
Apache Airflow
Airflow DAGs call tasks. A task is a Python function or a bash command. You cannot call a DAG from a task as if the DAG were a task. You can trigger a DAG from a DAG using TriggerDagRunOperator, but the triggered DAG runs asynchronously in a separate context. It is not a first-class step in the parent DAG’s execution graph. The parent DAG cannot wait for the child DAG’s output and use it in its next step without external state management (XComs, which are limited to small payloads).
This means Airflow workflows are effectively two levels deep: one DAG, with tasks inside it.
Temporal
Temporal workflows call activities. You can call a child workflow from a parent workflow using executeChildWorkflow. This is closer to the recursive property — parent workflows can wait for child workflow results. But activities and workflows are distinct types. An activity cannot be swapped for a workflow without changing the call site. The type asymmetry is explicit in the SDK: workflow.executeActivity() vs workflow.executeChildWorkflow().
Temporal also does not have a unified capability registry or content-addressed caching. Each activity type is registered separately, versioned separately, and cached (if at all) through custom application code.
GitHub Actions
Actions workflows can be called from other workflows using workflow_call trigger. This is the closest to the recursive property in the tools listed. But GitHub Actions does not support content-addressed caching across workflow invocations. Cache keys are manually specified strings, not SHA256 of inputs. There is no fan-out batching primitive. There is no budget control. And the primitive types are still distinct: a job calls steps; a workflow calls jobs; a caller workflow triggers child workflows. The layers are not interchangeable.
The Difference
| Airflow | Temporal | GitHub Actions | Scram-Jet | |
|---|---|---|---|---|
| Recursive composition | No | Partial | Partial | Yes |
| Uniform type (leaf = composite) | No | No | No | Yes |
| Content-addressed caching | No | No | No | Yes |
| Fan-out batching primitive | Operator | Manual | Matrix | Native |
| Budget control | No | No | No | Native |
| Deterministic step IDs | Manual | Native | N/A | Native |
The critical column is “Uniform type.” When leaf and composite are the same type, you get unlimited depth, natural reuse, and the ability to upgrade the implementation of a capability without changing any of its callers. That property is what makes a pipeline architecture feel like a programming language rather than a configuration file.
References
- garywu/_readme: cloudflare-autonomous-pipeline — Autonomous pipeline stages, Cloudflare primitives, priority-based work scheduling
- garywu/_readme: api-mom-intelligent-router — Capability routing, cost-aware dispatch, budget enforcement
- garywu/_readme: media-store-content-addressable — Media Store key model, binary buffering between steps
- Bazel: Remote Caching — Content-addressed cache keys, the foundational insight
- Temporal: Determinism Requirements — Why step IDs must be deterministic for correct replay
- Cloudflare Workflows: Determinism — Cloudflare’s implementation of durable execution
- Kestra: YAML Orchestration — YAML-first pipeline definition, comparison point for step structure