Skip to content
Gary Wu
Go back

Composable Pipelines: When a Pipeline IS a Capability

Edit page

A capability can be a single script or a full pipeline. The caller cannot tell the difference. This recursive property — pipelines calling pipelines calling pipelines — is the most underexplored idea in workflow orchestration, and it changes everything about how you design systems.

Most workflow systems draw a hard line between “a task” and “a workflow.” A task is a leaf node — a function, a script, a subprocess. A workflow is a graph of tasks. You call tasks from workflows, but you do not call workflows from workflows. That asymmetry forces you to flatten everything beyond two levels of abstraction, and flattening kills reuse.

Scram-Jet eliminates that asymmetry. A capability is either a leaf (one script, one process) or a composite (a pipeline of other capabilities). The interface is identical in both cases. This means capabilities compose like functions, and composition depth is unlimited.

This article explains why that matters, what it costs, and how to do it right.


Table of Contents

Open Table of Contents

The Insight

Here is the simplest version of the idea:

capability("ffmpeg-concat", inputs) → outputs

The caller does not know and does not care whether ffmpeg-concat is:

The contract is identical: pass inputs, receive outputs. The implementation is opaque. This is the same principle that makes function composition work in programming — f(g(x)) does not require the caller to know what is inside g.

The consequence is that any composite capability can be treated as a leaf by its callers, which means any caller can itself become a composite, which means composition depth is unbounded.

Most orchestration systems do not have this property. Airflow DAGs call tasks. Temporal workflows call activities. GitHub Actions workflows call steps. In all three, the higher-level construct is a fundamentally different type from the lower-level construct. You cannot call a DAG from a DAG as if it were a task. You cannot call a Temporal workflow from an activity. The type asymmetry is load-bearing — and it is the reason those systems force you to flatten.

Scram-Jet removes the asymmetry. The pipeline scheduler does not need to know whether the thing it is about to invoke is a leaf or a composite. It calls the capability registry, gets an endpoint, posts the input, and waits for the output. The complexity inside that capability is hidden.


Leaf vs Composite vs Meta-Composite

There are three levels of composition. All three look identical to their callers.

Level 1: Leaf Capability

A leaf is a single script or subprocess. It takes input, does work, produces output.

# capabilities/ffmpeg-concat.yaml
name: ffmpeg-concat
version: 1.2.0
type: leaf-capability

runtime:
  command: python3 scripts/ffmpeg_concat.py
  timeout_seconds: 300

inputs:
  clip_keys:
    type: array
    items: { type: string, description: "Media Store keys for input clips" }
  output_format:
    type: string
    enum: [mp4, webm]
    default: mp4

outputs:
  output_key:
    type: string
    description: "Media Store key for the concatenated result"

cache:
  enabled: true
  ttl_seconds: 86400

One file. One process. No orchestration. The runtime.command is what Scram-Jet executes.

Level 2: Composite Capability

A composite is a pipeline of leaf capabilities (or other composites). It defines steps, data flow, and platform features like batching and parallelism. From the outside, it looks exactly like a leaf.

# capabilities/render-video.yaml
name: render-video
version: 2.1.0
type: composite-capability

inputs:
  script_text: { type: string }
  visual_beats:
    type: array
    items:
      type: object
      properties:
        timestamp_ms: { type: integer }
        query: { type: string }

outputs:
  video_key: { type: string }
  duration_ms: { type: integer }

steps:
  - id: find-visuals
    batch:
      over: "$.inputs.visual_beats"
      invoke:
        capability: stock-search
        version: "^1.0"
        input:
          query: "$.item.query"
          aspect_ratio: "16:9"
        output_key: "$.item.clip_key"
    cache: true

  - id: generate-voiceover
    invoke:
      capability: text-to-speech
      version: "^3.0"
      input:
        text: "$.inputs.script_text"
        voice: "nova"
    output_as: voiceover

  - id: concat-clips
    invoke:
      capability: ffmpeg-concat
      version: "^1.2"
      input:
        clip_keys: "$.steps.find-visuals[*].clip_key"
        output_format: mp4
    cache: true
    output_as: raw_video

  - id: mix-audio
    invoke:
      capability: audio-mix
      version: "^1.0"
      input:
        video_key: "$.steps.raw_video.output_key"
        audio_key: "$.steps.voiceover.audio_key"
        ducking_db: -18
    output_as: mixed_video

  - id: normalize-loudness
    invoke:
      capability: loudnorm
      version: "^1.0"
      input:
        media_key: "$.steps.mixed_video.output_key"
        target_lufs: -14
    output_as: final_video

outputs_map:
  video_key: "$.steps.final_video.output_key"
  duration_ms: "$.steps.final_video.duration_ms"

The caller sees render-video as a single capability. They post a script and a list of visual beats. They receive a finished video key. The five internal steps are invisible to them.

Level 3: Meta-Composite

A meta-composite is a pipeline that includes composites as steps. To Scram-Jet’s scheduler, render-video looks identical to ffmpeg-concat when invoked from a higher level.

# capabilities/produce-short-form.yaml
name: produce-short-form
version: 1.0.0
type: composite-capability

inputs:
  topic: { type: string }
  duration_seconds: { type: integer, default: 60 }

steps:
  - id: write-script
    invoke:
      capability: script-writer
      version: "^2.0"
      input:
        topic: "$.inputs.topic"
        duration_seconds: "$.inputs.duration_seconds"
        format: short-form
    output_as: script

  - id: extract-beats
    invoke:
      capability: beat-extractor
      version: "^1.0"
      input:
        script: "$.steps.script.text"
    output_as: beats

  - id: render
    invoke:
      capability: render-video        # <-- this is a composite, called like a leaf
      version: "^2.1"
      input:
        script_text: "$.steps.script.text"
        visual_beats: "$.steps.beats.visual_beats"
    budget_cents: 150
    output_as: video

  - id: upload
    invoke:
      capability: cdn-upload
      version: "^1.0"
      input:
        media_key: "$.steps.video.video_key"
        destination: "shorts/{topic}"

outputs_map:
  cdn_url: "$.steps.upload.url"
  duration_ms: "$.steps.video.duration_ms"

produce-short-form calls render-video exactly as it calls script-writer or cdn-upload. The meta-composite does not know or care that render-video is itself a pipeline of five steps.


Hierarchy Collapse: Flat API, Deep Execution

When a caller invokes produce-short-form, the interaction is one request and one response:

POST /capabilities/produce-short-form/invoke
{ "topic": "sourdough fermentation", "duration_seconds": 60 }

→ 200 OK
{ "cdn_url": "https://cdn.example.com/shorts/sourdough-fermentation.mp4", "duration_ms": 58400 }

Internally, Scram-Jet expands the full capability tree before execution begins:

produce-short-form (meta-composite)
├── script-writer (leaf)
├── beat-extractor (leaf)
├── render-video (composite)
│   ├── stock-search × N (leaf, batched)
│   ├── text-to-speech (leaf)
│   ├── ffmpeg-concat (leaf)
│   ├── audio-mix (leaf)
│   └── loudnorm (leaf)
└── cdn-upload (leaf)

The scheduler resolves the full tree, determines data dependencies, identifies which steps can run in parallel, and builds an execution plan. Then it runs. When it finishes, it collapses the tree and returns the final output to the original caller.

The API surface is flat. The execution graph is deep. The caller does not experience the depth — they experience one request and one response.

Observability is the exception. The trace for a produce-short-form invocation shows the full tree, with timing, input/output sizes, cache hits, and cost attribution at every node. This is how you debug a slow pipeline — you look at the trace and find the step that took 12 seconds instead of 2. But the trace is a diagnostic tool, not a user-facing API.

Trace: produce-short-form [req-a7f3]
  ├── script-writer          340ms   $0.0031   cache: miss
  ├── beat-extractor         18ms    $0.0000   cache: miss
  └── render-video           [composite, expanded]
      ├── stock-search × 12  4,100ms $0.0240   cache: 9 hits, 3 misses
      ├── text-to-speech     820ms   $0.0018   cache: miss
      ├── ffmpeg-concat      6,200ms $0.0000   cache: miss
      ├── audio-mix          1,100ms $0.0000   cache: miss
      └── loudnorm           430ms   $0.0000   cache: miss
  └── cdn-upload             290ms   $0.0000   cache: miss
  Total: 13,298ms  $0.0289

Content-Addressed Caching: Bazel’s Insight

Bazel reduced Google’s CI times from 45 minutes to 7 minutes with one idea: cache key = SHA256(inputs). If you have already computed the output for a given set of inputs, return the cached result. Do not re-execute.

The key insight is that the cache key does not include time, machine identity, or any ambient state. It includes only the capability name, the version, and the canonical representation of the inputs. Two builds on different machines, on different days, with identical inputs will both hit the same cache entry.

Applied to Scram-Jet:

cache_key = SHA256(
  capability_name +
  capability_version +
  canonical_json(inputs)     # sorted keys, normalized whitespace
)

This means:

Cache invalidation happens at the version boundary. Bumping text-to-speech from 3.0 to 3.1 produces a different cache key. Old cache entries are not used.

The practical impact in a video pipeline: a short-form video that renders from 12 stock clips and a voiceover costs full compute the first time. If the same video is re-rendered (because of a downstream bug, or because a reviewer requested a re-cut that ultimately reverts to the original), every step that received identical inputs returns instantly from cache. A 13-second pipeline becomes a 0.4-second pipeline.

# Cache is opt-in per step, but the mechanism is uniform
- id: generate-voiceover
  invoke:
    capability: text-to-speech
    input:
      text: "$.inputs.script_text"
      voice: nova
  cache: true          # SHA256(text-to-speech + version + {text, voice})
  budget_cents: 5

Orthogonal Concerns

The five platform features — batching, parallelism, buffering, caching, and budget control — are orthogonal to the logic inside each capability. They are declared in the pipeline YAML, not implemented in capability code. A capability author does not need to know whether their capability will be batched, cached, or budget-limited. They write a function. The platform handles the rest.

Batching

Fan-out over a list of items, invoking a capability once per item.

- id: find-visuals
  batch:
    over: "$.inputs.visual_beats"       # the list to iterate
    concurrency: 8                      # max parallel invocations
    invoke:
      capability: stock-search
      input:
        query: "$.item.query"           # $.item refers to the current element
        aspect_ratio: "16:9"

The scheduler fans out 12 items to 12 parallel stock-search invocations (or 8 at a time if concurrency: 8). The results are collected back into an array before the next step runs. The capability code sees one item at a time. The batching logic lives entirely in the scheduler.

Parallelism

Run independent steps concurrently.

- id: parallel-generation
  parallel:
    - invoke:
        capability: stock-search
        input: { query: "$.inputs.scene_query" }
      output_as: visuals
    - invoke:
        capability: text-to-speech
        input: { text: "$.inputs.narration" }
      output_as: audio
  # Both steps start simultaneously. Next step waits for both.

stock-search and text-to-speech have no data dependency on each other. Running them in parallel shaves their combined latency to the slower of the two, rather than the sum of both.

Buffering

Steps pass data through Media Store rather than inline in the pipeline payload. Large binary outputs (video clips, audio files, rendered images) are stored as content-addressed keys. Steps receive keys, not bytes.

This is automatic. When a capability output is a binary blob, Scram-Jet stores it in Media Store and passes the key to the next step. Pipeline payloads stay small. Steps that receive the same media key do not re-download the blob — they access it directly in the local Media Store cache.

Caching

- id: concat-clips
  invoke:
    capability: ffmpeg-concat
    input:
      clip_keys: "$.steps.find-visuals[*].clip_key"
  cache: true     # adds cache_key check before invocation

cache: true on any step tells the scheduler to compute the cache key before invoking the capability. If a cache hit exists, the step result is returned immediately and the capability is not invoked. The capability code is unchanged.

Budget Control

- id: render
  invoke:
    capability: render-video
    input:
      script_text: "$.steps.script.text"
      visual_beats: "$.steps.beats.visual_beats"
  budget_cents: 150    # hard limit for this step and all descendants

budget_cents: 150 on a composite step propagates a budget envelope through the entire sub-tree. If cumulative cost across all descendant capabilities exceeds 150 cents, the scheduler aborts with a BUDGET_EXCEEDED error. The capability code does not implement budget tracking — it is a platform concern.


Deterministic Step IDs

This is the lesson from Temporal and Cloudflare Workflows. It is also the most commonly skipped lesson.

When a workflow fails partway through and is retried, the scheduler needs to know which steps have already completed. It does this by looking up the step ID in the execution log. If the step ID was a random UUID generated at runtime, each retry produces different IDs. The scheduler cannot match the retry’s steps to the original run’s completed steps. Everything re-executes from the start.

Wrong:

// UUID generated at invocation time — different every run
const stepId = crypto.randomUUID()
await scheduler.invoke(stepId, 'text-to-speech', inputs)

Right:

// SHA256 of the step's identity — identical for identical work
const stepId = await sha256(
  `${pipelineId}:${stepName}:${capabilityName}:${JSON.stringify(canonicalInputs)}`
)
await scheduler.invoke(stepId, 'text-to-speech', inputs)

The deterministic step ID is also the cache key. A step that was completed in a previous run (same pipeline, same inputs) will match on lookup and return the cached result, even if the run is a full retry. This is idempotent execution: retry the whole pipeline, pay only for the steps that genuinely need re-running.

The trace for a retried pipeline should look like this:

Trace: render-video [req-a7f3, retry 2]
  ├── stock-search × 12     cache: 12 hits (previous run completed)
  ├── text-to-speech        cache: 1 hit  (previous run completed)
  ├── ffmpeg-concat         EXECUTED      (previous run failed here)
  ├── audio-mix             EXECUTED      (new)
  └── loudnorm              EXECUTED      (new)

Without deterministic IDs, all five steps re-execute. With deterministic IDs, only the three steps after the failure point re-execute. The savings compound with pipeline depth.


The Full Example: render-video

Putting it together: a complete composite capability, followed by its invocation from a meta-composite.

# capabilities/render-video.yaml
name: render-video
version: 2.1.0
type: composite-capability
description: >
  Takes a script and visual beats, searches stock footage, generates voiceover,
  concatenates clips, mixes audio, and loudness-normalizes the result.

inputs:
  script_text:
    type: string
    description: "Full narration text for the video"
  visual_beats:
    type: array
    description: "Timestamps and search queries for visual cut points"
    items:
      type: object
      required: [timestamp_ms, query]
      properties:
        timestamp_ms: { type: integer }
        query: { type: string }

outputs:
  video_key:
    type: string
    description: "Media Store key for the rendered video"
  duration_ms:
    type: integer

steps:
  - id: find-visuals
    batch:
      over: "$.inputs.visual_beats"
      concurrency: 6
      invoke:
        capability: stock-search
        version: "^1.0"
        input:
          query: "$.item.query"
          aspect_ratio: "16:9"
          min_duration_ms: 3000
      collect_as: clips
    cache: true

  - id: generate-voiceover
    invoke:
      capability: text-to-speech
      version: "^3.0"
      input:
        text: "$.inputs.script_text"
        voice: nova
        output_format: mp3
    cache: true
    output_as: voiceover

  - id: validate-inputs
    invoke:
      capability: media-validator
      version: "^1.0"
      input:
        clip_keys: "$.steps.clips[*].clip_key"
        audio_key: "$.steps.voiceover.audio_key"
        require_min_clips: 3
    # Fails fast if any clip is missing or duration is 0.
    # Next steps do not run if this step fails.

  - id: concat-clips
    invoke:
      capability: ffmpeg-concat
      version: "^1.2"
      input:
        clip_keys: "$.steps.clips[*].clip_key"
        beat_timestamps_ms: "$.inputs.visual_beats[*].timestamp_ms"
        output_format: mp4
    cache: true
    output_as: raw_video

  - id: mix-audio
    invoke:
      capability: audio-mix
      version: "^1.0"
      input:
        video_key: "$.steps.raw_video.output_key"
        audio_key: "$.steps.voiceover.audio_key"
        ducking_db: -18
        fade_in_ms: 200
        fade_out_ms: 500
    output_as: mixed_video

  - id: normalize-loudness
    invoke:
      capability: loudnorm
      version: "^1.0"
      input:
        media_key: "$.steps.mixed_video.output_key"
        target_lufs: -14
        true_peak_dbfs: -1.0
    output_as: final

outputs_map:
  video_key: "$.steps.final.output_key"
  duration_ms: "$.steps.final.duration_ms"

Now the meta-composite that calls it. From produce-short-form’s perspective, render-video is a single step with one input schema and one output schema:

# capabilities/produce-short-form.yaml
name: produce-short-form
version: 1.0.0
type: composite-capability

steps:
  - id: write-script
    invoke:
      capability: script-writer
      version: "^2.0"
      input:
        topic: "$.inputs.topic"
        duration_seconds: "$.inputs.duration_seconds"
        style: conversational
        hook: first-5-seconds
    output_as: script

  - id: extract-beats
    invoke:
      capability: beat-extractor
      version: "^1.0"
      input:
        script: "$.steps.script.text"
        target_beat_count: 12
    output_as: beats

  - id: render
    invoke:
      capability: render-video   # composite, called identically to any leaf
      version: "^2.1"
      input:
        script_text: "$.steps.script.text"
        visual_beats: "$.steps.beats.visual_beats"
    budget_cents: 150
    output_as: video

  - id: upload
    invoke:
      capability: cdn-upload
      version: "^1.0"
      input:
        media_key: "$.steps.video.video_key"
        path_template: "shorts/{{ inputs.topic | slugify }}"
        public: true
    output_as: upload_result

outputs_map:
  cdn_url: "$.steps.upload_result.url"
  duration_ms: "$.steps.video.duration_ms"
  cost_cents: "$.meta.cost_cents"

The call from the outside:

const result = await capabilities.invoke('produce-short-form', {
  topic: 'sourdough fermentation',
  duration_seconds: 60,
})

// result.cdn_url    → "https://cdn.example.com/shorts/sourdough-fermentation.mp4"
// result.duration_ms → 58400
// result.cost_cents  → 2.89

One function call. Three levels of composition inside. The caller never sees the depth.


Depth Limits

Unlimited composition depth sounds good in theory. In practice, each level adds real latency from two sources:

  1. API Mom routing overhead: each invoke is an HTTP call to the capability registry, which validates the request, resolves the capability version, checks the cache, and routes to an executor. At low latency this is 5-15ms per step. At high nesting depth with many steps per level, it accumulates.

  2. Media Store I/O: each step that produces binary output writes to Media Store. Each step that consumes binary output reads from Media Store. Deep pipelines with many binary-producing steps pay this cost at every level.

A practical measurement on a three-level pipeline:

DepthCompute timePlatform overheadOverhead fraction
1 (leaf)6,200ms12ms0.2%
2 (composite)13,300ms85ms0.6%
3 (meta-composite)13,300ms210ms1.6%
5 (deep)13,300ms600ms4.3%

At 5 levels, overhead is still under 5%. Beyond that, it grows faster because step counts multiply. The recommended hard limit is 5 levels. If you find yourself needing level 6, the right fix is to flatten the innermost composite into its parent rather than adding another nesting level.

Level 6 smell: produce-series → produce-episode → produce-segment → render-video → mix-media → transcode
Fix: flatten mix-media + transcode into render-video

Anti-Patterns

Compose for network calls, not for lines of code

A composite capability is a network call. It crosses API Mom’s routing layer, hits the registry, serializes inputs and outputs through Media Store. For three lines of Python that transform a string, this cost is indefensible.

# Wrong — wrapping trivial logic in a capability
- id: slugify-title
  invoke:
    capability: string-slugify    # 4ms of actual work, 15ms of overhead
    input: { text: "$.inputs.topic" }
# Right — do trivial transforms inline in the script
# The text-to-speech capability's script handles slug generation itself

A capability should represent work that takes longer than the platform overhead to invoke it (roughly 15ms for a warm path). If the work takes 2ms, inline it.

Do not hide side effects in composites

Each step in a composite should be deterministic: same inputs produce same outputs, with no observable effects beyond the output. Side effects — writing to a database, posting to an API, sending a notification — belong in explicitly named leaf capabilities at the edges of the pipeline, not buried inside intermediate composites.

When a pipeline is retried, Scram-Jet re-executes non-cached steps. A step that posts to a social media API on every execution will post twice. A step that inserts a database row will insert twice. Make side effects explicit, name them clearly, and put idempotency guards in their capability scripts.

# Wrong — side effect inside a step with cache: true
- id: mix-and-post
  invoke:
    capability: audio-mix-and-notify-slack   # hides a POST to Slack inside
  cache: true   # if retried with cache miss, Slack gets two messages
# Right — side effects at the pipeline edge, never cached
- id: mix-audio
  invoke: { capability: audio-mix }
  cache: true

- id: notify-complete
  invoke:
    capability: slack-notify       # explicit, obviously a side effect
    input: { channel: "#renders", message: "Render complete" }
  cache: false   # explicit, never cached

Validate between steps, not at the end

Validation failures discovered at the final step waste all the compute that preceded them. Validate early and fail fast. The media-validator step in the render-video example runs before ffmpeg-concat. If any clip is missing or zero-duration, the pipeline fails before spending 6 seconds on FFmpeg.

# Wrong — FFmpeg runs for 6 seconds then fails on bad input
- id: concat-clips
  invoke:
    capability: ffmpeg-concat
    input:
      clip_keys: "$.steps.clips[*].clip_key"

# Right — validate before expensive compute
- id: validate-clips
  invoke:
    capability: media-validator
    input:
      clip_keys: "$.steps.clips[*].clip_key"
      require_min_clips: 3
      require_min_duration_ms: 2000

- id: concat-clips
  invoke:
    capability: ffmpeg-concat
    input:
      clip_keys: "$.steps.clips[*].clip_key"

Comparison: Airflow, Temporal, GitHub Actions

These are the three most commonly proposed alternatives when discussing pipeline orchestration. All three are excellent systems. None of them have the recursive property described in this article.

Apache Airflow

Airflow DAGs call tasks. A task is a Python function or a bash command. You cannot call a DAG from a task as if the DAG were a task. You can trigger a DAG from a DAG using TriggerDagRunOperator, but the triggered DAG runs asynchronously in a separate context. It is not a first-class step in the parent DAG’s execution graph. The parent DAG cannot wait for the child DAG’s output and use it in its next step without external state management (XComs, which are limited to small payloads).

This means Airflow workflows are effectively two levels deep: one DAG, with tasks inside it.

Temporal

Temporal workflows call activities. You can call a child workflow from a parent workflow using executeChildWorkflow. This is closer to the recursive property — parent workflows can wait for child workflow results. But activities and workflows are distinct types. An activity cannot be swapped for a workflow without changing the call site. The type asymmetry is explicit in the SDK: workflow.executeActivity() vs workflow.executeChildWorkflow().

Temporal also does not have a unified capability registry or content-addressed caching. Each activity type is registered separately, versioned separately, and cached (if at all) through custom application code.

GitHub Actions

Actions workflows can be called from other workflows using workflow_call trigger. This is the closest to the recursive property in the tools listed. But GitHub Actions does not support content-addressed caching across workflow invocations. Cache keys are manually specified strings, not SHA256 of inputs. There is no fan-out batching primitive. There is no budget control. And the primitive types are still distinct: a job calls steps; a workflow calls jobs; a caller workflow triggers child workflows. The layers are not interchangeable.

The Difference

AirflowTemporalGitHub ActionsScram-Jet
Recursive compositionNoPartialPartialYes
Uniform type (leaf = composite)NoNoNoYes
Content-addressed cachingNoNoNoYes
Fan-out batching primitiveOperatorManualMatrixNative
Budget controlNoNoNoNative
Deterministic step IDsManualNativeN/ANative

The critical column is “Uniform type.” When leaf and composite are the same type, you get unlimited depth, natural reuse, and the ability to upgrade the implementation of a capability without changing any of its callers. That property is what makes a pipeline architecture feel like a programming language rather than a configuration file.


References


Edit page
Share this post on:

Previous Post
Why Every Capability Should Assume 80% Reliability
Next Post
Cost-Aware Orchestration: Budget as a First-Class Constraint