Skip to content
Gary Wu
Go back

The Media Store Pattern

Edit page

Every service in a pipeline reinvents file storage — different APIs, different buckets, different metadata schemas. The fix is a single content-addressable store: bytes in, SHA-256 key out. Upload the same file twice and you get the same key with no duplicate storage. The store becomes the implicit buffer between every pipeline stage, the LRU cache for expensive computations, and the deduplication layer for free.


The Problem

Count the file storage surfaces in a typical AI media pipeline:

Each service has its own upload endpoint, its own metadata schema, its own storage conventions. When a pipeline capability needs the output from a previous step, it has to know which service produced it, what bucket it’s in, and how to authenticate against that specific API.

This is not an integration problem — it’s an architecture problem. The services are not wrong. They’re just each solving storage independently, and the cost is paid at the seams.

The fix is to stop treating file storage as a feature each service implements and start treating it as infrastructure every service shares.


One API, One Bucket

Media Store has a single contract: bytes in, key out.

// Upload any bytes — screenshot, MP4, TTS audio, generated image
const key = await mediaStore.put(bytes, {
  contentType: 'video/mp4',
  tags: ['clips', 'wan'],
});

// Download by key — same API regardless of what it is
const bytes = await mediaStore.get(key);

The store does not care what the bytes represent. It does not have a “clips” table and a “screenshots” table and an “audio” table. It has bytes, and it has keys.

Every capability in the pipeline talks to the same endpoint. Video Factory puts a clip in Media Store. A later capability reads it by key. Neither capability knows or cares about the other’s internal storage model.

This is the same insight that made S3 dominant: don’t make storage opinionated about content. Bytes are bytes. Make the interface simple and let the content decide what it means.


Content-Addressable Dedup

The key is not a UUID. It is the SHA-256 of the file content.

import { createHash } from 'crypto';

function contentKey(bytes: Uint8Array): string {
  return createHash('sha256').update(bytes).digest('hex');
}

// Same file, same key — always
const key1 = contentKey(ttsAudio); // "a3f2b8..."
const key2 = contentKey(ttsAudio); // "a3f2b8..." — identical

Upload the same file twice: same key, one object in R2, no wasted bytes. This is not an optimization layered on top of storage — it is a consequence of the key derivation. Deduplication is free.

Git, Docker, and IPFS all use this model. npm’s cacache is the canonical JS reference implementation — it stores packages by content hash, checks existence before writing, and returns cached results by key. Docker Hub’s content-addressable layer store achieved roughly 2x compression on their image registry because layers are shared across images by hash. A 2019 analysis of Docker Hub found 97% layer duplication across images — a number that sounds extreme until you consider how many images share a base Ubuntu layer.

Our pipeline’s duplication profile is different. Video clips are mostly unique. But TTS outputs are not — “Hello world” synthesized with voice “christopher” at the same parameters produces the same audio bytes every time. Stock footage reused across multiple projects is identical bytes. Generated images from the same prompt and seed are identical. For these asset classes, content-addressing is the difference between generating once and generating thousands of times.

The storage check is a single KV lookup before any R2 write:

async function putIdempotent(
  bytes: Uint8Array,
  meta: FileMeta,
  env: Env,
): Promise<string> {
  const key = contentKey(bytes);

  // Check KV first — sub-millisecond
  const existing = await env.MEDIA_KV.get(`file:${key}`);
  if (existing) return key; // Already stored, no R2 write needed

  // Write to R2 and index
  await env.MEDIA_BUCKET.put(key, bytes, {
    httpMetadata: { contentType: meta.contentType },
    customMetadata: { tags: meta.tags.join(',') },
  });
  await env.MEDIA_KV.put(`file:${key}`, JSON.stringify({
    key,
    size: bytes.byteLength,
    contentType: meta.contentType,
    tags: meta.tags,
    createdAt: Date.now(),
  }));
  await indexInD1(key, meta, env);

  return key;
}

The Hybrid Storage Stack

The naive design puts everything in D1. D1 is a managed SQLite service — it handles SQL queries, pagination, filtering. But D1’s latency on the hot path (file existence checks, metadata lookups) is 5–20ms per query under load. At scale, that adds up.

Research into Cloudflare’s own serverless-registry — their OCI-compatible container registry built on Workers primitives — shows the same pattern we need: KV for fast existence checks, R2 for blob storage, and a SQL layer only when you need it.

The three-layer stack:

┌─────────────────────────────────────────────┐
│  KV  — hot path                             │
│  • file:sha256hex → FileMetaJSON            │
│  • cache:sha256hex → CacheEntryJSON         │
│  Latency: <1ms. No SQL. No joins.           │
├─────────────────────────────────────────────┤
│  D1  — cold path                            │
│  • file_index table (key, size, type, tags) │
│  • Search, filter, pagination, tag queries  │
│  Latency: 5–20ms. SQL. Acceptable for UX.  │
├─────────────────────────────────────────────┤
│  R2  — bytes                                │
│  • Keyed by SHA-256 hex                     │
│  • Zero egress cost (reads are free)        │
│  • No CDN needed — R2 + Cache API handles  │
└─────────────────────────────────────────────┘

KV for existence and metadata on the critical path. D1 for queries that can tolerate latency: “list all MP4 files tagged wan uploaded this week.” R2 for bytes because R2’s zero-egress model makes media delivery structurally cheaper than S3 or GCS at any volume.

The D1 schema is deliberately narrow:

CREATE TABLE file_index (
  key         TEXT PRIMARY KEY,   -- SHA-256 hex
  size        INTEGER NOT NULL,
  content_type TEXT NOT NULL,
  created_at  INTEGER NOT NULL,
  tags        TEXT DEFAULT ''     -- comma-separated, queryable via LIKE
);

CREATE INDEX idx_content_type ON file_index(content_type);
CREATE INDEX idx_created_at   ON file_index(created_at);

Tag queries use WHERE tags LIKE '%wan%' — not beautiful, but adequate for the search volumes a pipeline generates. If tag cardinality explodes, a normalized file_tags table is a single migration.


As Pipeline Buffer

In a Scram-Jet pipeline, capabilities are composable units. Each capability reads inputs and writes outputs. The naive approach passes bytes between capabilities — the output of capability A is fed as input bytes to capability B.

The Media Store approach passes keys instead.

// Capability A: generate a video clip
async function generateClip(prompt: string, env: Env): Promise<string> {
  const bytes = await wanVideoGenerate(prompt);
  return mediaStore.put(bytes, { contentType: 'video/mp4', tags: ['wan'] }, env);
  // Returns: "a3f2b8c1..." — a key, not bytes
}

// Capability B: add captions to a clip
async function addCaptions(clipKey: string, env: Env): Promise<string> {
  const bytes = await mediaStore.get(clipKey, env);
  const captioned = await whisperCaption(bytes);
  return mediaStore.put(captioned, { contentType: 'video/mp4', tags: ['captioned'] }, env);
}

// Pipeline: keys flow, not bytes
const clipKey      = await generateClip("sunset timelapse", env);
const captionedKey = await addCaptions(clipKey, env);

The pipeline doesn’t move bytes between steps. It moves 64-character strings. Worker-to-Worker communication via KV is ~1ms. Worker-to-Worker communication via byte payloads is bounded by payload size limits (128 MB for service bindings) and CPU time for serialization.

More importantly, the pipeline becomes resumable. If capability B fails, the key is still valid. Retry from the key, not from the beginning. The buffer between every stage is already materialized in durable storage.


As LRU Cache

Content-addressing extends naturally to computation caching. The pattern: hash the capability name plus its canonical inputs to get a cache key. Before invoking the capability, check Media Store. If the key exists and hasn’t expired, return the cached result.

function cacheKey(capability: string, inputs: Record<string, unknown>): string {
  const canonical = JSON.stringify({ capability, inputs }, Object.keys({ capability, ...inputs }).sort());
  return createHash('sha256').update(canonical).digest('hex');
}

async function cachedCapability<T>(
  capability: string,
  inputs: Record<string, unknown>,
  invoke: () => Promise<T>,
  env: Env,
  ttlMs = 7 * 24 * 60 * 60 * 1000, // 7 days
): Promise<T> {
  const key = cacheKey(capability, inputs);
  const cached = await env.MEDIA_KV.get(`cache:${key}`, 'json') as CacheEntry<T> | null;

  if (cached && Date.now() < cached.expiresAt) {
    return cached.result;
  }

  const result = await invoke();
  await env.MEDIA_KV.put(`cache:${key}`, JSON.stringify({
    result,
    expiresAt: Date.now() + ttlMs,
  }), { expirationTtl: Math.ceil(ttlMs / 1000) });

  return result;
}

// TTS of "Hello world" with voice "christopher" — generated once, served forever
const audio = await cachedCapability(
  'tts',
  { text: 'Hello world', voice: 'christopher', speed: 1.0 },
  () => elevenLabsTTS('Hello world', 'christopher'),
  env,
);

SHA-256("tts" + "Hello world" + "christopher" + "1.0") is the same value every time those inputs are the same. The TTS API charges per character — a cache hit at $0/call versus a cache miss at $0.0003/character adds up fast across a production pipeline. npm’s cacache implements this exact pattern for package builds: the first install of a package version downloads and hashes it; every subsequent install reads from the local cache by hash.

The inputs object must be canonicalized — sorted keys, no undefined values — before hashing. An object like { voice: "christopher", text: "Hello world" } and { text: "Hello world", voice: "christopher" } must produce the same key. JSON.stringify with sorted keys handles this.


R2 Reliability Reality

R2 is not infallible. In March 2025, Cloudflare experienced a 100% write failure event affecting R2 in multiple regions — uploads silently succeeded with 200 responses but objects were not durably stored. Reads for recently written objects returned 404. This is the kind of incident that breaks a pipeline that assumes writes are atomic.

MP4 streaming has a separate class of issues: R2’s range request support is functional but bucket-wide rate limits can cause 429s during high-throughput streaming. A pipeline that reads back large video files for processing will hit these limits before a pipeline that reads small audio files.

Bucket-wide locks during maintenance windows are undocumented but observed. A single large multipart upload in progress can degrade write latency for the entire bucket.

The mitigations:

async function r2PutWithRetry(
  bucket: R2Bucket,
  key: string,
  bytes: ReadableStream | ArrayBuffer,
  opts: R2PutOptions,
  maxAttempts = 4,
): Promise<R2Object> {
  const delays = [500, 1500, 4000]; // exponential backoff

  for (let attempt = 0; attempt < maxAttempts; attempt++) {
    try {
      const obj = await bucket.put(key, bytes, opts);
      if (!obj) throw new Error('R2 put returned null — silent failure');
      return obj;
    } catch (err) {
      if (attempt === maxAttempts - 1) throw err;
      await new Promise(r => setTimeout(r, delays[attempt]));
    }
  }
  throw new Error('unreachable');
}

The if (!obj) check catches the March 2025 failure mode — R2 returned 200 but put() returned null in the SDK. Always validate the return value.

For reads, layer the Cache API in front of R2. Cloudflare’s Cache API stores responses at the edge. Once a file is read once, subsequent reads for the same key hit the cache rather than R2:

async function r2GetWithCache(
  request: Request,
  key: string,
  bucket: R2Bucket,
): Promise<Response> {
  const cacheKey = new Request(`https://media-store.internal/${key}`);
  const cached = await caches.default.match(cacheKey);
  if (cached) return cached;

  const obj = await bucket.get(key);
  if (!obj) return new Response('Not Found', { status: 404 });

  const response = new Response(obj.body, {
    headers: {
      'Content-Type': obj.httpMetadata?.contentType ?? 'application/octet-stream',
      'Cache-Control': 'public, max-age=31536000, immutable',
    },
  });

  // Cache immutably — content-addressed keys never change content
  await caches.default.put(cacheKey, response.clone());
  return response;
}

Content-addressed keys are immutable by definition. The bytes at key a3f2b8... are always and forever the bytes that hash to a3f2b8.... Setting Cache-Control: immutable is safe and correct.


API Design

The full endpoint surface in Hono:

import { Hono } from 'hono';
const app = new Hono<{ Bindings: Env }>();

// PUT /v1/files — upload bytes, get key
app.put('/v1/files', async (c) => {
  const bytes = await c.req.arrayBuffer();
  const contentType = c.req.header('Content-Type') ?? 'application/octet-stream';
  const tags = (c.req.header('X-Tags') ?? '').split(',').filter(Boolean);

  const key = await putIdempotent(new Uint8Array(bytes), { contentType, tags }, c.env);
  return c.json({ key }, 201);
});

// GET /v1/files/:key — download bytes
app.get('/v1/files/:key', async (c) => {
  const { key } = c.req.param();
  return r2GetWithCache(c.req.raw, key, c.env.MEDIA_BUCKET);
});

// GET /v1/files/:key/meta — metadata from KV (fast)
app.get('/v1/files/:key/meta', async (c) => {
  const { key } = c.req.param();
  const meta = await c.env.MEDIA_KV.get(`file:${key}`, 'json');
  if (!meta) return c.json({ error: 'Not found' }, 404);
  return c.json(meta);
});

// GET /v1/files — list/search from D1 (slower, SQL)
app.get('/v1/files', async (c) => {
  const { tag, type, limit = '50', cursor } = c.req.query();
  let query = 'SELECT key, size, content_type, created_at, tags FROM file_index WHERE 1=1';
  const params: unknown[] = [];

  if (tag)  { query += ' AND tags LIKE ?';         params.push(`%${tag}%`); }
  if (type) { query += ' AND content_type LIKE ?'; params.push(`${type}%`); }
  if (cursor) { query += ' AND created_at < ?';    params.push(parseInt(cursor)); }

  query += ` ORDER BY created_at DESC LIMIT ${parseInt(limit)}`;

  const { results } = await c.env.DB.prepare(query).bind(...params).all();
  const nextCursor = results.length === parseInt(limit)
    ? String(results[results.length - 1].created_at)
    : null;

  return c.json({ files: results, nextCursor });
});

// POST /v1/files/:key/tag — add tags
app.post('/v1/files/:key/tag', async (c) => {
  const { key } = c.req.param();
  const { tags } = await c.req.json<{ tags: string[] }>();

  const existing = await c.env.MEDIA_KV.get<FileMeta>(`file:${key}`, 'json');
  if (!existing) return c.json({ error: 'Not found' }, 404);

  const merged = [...new Set([...existing.tags, ...tags])];
  await c.env.MEDIA_KV.put(`file:${key}`, JSON.stringify({ ...existing, tags: merged }));
  await c.env.DB.prepare(
    'UPDATE file_index SET tags = ? WHERE key = ?'
  ).bind(merged.join(','), key).run();

  return c.json({ key, tags: merged });
});

// GET /v1/cache/:key — check LRU cache
app.get('/v1/cache/:key', async (c) => {
  const { key } = c.req.param();
  const entry = await c.env.MEDIA_KV.get(`cache:${key}`, 'json');
  if (!entry) return c.json({ hit: false }, 404);
  return c.json({ hit: true, entry });
});

// PUT /v1/cache/:key — write LRU cache entry
app.put('/v1/cache/:key', async (c) => {
  const { key } = c.req.param();
  const body = await c.req.json<{ result: unknown; ttlMs?: number }>();
  const ttlMs = body.ttlMs ?? 7 * 24 * 60 * 60 * 1000;

  await c.env.MEDIA_KV.put(`cache:${key}`, JSON.stringify({
    result: body.result,
    expiresAt: Date.now() + ttlMs,
  }), { expirationTtl: Math.ceil(ttlMs / 1000) });

  return c.json({ key });
});

export default app;

The PUT /v1/files endpoint is idempotent by construction — the same bytes always return the same key, so retrying a failed upload is safe. The GET /v1/files list endpoint uses cursor-based pagination via created_at timestamp rather than OFFSET, which degrades gracefully on large tables where OFFSET scans grow linearly.


What Docker Teaches Us

Docker Hub’s DupHunter analysis is the most detailed public study of content duplication in a large-scale content store. Key findings from the 2019 dataset of 3.18 million Docker Hub images:

Our pipeline’s duplication profile differs by asset class:

Asset typeExpected duplicationWhy
Generated video clipsLow (~5%)Unique prompts, unique seeds
TTS audioHigh (40–80%)Same phrases, same voices, same parameters
Generated imagesMedium (20–40%)Prompt variation but seeded reruns
Stock footageVery high (60–90%)Same clips reused across projects
ScreenshotsVery low (<1%)Unique content every time

TTS is the highest-leverage target. A pipeline that generates speech for a set of product categories, runs nightly, and synthesizes the same category names in the same voice will regenerate identical audio files unless there is a content-addressed cache layer. The first generation is unavoidable. Every subsequent generation is pure waste.


What Not to Build

Media Store is dumb storage with smart indexing. Clear scope boundaries prevent it from becoming a platform that tries to own everything.

No image transformation. Resizing, cropping, format conversion — that’s imgix or Cloudflare Images or a capability. Media Store stores the original bytes. Transformation is a read-time concern, not a storage-time concern.

No transcoding. Converting MP4 to WebM, H.264 to AV1 — that’s a capability. The transcoded output gets its own key in Media Store. Media Store does not know or care that two keys are related by transcoding.

No CDN configuration. R2 serves bytes. The Cache API caches at the edge. Content-addressed keys are immutable, so cache headers are trivially set to max-age forever. There is no CDN to configure.

No access control beyond bucket-level auth. Media Store uses a shared API key. Per-file ACLs are a product feature, not a storage feature. If you need per-user access control, build a proxy that validates auth before calling Media Store — don’t pollute the storage layer with auth logic.

No content understanding. Media Store does not know that a file is a face, a contract, or a stock footage clip. Tags are caller-provided. AI annotation is a capability that writes tags back to Media Store via POST /v1/files/:key/tag.

The value of these constraints is operational: a service with a clear scope has a clear failure mode. When Media Store is down, you know exactly which operations are broken and what is unaffected. When Media Store is up, every capability that touches files works. The blast radius is bounded.


Prior Art


Summary

DecisionRationale
SHA-256 as keyDeduplication is free; idempotency is structural
KV on hot pathSub-millisecond existence checks; metadata without SQL
D1 on cold pathSQL needed for search and filter; latency is acceptable
R2 for bytesZero egress cost; immutable keys enable Cache-Control: immutable
Keys flow, not bytesPipeline stages are decoupled and resumable
Cache layer on KVCapability caching at SHA-256(capability + inputs)
Retry with backoffR2 silent write failures are real; validate return values
No transformationScope discipline prevents creep; capabilities do content work

The pattern is not novel — Git has been doing it since 2005, Docker since 2013, npm’s cacache since 2016. What is novel is applying it deliberately as a pipeline primitive on a serverless stack, where the KV+D1+R2 split maps cleanly to hot path, cold path, and bytes.

Build the store once. Every capability gets deduplication, caching, and search for free.


Written 2026-03-27. References: npm/cacache, Zhao et al. “DupHunter: Flexible High-Performance Deduplication for Docker Registries” USENIX ATC 2019, cloudflare/serverless-registry, garywu/_readme articles/metadata-economics-personal-storage. Cloudflare R2 incident history from Cloudflare status page. Pricing current as of March 2026.


Edit page
Share this post on:

Previous Post
Durable Objects as Capability Registries
Next Post
The RFC Process for Multi-Repo Ecosystems