Video Production Techniques

Org Status: 🟡 Dormant Cloudflare: N/A Last Audited: 2026-04-28

How to bridge the gap between “AI slideshow” and viral-quality video — using FFmpeg, cloud rendering APIs, and the exact techniques human editors use to hold attention. Every command is real. Every statistic is sourced.

What you’ll learn:

The 6 production levels from amateur to professional, with exact FFmpeg commands for each
4 secrets human editors use that most AI pipelines miss entirely
Complete FFmpeg filter reference: Ken Burns, transitions, color grading, text overlays, audio mixing, speed effects
How to build multi-track compositions that create visual depth
Cloud API comparison: Shotstack vs Creatomate vs Remotion vs Plainly vs JSON2Video
Hook archetypes that capture viewers in the first 3 seconds
Stock resource strategy: what to download, what to API, what to skip
Architecture decisions for rendering at $0.01/video vs $0.30/minute

The Problem: Why AI Video Looks Amateur
The Retention Science
Core Concepts
The 6 Production Levels
The 4 Human Editor Secrets
FFmpeg Effects Reference
Hook Archetypes
Patterns: Production-Quality Compositions
Small Examples
Cloud API Comparison
Stock Resource Strategy
Architecture Decisions
Anti-Patterns
References

Most programmatic video pipelines produce content that viewers scroll past in under a second. The symptoms are obvious:

Static visuals — stock images sit motionless for 5-10 seconds while TTS drones
Hard cuts — jarring transitions between clips with no visual flow
Flat composition — single-layer video with text slapped on top
No audio design — TTS at 100% volume, no music, no sound effects
No pacing — every segment is the same length regardless of content

The result is what the industry calls a “slideshow video” — technically a video file, functionally a PowerPoint with narration. Viewers detect this in under 3 seconds and swipe away.

The gap between slideshow and viral is not talent or expensive software. It is a specific set of techniques that human editors apply instinctively and AI pipelines skip entirely. These techniques are all implementable with FFmpeg filters, and they compound — each one adds 10-30% retention lift, and together they transform amateur content into content that algorithms promote.

What changes if you get this right:

Metric	Slideshow (Level 1)	Production (Level 4+)
3-second retention	30-40%	70-85%
Average watch time	15-25% of duration	45-65% of duration
Algorithmic promotion	Minimal	Active distribution
RPM (finance niche)	$2-5	$9-21
Viewer perception	”AI generated"	"Professional content”
Cost per video (local)	$0.01	$0.01-0.03

The cost difference is negligible. The quality difference is everything. Every technique in this article adds production value without adding meaningful cost when you render locally with FFmpeg.

Before diving into techniques, understand why they work. Video retention is governed by neuroscience, not aesthetics.

The 3-Second Window

71% of viewers decide in the first 3 seconds whether to keep watching. TikTok videos that maintain 70-85% retention in the first 3 seconds receive 2.2x more total views than videos with lower retention. Videos exceeding 85% first-3-second retention achieve viral potential. Content below 60% gets minimal algorithmic promotion.

interface RetentionWindow {
  /** Seconds 0-3: The hook. Visual must change, text must appear, audio must hit. */
  hook: { durationMs: 3000; targetRetention: 0.85; visualChanges: number }; // minimum 2

  /** Seconds 3-10: The promise. Viewer decides if the payoff is worth waiting for. */
  promise: { durationMs: 7000; targetRetention: 0.70; paceSecondsPerVisual: 3 };

  /** Seconds 10-30: The delivery. Content must match or exceed the hook's promise. */
  delivery: { durationMs: 20000; targetRetention: 0.55; paceSecondsPerVisual: 3 };

  /** Seconds 30+: The payoff. Reward the viewer for staying. */
  payoff: { targetRetention: 0.40; includesCTA: boolean };
}

Why Movement Matters

The human visual system is wired to track movement. A static image on screen triggers the brain’s “nothing is happening” response and attention drops. Even subtle movement — a 2% zoom over 5 seconds — keeps the visual cortex engaged.

Key insight: Static visuals are not “neutral” — they are actively harmful to retention. Every frame must move. This is the single highest-impact change you can make to any video pipeline.

Why Multi-Layer Composition Works

Professional video uses depth — multiple visual planes stacked to create a sense of space. Multi-plane B-roll compositions increase retention by 31% compared to single-layer video. The brain interprets layered visuals as “richer” content worth paying attention to.

Why Sound Design Matters

78% of social media video is watched on mute. This means captions are mandatory, not optional. But for the 22% who listen, sound design — background music, sound effects, audio ducking — dramatically increases perceived production quality. The combination of visual captions AND sound design covers both audiences.

Optimal Duration

The highest engagement for short-form video occurs in the 60-90 second range. For long-form YouTube, 8-15 minutes is the sweet spot for finance/education niches, with $9-21 RPM.

Concept 1: The Render Pipeline

Every programmatic video follows the same pipeline, regardless of whether you use FFmpeg locally or a cloud API.

interface RenderPipeline {
  /** Step 1: Script generation — what the video says */
  script: {
    hook: string;           // First 3 seconds
    segments: Segment[];    // Each segment = one visual scene
    cta: string;            // Call to action
  };

  /** Step 2: Asset collection — what the video shows */
  assets: {
    stockVideo: StockClip[];     // B-roll footage from Pexels/Pixabay
    stockImages: StockImage[];   // For Ken Burns treatment
    music: AudioTrack;           // Background music from Pixabay Music
    sfx: SoundEffect[];          // Whoosh, bass drop, etc.
    voiceover: AudioTrack;       // TTS from edge-tts or ElevenLabs
  };

  /** Step 3: Composition — how it all fits together */
  composition: {
    tracks: Track[];             // Layered video tracks (background, subject, text)
    transitions: Transition[];   // Between clips
    colorGrade: ColorProfile;    // Overall mood
    effects: Effect[];           // Grain, vignette, etc.
  };

  /** Step 4: Render — produce the final file */
  output: {
    format: 'mp4';
    resolution: '1080x1920' | '1920x1080';  // Vertical or horizontal
    fps: 30;
    codec: 'libx264' | 'libx265';
    audioBitrate: '192k';
  };
}

Concept 2: The Visual Change Cadence

The “3-second rule” is the most important pacing concept in viral video. The visual on screen must change every 3 seconds. Not every 10 seconds. Not every 5 seconds. Every 3 seconds.

interface VisualCadence {
  /** Maximum seconds any single visual can stay on screen */
  maxVisualDuration: 3;

  /** For a 60-second video, you need at minimum 20 distinct visuals */
  visualsPerMinute: 20;

  /** Each script segment (sentence) should map to 2-3 visuals, not 1 */
  visualsPerSegment: 2 | 3;

  /** Never repeat the same visual in a video */
  allowRepeat: false;

  /** Movement type must vary — never two consecutive clips with same Ken Burns */
  movementVariation: true;
}

Key insight: Divide each script segment duration by 3. That’s how many distinct visuals you need for that segment. A 9-second sentence needs 3 different visuals, each with its own Ken Burns movement, with crossfade transitions between them.

Concept 3: The Audio Stack

Professional video has 3-4 audio layers, not 1.

interface AudioStack {
  /** Layer 1: Voice — the primary content, always at 100% */
  voice: {
    source: 'edge-tts' | 'elevenlabs' | 'recorded';
    volume: 1.0;
    processing: 'normalize' | 'compress';
  };

  /** Layer 2: Music — sets mood, always ducked under voice */
  music: {
    source: 'stock';  // Pixabay Music, Mixkit
    volume: 0.15;     // 15% when voice is active
    ducking: {
      method: 'sidechaincompress';
      threshold: 0.02;
      ratio: 8;
      attackMs: 200;
      releaseMs: 1000;
    };
  };

  /** Layer 3: SFX — punctuate transitions and key moments */
  sfx: {
    onTransition: 'whoosh';        // Every scene change
    onHookReveal: 'bass_drop';     // The first number/stat
    onMoney: 'cash_register';      // Dollar amounts
    onData: 'keyboard_typing';     // Data reveals
    onCTA: 'success_chime';        // End call-to-action
    volume: 0.5;
  };

  /** Layer 4: Ambience (optional) — subtle background texture */
  ambience?: {
    source: 'tension_drone' | 'room_tone';
    volume: 0.05;
  };
}

Concept 4: The Color Grade

Color grading is the difference between “footage” and “cinema.” A single FFmpeg filter chain transforms generic stock footage into a cohesive visual identity.

interface ColorProfile {
  name: string;
  /** FFmpeg eq filter values */
  brightness: number;    // -1.0 to 1.0
  contrast: number;      // 0.0 to 2.0
  saturation: number;    // 0.0 to 3.0
  /** Additional filters */
  grain: boolean;        // noise filter for texture
  vignette: boolean;     // dark edges for focus
  colorBalance?: {       // Teal/orange, cold, warm
    shadowsRed: number;
    shadowsGreen: number;
    shadowsBlue: number;
    highlightsRed: number;
    highlightsGreen: number;
    highlightsBlue: number;
  };
}

const PROFILES: Record<string, ColorProfile> = {
  darkCinematic: {
    name: 'Dark Cinematic',
    brightness: -0.1,
    contrast: 1.3,
    saturation: 0.7,
    grain: true,
    vignette: true,
  },
  tealOrange: {
    name: 'Teal & Orange (Hollywood)',
    brightness: 0,
    contrast: 1.1,
    saturation: 1.2,
    grain: false,
    vignette: true,
    colorBalance: {
      shadowsRed: 0.1, shadowsGreen: -0.1, shadowsBlue: -0.2,
      highlightsRed: -0.1, highlightsGreen: 0.05, highlightsBlue: 0.15,
    },
  },
  coldClinical: {
    name: 'Cold/Clinical (Tech)',
    brightness: 0,
    contrast: 1.2,
    saturation: 0.6,
    grain: false,
    vignette: false,
    colorBalance: {
      shadowsRed: -0.15, shadowsGreen: 0, shadowsBlue: 0.15,
      highlightsRed: -0.1, highlightsGreen: 0, highlightsBlue: 0.1,
    },
  },
};

Concept 5: The Track System

Professional video is composed in tracks, like audio mixing. Each track is a visual layer.

interface TrackSystem {
  /** Track 1 (bottom): Full-screen background — stock footage with Ken Burns */
  background: {
    content: 'stock_video' | 'stock_image_with_zoompan';
    movement: KenBurnsEffect;
    colorGrade: ColorProfile;
  };

  /** Track 2 (middle): Subject matter — the thing you're talking about */
  subject: {
    content: 'screenshot' | 'chart' | 'product_image' | 'person';
    scale: 0.8;          // 80% of frame size
    opacity: 0.9;        // Slightly transparent
    position: 'center';
    shadow: boolean;     // Drop shadow for depth
  };

  /** Track 3 (top): Text and data overlays */
  text: {
    content: 'caption' | 'statistic' | 'title' | 'counter';
    animation: 'slide_up' | 'fade_in' | 'typewriter' | 'scale_pop';
    font: 'Inter-Black' | 'Montserrat-Bold';
    position: 'bottom_third' | 'center' | 'top_bar';
  };
}

Key insight: The 3-track system creates perceived depth. Track 1 moves, Track 2 is semi-transparent with a shadow, Track 3 animates text. The viewer’s brain interprets this as a 3D space, which registers as “professional production” even when every asset is stock footage composited with FFmpeg.

Each level builds on the previous one. The cost column assumes local FFmpeg rendering on consumer hardware.

Level 1 — Basic (The Slideshow)

What it looks like: Static clips + TTS + captions + hard cuts.

What’s wrong: Everything is static. The viewer’s brain sees “nothing is happening” and swipes away. No movement, no transitions, no audio design. This is where most AI video pipelines stop.

ffmpeg -i clip1.mp4 -i clip2.mp4 -i clip3.mp4 -i voiceover.mp3 \
  -filter_complex "[0:v][1:v][2:v]concat=n=3:v=1:a=0[v]" \
  -map "[v]" -map 3:a \
  -c:v libx264 -c:a aac -shortest output.mp4

Attribute	Value
Retention impact	Baseline
Viewer perception	”AI slideshow”
Implementation time	1 hour
Cost per video	$0.01

Level 2 — Movement (Ken Burns + Transitions)

What changes: Every clip moves. Transitions flow. The video feels alive.

This is the single biggest quality jump — going from static to moving. Apply Ken Burns to every visual and crossfade between clips.

ffmpeg -loop 1 -i image1.jpg -vf \
  "zoompan=z='min(zoom+0.002,1.2)':x='iw/2-(iw/zoom/2)':y='ih/2-(ih/zoom/2)':d=150:s=1080x1920:fps=30" \
  -t 5 -c:v libx264 -pix_fmt yuv420p clip1_kb.mp4

ffmpeg -loop 1 -i image2.jpg -vf \
  "zoompan=z='if(eq(on,1),1.3,max(zoom-0.002,1.0))':x='iw/2-(iw/zoom/2)':y='ih/2-(ih/zoom/2)':d=150:s=1080x1920:fps=30" \
  -t 5 -c:v libx264 -pix_fmt yuv420p clip2_kb.mp4

ffmpeg -i clip1_kb.mp4 -i clip2_kb.mp4 \
  -filter_complex "xfade=transition=fade:duration=1:offset=4" \
  -c:v libx264 -pix_fmt yuv420p output.mp4

Attribute	Value
Retention impact	+25-35% watch time
Viewer perception	”Looks like a real video”
Implementation time	4 hours
Cost per video	$0.01

Level 3 — Layering (3-Track Composition)

What changes: Visual depth. Background moves, subject floats with shadow, text overlays add data.

ffmpeg -i background.mp4 -i subject.png -i voice.mp3 \
  -filter_complex "
    [0:v]zoompan=z='min(zoom+0.001,1.15)':x='iw/2-(iw/zoom/2)':y='ih/2-(ih/zoom/2)':d=300:s=1080x1920:fps=30,
    eq=brightness=-0.1:contrast=1.3:saturation=0.7,
    noise=alls=15:allf=t+u,
    vignette=PI/4:1.2[bg];
    [1:v]scale=864:-1,format=rgba,colorchannelmixer=aa=0.9[fg];
    [bg][fg]overlay=(W-w)/2:(H-h)/2[comp];
    [comp]drawtext=text='\$3,200 SAVED':fontfile=Inter-Black.ttf:fontsize=100:
    fontcolor=white:borderw=4:bordercolor=black:
    x=(w-text_w)/2:y=h*0.75:enable='between(t,3,6)'[out]
  " \
  -map "[out]" -map 2:a \
  -c:v libx264 -c:a aac -t 10 output.mp4

Attribute	Value
Retention impact	+31% over single-layer
Viewer perception	”Professional production”
Implementation time	8 hours
Cost per video	$0.01-0.02

Level 4 — Sound Design (Music + SFX + Ducking)

What changes: Audio becomes an experience. Background music sets mood, sound effects punctuate moments, voice ducks the music automatically.

ffmpeg -i composed_video.mp4 -i music.mp3 -i whoosh.mp3 -i bass_drop.mp3 \
  -filter_complex "
    [1:a]volume=0.15[bg_music];
    [2:a]adelay=4000|4000,volume=0.5[whoosh];
    [3:a]adelay=1000|1000,volume=0.7[bass];
    [whoosh][bass]amix=inputs=2[sfx];
    [bg_music][sfx]amix=inputs=2[music_sfx];
    [0:a][music_sfx]sidechaincompress=threshold=0.02:ratio=8:attack=200:release=1000[final_audio]
  " \
  -map 0:v -map "[final_audio]" \
  -c:v copy -c:a aac output.mp4

Attribute	Value
Retention impact	+15-20% perceived quality
Viewer perception	”This has a production team”
Implementation time	12 hours
Cost per video	$0.01-0.02

Level 5 — Typography (Kinetic Text + Counters)

What changes: Text animates. Numbers count up. Statistics slide in. Kinetic typography improves retention by 25-50%.

ffmpeg -i composed_video.mp4 -vf "
  drawtext=text='\$%{eif\\:min(floor((t-2)*1600)\\,3200)\\:d}':
  fontfile=Inter-Black.ttf:fontsize=120:fontcolor=white:
  borderw=4:bordercolor=black:
  x=(w-text_w)/2:y=h*0.4:
  enable='between(t,2,5)',

  drawtext=text='ANNUAL SAVINGS':
  fontfile=Inter-Bold.ttf:fontsize=48:fontcolor=white@0.8:
  x=(w-text_w)/2:y=h*0.4+130:
  enable='between(t,2.5,5)'
" -c:v libx264 -c:a copy output.mp4

Attribute	Value
Retention impact	+25-50% engagement
Viewer perception	”Motion graphics quality”
Implementation time	16 hours
Cost per video	$0.02-0.03

Level 6 — Professional (Speed Ramping + Match Cuts + Nano-Hooks)

What changes: Pacing becomes aggressive. Speed ramping on dramatic moments. Visual changes every 1.5 seconds on hooks. Multiple clips per sentence with match-cut transitions.

ffmpeg -i clip.mp4 -filter_complex "
  [0:v]setpts='
    if(between(T,0,3), PTS,
    if(between(T,3,5), 2.0*PTS,
    PTS))'[v];
  [0:a]atempo=1.0[a]
" -map "[v]" -map "[a]" -c:v libx264 output.mp4

Attribute	Value
Retention impact	Approaches human-edited quality
Viewer perception	”Can’t tell this is AI”
Implementation time	24+ hours
Cost per video	$0.02-0.05

Key insight: 73% of viewers cannot distinguish AI-assisted video from traditionally produced video when Level 4+ techniques are applied. The quality ceiling for programmatic video is far higher than most pipelines reach.

Level Progression Summary

Level	Name	Key Addition	Retention Lift	Cumulative Effect
1	Basic	Clips + TTS	Baseline	”Slideshow”
2	Movement	Ken Burns + crossfades	+25-35%	“Looks like video”
3	Layering	3-track + overlays + grading	+31%	“Professional”
4	Sound	Music + SFX + ducking	+15-20%	“Has a production team”
5	Typography	Kinetic text + counters	+25-50%	“Motion graphics”
6	Professional	Speed ramps + nano-hooks	+10-15%	“Can’t tell it’s AI”

These come from studying million-dollar YouTube channels. They are the techniques that separate amateur from professional, and most AI pipelines implement none of them.

Secret 1: Aggressive Pacing and Silence Removal

Human editors obsessively remove silence. Any gap longer than 0.3 seconds between words gets cut. Word endings overlap with the next word’s beginning for a relentless pace.

The tool: whisper-timestamped — word-level timestamps from OpenAI Whisper with silence detection built in.

interface SilenceRemoval {
  /** Maximum silence duration before trimming */
  maxSilenceMs: 300;

  /** Overlap word boundaries for pace */
  overlapMs: 50;

  /** Preserve intentional pauses (before reveals) */
  preserveDramaticPause: boolean;

  /** whisper-timestamped output gives us word-level timestamps */
  timestampSource: 'whisper-timestamped';
}

// whisper-timestamped output format
interface WhisperWord {
  text: string;
  start: number;  // seconds
  end: number;    // seconds
  confidence: number;
}

// Find silences longer than threshold
function findSilences(words: WhisperWord[], thresholdMs: number): Silence[] {
  const silences: Silence[] = [];
  for (let i = 0; i < words.length - 1; i++) {
    const gap = (words[i + 1].start - words[i].end) * 1000;
    if (gap > thresholdMs) {
      silences.push({
        start: words[i].end,
        end: words[i + 1].start,
        durationMs: gap,
      });
    }
  }
  return silences;
}

FFmpeg implementation — trim silences:

whisper_timestamped audio.mp3 --model small --language en --output_format json

ffmpeg -i audio.mp3 -filter_complex "
  [0:a]atrim=start=0:end=2.3[s1];
  [0:a]atrim=start=2.8:end=5.1[s2];
  [0:a]atrim=start=5.9:end=8.4[s3];
  [s1][s2][s3]concat=n=3:v=0:a=1[out]
" -map "[out]" trimmed_audio.mp3

Secret 2: Ken Burns Parallax (Movement is Mandatory)

No visual asset is EVER static. Every image gets a zoompan effect. Every video clip gets repositioned. The human editor treats stillness as a bug.

The library — 7 movement variants:

enum KenBurnsMovement {
  ZOOM_IN = 'zoom_in',           // Intimacy, focus
  ZOOM_OUT = 'zoom_out',         // Reveals context, scale
  PAN_LEFT = 'pan_left',         // Follows action
  PAN_RIGHT = 'pan_right',       // Reverse motion
  PAN_DOWN = 'pan_down',         // Gravity, weight, reveal
  PAN_UP = 'pan_up',             // Aspiration, growth
  DIAGONAL_DRIFT = 'diagonal',   // Subtle, cinematic
}

// Rule: Never two consecutive clips with the same movement
function selectMovement(previous: KenBurnsMovement): KenBurnsMovement {
  const movements = Object.values(KenBurnsMovement);
  let next: KenBurnsMovement;
  do {
    next = movements[Math.floor(Math.random() * movements.length)];
  } while (next === previous);
  return next;
}

Complete FFmpeg commands for all 7 variants (1080x1920 vertical, 5 seconds, 30fps):

ffmpeg -loop 1 -i image.jpg -vf \
  "zoompan=z='min(zoom+0.002,1.2)':x='iw/2-(iw/zoom/2)':y='ih/2-(ih/zoom/2)':d=150:s=1080x1920:fps=30" \
  -t 5 -c:v libx264 -pix_fmt yuv420p zoom_in.mp4

ffmpeg -loop 1 -i image.jpg -vf \
  "zoompan=z='if(eq(on,1),1.3,max(zoom-0.002,1.0))':x='iw/2-(iw/zoom/2)':y='ih/2-(ih/zoom/2)':d=150:s=1080x1920:fps=30" \
  -t 5 -c:v libx264 -pix_fmt yuv420p zoom_out.mp4

ffmpeg -loop 1 -i image.jpg -vf \
  "zoompan=z='1.15':x='if(eq(on,1),0,min(x+2,iw-iw/zoom))':y='ih/2-(ih/zoom/2)':d=150:s=1080x1920:fps=30" \
  -t 5 -c:v libx264 -pix_fmt yuv420p pan_right.mp4

ffmpeg -loop 1 -i image.jpg -vf \
  "zoompan=z='1.15':x='if(eq(on,1),iw-iw/zoom,max(x-2,0))':y='ih/2-(ih/zoom/2)':d=150:s=1080x1920:fps=30" \
  -t 5 -c:v libx264 -pix_fmt yuv420p pan_left.mp4

ffmpeg -loop 1 -i image.jpg -vf \
  "zoompan=z='1.15':x='iw/2-(iw/zoom/2)':y='if(eq(on,1),0,min(y+2,ih-ih/zoom))':d=150:s=1080x1920:fps=30" \
  -t 5 -c:v libx264 -pix_fmt yuv420p pan_down.mp4

ffmpeg -loop 1 -i image.jpg -vf \
  "zoompan=z='1.15':x='iw/2-(iw/zoom/2)':y='if(eq(on,1),ih-ih/zoom,max(y-2,0))':d=150:s=1080x1920:fps=30" \
  -t 5 -c:v libx264 -pix_fmt yuv420p pan_up.mp4

ffmpeg -loop 1 -i image.jpg -vf \
  "zoompan=z='1.2':x='if(eq(on,1),0,min(x+1.5,iw-iw/zoom))':y='if(eq(on,1),0,min(y+1,ih-ih/zoom))':d=150:s=1080x1920:fps=30" \
  -t 5 -c:v libx264 -pix_fmt yuv420p diagonal.mp4

Secret 3: B-Roll Layering (The Opacity Trick)

The 3-track composition creates perceived depth that single-layer video cannot match:

Track 3 (top):    Kinetic typography, captions, data overlays
Track 2 (middle): Subject at 80% scale, 90% opacity, drop shadow
Track 1 (bottom): Moving abstract background with color grade

FFmpeg multi-track composition:

ffmpeg -i abstract_bg.mp4 -i subject.png \
  -filter_complex "
    # Track 1: Background with Ken Burns + color grade + grain + vignette
    [0:v]scale=1080:1920:force_original_aspect_ratio=increase,crop=1080:1920,
    zoompan=z='min(zoom+0.001,1.1)':x='iw/2-(iw/zoom/2)':y='ih/2-(ih/zoom/2)':d=300:s=1080x1920:fps=30,
    eq=brightness=-0.1:contrast=1.3:saturation=0.7,
    noise=alls=15:allf=t+u,
    vignette=PI/4:1.2[bg];

    # Track 2: Subject at 80% scale, 90% opacity
    [1:v]scale=864:-1,format=rgba,colorchannelmixer=aa=0.9[subject];

    # Compose Track 1 + Track 2
    [bg][subject]overlay=(W-w)/2:(H-h)/2[comp];

    # Track 3: Text overlay (data reveal at timestamp)
    [comp]drawtext=text='\$3,200':fontfile=Inter-Black.ttf:fontsize=120:
    fontcolor=white:borderw=5:bordercolor=black@0.8:
    x=(w-text_w)/2:y=h*0.3:
    enable='between(t,3,7)',
    drawtext=text='ANNUAL TAX SAVINGS':fontfile=Inter-Bold.ttf:fontsize=40:
    fontcolor=white@0.8:x=(w-text_w)/2:y=h*0.3+140:
    enable='between(t,3.5,7)'[final]
  " \
  -map "[final]" -c:v libx264 -pix_fmt yuv420p -t 10 output.mp4

Secret 4: 3-Second Visual Hook Swap

The visual on screen must change every 3 seconds. For each script segment, divide the duration by 3, then fetch that many distinct visuals.

interface VisualSwapStrategy {
  segmentDurationSec: number;
  visualCount: number;          // Math.ceil(segmentDurationSec / 3)
  transitionDuration: 0.5;      // Half-second crossfade between visuals
  searchStrategy: 'varied';     // Each visual uses different search terms

  /** Example: "LLC saves you $3,200 per year on taxes" (9 seconds)
   *  Visual 1 (0-3s): Business office footage — zoom in
   *  Visual 2 (3-6s): Calculator/spreadsheet — pan right
   *  Visual 3 (6-9s): Money/savings imagery — zoom out
   */
}

function planVisuals(segment: ScriptSegment): VisualPlan[] {
  const count = Math.ceil(segment.durationSec / 3);
  const visualDuration = segment.durationSec / count;

  return Array.from({ length: count }, (_, i) => ({
    searchQuery: segment.visualSearchTerms[i],
    startTime: segment.startTime + i * visualDuration,
    duration: visualDuration,
    kenBurns: selectMovement(i > 0 ? plans[i - 1].kenBurns : null),
    transition: i > 0 ? 'crossfade' : 'none',
  }));
}

Complete reference for every visual and audio effect, with exact syntax.

Ken Burns (zoompan filter)

The zoompan filter accepts values for zoom between 1 and 10. Key parameters:

Parameter	Description	Default
`z`	Zoom expression (1.0 = no zoom)	1
`x`	Horizontal pan position	0
`y`	Vertical pan position	0
`d`	Duration in frames (fps * seconds)	90
`s`	Output size	1280x720
`fps`	Output frame rate	25

Zoom speed reference:

Effect	Zoom increment	Duration feel
Barely perceptible	+0.0005/frame	Very subtle, cinematic
Gentle	+0.001/frame	Natural, documentary
Standard	+0.002/frame	Noticeable, engaging
Aggressive	+0.004/frame	Dramatic, attention-grabbing
Fast	+0.008/frame	Action, urgency

Center-zoom formula explained:

x='iw/2-(iw/zoom/2)'  →  Centers horizontally as zoom changes
y='ih/2-(ih/zoom/2)'  →  Centers vertically as zoom changes

Transitions (xfade filter)

The xfade filter provides 40+ built-in transitions between two video streams.

Syntax:

ffmpeg -i clip1.mp4 -i clip2.mp4 \
  -filter_complex "xfade=transition=TRANSITION_NAME:duration=SECONDS:offset=SECONDS" \
  output.mp4

Complete transition list with use cases:

Transition	Visual Effect	Best For
`fade`	Classic fade to black/white	Universal, safe default
`dissolve`	Cross-dissolve blend	Emotional moments
`wipeleft`	Wipe from right to left	Forward progress
`wiperight`	Wipe from left to right	Flashback, reverse
`wipeup`	Wipe from bottom to top	Aspiration, growth
`wipedown`	Wipe from top to bottom	Gravity, grounding
`slideleft`	Second clip slides in from right	Fast pace, lists
`slideright`	Second clip slides in from left	Fast pace
`slideup`	Second clip slides in from bottom	Reveals
`slidedown`	Second clip slides in from top	Drops, emphasis
`circlecrop`	Circle expanding from center	Focus, spotlight
`rectcrop`	Rectangle expanding from center	Data reveals
`distance`	Pixel distance blend	Abstract, artistic
`fadeblack`	Fade through black	Scene change, time jump
`fadewhite`	Fade through white	Dream, flashback
`radial`	Radial wipe	Clock-like, time-based
`smoothleft`	Smooth left transition	Professional, clean
`smoothright`	Smooth right transition	Professional
`smoothup`	Smooth upward transition	Growth narrative
`smoothdown`	Smooth downward transition	Grounding
`circleopen`	Circle opening out	Spotlight reveal
`circleclose`	Circle closing in	Focus, ending
`vertopen`	Vertical blinds opening	Data, corporate
`vertclose`	Vertical blinds closing	Closing sequence
`horzopen`	Horizontal blinds opening	Reveal
`horzclose`	Horizontal blinds closing	Closing
`diagtl`	Diagonal from top-left	Dynamic, energetic
`diagtr`	Diagonal from top-right	Variety
`diagbl`	Diagonal from bottom-left	Variety
`diagbr`	Diagonal from bottom-right	Variety
`hlslice`	Horizontal left slice	Glitch, tech
`hrslice`	Horizontal right slice	Glitch, tech
`vuslice`	Vertical up slice	Glitch, tech
`vdslice`	Vertical down slice	Glitch, tech
`dissolve`	Random pixel dissolve	Soft, emotional
`pixelize`	Pixelation transition	Retro, gaming
`hblur`	Horizontal blur transition	Speed, motion
`fadegrays`	Fade through grayscale	Dramatic
`squeezeh`	Horizontal squeeze	Compression
`squeezev`	Vertical squeeze	Compression

Chaining multiple transitions (3+ clips):

ffmpeg -i clip1.mp4 -i clip2.mp4 -i clip3.mp4 -i clip4.mp4 \
  -filter_complex "
    [0:v][1:v]xfade=transition=fade:duration=0.5:offset=4.5[v01];
    [v01][2:v]xfade=transition=slideleft:duration=0.5:offset=9[v012];
    [v012][3:v]xfade=transition=circlecrop:duration=0.5:offset=13.5[vout]
  " \
  -map "[vout]" output.mp4

Key insight: The offset parameter is cumulative — it’s the timestamp in the OUTPUT where the transition starts, not the input. For a chain, each offset = previous_offset + clip_duration - transition_duration.

Extended transitions with xfade-easing:

The xfade-easing project adds CSS easing functions and ported GLSL transitions to FFmpeg’s xfade filter, expanding from 40 to 200+ transitions.

Color Grading

Dark Cinematic (finance/business):

ffmpeg -i clip.mp4 -vf \
  "eq=brightness=-0.1:contrast=1.3:saturation=0.7,curves=preset=darker" \
  output.mp4

Teal and Orange (Hollywood):

ffmpeg -i clip.mp4 -vf \
  "colorbalance=rs=0.1:gs=-0.1:bs=-0.2:rh=-0.1:gh=0.05:bh=0.15" \
  output.mp4

Cold/Clinical (tech/business):

ffmpeg -i clip.mp4 -vf \
  "colorbalance=rs=-0.15:bs=0.15:rh=-0.1:bh=0.1,eq=contrast=1.2:saturation=0.6" \
  output.mp4

Black and White High Contrast:

ffmpeg -i clip.mp4 -vf \
  "hue=s=0,eq=contrast=1.5:brightness=0.05" \
  output.mp4

Warm Nostalgic (lifestyle):

ffmpeg -i clip.mp4 -vf \
  "colorbalance=rs=0.1:gs=0.05:bs=-0.1:rh=0.05:gh=0.02:bh=-0.05,eq=saturation=1.1:brightness=0.05" \
  output.mp4

Film Grain:

ffmpeg -i clip.mp4 -vf "noise=alls=15:allf=t+u" output.mp4

ffmpeg -i clip.mp4 -vf "noise=alls=30:allf=t+u" output.mp4

ffmpeg -i clip.mp4 -vf "noise=alls=20:allf=t" output.mp4

Vignette:

ffmpeg -i clip.mp4 -vf "vignette=PI/4:1.2" output.mp4

ffmpeg -i clip.mp4 -vf "vignette=PI/6:0.8" output.mp4

ffmpeg -i clip.mp4 -vf "vignette=PI/3:1.5" output.mp4

Complete color grade chain (apply all at once):

ffmpeg -i clip.mp4 -vf \
  "eq=brightness=-0.1:contrast=1.3:saturation=0.7,
   noise=alls=15:allf=t+u,
   vignette=PI/4:1.2" \
  -c:v libx264 -c:a copy output.mp4

Text Overlays (drawtext filter)

The drawtext filter renders text on video frames.

Static text with background box:

ffmpeg -i clip.mp4 -vf \
  "drawtext=text='90%% FAIL':
   fontfile=Inter-Black.ttf:fontsize=100:
   fontcolor=white:
   box=1:boxcolor=red@0.8:boxborderw=20:
   x=(w-text_w)/2:y=h*0.3:
   enable='between(t,1,4)'" \
  output.mp4

Text with border (no box):

ffmpeg -i clip.mp4 -vf \
  "drawtext=text='\$3,200':
   fontfile=Inter-Black.ttf:fontsize=120:
   fontcolor=white:
   borderw=4:bordercolor=black:
   x=(w-text_w)/2:y=(h-text_h)/2:
   enable='between(t,3,6)'" \
  output.mp4

Slide-up from bottom:

ffmpeg -i clip.mp4 -vf \
  "drawtext=text='ANNUAL SAVINGS':
   fontfile=Inter-Black.ttf:fontsize=80:fontcolor=white:
   borderw=3:bordercolor=black:
   x=(w-text_w)/2:
   y='if(between(t,3,3.5), h - (h - h*0.4)*(t-3)/0.5, h*0.4)':
   enable='between(t,3,7)'" \
  output.mp4

Counter animation (counting up to a number):

ffmpeg -i clip.mp4 -vf \
  "drawtext=text='\$%{eif\\:min(floor((t-2)*1600)\\,3200)\\:d}':
   fontfile=Inter-Black.ttf:fontsize=120:fontcolor=white:
   borderw=4:bordercolor=black:
   x=(w-text_w)/2:y=h*0.4:
   enable='between(t,2,5)'" \
  output.mp4

Countdown timer:

ffmpeg -i clip.mp4 -vf \
  "drawtext=text='%{eif\\:10-floor(t)\\:d}':
   fontfile=Inter-Black.ttf:fontsize=200:fontcolor=red:
   borderw=5:bordercolor=white:
   x=(w-text_w)/2:y=(h-text_h)/2:
   enable='between(t,0,10)'" \
  output.mp4

Fade-in text (opacity animation):

ffmpeg -i clip.mp4 -vf \
  "drawtext=text='THE LLC TRAP':
   fontfile=Inter-Black.ttf:fontsize=100:
   fontcolor=white@'if(between(t,2,2.5),(t-2)*2,if(between(t,2.5,5),1,0))':
   borderw=4:bordercolor=black@'if(between(t,2,2.5),(t-2)*2,if(between(t,2.5,5),1,0))':
   x=(w-text_w)/2:y=h*0.3:
   enable='between(t,2,5)'" \
  output.mp4

Multiple text overlays (stacked data):

ffmpeg -i clip.mp4 -vf "
  drawtext=text='LLC COST BREAKDOWN':fontfile=Inter-Black.ttf:fontsize=60:
    fontcolor=white:borderw=3:bordercolor=black:
    x=(w-text_w)/2:y=h*0.15:enable='between(t,1,8)',
  drawtext=text='State Fee\: \$100':fontfile=Inter-Bold.ttf:fontsize=48:
    fontcolor=white:x=(w-text_w)/2:y=h*0.30:enable='between(t,2,8)',
  drawtext=text='Agent\: \$120':fontfile=Inter-Bold.ttf:fontsize=48:
    fontcolor=white:x=(w-text_w)/2:y=h*0.38:enable='between(t,3,8)',
  drawtext=text='EIN\: FREE':fontfile=Inter-Bold.ttf:fontsize=48:
    fontcolor=green:x=(w-text_w)/2:y=h*0.46:enable='between(t,4,8)',
  drawtext=text='TOTAL\: \$220':fontfile=Inter-Black.ttf:fontsize=72:
    fontcolor=yellow:borderw=4:bordercolor=black:
    x=(w-text_w)/2:y=h*0.58:enable='between(t,5,8)'
" output.mp4

Audio Effects

Voice + music mixing (voice at 100%, music at 15%):

ffmpeg -i voice.mp3 -i music.mp3 \
  -filter_complex "[1:a]volume=0.15[bg];[0:a][bg]amix=inputs=2:duration=first" \
  output.mp3

Sidechain compression (duck music when voice plays):

ffmpeg -i voice.mp3 -i music.mp3 \
  -filter_complex \
  "[1:a]volume=0.3[bg];[0:a][bg]sidechaincompress=threshold=0.02:ratio=8:attack=200:release=1000" \
  output.mp3

Parameter	Value	Effect
`threshold`	0.02	Sensitivity — lower = more ducking
`ratio`	8	Compression amount — higher = more aggressive duck
`attack`	200ms	How fast music ducks when voice starts
`release`	1000ms	How fast music returns when voice stops

Sound effect at specific timestamp:

ffmpeg -i main.mp4 -i whoosh.mp3 \
  -filter_complex "[1:a]adelay=3000|3000,volume=0.5[sfx];[0:a][sfx]amix=inputs=2:duration=first" \
  output.mp4

Multiple SFX at different timestamps:

ffmpeg -i main.mp4 -i bass_drop.mp3 -i whoosh.mp3 -i cash.mp3 \
  -filter_complex "
    [1:a]adelay=500|500,volume=0.7[sfx1];
    [2:a]adelay=4000|4000,volume=0.5[sfx2];
    [3:a]adelay=7000|7000,volume=0.4[sfx3];
    [sfx1][sfx2][sfx3]amix=inputs=3[all_sfx];
    [0:a][all_sfx]amix=inputs=2:duration=first[out]
  " -map 0:v -map "[out]" output.mp4

Audio normalization:

ffmpeg -i audio.mp3 -filter:a "loudnorm=I=-16:TP=-1.5:LRA=11" normalized.mp3

ffmpeg -i audio.mp3 -filter:a "dynaudnorm" normalized.mp3

Speed Effects

Slow motion (0.5x):

ffmpeg -i clip.mp4 -vf "setpts=2.0*PTS" -af "atempo=0.5" slow.mp4

Fast motion (2x):

ffmpeg -i clip.mp4 -vf "setpts=0.5*PTS" -af "atempo=2.0" fast.mp4

Speed ramp (normal to slow to normal):

ffmpeg -i clip.mp4 -vf \
  "setpts='if(between(T,3,5),2.0*PTS,PTS)'" \
  speed_ramp.mp4

Time-lapse (4x speed):

ffmpeg -i clip.mp4 -vf "setpts=0.25*PTS" -an timelapse.mp4

Overlay and Picture-in-Picture

Basic overlay (logo/watermark):

ffmpeg -i main.mp4 -i logo.png \
  -filter_complex "[1:v]scale=100:-1[logo];[0:v][logo]overlay=W-w-20:20" \
  output.mp4

Picture-in-picture with border:

ffmpeg -i main.mp4 -i pip.mp4 \
  -filter_complex "
    [1:v]scale=320:180[pip];
    [0:v][pip]overlay=W-w-20:H-h-20
  " output.mp4

Opacity blending:

ffmpeg -i background.mp4 -i overlay.png \
  -filter_complex "
    [1:v]format=rgba,colorchannelmixer=aa=0.5[fg];
    [0:v][fg]overlay=0:0
  " output.mp4

Drop shadow effect (simulate with offset dark copy):

ffmpeg -i background.mp4 -i subject.png \
  -filter_complex "
    [1:v]scale=800:-1[subj];
    [subj]split[shadow][main];
    [shadow]colorchannelmixer=rr=0:gg=0:bb=0:aa=0.5,
    boxblur=10:10[shadow_blur];
    [0:v][shadow_blur]overlay=(W-w)/2+5:(H-h)/2+5[with_shadow];
    [with_shadow][main]overlay=(W-w)/2:(H-h)/2
  " output.mp4

The first 3 seconds determine everything. These 6 hook archetypes are proven patterns from viral content analysis.

The 6 Archetypes

interface HookArchetype {
  name: string;
  pattern: string;
  psychology: string;
  examples: string[];
  visualTreatment: string;
}

const ARCHETYPES: HookArchetype[] = [
  {
    name: 'Fortuneteller',
    pattern: 'Teases a future outcome the viewer wants',
    psychology: 'Curiosity gap — viewer must watch to see if it applies to them',
    examples: [
      'How to double your savings in 2026',
      'Your LLC is about to cost you $3,200 less',
      'What your accountant won\'t tell you about Q4',
    ],
    visualTreatment: 'Crystal ball / chart trending up / calendar with circled date',
  },
  {
    name: 'Magician',
    pattern: 'Reveals a surprising condensation or transformation',
    psychology: 'Value perception — massive input compressed into digestible output',
    examples: [
      'I condensed 50 finance books into 60 seconds',
      'The entire tax code in one sentence',
      '10 years of investing mistakes so you don\'t have to',
    ],
    visualTreatment: 'Stack of books → single page / time-lapse / before-after split',
  },
  {
    name: 'Contrarian',
    pattern: 'Challenges commonly accepted knowledge',
    psychology: 'Pattern interrupt — brain flags contradictions as important',
    examples: [
      'Stock market experts HATE this one simple rule',
      'Stop saving for retirement (here\'s why)',
      'The LLC advice everyone gives is dead wrong',
    ],
    visualTreatment: 'Red X over conventional wisdom / crossed-out text / head shake',
  },
  {
    name: 'Provocateur',
    pattern: 'Makes a controversial or emotionally charged statement',
    psychology: 'Emotional activation — anger/outrage increases engagement',
    examples: [
      'California is robbing you blind with this tax',
      'Your bank is stealing $400/year and you don\'t know it',
      'The IRS designed this system to keep you poor',
    ],
    visualTreatment: 'Red background / alarm / angry emoji / bold accusatory text',
  },
  {
    name: 'Statistician',
    pattern: 'Opens with a shocking number',
    psychology: 'Concrete specificity signals authority and triggers recall',
    examples: [
      '$1.4 billion evaporated in 48 hours',
      '93% of LLCs overpay by $2,100 per year',
      '1 in 4 Americans can\'t cover a $400 emergency',
    ],
    visualTreatment: 'Large number filling screen / counter animation / data visualization',
  },
  {
    name: 'Questioner',
    pattern: 'Poses an engaging question the viewer wants answered',
    psychology: 'Open loop — the brain seeks closure on unanswered questions',
    examples: [
      'How much do you REALLY need to quit your job?',
      'What would you do with an extra $3,200?',
      'Are you in the 93% who overpay on taxes?',
    ],
    visualTreatment: 'Question mark animation / thinking emoji / person looking puzzled',
  },
];

Hook Visual Treatment Matrix

Archetype	Text Style	Background	Animation	SFX
Fortuneteller	Gold/white, elegant	Dark, moody	Slow zoom in	Mystical tone
Magician	Bold white, large	Book/knowledge imagery	Scale pop	Whoosh
Contrarian	Red/white, aggressive	Crossed-out text	Shake effect	Record scratch
Provocateur	Red bold, ALL CAPS	Dark red gradient	Flash/pulse	Bass drop
Statistician	Yellow/white numbers	Dark with data viz	Counter animation	Cash register
Questioner	White italic	Person thinking	Typewriter	Question sound

Implementing Hooks in FFmpeg

ffmpeg -i background.mp4 -i bass_drop.mp3 \
  -filter_complex "
    [0:v]eq=brightness=-0.15:contrast=1.4:saturation=0.6,
    vignette=PI/3:1.3[bg];

    [bg]drawtext=text='\$%{eif\\:min(floor(t*700)\\,1400000000)\\:d\\:,}':
    fontfile=Inter-Black.ttf:fontsize=80:fontcolor=yellow:
    borderw=4:bordercolor=black:
    x=(w-text_w)/2:y=h*0.35:
    enable='between(t,0.5,3)',

    drawtext=text='evaporated in 48 hours':
    fontfile=Inter-Bold.ttf:fontsize=50:fontcolor=white:
    x=(w-text_w)/2:y=h*0.35+100:
    enable='between(t,1.5,3)'[hooked];

    [1:a]adelay=500|500,volume=0.7[sfx];
    [hooked]null[vout]
  " \
  -map "[vout]" -map "[sfx]" \
  -c:v libx264 -c:a aac -t 3 hook.mp4

Pattern 1: The Data Story (Finance/Business)

A complete video that reveals a financial insight with progressive data disclosure.

When to use: Financial education, tax tips, business analysis, calculator demos.

#!/bin/bash

FONT_BLACK="Inter-Black.ttf"
FONT_BOLD="Inter-Bold.ttf"
FONT_REG="Inter-Regular.ttf"
RESOLUTION="1080x1920"
FPS=30

ffmpeg -loop 1 -i assets/office.jpg -vf \
  "zoompan=z='min(zoom+0.002,1.2)':x='iw/2-(iw/zoom/2)':y='ih/2-(ih/zoom/2)':d=$((FPS*4)):s=${RESOLUTION}:fps=${FPS}" \
  -t 4 -c:v libx264 -pix_fmt yuv420p /tmp/scene1.mp4

ffmpeg -loop 1 -i assets/calculator.jpg -vf \
  "zoompan=z='1.15':x='if(eq(on,1),0,min(x+2,iw-iw/zoom))':y='ih/2-(ih/zoom/2)':d=$((FPS*4)):s=${RESOLUTION}:fps=${FPS}" \
  -t 4 -c:v libx264 -pix_fmt yuv420p /tmp/scene2.mp4

ffmpeg -loop 1 -i assets/money.jpg -vf \
  "zoompan=z='if(eq(on,1),1.3,max(zoom-0.002,1.0))':x='iw/2-(iw/zoom/2)':y='ih/2-(ih/zoom/2)':d=$((FPS*4)):s=${RESOLUTION}:fps=${FPS}" \
  -t 4 -c:v libx264 -pix_fmt yuv420p /tmp/scene3.mp4

ffmpeg -loop 1 -i assets/document.jpg -vf \
  "zoompan=z='1.15':x='iw/2-(iw/zoom/2)':y='if(eq(on,1),0,min(y+1.5,ih-ih/zoom))':d=$((FPS*4)):s=${RESOLUTION}:fps=${FPS}" \
  -t 4 -c:v libx264 -pix_fmt yuv420p /tmp/scene4.mp4

ffmpeg -loop 1 -i assets/savings.jpg -vf \
  "zoompan=z='min(zoom+0.003,1.25)':x='iw/2-(iw/zoom/2)':y='ih/2-(ih/zoom/2)':d=$((FPS*4)):s=${RESOLUTION}:fps=${FPS}" \
  -t 4 -c:v libx264 -pix_fmt yuv420p /tmp/scene5.mp4

ffmpeg -loop 1 -i assets/chart_up.jpg -vf \
  "zoompan=z='1.2':x='if(eq(on,1),0,min(x+1,iw-iw/zoom))':y='if(eq(on,1),0,min(y+0.7,ih-ih/zoom))':d=$((FPS*5)):s=${RESOLUTION}:fps=${FPS}" \
  -t 5 -c:v libx264 -pix_fmt yuv420p /tmp/scene6.mp4

ffmpeg -i /tmp/scene1.mp4 -i /tmp/scene2.mp4 -i /tmp/scene3.mp4 \
  -i /tmp/scene4.mp4 -i /tmp/scene5.mp4 -i /tmp/scene6.mp4 \
  -filter_complex "
    [0:v][1:v]xfade=transition=fade:duration=0.5:offset=3.5[v01];
    [v01][2:v]xfade=transition=slideleft:duration=0.5:offset=7[v012];
    [v012][3:v]xfade=transition=dissolve:duration=0.5:offset=10.5[v0123];
    [v0123][4:v]xfade=transition=smoothup:duration=0.5:offset=14[v01234];
    [v01234][5:v]xfade=transition=circlecrop:duration=0.5:offset=17.5[vchain]
  " -map "[vchain]" -c:v libx264 -pix_fmt yuv420p /tmp/chained.mp4

ffmpeg -i /tmp/chained.mp4 -vf "
  eq=brightness=-0.1:contrast=1.3:saturation=0.7,
  noise=alls=12:allf=t+u,
  vignette=PI/4:1.2,

  drawtext=text='93%% of LLCs overpay':fontfile=${FONT_BLACK}:fontsize=80:
    fontcolor=white:borderw=4:bordercolor=black:
    x=(w-text_w)/2:y=h*0.35:enable='between(t,0,3)',

  drawtext=text='Here is what they miss':fontfile=${FONT_BOLD}:fontsize=50:
    fontcolor=white@0.8:x=(w-text_w)/2:y=h*0.35+100:
    enable='between(t,0.5,3)',

  drawtext=text='State Fee':fontfile=${FONT_BOLD}:fontsize=48:
    fontcolor=white:x=100:y=h*0.25:enable='between(t,4,10)',
  drawtext=text='\$100':fontfile=${FONT_BLACK}:fontsize=60:
    fontcolor=green:x=w-300:y=h*0.25:enable='between(t,4,10)',

  drawtext=text='Agent Fee':fontfile=${FONT_BOLD}:fontsize=48:
    fontcolor=white:x=100:y=h*0.33:enable='between(t,5,10)',
  drawtext=text='\$0':fontfile=${FONT_BLACK}:fontsize=60:
    fontcolor=green:x=w-300:y=h*0.33:enable='between(t,5,10)',

  drawtext=text='Tax Savings':fontfile=${FONT_BOLD}:fontsize=48:
    fontcolor=white:x=100:y=h*0.41:enable='between(t,6,10)',
  drawtext=text='-\$3,200':fontfile=${FONT_BLACK}:fontsize=60:
    fontcolor=yellow:x=w-350:y=h*0.41:enable='between(t,6,10)',

  drawtext=text='TOTAL SAVINGS':fontfile=${FONT_BLACK}:fontsize=70:
    fontcolor=yellow:borderw=4:bordercolor=black:
    x=(w-text_w)/2:y=h*0.55:enable='between(t,8,12)',

  drawtext=text='\$%{eif\\:min(floor((t-8)*1600)\\,3200)\\:d}':
    fontfile=${FONT_BLACK}:fontsize=140:fontcolor=white:
    borderw=5:bordercolor=black:
    x=(w-text_w)/2:y=h*0.63:enable='between(t,8,12)',

  drawtext=text='Link in bio':fontfile=${FONT_BOLD}:fontsize=40:
    fontcolor=white@0.7:x=(w-text_w)/2:y=h*0.85:
    enable='between(t,10,14)'
" -c:v libx264 -c:a copy /tmp/graded.mp4

ffmpeg -i /tmp/graded.mp4 -i voiceover.mp3 -i music_dark_cinematic.mp3 \
  -i sfx/bass_drop.mp3 -i sfx/whoosh.mp3 -i sfx/cash_register.mp3 \
  -filter_complex "
    [2:a]volume=0.15,afade=t=in:ss=0:d=2,afade=t=out:st=19:d=2[music];
    [3:a]adelay=500|500,volume=0.7[bass];
    [4:a]adelay=3500|3500,volume=0.4[whoosh1];
    [4:a]adelay=7000|7000,volume=0.4[whoosh2];
    [5:a]adelay=8000|8000,volume=0.5[cash];
    [bass][whoosh1][whoosh2][cash]amix=inputs=4[all_sfx];
    [music][all_sfx]amix=inputs=2[bg_audio];
    [1:a][bg_audio]sidechaincompress=threshold=0.02:ratio=8:attack=200:release=1000[final_audio]
  " \
  -map 0:v -map "[final_audio]" \
  -c:v copy -c:a aac -b:a 192k -shortest output.mp4

Gotchas:

The offset in xfade chains is cumulative from the start of the output, not the clip
adelay values are in milliseconds, not seconds
sidechaincompress expects the voice as first input (the one being analyzed) and the music as second (the one being compressed)
enable='between(t,start,end)' uses seconds in the output timeline

Pattern 2: The Comparison Grid (Product/Tool)

Side-by-side comparison with animated data points.

When to use: Tool comparisons, product reviews, A-vs-B decisions.

#!/bin/bash

ffmpeg -i left_approach.mp4 -i right_approach.mp4 \
  -filter_complex "
    # Scale both to half-width
    [0:v]scale=540:960,
    eq=brightness=-0.05:contrast=1.2:saturation=0.8,
    pad=540:1920:0:0:black[left];

    [1:v]scale=540:960,
    eq=brightness=-0.05:contrast=1.2:saturation=0.8,
    pad=540:1920:0:960:black[right];

    # Stack horizontally
    [left][right]hstack[grid];

    # Add labels
    [grid]drawtext=text='TRADITIONAL':fontfile=Inter-Black.ttf:fontsize=40:
      fontcolor=red:x=270-text_w/2:y=50:enable='between(t,0,15)',
    drawtext=text='OPTIMIZED':fontfile=Inter-Black.ttf:fontsize=40:
      fontcolor=green:x=810-text_w/2:y=50:enable='between(t,0,15)',

    # VS divider
    drawtext=text='VS':fontfile=Inter-Black.ttf:fontsize=60:
      fontcolor=yellow:box=1:boxcolor=black@0.8:boxborderw=15:
      x=(w-text_w)/2:y=h*0.48:enable='between(t,0,15)',

    # Cost comparison (appears at 3s)
    drawtext=text='\$4,500/yr':fontfile=Inter-Black.ttf:fontsize=50:
      fontcolor=red:x=270-text_w/2:y=h*0.6:enable='between(t,3,15)',
    drawtext=text='\$1,300/yr':fontfile=Inter-Black.ttf:fontsize=50:
      fontcolor=green:x=810-text_w/2:y=h*0.6:enable='between(t,3,15)',

    # Savings callout (appears at 6s)
    drawtext=text='SAVE \$3,200':fontfile=Inter-Black.ttf:fontsize=70:
      fontcolor=yellow:borderw=4:bordercolor=black:
      x=(w-text_w)/2:y=h*0.75:enable='between(t,6,15)'
  " \
  -c:v libx264 -pix_fmt yuv420p output.mp4

Pattern 3: The Explainer (Educational/How-To)

Step-by-step tutorial with numbered progression.

When to use: How-to guides, process explanations, tutorial content.

#!/bin/bash

STEPS=("Open the calculator" "Enter your income" "Select LLC type" "Review deductions" "See your savings")
IMAGES=(calculator.jpg income.jpg llc_type.jpg deductions.jpg savings.jpg)
MOVEMENTS=("zoom_in" "pan_right" "zoom_out" "pan_down" "zoom_in")

for i in "${!STEPS[@]}"; do
  STEP_NUM=$((i + 1))
  STEP_TEXT="${STEPS[$i]}"

  # Determine zoompan expression based on movement type
  case "${MOVEMENTS[$i]}" in
    zoom_in)  ZP="z='min(zoom+0.002,1.2)':x='iw/2-(iw/zoom/2)':y='ih/2-(ih/zoom/2)'" ;;
    zoom_out) ZP="z='if(eq(on,1),1.3,max(zoom-0.002,1.0))':x='iw/2-(iw/zoom/2)':y='ih/2-(ih/zoom/2)'" ;;
    pan_right) ZP="z='1.15':x='if(eq(on,1),0,min(x+2,iw-iw/zoom))':y='ih/2-(ih/zoom/2)'" ;;
    pan_down) ZP="z='1.15':x='iw/2-(iw/zoom/2)':y='if(eq(on,1),0,min(y+1.5,ih-ih/zoom))'" ;;
  esac

  ffmpeg -loop 1 -i "assets/${IMAGES[$i]}" -vf "
    zoompan=${ZP}:d=120:s=1080x1920:fps=30,
    eq=brightness=-0.08:contrast=1.2:saturation=0.8,
    vignette=PI/5:1.0,

    drawtext=text='STEP ${STEP_NUM}':fontfile=Inter-Black.ttf:fontsize=90:
      fontcolor=yellow:borderw=4:bordercolor=black:
      x=(w-text_w)/2:y=h*0.25,

    drawtext=text='${STEP_TEXT}':fontfile=Inter-Bold.ttf:fontsize=50:
      fontcolor=white:x=(w-text_w)/2:y=h*0.25+120

  " -t 4 -c:v libx264 -pix_fmt yuv420p "/tmp/step${STEP_NUM}.mp4"
done

ffmpeg -i /tmp/step1.mp4 -i /tmp/step2.mp4 -i /tmp/step3.mp4 \
  -i /tmp/step4.mp4 -i /tmp/step5.mp4 \
  -filter_complex "
    [0:v][1:v]xfade=transition=slideleft:duration=0.5:offset=3.5[v01];
    [v01][2:v]xfade=transition=slideright:duration=0.5:offset=7[v012];
    [v012][3:v]xfade=transition=slideleft:duration=0.5:offset=10.5[v0123];
    [v0123][4:v]xfade=transition=circlecrop:duration=0.5:offset=14[vout]
  " -map "[vout]" output.mp4

Pattern 4: The Montage (Motivation/Compilation)

Rapid-fire clips with music-driven pacing.

When to use: Motivational content, compilations, brand sizzle reels.

#!/bin/bash

CLIPS=(clip1.mp4 clip2.mp4 clip3.mp4 clip4.mp4 clip5.mp4
       clip6.mp4 clip7.mp4 clip8.mp4 clip9.mp4 clip10.mp4)
TRANSITIONS=(fade slideleft dissolve diagtl smoothup
             circlecrop slidedown fadeblack radial fade)

for i in "${!CLIPS[@]}"; do
  ffmpeg -i "assets/${CLIPS[$i]}" -vf "
    scale=1080:1920:force_original_aspect_ratio=increase,crop=1080:1920,
    zoompan=z='min(zoom+0.004,1.3)':x='iw/2-(iw/zoom/2)':y='ih/2-(ih/zoom/2)':d=60:s=1080x1920:fps=30,
    eq=brightness=-0.15:contrast=1.5:saturation=0.5,
    noise=alls=20:allf=t+u,
    vignette=PI/3:1.4
  " -t 2 -c:v libx264 -pix_fmt yuv420p "/tmp/montage_${i}.mp4"
done

ffmpeg \
  -i /tmp/montage_0.mp4 -i /tmp/montage_1.mp4 -i /tmp/montage_2.mp4 \
  -i /tmp/montage_3.mp4 -i /tmp/montage_4.mp4 -i /tmp/montage_5.mp4 \
  -i /tmp/montage_6.mp4 -i /tmp/montage_7.mp4 -i /tmp/montage_8.mp4 \
  -i /tmp/montage_9.mp4 \
  -filter_complex "
    [0:v][1:v]xfade=transition=${TRANSITIONS[0]}:duration=0.3:offset=1.7[v01];
    [v01][2:v]xfade=transition=${TRANSITIONS[1]}:duration=0.3:offset=3.4[v02];
    [v02][3:v]xfade=transition=${TRANSITIONS[2]}:duration=0.3:offset=5.1[v03];
    [v03][4:v]xfade=transition=${TRANSITIONS[3]}:duration=0.3:offset=6.8[v04];
    [v04][5:v]xfade=transition=${TRANSITIONS[4]}:duration=0.3:offset=8.5[v05];
    [v05][6:v]xfade=transition=${TRANSITIONS[5]}:duration=0.3:offset=10.2[v06];
    [v06][7:v]xfade=transition=${TRANSITIONS[6]}:duration=0.3:offset=11.9[v07];
    [v07][8:v]xfade=transition=${TRANSITIONS[7]}:duration=0.3:offset=13.6[v08];
    [v08][9:v]xfade=transition=${TRANSITIONS[8]}:duration=0.3:offset=15.3[vout]
  " -map "[vout]" montage.mp4

ffmpeg -i montage.mp4 -i music_epic.mp3 \
  -filter_complex "[1:a]volume=0.4,afade=t=in:d=1,afade=t=out:st=16:d=1[m];[m]atrim=0:17[mt]" \
  -map 0:v -map "[mt]" -c:v copy -c:a aac -shortest output.mp4

Example 1: Quick Ken Burns from a Single Image

ffmpeg -loop 1 -i photo.jpg -vf \
  "zoompan=z='min(zoom+0.0015,1.15)':x='iw/2-(iw/zoom/2)':y='ih/2-(ih/zoom/2)':d=150:s=1080x1920:fps=30" \
  -t 5 -c:v libx264 -pix_fmt yuv420p output.mp4

Example 2: Crossfade Between Two Clips

ffmpeg -i clip1.mp4 -i clip2.mp4 \
  -filter_complex "[0:v][1:v]xfade=transition=dissolve:duration=1:offset=4" \
  -c:v libx264 output.mp4

Example 3: Add Background Music with Auto-Ducking

ffmpeg -i video_with_voice.mp4 -i background_music.mp3 \
  -filter_complex "
    [1:a]volume=0.15[music];
    [0:a][music]sidechaincompress=threshold=0.02:ratio=8:attack=200:release=1000[audio]
  " \
  -map 0:v -map "[audio]" -c:v copy -c:a aac output.mp4

Example 4: Animated Counter (0 to $10,000)

ffmpeg -i background.mp4 -vf \
  "drawtext=text='\$%{eif\\:min(floor((t-1)*3333)\\,10000)\\:d\\:,}':
   fontfile=Inter-Black.ttf:fontsize=140:fontcolor=white:
   borderw=5:bordercolor=black:
   x=(w-text_w)/2:y=(h-text_h)/2:
   enable='between(t,1,4)'" \
  -c:v libx264 -c:a copy output.mp4

Example 5: Film Grain + Vignette + Color Grade in One Pass

ffmpeg -i raw_clip.mp4 -vf \
  "eq=brightness=-0.1:contrast=1.3:saturation=0.7,
   noise=alls=15:allf=t+u,
   vignette=PI/4:1.2" \
  -c:v libx264 -c:a copy cinematic.mp4

Example 6: Text with Background Box (Title Card)

ffmpeg -i clip.mp4 -vf \
  "drawtext=text='THE HIDDEN TAX TRAP':
   fontfile=Inter-Black.ttf:fontsize=70:fontcolor=white:
   box=1:boxcolor=red@0.85:boxborderw=25:
   x=(w-text_w)/2:y=h*0.4:
   enable='between(t,0,4)'" \
  -c:v libx264 -c:a copy output.mp4

Example 7: Sound Effect at Specific Timestamp

ffmpeg -i main_video.mp4 -i sfx/whoosh.mp3 \
  -filter_complex "
    [1:a]adelay=3000|3000,volume=0.5[sfx];
    [0:a][sfx]amix=inputs=2:duration=first[audio]
  " \
  -map 0:v -map "[audio]" -c:v copy -c:a aac output.mp4

Example 8: Speed Ramp (Normal to Slow-Mo)

ffmpeg -i action_clip.mp4 -filter_complex "
  [0:v]setpts='if(between(T,3,5),2.0*PTS,PTS)'[v]
" -map "[v]" -an -c:v libx264 output.mp4

Example 9: Picture-in-Picture Overlay

ffmpeg -i main.mp4 -i pip_source.mp4 \
  -filter_complex "
    [1:v]scale=280:500[pip];
    [0:v][pip]overlay=W-w-30:H-h-30:enable='between(t,5,15)'[out]
  " \
  -map "[out]" -map 0:a -c:v libx264 -c:a copy output.mp4

Example 10: Teal and Orange Hollywood Grade

ffmpeg -i footage.mp4 -vf \
  "colorbalance=rs=0.1:gs=-0.1:bs=-0.2:rh=-0.1:gh=0.05:bh=0.15,
   eq=contrast=1.1:saturation=1.2,
   vignette=PI/5:1.0" \
  -c:v libx264 -c:a copy hollywood.mp4

When FFmpeg is not enough — or when you need to scale beyond local rendering — these cloud APIs provide programmatic video creation.

Feature Comparison

Feature	FFmpeg (local)	Shotstack	Creatomate	JSON2Video	Remotion	Plainly
Ken Burns	`zoompan` filter	`effects` property	Keyframe animation	Manual positioning	CSS transforms	After Effects
Multi-track	`filter_complex`	Tracks + clips JSON	Layers in JSON	Single track	React composition	AE tracks
Text animation	`drawtext` (limited)	Basic text	Advanced (cascade, typewriter, bounce)	Basic text	Full React animations	AE keyframes
Transitions	`xfade` (40+)	Built-in set	Keyframe-based	Limited	Custom React	AE transitions
Color grading	`eq`, `colorbalance`	Filters	Adjustments	None	CSS filters	AE effects
Audio mixing	`amix`, `sidechaincompress`	Timeline audio	Audio layers	Basic	Web Audio API	AE audio
Templates	None (scripts)	JSON templates	Visual editor + JSON	None	Code templates	AE templates
Rendering	Local GPU/CPU	Cloud	Cloud	Cloud	Local/Lambda/Cloud	Cloud
Pricing	Free	$0.049/render (SD)	From $20/mo	Credits	$19/mo (Cloud Run)	From $59/mo
Best for	Full control, cost	JSON-driven automation	Template-based brands	Simple slideshows	Complex animations	Premium AE quality
Limitations	No visual editor, steep learning curve	Limited text animation	Monthly minimums	Single track only	React knowledge required	Expensive, AE dependency

Detailed API Profiles

Shotstack

Shotstack renders video from JSON specifications via REST API. The composition model uses timelines, tracks, and clips.

// Shotstack JSON composition structure
interface ShotstackEdit {
  timeline: {
    tracks: Array<{
      clips: Array<{
        asset: {
          type: 'video' | 'image' | 'title' | 'audio' | 'html';
          src?: string;
          text?: string;
          style?: string;
        };
        start: number;     // seconds
        length: number;    // seconds
        transition?: {
          in: 'fade' | 'reveal' | 'wipeLeft' | 'slideLeft';
          out: 'fade' | 'reveal' | 'wipeRight' | 'slideRight';
        };
        effect?: 'zoomIn' | 'zoomOut' | 'slideLeft' | 'slideRight';
        filter?: 'greyscale' | 'boost' | 'contrast' | 'darken';
        opacity?: number;
        position?: 'center' | 'top' | 'bottom';
      }>;
    }>;
    background?: string;
  };
  output: {
    format: 'mp4' | 'gif';
    resolution: 'sd' | 'hd' | '1080';
    fps: number;
  };
}

Pros: Simple JSON model, good documentation, supports multi-track, built-in effects. Cons: Limited text animation, no keyframe control, effects are preset-only.

Pricing: $0.049/render (SD), $0.098/render (HD), $0.196/render (1080p). API docs.

Creatomate

Creatomate offers both template-based and JSON-from-scratch rendering with advanced text animations.

// Creatomate render request
interface CreatomateRender {
  source: {
    output_format: 'mp4' | 'gif' | 'png';
    width: number;
    height: number;
    duration: number;
    elements: Array<{
      type: 'video' | 'image' | 'text' | 'shape' | 'composition';
      source?: string;
      text?: string;
      x: string;       // Supports expressions and percentages
      y: string;
      width: string;
      height: string;
      animations?: Array<{
        type: 'scale' | 'fade' | 'slide' | 'text-typewriter' | 'text-cascade';
        time: 'start' | 'end' | number;
        duration: number;
        easing?: string;
      }>;
      keyframes?: Array<{
        time: number;
        value: Record<string, unknown>;
      }>;
    }>;
  };
}

Pros: Rich text animation (typewriter, cascade, bounce), keyframe support, visual template editor, good for branded content. Cons: Monthly subscription required, less control than raw FFmpeg for custom effects.

Pricing: Starts at $20/month. Developer docs.

Remotion

Remotion renders video using React components. Each frame is a React render.

// Remotion video composition
import { AbsoluteFill, useCurrentFrame, interpolate, Sequence } from 'remotion';

const KenBurnsImage: React.FC<{ src: string; direction: 'in' | 'out' }> = ({ src, direction }) => {
  const frame = useCurrentFrame();

  const scale = direction === 'in'
    ? interpolate(frame, [0, 150], [1, 1.2])
    : interpolate(frame, [0, 150], [1.3, 1.0]);

  return (
    <AbsoluteFill>
      <img
        src={src}
        style={{
          width: '100%',
          height: '100%',
          objectFit: 'cover',
          transform: `scale(${scale})`,
        }}
      />
    </AbsoluteFill>
  );
};

const DataStoryVideo: React.FC = () => {
  return (
    <>
      <Sequence from={0} durationInFrames={90}>
        <KenBurnsImage src="/office.jpg" direction="in" />
        <AnimatedText text="93% of LLCs overpay" y={0.35} />
      </Sequence>
      <Sequence from={90} durationInFrames={90}>
        <KenBurnsImage src="/calculator.jpg" direction="out" />
        <CounterAnimation target={3200} prefix="$" y={0.4} />
      </Sequence>
    </>
  );
};

Pros: Full React ecosystem, unlimited animation complexity, self-hostable, type-safe, testable. Cons: Requires React knowledge, local rendering needs good hardware, Lambda rendering has cold start overhead.

Pricing: Open source (local render is free). Cloud Run: $19/month. Lambda rendering: pay per invocation. Docs.

JSON2Video

JSON2Video converts JSON documents to video with built-in TTS and HTML element support.

// JSON2Video movie structure
interface JSON2VideoMovie {
  resolution: 'full-hd' | 'hd' | 'sd';
  quality: 'high' | 'medium' | 'low';
  scenes: Array<{
    background: string;
    elements: Array<{
      type: 'text' | 'image' | 'video' | 'audio' | 'html' | 'voice';
      src?: string;
      text?: string;
      start?: number;
      duration?: number;
      position?: 'center' | 'custom';
      x?: number;
      y?: number;
    }>;
  }>;
}

Pros: Simple API, built-in TTS, HTML/CSS element support. Cons: Single-track, limited animation, no multi-layer composition, credit-based pricing is unpredictable.

Pricing: Credit-based system. API docs.

Plainly

Plainly renders Adobe After Effects templates in the cloud, providing the highest visual quality at the highest cost.

Pros: Full After Effects quality, complex animations, professional templates, motion graphics. Cons: Requires After Effects knowledge to create templates, expensive, slower rendering.

Pricing: From $59/month for 20 minutes ($3/minute). 100 minutes at $249/month ($2.50/minute). Pricing page.

Decision Matrix

If you need…	Use…	Because…
Maximum control, lowest cost	FFmpeg locally	Free, full filter access, $0.01/video
JSON-driven automation at scale	Shotstack	Simple API, predictable per-render pricing
Branded templates with rich text	Creatomate	Best text animations, visual editor
Complex custom animations	Remotion	React-based, unlimited creativity
Simple slideshow with TTS	JSON2Video	Quickest to implement
Premium motion graphics	Plainly	After Effects quality in the cloud
All of the above	FFmpeg + graduate up	Start local, move to cloud when you hit limits

Key insight: Start with FFmpeg. It handles Levels 1-5 with zero cost. Graduate to Shotstack or Creatomate only when you need features FFmpeg cannot provide — primarily complex text animation and template management. If you need React-level animation control, use Remotion. If you need After Effects quality, use Plainly.

Video Footage

Source	License	API	Quality	Volume
Pexels	Free, attribution optional	REST API, 200 req/hr	HD-4K	50K+ videos
Pixabay	Free, no attribution	REST API	HD-4K	30K+ videos
Coverr	Free, no attribution	Manual download	HD	2K+ videos
Mixkit	Free, no attribution	Manual download	HD-4K	5K+ videos

Recommendation: Use Pexels API as primary (best search, largest catalog, API access). Pixabay as secondary. Download Coverr/Mixkit clips locally for common backgrounds (abstract, nature, city, technology).

Music

Source	License	Genres	Download
Pixabay Music	Free, no attribution	All genres	Direct
Mixkit Music	Free, no attribution	All genres	Direct
Freesound	CC (varies)	SFX + ambient	API
Uppbeat	Free tier available	All genres	Direct

Pre-download library by mood (20-30 tracks):

Mood	Use Case	Search Terms
Dark cinematic	Finance, business	”dark cinematic”, “tension”, “corporate dark”
Corporate tech	SaaS, technology	”corporate tech”, “innovation”, “digital”
Motivational epic	Success, achievement	”epic motivation”, “triumph”, “inspirational”
Dramatic tension	Reveals, surprises	”suspense”, “dramatic build”, “anticipation”
Neutral ambient	How-to, tutorials	”ambient”, “background”, “minimal”
Upbeat energy	Lifestyle, marketing	”upbeat pop”, “energetic”, “happy”

Sound Effects

Pre-download library (15-20 effects):

SFX	Use Case	Source
Whoosh (3 variants)	Transitions	Pixabay SFX
Bass drop	Hook reveal	Pixabay SFX
Cash register	Money mentions	Pixabay SFX
Keyboard typing	Data reveals	Freesound
Camera shutter	Screenshot moments	Pixabay SFX
Tension drone	Background suspense	Freesound
Success chime	CTA, completion	Pixabay SFX
Notification ping	Alerts, pop-ups	Pixabay SFX
Record scratch	Contrarian hooks	Pixabay SFX
Glass shatter	Breaking misconceptions	Pixabay SFX

Fonts

Font	Weight	Use Case	Source
Inter Black	900	Headlines, numbers	Google Fonts
Inter Bold	700	Subheads, labels	Google Fonts
Inter Regular	400	Body text, captions	Google Fonts
Montserrat Bold	700	Alternative headline	Google Fonts

Download and install locally — FFmpeg’s drawtext requires a fontfile path to a .ttf file.

Background Textures (download 5-10 looping videos)

Texture	Mood	Search Terms
Dark digital grid	Tech, data	”digital grid loop”
Abstract particles	Universal	”particles dark background”
Bokeh lights	Warm, lifestyle	”bokeh lights loop”
Smoke/fog	Dramatic, moody	”smoke dark background”
Matrix code rain	Tech, hacking	”matrix code loop”
Gradient flow	Modern, clean	”gradient abstract loop”

Where to Run Each Step

Step	Run Where	Why
Script generation	Cloud (Claude SDK, Gemini, Workers AI)	Quality matters — use the best model available
Voice synthesis	Local (edge-tts) or ElevenLabs API	edge-tts is free and fast; ElevenLabs for premium
Music	Stock libraries	Don’t generate — curate from free libraries
Stock footage search	Pexels/Pixabay API	Already in API Mom, free tier is generous
Video rendering	Local FFmpeg	$0.01/video on consumer hardware
Caption generation	Local (whisper-timestamped)	Word-level timestamps, runs on GPU

Cost Comparison

Pipeline	Cost/Video	At 100 videos/month	Quality
Full local (FFmpeg + edge-tts)	$0.01-0.03	$1-3	Level 1-5
Local + ElevenLabs voice	$0.10-0.30	$10-30	Level 1-5, better voice
Shotstack cloud	$0.30-1.00	$30-100	Level 1-3 (limited effects)
Creatomate cloud	$0.50-2.00	$50-200	Level 1-4 (good text)
Remotion Lambda	$0.05-0.15	$5-15	Level 1-6 (any animation)
Plainly (After Effects)	$2.50-3.00	$250-300	Level 6+ (premium)

Key insight: Local FFmpeg rendering on a consumer GPU (RTX 4070 or similar) costs approximately $0.01/video in electricity. At 100 videos per month, that is $1 versus $100+ for cloud APIs — a 100x cost difference with equivalent or better quality for Levels 1-5.

The Graduation Path

Stage 1: FFmpeg for everything (Levels 1-5)
  ↓ When you need: complex text animation, template management
Stage 2: FFmpeg + Creatomate for text-heavy content
  ↓ When you need: React-level custom animation
Stage 3: Remotion for complex sequences, FFmpeg for simple ones
  ↓ When you need: professional motion graphics with existing AE templates
Stage 4: Plainly for premium content, Remotion/FFmpeg for volume

Render Pipeline Architecture (TypeScript)

interface RenderJob {
  id: string;
  script: VideoScript;
  assets: CollectedAssets;
  profile: ColorProfile;
  output: OutputSpec;
  status: 'queued' | 'rendering' | 'complete' | 'failed';
}

interface VideoScript {
  hook: {
    archetype: HookArchetype;
    text: string;
    durationSec: number;
  };
  segments: Array<{
    narration: string;
    durationSec: number;
    visualSearchTerms: string[];  // For Pexels API queries
    dataOverlays: DataOverlay[];  // Numbers, stats to display
    sfxCues: SFXCue[];           // Sound effects at timestamps
  }>;
  cta: {
    text: string;
    durationSec: number;
  };
  totalDurationSec: number;
}

interface CollectedAssets {
  voiceover: { path: string; wordTimestamps: WhisperWord[] };
  music: { path: string; mood: string; bpm: number };
  clips: Array<{
    path: string;
    source: 'pexels' | 'pixabay' | 'local';
    kenBurns: KenBurnsMovement;
    durationSec: number;
  }>;
  sfx: Map<string, string>;  // name → file path
  fonts: Map<string, string>; // weight → file path
}

interface OutputSpec {
  resolution: '1080x1920' | '1920x1080';
  fps: 30;
  codec: 'libx264';
  audioBitrate: '192k';
  format: 'mp4';
}

// The render function builds an FFmpeg command from the job spec
function buildFFmpegCommand(job: RenderJob): string {
  const inputs: string[] = [];
  const filters: string[] = [];

  // Step 1: Add all video inputs with Ken Burns
  job.assets.clips.forEach((clip, i) => {
    inputs.push(`-i ${clip.path}`);
    filters.push(buildKenBurnsFilter(clip, i));
  });

  // Step 2: Chain transitions
  filters.push(buildTransitionChain(job.assets.clips));

  // Step 3: Apply color grade + grain + vignette
  filters.push(buildColorGradeFilter(job.profile));

  // Step 4: Add text overlays
  job.script.segments.forEach(seg => {
    seg.dataOverlays.forEach(overlay => {
      filters.push(buildTextOverlayFilter(overlay));
    });
  });

  // Step 5: Mix audio
  inputs.push(`-i ${job.assets.voiceover.path}`);
  inputs.push(`-i ${job.assets.music.path}`);
  filters.push(buildAudioMixFilter(job));

  return `ffmpeg ${inputs.join(' ')} -filter_complex "${filters.join(';')}" output.mp4`;
}

Don’t	Do Instead	Why
Use static images without Ken Burns	Apply zoompan to every image	Static = retention death. The brain disengages when nothing moves.
Hard cut between every clip	Use xfade transitions (0.3-1s)	Hard cuts feel jarring and amateur. Crossfades feel intentional.
Same Ken Burns movement on consecutive clips	Alternate from the 7-movement library	Repetitive movement becomes predictable and boring.
TTS voiceover with no background music	Mix music at 15% volume with sidechain ducking	Music sets mood and fills silence gaps. Ducking keeps voice clear.
Single visual per sentence	2-3 visuals per sentence (3-second rule)	One visual for 9 seconds loses attention. Three visuals maintain pace.
Text without background/border	Always use `borderw` or `box` on drawtext	Text without contrast against video is unreadable on mobile.
Same transition type throughout	Vary transitions (fade, slide, dissolve, circle)	Variety maintains visual interest. Same transition = monotonous.
Skip color grading	Apply a consistent grade to all clips	Raw footage from different sources looks inconsistent. Grade unifies.
Generate music with AI	Curate from free stock libraries	AI music is detectable, sounds generic, and quality varies. Stock is curated.
Use only the `fade` transition	Use 6-8 different transitions per video	FFmpeg has 40+ transitions — use them. Variety = production value.
Render in the cloud for simple compositions	Use local FFmpeg first, cloud only when needed	Cloud = $0.30-3.00/video. Local = $0.01/video. Same quality for Levels 1-5.
Skip silence removal in voiceover	Run whisper-timestamped and trim gaps > 0.3s	Dead air kills pacing. Professional editors obsessively remove silence.
Put all text at the center	Vary position (top bar, bottom third, center)	Same position becomes invisible (banner blindness). Vary to maintain attention.
Use system fonts in drawtext	Download Inter Black/Bold from Google Fonts	System fonts look amateur. Inter is designed for screens and data display.
Skip the hook (start with content)	First 3 seconds must have hook archetype	71% decide in 3 seconds. No hook = no viewers.
Captions as an afterthought	Build captions into the composition as Track 3	78% watch on mute. Captions are primary content, not accessibility add-on.

Official Documentation

FFmpeg Filters Documentation — Complete filter reference including zoompan, xfade, drawtext, eq, colorbalance, noise, vignette, sidechaincompress
FFmpeg Filtering Guide — Official wiki guide to filter graphs and filter_complex
Shotstack API Documentation — REST API reference for JSON-to-video rendering
Creatomate Developer Docs — API reference for template and JSON-based video rendering
Remotion Documentation — React-based programmatic video creation framework
JSON2Video API Reference — JSON-to-video API with built-in TTS
Plainly Product — After Effects cloud rendering API
Pexels API Documentation — Free stock photo and video API
Pixabay API Documentation — Free stock media API
whisper-timestamped (GitHub) — Word-level timestamp extraction for silence detection
WhisperX (GitHub) — Automatic speech recognition with word-level timestamps and diarization

FFmpeg Tutorials and Guides

Ken Burns Effect with FFmpeg — Bannerbear — Detailed zoompan tutorial with examples
Ken Burns Effect Slideshows with FFmpeg — mko.re — Practical slideshow creation guide
Ken Burns Effect Using FFmpeg — hadna.space — Zoom and pan reference
How to Zoom Images and Videos using FFmpeg — Creatomate — Zoom techniques guide
How to Create a Slideshow from Images using FFmpeg — Creatomate — Complete slideshow pipeline
XFade Transitions — OTTVerse — Complete xfade transition guide with examples
xfade-easing (GitHub) — 200+ extended transitions with CSS easing functions
FFmpeg drawtext filter — OTTVerse — Dynamic text overlays, scrolling text, timestamps
FFmpeg drawtext animations — Brayden Blackwell — Exploration of drawtext animation expressions
FFmpeg Audio Manipulation Guide — Audio mixing, ducking, effects
sidechaincompress — FFmpeg Docs — Sidechain compression filter reference
How to Add a Transparent Overlay — Creatomate — Overlay and opacity techniques
FFmpeg Engineering Handbook — Filters — Filter graph fundamentals

Retention Science and Statistics

TikTok First 3 Seconds Hook Retention Rate Statistics — TTS Vibes — 71% decide in first 3 seconds, 85% retention = viral potential
Social Media Attention Span Statistics 2026 — SQ Magazine — 73% can’t distinguish AI-assisted from traditional video
2025 YouTube Audience Retention Benchmarks — Retention Rabbit — Average 23.7% retention, 8-second consideration window
Short Form Video Statistics 2025 — Marketing LTB — 92% completion for under-15-second videos, 60-90 second sweet spot
2025 Social Media Video Statistics — Social Insider — 78% watch on mute, cross-platform video benchmarks
YouTube Audience Retention 2026 Guide — Social Rails — Complete retention optimization guide
User Attention Span Statistics 2026 — Amra and Elma — Digital focus collapse data

Kinetic Typography and Motion Design

Kinetic Typography for Video Engagement 2025 — Influencers Time — 25-50% retention improvement with motion text
Kinetic Typography for Short-Form Video — Influencers Time — Short-form specific motion text techniques
Kinetic Typography: Complete Guide 2026 — IK Agency — Comprehensive motion text design guide
AI-Powered Text Animation Trends 2025 — SuperAGI — AI tools for kinetic typography

Cloud API Comparisons

7 Best Video Editing APIs 2026 — Plainly — Comprehensive API comparison
Best Video Generation APIs Reviewed — Creatomate — Cloud rendering API comparison
Best Stock Image and Video APIs — Shotstack — Stock media API comparison
Top 10 Stock Video APIs — Plainly — Stock footage API roundup

Stock Resources

Pexels — Free stock photos and videos, REST API
Pixabay — Free stock photos, videos, music, and sound effects
Pixabay Music — Free background music, no attribution required
Mixkit — Free stock video, music, and sound effects
Coverr — Free stock video footage
Freesound — Community-sourced sound effects (Creative Commons)
Uppbeat — Free music for creators
Google Fonts — Inter — Primary font for video text overlays
Google Fonts — Montserrat — Alternative headline font

GitHub Repositories

Remotion (GitHub) — React framework for programmatic video
Shotstack JSON Examples (GitHub) — Collection of Shotstack API examples
JSON2Video Node.js SDK (GitHub) — JSON2Video SDK for Node.js
xfade-ffmpeg-script (GitHub) — Bash script for chaining xfade transitions
FFmpeg leveler (GitHub) — Automated leveling and ducking with FFmpeg
cerberus/ffmpeg/zoompan (GitHub) — zoompan filter reference and examples