Skip to content
Gary Wu
Go back

Video Production Techniques

Edit page

Org Status: 🟑 Dormant Cloudflare: N/A Last Audited: 2026-04-28


How to bridge the gap between β€œAI slideshow” and viral-quality video β€” using FFmpeg, cloud rendering APIs, and the exact techniques human editors use to hold attention. Every command is real. Every statistic is sourced.

What you’ll learn:


  1. The Problem: Why AI Video Looks Amateur
  2. The Retention Science
  3. Core Concepts
  4. The 6 Production Levels
  5. The 4 Human Editor Secrets
  6. FFmpeg Effects Reference
  7. Hook Archetypes
  8. Patterns: Production-Quality Compositions
  9. Small Examples
  10. Cloud API Comparison
  11. Stock Resource Strategy
  12. Architecture Decisions
  13. Anti-Patterns
  14. References

Most programmatic video pipelines produce content that viewers scroll past in under a second. The symptoms are obvious:

The result is what the industry calls a β€œslideshow video” β€” technically a video file, functionally a PowerPoint with narration. Viewers detect this in under 3 seconds and swipe away.

The gap between slideshow and viral is not talent or expensive software. It is a specific set of techniques that human editors apply instinctively and AI pipelines skip entirely. These techniques are all implementable with FFmpeg filters, and they compound β€” each one adds 10-30% retention lift, and together they transform amateur content into content that algorithms promote.

What changes if you get this right:

MetricSlideshow (Level 1)Production (Level 4+)
3-second retention30-40%70-85%
Average watch time15-25% of duration45-65% of duration
Algorithmic promotionMinimalActive distribution
RPM (finance niche)$2-5$9-21
Viewer perception”AI generated""Professional content”
Cost per video (local)$0.01$0.01-0.03

The cost difference is negligible. The quality difference is everything. Every technique in this article adds production value without adding meaningful cost when you render locally with FFmpeg.


Before diving into techniques, understand why they work. Video retention is governed by neuroscience, not aesthetics.

The 3-Second Window

71% of viewers decide in the first 3 seconds whether to keep watching. TikTok videos that maintain 70-85% retention in the first 3 seconds receive 2.2x more total views than videos with lower retention. Videos exceeding 85% first-3-second retention achieve viral potential. Content below 60% gets minimal algorithmic promotion.

interface RetentionWindow {
  /** Seconds 0-3: The hook. Visual must change, text must appear, audio must hit. */
  hook: { durationMs: 3000; targetRetention: 0.85; visualChanges: number }; // minimum 2

  /** Seconds 3-10: The promise. Viewer decides if the payoff is worth waiting for. */
  promise: { durationMs: 7000; targetRetention: 0.70; paceSecondsPerVisual: 3 };

  /** Seconds 10-30: The delivery. Content must match or exceed the hook's promise. */
  delivery: { durationMs: 20000; targetRetention: 0.55; paceSecondsPerVisual: 3 };

  /** Seconds 30+: The payoff. Reward the viewer for staying. */
  payoff: { targetRetention: 0.40; includesCTA: boolean };
}

Why Movement Matters

The human visual system is wired to track movement. A static image on screen triggers the brain’s β€œnothing is happening” response and attention drops. Even subtle movement β€” a 2% zoom over 5 seconds β€” keeps the visual cortex engaged.

Key insight: Static visuals are not β€œneutral” β€” they are actively harmful to retention. Every frame must move. This is the single highest-impact change you can make to any video pipeline.

Why Multi-Layer Composition Works

Professional video uses depth β€” multiple visual planes stacked to create a sense of space. Multi-plane B-roll compositions increase retention by 31% compared to single-layer video. The brain interprets layered visuals as β€œricher” content worth paying attention to.

Why Sound Design Matters

78% of social media video is watched on mute. This means captions are mandatory, not optional. But for the 22% who listen, sound design β€” background music, sound effects, audio ducking β€” dramatically increases perceived production quality. The combination of visual captions AND sound design covers both audiences.

Optimal Duration

The highest engagement for short-form video occurs in the 60-90 second range. For long-form YouTube, 8-15 minutes is the sweet spot for finance/education niches, with $9-21 RPM.


Concept 1: The Render Pipeline

Every programmatic video follows the same pipeline, regardless of whether you use FFmpeg locally or a cloud API.

interface RenderPipeline {
  /** Step 1: Script generation β€” what the video says */
  script: {
    hook: string;           // First 3 seconds
    segments: Segment[];    // Each segment = one visual scene
    cta: string;            // Call to action
  };

  /** Step 2: Asset collection β€” what the video shows */
  assets: {
    stockVideo: StockClip[];     // B-roll footage from Pexels/Pixabay
    stockImages: StockImage[];   // For Ken Burns treatment
    music: AudioTrack;           // Background music from Pixabay Music
    sfx: SoundEffect[];          // Whoosh, bass drop, etc.
    voiceover: AudioTrack;       // TTS from edge-tts or ElevenLabs
  };

  /** Step 3: Composition β€” how it all fits together */
  composition: {
    tracks: Track[];             // Layered video tracks (background, subject, text)
    transitions: Transition[];   // Between clips
    colorGrade: ColorProfile;    // Overall mood
    effects: Effect[];           // Grain, vignette, etc.
  };

  /** Step 4: Render β€” produce the final file */
  output: {
    format: 'mp4';
    resolution: '1080x1920' | '1920x1080';  // Vertical or horizontal
    fps: 30;
    codec: 'libx264' | 'libx265';
    audioBitrate: '192k';
  };
}

Concept 2: The Visual Change Cadence

The β€œ3-second rule” is the most important pacing concept in viral video. The visual on screen must change every 3 seconds. Not every 10 seconds. Not every 5 seconds. Every 3 seconds.

interface VisualCadence {
  /** Maximum seconds any single visual can stay on screen */
  maxVisualDuration: 3;

  /** For a 60-second video, you need at minimum 20 distinct visuals */
  visualsPerMinute: 20;

  /** Each script segment (sentence) should map to 2-3 visuals, not 1 */
  visualsPerSegment: 2 | 3;

  /** Never repeat the same visual in a video */
  allowRepeat: false;

  /** Movement type must vary β€” never two consecutive clips with same Ken Burns */
  movementVariation: true;
}

Key insight: Divide each script segment duration by 3. That’s how many distinct visuals you need for that segment. A 9-second sentence needs 3 different visuals, each with its own Ken Burns movement, with crossfade transitions between them.

Concept 3: The Audio Stack

Professional video has 3-4 audio layers, not 1.

interface AudioStack {
  /** Layer 1: Voice β€” the primary content, always at 100% */
  voice: {
    source: 'edge-tts' | 'elevenlabs' | 'recorded';
    volume: 1.0;
    processing: 'normalize' | 'compress';
  };

  /** Layer 2: Music β€” sets mood, always ducked under voice */
  music: {
    source: 'stock';  // Pixabay Music, Mixkit
    volume: 0.15;     // 15% when voice is active
    ducking: {
      method: 'sidechaincompress';
      threshold: 0.02;
      ratio: 8;
      attackMs: 200;
      releaseMs: 1000;
    };
  };

  /** Layer 3: SFX β€” punctuate transitions and key moments */
  sfx: {
    onTransition: 'whoosh';        // Every scene change
    onHookReveal: 'bass_drop';     // The first number/stat
    onMoney: 'cash_register';      // Dollar amounts
    onData: 'keyboard_typing';     // Data reveals
    onCTA: 'success_chime';        // End call-to-action
    volume: 0.5;
  };

  /** Layer 4: Ambience (optional) β€” subtle background texture */
  ambience?: {
    source: 'tension_drone' | 'room_tone';
    volume: 0.05;
  };
}

Concept 4: The Color Grade

Color grading is the difference between β€œfootage” and β€œcinema.” A single FFmpeg filter chain transforms generic stock footage into a cohesive visual identity.

interface ColorProfile {
  name: string;
  /** FFmpeg eq filter values */
  brightness: number;    // -1.0 to 1.0
  contrast: number;      // 0.0 to 2.0
  saturation: number;    // 0.0 to 3.0
  /** Additional filters */
  grain: boolean;        // noise filter for texture
  vignette: boolean;     // dark edges for focus
  colorBalance?: {       // Teal/orange, cold, warm
    shadowsRed: number;
    shadowsGreen: number;
    shadowsBlue: number;
    highlightsRed: number;
    highlightsGreen: number;
    highlightsBlue: number;
  };
}

const PROFILES: Record<string, ColorProfile> = {
  darkCinematic: {
    name: 'Dark Cinematic',
    brightness: -0.1,
    contrast: 1.3,
    saturation: 0.7,
    grain: true,
    vignette: true,
  },
  tealOrange: {
    name: 'Teal & Orange (Hollywood)',
    brightness: 0,
    contrast: 1.1,
    saturation: 1.2,
    grain: false,
    vignette: true,
    colorBalance: {
      shadowsRed: 0.1, shadowsGreen: -0.1, shadowsBlue: -0.2,
      highlightsRed: -0.1, highlightsGreen: 0.05, highlightsBlue: 0.15,
    },
  },
  coldClinical: {
    name: 'Cold/Clinical (Tech)',
    brightness: 0,
    contrast: 1.2,
    saturation: 0.6,
    grain: false,
    vignette: false,
    colorBalance: {
      shadowsRed: -0.15, shadowsGreen: 0, shadowsBlue: 0.15,
      highlightsRed: -0.1, highlightsGreen: 0, highlightsBlue: 0.1,
    },
  },
};

Concept 5: The Track System

Professional video is composed in tracks, like audio mixing. Each track is a visual layer.

interface TrackSystem {
  /** Track 1 (bottom): Full-screen background β€” stock footage with Ken Burns */
  background: {
    content: 'stock_video' | 'stock_image_with_zoompan';
    movement: KenBurnsEffect;
    colorGrade: ColorProfile;
  };

  /** Track 2 (middle): Subject matter β€” the thing you're talking about */
  subject: {
    content: 'screenshot' | 'chart' | 'product_image' | 'person';
    scale: 0.8;          // 80% of frame size
    opacity: 0.9;        // Slightly transparent
    position: 'center';
    shadow: boolean;     // Drop shadow for depth
  };

  /** Track 3 (top): Text and data overlays */
  text: {
    content: 'caption' | 'statistic' | 'title' | 'counter';
    animation: 'slide_up' | 'fade_in' | 'typewriter' | 'scale_pop';
    font: 'Inter-Black' | 'Montserrat-Bold';
    position: 'bottom_third' | 'center' | 'top_bar';
  };
}

Key insight: The 3-track system creates perceived depth. Track 1 moves, Track 2 is semi-transparent with a shadow, Track 3 animates text. The viewer’s brain interprets this as a 3D space, which registers as β€œprofessional production” even when every asset is stock footage composited with FFmpeg.


Each level builds on the previous one. The cost column assumes local FFmpeg rendering on consumer hardware.

Level 1 β€” Basic (The Slideshow)

What it looks like: Static clips + TTS + captions + hard cuts.

What’s wrong: Everything is static. The viewer’s brain sees β€œnothing is happening” and swipes away. No movement, no transitions, no audio design. This is where most AI video pipelines stop.

ffmpeg -i clip1.mp4 -i clip2.mp4 -i clip3.mp4 -i voiceover.mp3 \
  -filter_complex "[0:v][1:v][2:v]concat=n=3:v=1:a=0[v]" \
  -map "[v]" -map 3:a \
  -c:v libx264 -c:a aac -shortest output.mp4
AttributeValue
Retention impactBaseline
Viewer perception”AI slideshow”
Implementation time1 hour
Cost per video$0.01

Level 2 β€” Movement (Ken Burns + Transitions)

What changes: Every clip moves. Transitions flow. The video feels alive.

This is the single biggest quality jump β€” going from static to moving. Apply Ken Burns to every visual and crossfade between clips.

ffmpeg -loop 1 -i image1.jpg -vf \
  "zoompan=z='min(zoom+0.002,1.2)':x='iw/2-(iw/zoom/2)':y='ih/2-(ih/zoom/2)':d=150:s=1080x1920:fps=30" \
  -t 5 -c:v libx264 -pix_fmt yuv420p clip1_kb.mp4

ffmpeg -loop 1 -i image2.jpg -vf \
  "zoompan=z='if(eq(on,1),1.3,max(zoom-0.002,1.0))':x='iw/2-(iw/zoom/2)':y='ih/2-(ih/zoom/2)':d=150:s=1080x1920:fps=30" \
  -t 5 -c:v libx264 -pix_fmt yuv420p clip2_kb.mp4

ffmpeg -i clip1_kb.mp4 -i clip2_kb.mp4 \
  -filter_complex "xfade=transition=fade:duration=1:offset=4" \
  -c:v libx264 -pix_fmt yuv420p output.mp4
AttributeValue
Retention impact+25-35% watch time
Viewer perception”Looks like a real video”
Implementation time4 hours
Cost per video$0.01

Level 3 β€” Layering (3-Track Composition)

What changes: Visual depth. Background moves, subject floats with shadow, text overlays add data.

ffmpeg -i background.mp4 -i subject.png -i voice.mp3 \
  -filter_complex "
    [0:v]zoompan=z='min(zoom+0.001,1.15)':x='iw/2-(iw/zoom/2)':y='ih/2-(ih/zoom/2)':d=300:s=1080x1920:fps=30,
    eq=brightness=-0.1:contrast=1.3:saturation=0.7,
    noise=alls=15:allf=t+u,
    vignette=PI/4:1.2[bg];
    [1:v]scale=864:-1,format=rgba,colorchannelmixer=aa=0.9[fg];
    [bg][fg]overlay=(W-w)/2:(H-h)/2[comp];
    [comp]drawtext=text='\$3,200 SAVED':fontfile=Inter-Black.ttf:fontsize=100:
    fontcolor=white:borderw=4:bordercolor=black:
    x=(w-text_w)/2:y=h*0.75:enable='between(t,3,6)'[out]
  " \
  -map "[out]" -map 2:a \
  -c:v libx264 -c:a aac -t 10 output.mp4
AttributeValue
Retention impact+31% over single-layer
Viewer perception”Professional production”
Implementation time8 hours
Cost per video$0.01-0.02

Level 4 β€” Sound Design (Music + SFX + Ducking)

What changes: Audio becomes an experience. Background music sets mood, sound effects punctuate moments, voice ducks the music automatically.

ffmpeg -i composed_video.mp4 -i music.mp3 -i whoosh.mp3 -i bass_drop.mp3 \
  -filter_complex "
    [1:a]volume=0.15[bg_music];
    [2:a]adelay=4000|4000,volume=0.5[whoosh];
    [3:a]adelay=1000|1000,volume=0.7[bass];
    [whoosh][bass]amix=inputs=2[sfx];
    [bg_music][sfx]amix=inputs=2[music_sfx];
    [0:a][music_sfx]sidechaincompress=threshold=0.02:ratio=8:attack=200:release=1000[final_audio]
  " \
  -map 0:v -map "[final_audio]" \
  -c:v copy -c:a aac output.mp4
AttributeValue
Retention impact+15-20% perceived quality
Viewer perception”This has a production team”
Implementation time12 hours
Cost per video$0.01-0.02

Level 5 β€” Typography (Kinetic Text + Counters)

What changes: Text animates. Numbers count up. Statistics slide in. Kinetic typography improves retention by 25-50%.

ffmpeg -i composed_video.mp4 -vf "
  drawtext=text='\$%{eif\\:min(floor((t-2)*1600)\\,3200)\\:d}':
  fontfile=Inter-Black.ttf:fontsize=120:fontcolor=white:
  borderw=4:bordercolor=black:
  x=(w-text_w)/2:y=h*0.4:
  enable='between(t,2,5)',

  drawtext=text='ANNUAL SAVINGS':
  fontfile=Inter-Bold.ttf:fontsize=48:fontcolor=white@0.8:
  x=(w-text_w)/2:y=h*0.4+130:
  enable='between(t,2.5,5)'
" -c:v libx264 -c:a copy output.mp4
AttributeValue
Retention impact+25-50% engagement
Viewer perception”Motion graphics quality”
Implementation time16 hours
Cost per video$0.02-0.03

Level 6 β€” Professional (Speed Ramping + Match Cuts + Nano-Hooks)

What changes: Pacing becomes aggressive. Speed ramping on dramatic moments. Visual changes every 1.5 seconds on hooks. Multiple clips per sentence with match-cut transitions.

ffmpeg -i clip.mp4 -filter_complex "
  [0:v]setpts='
    if(between(T,0,3), PTS,
    if(between(T,3,5), 2.0*PTS,
    PTS))'[v];
  [0:a]atempo=1.0[a]
" -map "[v]" -map "[a]" -c:v libx264 output.mp4
AttributeValue
Retention impactApproaches human-edited quality
Viewer perception”Can’t tell this is AI”
Implementation time24+ hours
Cost per video$0.02-0.05

Key insight: 73% of viewers cannot distinguish AI-assisted video from traditionally produced video when Level 4+ techniques are applied. The quality ceiling for programmatic video is far higher than most pipelines reach.

Level Progression Summary

LevelNameKey AdditionRetention LiftCumulative Effect
1BasicClips + TTSBaseline”Slideshow”
2MovementKen Burns + crossfades+25-35%β€œLooks like video”
3Layering3-track + overlays + grading+31%β€œProfessional”
4SoundMusic + SFX + ducking+15-20%β€œHas a production team”
5TypographyKinetic text + counters+25-50%β€œMotion graphics”
6ProfessionalSpeed ramps + nano-hooks+10-15%β€œCan’t tell it’s AI”

These come from studying million-dollar YouTube channels. They are the techniques that separate amateur from professional, and most AI pipelines implement none of them.

Secret 1: Aggressive Pacing and Silence Removal

Human editors obsessively remove silence. Any gap longer than 0.3 seconds between words gets cut. Word endings overlap with the next word’s beginning for a relentless pace.

The tool: whisper-timestamped β€” word-level timestamps from OpenAI Whisper with silence detection built in.

interface SilenceRemoval {
  /** Maximum silence duration before trimming */
  maxSilenceMs: 300;

  /** Overlap word boundaries for pace */
  overlapMs: 50;

  /** Preserve intentional pauses (before reveals) */
  preserveDramaticPause: boolean;

  /** whisper-timestamped output gives us word-level timestamps */
  timestampSource: 'whisper-timestamped';
}

// whisper-timestamped output format
interface WhisperWord {
  text: string;
  start: number;  // seconds
  end: number;    // seconds
  confidence: number;
}

// Find silences longer than threshold
function findSilences(words: WhisperWord[], thresholdMs: number): Silence[] {
  const silences: Silence[] = [];
  for (let i = 0; i < words.length - 1; i++) {
    const gap = (words[i + 1].start - words[i].end) * 1000;
    if (gap > thresholdMs) {
      silences.push({
        start: words[i].end,
        end: words[i + 1].start,
        durationMs: gap,
      });
    }
  }
  return silences;
}

FFmpeg implementation β€” trim silences:

whisper_timestamped audio.mp3 --model small --language en --output_format json

ffmpeg -i audio.mp3 -filter_complex "
  [0:a]atrim=start=0:end=2.3[s1];
  [0:a]atrim=start=2.8:end=5.1[s2];
  [0:a]atrim=start=5.9:end=8.4[s3];
  [s1][s2][s3]concat=n=3:v=0:a=1[out]
" -map "[out]" trimmed_audio.mp3

Secret 2: Ken Burns Parallax (Movement is Mandatory)

No visual asset is EVER static. Every image gets a zoompan effect. Every video clip gets repositioned. The human editor treats stillness as a bug.

The library β€” 7 movement variants:

enum KenBurnsMovement {
  ZOOM_IN = 'zoom_in',           // Intimacy, focus
  ZOOM_OUT = 'zoom_out',         // Reveals context, scale
  PAN_LEFT = 'pan_left',         // Follows action
  PAN_RIGHT = 'pan_right',       // Reverse motion
  PAN_DOWN = 'pan_down',         // Gravity, weight, reveal
  PAN_UP = 'pan_up',             // Aspiration, growth
  DIAGONAL_DRIFT = 'diagonal',   // Subtle, cinematic
}

// Rule: Never two consecutive clips with the same movement
function selectMovement(previous: KenBurnsMovement): KenBurnsMovement {
  const movements = Object.values(KenBurnsMovement);
  let next: KenBurnsMovement;
  do {
    next = movements[Math.floor(Math.random() * movements.length)];
  } while (next === previous);
  return next;
}

Complete FFmpeg commands for all 7 variants (1080x1920 vertical, 5 seconds, 30fps):

ffmpeg -loop 1 -i image.jpg -vf \
  "zoompan=z='min(zoom+0.002,1.2)':x='iw/2-(iw/zoom/2)':y='ih/2-(ih/zoom/2)':d=150:s=1080x1920:fps=30" \
  -t 5 -c:v libx264 -pix_fmt yuv420p zoom_in.mp4

ffmpeg -loop 1 -i image.jpg -vf \
  "zoompan=z='if(eq(on,1),1.3,max(zoom-0.002,1.0))':x='iw/2-(iw/zoom/2)':y='ih/2-(ih/zoom/2)':d=150:s=1080x1920:fps=30" \
  -t 5 -c:v libx264 -pix_fmt yuv420p zoom_out.mp4

ffmpeg -loop 1 -i image.jpg -vf \
  "zoompan=z='1.15':x='if(eq(on,1),0,min(x+2,iw-iw/zoom))':y='ih/2-(ih/zoom/2)':d=150:s=1080x1920:fps=30" \
  -t 5 -c:v libx264 -pix_fmt yuv420p pan_right.mp4

ffmpeg -loop 1 -i image.jpg -vf \
  "zoompan=z='1.15':x='if(eq(on,1),iw-iw/zoom,max(x-2,0))':y='ih/2-(ih/zoom/2)':d=150:s=1080x1920:fps=30" \
  -t 5 -c:v libx264 -pix_fmt yuv420p pan_left.mp4

ffmpeg -loop 1 -i image.jpg -vf \
  "zoompan=z='1.15':x='iw/2-(iw/zoom/2)':y='if(eq(on,1),0,min(y+2,ih-ih/zoom))':d=150:s=1080x1920:fps=30" \
  -t 5 -c:v libx264 -pix_fmt yuv420p pan_down.mp4

ffmpeg -loop 1 -i image.jpg -vf \
  "zoompan=z='1.15':x='iw/2-(iw/zoom/2)':y='if(eq(on,1),ih-ih/zoom,max(y-2,0))':d=150:s=1080x1920:fps=30" \
  -t 5 -c:v libx264 -pix_fmt yuv420p pan_up.mp4

ffmpeg -loop 1 -i image.jpg -vf \
  "zoompan=z='1.2':x='if(eq(on,1),0,min(x+1.5,iw-iw/zoom))':y='if(eq(on,1),0,min(y+1,ih-ih/zoom))':d=150:s=1080x1920:fps=30" \
  -t 5 -c:v libx264 -pix_fmt yuv420p diagonal.mp4

Secret 3: B-Roll Layering (The Opacity Trick)

The 3-track composition creates perceived depth that single-layer video cannot match:

Track 3 (top):    Kinetic typography, captions, data overlays
Track 2 (middle): Subject at 80% scale, 90% opacity, drop shadow
Track 1 (bottom): Moving abstract background with color grade

FFmpeg multi-track composition:

ffmpeg -i abstract_bg.mp4 -i subject.png \
  -filter_complex "
    # Track 1: Background with Ken Burns + color grade + grain + vignette
    [0:v]scale=1080:1920:force_original_aspect_ratio=increase,crop=1080:1920,
    zoompan=z='min(zoom+0.001,1.1)':x='iw/2-(iw/zoom/2)':y='ih/2-(ih/zoom/2)':d=300:s=1080x1920:fps=30,
    eq=brightness=-0.1:contrast=1.3:saturation=0.7,
    noise=alls=15:allf=t+u,
    vignette=PI/4:1.2[bg];

    # Track 2: Subject at 80% scale, 90% opacity
    [1:v]scale=864:-1,format=rgba,colorchannelmixer=aa=0.9[subject];

    # Compose Track 1 + Track 2
    [bg][subject]overlay=(W-w)/2:(H-h)/2[comp];

    # Track 3: Text overlay (data reveal at timestamp)
    [comp]drawtext=text='\$3,200':fontfile=Inter-Black.ttf:fontsize=120:
    fontcolor=white:borderw=5:bordercolor=black@0.8:
    x=(w-text_w)/2:y=h*0.3:
    enable='between(t,3,7)',
    drawtext=text='ANNUAL TAX SAVINGS':fontfile=Inter-Bold.ttf:fontsize=40:
    fontcolor=white@0.8:x=(w-text_w)/2:y=h*0.3+140:
    enable='between(t,3.5,7)'[final]
  " \
  -map "[final]" -c:v libx264 -pix_fmt yuv420p -t 10 output.mp4

Secret 4: 3-Second Visual Hook Swap

The visual on screen must change every 3 seconds. For each script segment, divide the duration by 3, then fetch that many distinct visuals.

interface VisualSwapStrategy {
  segmentDurationSec: number;
  visualCount: number;          // Math.ceil(segmentDurationSec / 3)
  transitionDuration: 0.5;      // Half-second crossfade between visuals
  searchStrategy: 'varied';     // Each visual uses different search terms

  /** Example: "LLC saves you $3,200 per year on taxes" (9 seconds)
   *  Visual 1 (0-3s): Business office footage β€” zoom in
   *  Visual 2 (3-6s): Calculator/spreadsheet β€” pan right
   *  Visual 3 (6-9s): Money/savings imagery β€” zoom out
   */
}

function planVisuals(segment: ScriptSegment): VisualPlan[] {
  const count = Math.ceil(segment.durationSec / 3);
  const visualDuration = segment.durationSec / count;

  return Array.from({ length: count }, (_, i) => ({
    searchQuery: segment.visualSearchTerms[i],
    startTime: segment.startTime + i * visualDuration,
    duration: visualDuration,
    kenBurns: selectMovement(i > 0 ? plans[i - 1].kenBurns : null),
    transition: i > 0 ? 'crossfade' : 'none',
  }));
}

Complete reference for every visual and audio effect, with exact syntax.

Ken Burns (zoompan filter)

The zoompan filter accepts values for zoom between 1 and 10. Key parameters:

ParameterDescriptionDefault
zZoom expression (1.0 = no zoom)1
xHorizontal pan position0
yVertical pan position0
dDuration in frames (fps * seconds)90
sOutput size1280x720
fpsOutput frame rate25

Zoom speed reference:

EffectZoom incrementDuration feel
Barely perceptible+0.0005/frameVery subtle, cinematic
Gentle+0.001/frameNatural, documentary
Standard+0.002/frameNoticeable, engaging
Aggressive+0.004/frameDramatic, attention-grabbing
Fast+0.008/frameAction, urgency

Center-zoom formula explained:

x='iw/2-(iw/zoom/2)'  β†’  Centers horizontally as zoom changes
y='ih/2-(ih/zoom/2)'  β†’  Centers vertically as zoom changes

Transitions (xfade filter)

The xfade filter provides 40+ built-in transitions between two video streams.

Syntax:

ffmpeg -i clip1.mp4 -i clip2.mp4 \
  -filter_complex "xfade=transition=TRANSITION_NAME:duration=SECONDS:offset=SECONDS" \
  output.mp4

Complete transition list with use cases:

TransitionVisual EffectBest For
fadeClassic fade to black/whiteUniversal, safe default
dissolveCross-dissolve blendEmotional moments
wipeleftWipe from right to leftForward progress
wiperightWipe from left to rightFlashback, reverse
wipeupWipe from bottom to topAspiration, growth
wipedownWipe from top to bottomGravity, grounding
slideleftSecond clip slides in from rightFast pace, lists
sliderightSecond clip slides in from leftFast pace
slideupSecond clip slides in from bottomReveals
slidedownSecond clip slides in from topDrops, emphasis
circlecropCircle expanding from centerFocus, spotlight
rectcropRectangle expanding from centerData reveals
distancePixel distance blendAbstract, artistic
fadeblackFade through blackScene change, time jump
fadewhiteFade through whiteDream, flashback
radialRadial wipeClock-like, time-based
smoothleftSmooth left transitionProfessional, clean
smoothrightSmooth right transitionProfessional
smoothupSmooth upward transitionGrowth narrative
smoothdownSmooth downward transitionGrounding
circleopenCircle opening outSpotlight reveal
circlecloseCircle closing inFocus, ending
vertopenVertical blinds openingData, corporate
vertcloseVertical blinds closingClosing sequence
horzopenHorizontal blinds openingReveal
horzcloseHorizontal blinds closingClosing
diagtlDiagonal from top-leftDynamic, energetic
diagtrDiagonal from top-rightVariety
diagblDiagonal from bottom-leftVariety
diagbrDiagonal from bottom-rightVariety
hlsliceHorizontal left sliceGlitch, tech
hrsliceHorizontal right sliceGlitch, tech
vusliceVertical up sliceGlitch, tech
vdsliceVertical down sliceGlitch, tech
dissolveRandom pixel dissolveSoft, emotional
pixelizePixelation transitionRetro, gaming
hblurHorizontal blur transitionSpeed, motion
fadegraysFade through grayscaleDramatic
squeezehHorizontal squeezeCompression
squeezevVertical squeezeCompression

Chaining multiple transitions (3+ clips):

ffmpeg -i clip1.mp4 -i clip2.mp4 -i clip3.mp4 -i clip4.mp4 \
  -filter_complex "
    [0:v][1:v]xfade=transition=fade:duration=0.5:offset=4.5[v01];
    [v01][2:v]xfade=transition=slideleft:duration=0.5:offset=9[v012];
    [v012][3:v]xfade=transition=circlecrop:duration=0.5:offset=13.5[vout]
  " \
  -map "[vout]" output.mp4

Key insight: The offset parameter is cumulative β€” it’s the timestamp in the OUTPUT where the transition starts, not the input. For a chain, each offset = previous_offset + clip_duration - transition_duration.

Extended transitions with xfade-easing:

The xfade-easing project adds CSS easing functions and ported GLSL transitions to FFmpeg’s xfade filter, expanding from 40 to 200+ transitions.

Color Grading

Dark Cinematic (finance/business):

ffmpeg -i clip.mp4 -vf \
  "eq=brightness=-0.1:contrast=1.3:saturation=0.7,curves=preset=darker" \
  output.mp4

Teal and Orange (Hollywood):

ffmpeg -i clip.mp4 -vf \
  "colorbalance=rs=0.1:gs=-0.1:bs=-0.2:rh=-0.1:gh=0.05:bh=0.15" \
  output.mp4

Cold/Clinical (tech/business):

ffmpeg -i clip.mp4 -vf \
  "colorbalance=rs=-0.15:bs=0.15:rh=-0.1:bh=0.1,eq=contrast=1.2:saturation=0.6" \
  output.mp4

Black and White High Contrast:

ffmpeg -i clip.mp4 -vf \
  "hue=s=0,eq=contrast=1.5:brightness=0.05" \
  output.mp4

Warm Nostalgic (lifestyle):

ffmpeg -i clip.mp4 -vf \
  "colorbalance=rs=0.1:gs=0.05:bs=-0.1:rh=0.05:gh=0.02:bh=-0.05,eq=saturation=1.1:brightness=0.05" \
  output.mp4

Film Grain:

ffmpeg -i clip.mp4 -vf "noise=alls=15:allf=t+u" output.mp4

ffmpeg -i clip.mp4 -vf "noise=alls=30:allf=t+u" output.mp4

ffmpeg -i clip.mp4 -vf "noise=alls=20:allf=t" output.mp4

Vignette:

ffmpeg -i clip.mp4 -vf "vignette=PI/4:1.2" output.mp4

ffmpeg -i clip.mp4 -vf "vignette=PI/6:0.8" output.mp4

ffmpeg -i clip.mp4 -vf "vignette=PI/3:1.5" output.mp4

Complete color grade chain (apply all at once):

ffmpeg -i clip.mp4 -vf \
  "eq=brightness=-0.1:contrast=1.3:saturation=0.7,
   noise=alls=15:allf=t+u,
   vignette=PI/4:1.2" \
  -c:v libx264 -c:a copy output.mp4

Text Overlays (drawtext filter)

The drawtext filter renders text on video frames.

Static text with background box:

ffmpeg -i clip.mp4 -vf \
  "drawtext=text='90%% FAIL':
   fontfile=Inter-Black.ttf:fontsize=100:
   fontcolor=white:
   box=1:boxcolor=red@0.8:boxborderw=20:
   x=(w-text_w)/2:y=h*0.3:
   enable='between(t,1,4)'" \
  output.mp4

Text with border (no box):

ffmpeg -i clip.mp4 -vf \
  "drawtext=text='\$3,200':
   fontfile=Inter-Black.ttf:fontsize=120:
   fontcolor=white:
   borderw=4:bordercolor=black:
   x=(w-text_w)/2:y=(h-text_h)/2:
   enable='between(t,3,6)'" \
  output.mp4

Slide-up from bottom:

ffmpeg -i clip.mp4 -vf \
  "drawtext=text='ANNUAL SAVINGS':
   fontfile=Inter-Black.ttf:fontsize=80:fontcolor=white:
   borderw=3:bordercolor=black:
   x=(w-text_w)/2:
   y='if(between(t,3,3.5), h - (h - h*0.4)*(t-3)/0.5, h*0.4)':
   enable='between(t,3,7)'" \
  output.mp4

Counter animation (counting up to a number):

ffmpeg -i clip.mp4 -vf \
  "drawtext=text='\$%{eif\\:min(floor((t-2)*1600)\\,3200)\\:d}':
   fontfile=Inter-Black.ttf:fontsize=120:fontcolor=white:
   borderw=4:bordercolor=black:
   x=(w-text_w)/2:y=h*0.4:
   enable='between(t,2,5)'" \
  output.mp4

Countdown timer:

ffmpeg -i clip.mp4 -vf \
  "drawtext=text='%{eif\\:10-floor(t)\\:d}':
   fontfile=Inter-Black.ttf:fontsize=200:fontcolor=red:
   borderw=5:bordercolor=white:
   x=(w-text_w)/2:y=(h-text_h)/2:
   enable='between(t,0,10)'" \
  output.mp4

Fade-in text (opacity animation):

ffmpeg -i clip.mp4 -vf \
  "drawtext=text='THE LLC TRAP':
   fontfile=Inter-Black.ttf:fontsize=100:
   fontcolor=white@'if(between(t,2,2.5),(t-2)*2,if(between(t,2.5,5),1,0))':
   borderw=4:bordercolor=black@'if(between(t,2,2.5),(t-2)*2,if(between(t,2.5,5),1,0))':
   x=(w-text_w)/2:y=h*0.3:
   enable='between(t,2,5)'" \
  output.mp4

Multiple text overlays (stacked data):

ffmpeg -i clip.mp4 -vf "
  drawtext=text='LLC COST BREAKDOWN':fontfile=Inter-Black.ttf:fontsize=60:
    fontcolor=white:borderw=3:bordercolor=black:
    x=(w-text_w)/2:y=h*0.15:enable='between(t,1,8)',
  drawtext=text='State Fee\: \$100':fontfile=Inter-Bold.ttf:fontsize=48:
    fontcolor=white:x=(w-text_w)/2:y=h*0.30:enable='between(t,2,8)',
  drawtext=text='Agent\: \$120':fontfile=Inter-Bold.ttf:fontsize=48:
    fontcolor=white:x=(w-text_w)/2:y=h*0.38:enable='between(t,3,8)',
  drawtext=text='EIN\: FREE':fontfile=Inter-Bold.ttf:fontsize=48:
    fontcolor=green:x=(w-text_w)/2:y=h*0.46:enable='between(t,4,8)',
  drawtext=text='TOTAL\: \$220':fontfile=Inter-Black.ttf:fontsize=72:
    fontcolor=yellow:borderw=4:bordercolor=black:
    x=(w-text_w)/2:y=h*0.58:enable='between(t,5,8)'
" output.mp4

Audio Effects

Voice + music mixing (voice at 100%, music at 15%):

ffmpeg -i voice.mp3 -i music.mp3 \
  -filter_complex "[1:a]volume=0.15[bg];[0:a][bg]amix=inputs=2:duration=first" \
  output.mp3

Sidechain compression (duck music when voice plays):

ffmpeg -i voice.mp3 -i music.mp3 \
  -filter_complex \
  "[1:a]volume=0.3[bg];[0:a][bg]sidechaincompress=threshold=0.02:ratio=8:attack=200:release=1000" \
  output.mp3
ParameterValueEffect
threshold0.02Sensitivity β€” lower = more ducking
ratio8Compression amount β€” higher = more aggressive duck
attack200msHow fast music ducks when voice starts
release1000msHow fast music returns when voice stops

Sound effect at specific timestamp:

ffmpeg -i main.mp4 -i whoosh.mp3 \
  -filter_complex "[1:a]adelay=3000|3000,volume=0.5[sfx];[0:a][sfx]amix=inputs=2:duration=first" \
  output.mp4

Multiple SFX at different timestamps:

ffmpeg -i main.mp4 -i bass_drop.mp3 -i whoosh.mp3 -i cash.mp3 \
  -filter_complex "
    [1:a]adelay=500|500,volume=0.7[sfx1];
    [2:a]adelay=4000|4000,volume=0.5[sfx2];
    [3:a]adelay=7000|7000,volume=0.4[sfx3];
    [sfx1][sfx2][sfx3]amix=inputs=3[all_sfx];
    [0:a][all_sfx]amix=inputs=2:duration=first[out]
  " -map 0:v -map "[out]" output.mp4

Audio normalization:

ffmpeg -i audio.mp3 -filter:a "loudnorm=I=-16:TP=-1.5:LRA=11" normalized.mp3

ffmpeg -i audio.mp3 -filter:a "dynaudnorm" normalized.mp3

Speed Effects

Slow motion (0.5x):

ffmpeg -i clip.mp4 -vf "setpts=2.0*PTS" -af "atempo=0.5" slow.mp4

Fast motion (2x):

ffmpeg -i clip.mp4 -vf "setpts=0.5*PTS" -af "atempo=2.0" fast.mp4

Speed ramp (normal to slow to normal):

ffmpeg -i clip.mp4 -vf \
  "setpts='if(between(T,3,5),2.0*PTS,PTS)'" \
  speed_ramp.mp4

Time-lapse (4x speed):

ffmpeg -i clip.mp4 -vf "setpts=0.25*PTS" -an timelapse.mp4

Overlay and Picture-in-Picture

Basic overlay (logo/watermark):

ffmpeg -i main.mp4 -i logo.png \
  -filter_complex "[1:v]scale=100:-1[logo];[0:v][logo]overlay=W-w-20:20" \
  output.mp4

Picture-in-picture with border:

ffmpeg -i main.mp4 -i pip.mp4 \
  -filter_complex "
    [1:v]scale=320:180[pip];
    [0:v][pip]overlay=W-w-20:H-h-20
  " output.mp4

Opacity blending:

ffmpeg -i background.mp4 -i overlay.png \
  -filter_complex "
    [1:v]format=rgba,colorchannelmixer=aa=0.5[fg];
    [0:v][fg]overlay=0:0
  " output.mp4

Drop shadow effect (simulate with offset dark copy):

ffmpeg -i background.mp4 -i subject.png \
  -filter_complex "
    [1:v]scale=800:-1[subj];
    [subj]split[shadow][main];
    [shadow]colorchannelmixer=rr=0:gg=0:bb=0:aa=0.5,
    boxblur=10:10[shadow_blur];
    [0:v][shadow_blur]overlay=(W-w)/2+5:(H-h)/2+5[with_shadow];
    [with_shadow][main]overlay=(W-w)/2:(H-h)/2
  " output.mp4

The first 3 seconds determine everything. These 6 hook archetypes are proven patterns from viral content analysis.

The 6 Archetypes

interface HookArchetype {
  name: string;
  pattern: string;
  psychology: string;
  examples: string[];
  visualTreatment: string;
}

const ARCHETYPES: HookArchetype[] = [
  {
    name: 'Fortuneteller',
    pattern: 'Teases a future outcome the viewer wants',
    psychology: 'Curiosity gap β€” viewer must watch to see if it applies to them',
    examples: [
      'How to double your savings in 2026',
      'Your LLC is about to cost you $3,200 less',
      'What your accountant won\'t tell you about Q4',
    ],
    visualTreatment: 'Crystal ball / chart trending up / calendar with circled date',
  },
  {
    name: 'Magician',
    pattern: 'Reveals a surprising condensation or transformation',
    psychology: 'Value perception β€” massive input compressed into digestible output',
    examples: [
      'I condensed 50 finance books into 60 seconds',
      'The entire tax code in one sentence',
      '10 years of investing mistakes so you don\'t have to',
    ],
    visualTreatment: 'Stack of books β†’ single page / time-lapse / before-after split',
  },
  {
    name: 'Contrarian',
    pattern: 'Challenges commonly accepted knowledge',
    psychology: 'Pattern interrupt β€” brain flags contradictions as important',
    examples: [
      'Stock market experts HATE this one simple rule',
      'Stop saving for retirement (here\'s why)',
      'The LLC advice everyone gives is dead wrong',
    ],
    visualTreatment: 'Red X over conventional wisdom / crossed-out text / head shake',
  },
  {
    name: 'Provocateur',
    pattern: 'Makes a controversial or emotionally charged statement',
    psychology: 'Emotional activation β€” anger/outrage increases engagement',
    examples: [
      'California is robbing you blind with this tax',
      'Your bank is stealing $400/year and you don\'t know it',
      'The IRS designed this system to keep you poor',
    ],
    visualTreatment: 'Red background / alarm / angry emoji / bold accusatory text',
  },
  {
    name: 'Statistician',
    pattern: 'Opens with a shocking number',
    psychology: 'Concrete specificity signals authority and triggers recall',
    examples: [
      '$1.4 billion evaporated in 48 hours',
      '93% of LLCs overpay by $2,100 per year',
      '1 in 4 Americans can\'t cover a $400 emergency',
    ],
    visualTreatment: 'Large number filling screen / counter animation / data visualization',
  },
  {
    name: 'Questioner',
    pattern: 'Poses an engaging question the viewer wants answered',
    psychology: 'Open loop β€” the brain seeks closure on unanswered questions',
    examples: [
      'How much do you REALLY need to quit your job?',
      'What would you do with an extra $3,200?',
      'Are you in the 93% who overpay on taxes?',
    ],
    visualTreatment: 'Question mark animation / thinking emoji / person looking puzzled',
  },
];

Hook Visual Treatment Matrix

ArchetypeText StyleBackgroundAnimationSFX
FortunetellerGold/white, elegantDark, moodySlow zoom inMystical tone
MagicianBold white, largeBook/knowledge imageryScale popWhoosh
ContrarianRed/white, aggressiveCrossed-out textShake effectRecord scratch
ProvocateurRed bold, ALL CAPSDark red gradientFlash/pulseBass drop
StatisticianYellow/white numbersDark with data vizCounter animationCash register
QuestionerWhite italicPerson thinkingTypewriterQuestion sound

Implementing Hooks in FFmpeg

ffmpeg -i background.mp4 -i bass_drop.mp3 \
  -filter_complex "
    [0:v]eq=brightness=-0.15:contrast=1.4:saturation=0.6,
    vignette=PI/3:1.3[bg];

    [bg]drawtext=text='\$%{eif\\:min(floor(t*700)\\,1400000000)\\:d\\:,}':
    fontfile=Inter-Black.ttf:fontsize=80:fontcolor=yellow:
    borderw=4:bordercolor=black:
    x=(w-text_w)/2:y=h*0.35:
    enable='between(t,0.5,3)',

    drawtext=text='evaporated in 48 hours':
    fontfile=Inter-Bold.ttf:fontsize=50:fontcolor=white:
    x=(w-text_w)/2:y=h*0.35+100:
    enable='between(t,1.5,3)'[hooked];

    [1:a]adelay=500|500,volume=0.7[sfx];
    [hooked]null[vout]
  " \
  -map "[vout]" -map "[sfx]" \
  -c:v libx264 -c:a aac -t 3 hook.mp4

Pattern 1: The Data Story (Finance/Business)

A complete video that reveals a financial insight with progressive data disclosure.

When to use: Financial education, tax tips, business analysis, calculator demos.

#!/bin/bash

FONT_BLACK="Inter-Black.ttf"
FONT_BOLD="Inter-Bold.ttf"
FONT_REG="Inter-Regular.ttf"
RESOLUTION="1080x1920"
FPS=30

ffmpeg -loop 1 -i assets/office.jpg -vf \
  "zoompan=z='min(zoom+0.002,1.2)':x='iw/2-(iw/zoom/2)':y='ih/2-(ih/zoom/2)':d=$((FPS*4)):s=${RESOLUTION}:fps=${FPS}" \
  -t 4 -c:v libx264 -pix_fmt yuv420p /tmp/scene1.mp4

ffmpeg -loop 1 -i assets/calculator.jpg -vf \
  "zoompan=z='1.15':x='if(eq(on,1),0,min(x+2,iw-iw/zoom))':y='ih/2-(ih/zoom/2)':d=$((FPS*4)):s=${RESOLUTION}:fps=${FPS}" \
  -t 4 -c:v libx264 -pix_fmt yuv420p /tmp/scene2.mp4

ffmpeg -loop 1 -i assets/money.jpg -vf \
  "zoompan=z='if(eq(on,1),1.3,max(zoom-0.002,1.0))':x='iw/2-(iw/zoom/2)':y='ih/2-(ih/zoom/2)':d=$((FPS*4)):s=${RESOLUTION}:fps=${FPS}" \
  -t 4 -c:v libx264 -pix_fmt yuv420p /tmp/scene3.mp4

ffmpeg -loop 1 -i assets/document.jpg -vf \
  "zoompan=z='1.15':x='iw/2-(iw/zoom/2)':y='if(eq(on,1),0,min(y+1.5,ih-ih/zoom))':d=$((FPS*4)):s=${RESOLUTION}:fps=${FPS}" \
  -t 4 -c:v libx264 -pix_fmt yuv420p /tmp/scene4.mp4

ffmpeg -loop 1 -i assets/savings.jpg -vf \
  "zoompan=z='min(zoom+0.003,1.25)':x='iw/2-(iw/zoom/2)':y='ih/2-(ih/zoom/2)':d=$((FPS*4)):s=${RESOLUTION}:fps=${FPS}" \
  -t 4 -c:v libx264 -pix_fmt yuv420p /tmp/scene5.mp4

ffmpeg -loop 1 -i assets/chart_up.jpg -vf \
  "zoompan=z='1.2':x='if(eq(on,1),0,min(x+1,iw-iw/zoom))':y='if(eq(on,1),0,min(y+0.7,ih-ih/zoom))':d=$((FPS*5)):s=${RESOLUTION}:fps=${FPS}" \
  -t 5 -c:v libx264 -pix_fmt yuv420p /tmp/scene6.mp4

ffmpeg -i /tmp/scene1.mp4 -i /tmp/scene2.mp4 -i /tmp/scene3.mp4 \
  -i /tmp/scene4.mp4 -i /tmp/scene5.mp4 -i /tmp/scene6.mp4 \
  -filter_complex "
    [0:v][1:v]xfade=transition=fade:duration=0.5:offset=3.5[v01];
    [v01][2:v]xfade=transition=slideleft:duration=0.5:offset=7[v012];
    [v012][3:v]xfade=transition=dissolve:duration=0.5:offset=10.5[v0123];
    [v0123][4:v]xfade=transition=smoothup:duration=0.5:offset=14[v01234];
    [v01234][5:v]xfade=transition=circlecrop:duration=0.5:offset=17.5[vchain]
  " -map "[vchain]" -c:v libx264 -pix_fmt yuv420p /tmp/chained.mp4

ffmpeg -i /tmp/chained.mp4 -vf "
  eq=brightness=-0.1:contrast=1.3:saturation=0.7,
  noise=alls=12:allf=t+u,
  vignette=PI/4:1.2,

  drawtext=text='93%% of LLCs overpay':fontfile=${FONT_BLACK}:fontsize=80:
    fontcolor=white:borderw=4:bordercolor=black:
    x=(w-text_w)/2:y=h*0.35:enable='between(t,0,3)',

  drawtext=text='Here is what they miss':fontfile=${FONT_BOLD}:fontsize=50:
    fontcolor=white@0.8:x=(w-text_w)/2:y=h*0.35+100:
    enable='between(t,0.5,3)',

  drawtext=text='State Fee':fontfile=${FONT_BOLD}:fontsize=48:
    fontcolor=white:x=100:y=h*0.25:enable='between(t,4,10)',
  drawtext=text='\$100':fontfile=${FONT_BLACK}:fontsize=60:
    fontcolor=green:x=w-300:y=h*0.25:enable='between(t,4,10)',

  drawtext=text='Agent Fee':fontfile=${FONT_BOLD}:fontsize=48:
    fontcolor=white:x=100:y=h*0.33:enable='between(t,5,10)',
  drawtext=text='\$0':fontfile=${FONT_BLACK}:fontsize=60:
    fontcolor=green:x=w-300:y=h*0.33:enable='between(t,5,10)',

  drawtext=text='Tax Savings':fontfile=${FONT_BOLD}:fontsize=48:
    fontcolor=white:x=100:y=h*0.41:enable='between(t,6,10)',
  drawtext=text='-\$3,200':fontfile=${FONT_BLACK}:fontsize=60:
    fontcolor=yellow:x=w-350:y=h*0.41:enable='between(t,6,10)',

  drawtext=text='TOTAL SAVINGS':fontfile=${FONT_BLACK}:fontsize=70:
    fontcolor=yellow:borderw=4:bordercolor=black:
    x=(w-text_w)/2:y=h*0.55:enable='between(t,8,12)',

  drawtext=text='\$%{eif\\:min(floor((t-8)*1600)\\,3200)\\:d}':
    fontfile=${FONT_BLACK}:fontsize=140:fontcolor=white:
    borderw=5:bordercolor=black:
    x=(w-text_w)/2:y=h*0.63:enable='between(t,8,12)',

  drawtext=text='Link in bio':fontfile=${FONT_BOLD}:fontsize=40:
    fontcolor=white@0.7:x=(w-text_w)/2:y=h*0.85:
    enable='between(t,10,14)'
" -c:v libx264 -c:a copy /tmp/graded.mp4

ffmpeg -i /tmp/graded.mp4 -i voiceover.mp3 -i music_dark_cinematic.mp3 \
  -i sfx/bass_drop.mp3 -i sfx/whoosh.mp3 -i sfx/cash_register.mp3 \
  -filter_complex "
    [2:a]volume=0.15,afade=t=in:ss=0:d=2,afade=t=out:st=19:d=2[music];
    [3:a]adelay=500|500,volume=0.7[bass];
    [4:a]adelay=3500|3500,volume=0.4[whoosh1];
    [4:a]adelay=7000|7000,volume=0.4[whoosh2];
    [5:a]adelay=8000|8000,volume=0.5[cash];
    [bass][whoosh1][whoosh2][cash]amix=inputs=4[all_sfx];
    [music][all_sfx]amix=inputs=2[bg_audio];
    [1:a][bg_audio]sidechaincompress=threshold=0.02:ratio=8:attack=200:release=1000[final_audio]
  " \
  -map 0:v -map "[final_audio]" \
  -c:v copy -c:a aac -b:a 192k -shortest output.mp4

Gotchas:

Pattern 2: The Comparison Grid (Product/Tool)

Side-by-side comparison with animated data points.

When to use: Tool comparisons, product reviews, A-vs-B decisions.

#!/bin/bash

ffmpeg -i left_approach.mp4 -i right_approach.mp4 \
  -filter_complex "
    # Scale both to half-width
    [0:v]scale=540:960,
    eq=brightness=-0.05:contrast=1.2:saturation=0.8,
    pad=540:1920:0:0:black[left];

    [1:v]scale=540:960,
    eq=brightness=-0.05:contrast=1.2:saturation=0.8,
    pad=540:1920:0:960:black[right];

    # Stack horizontally
    [left][right]hstack[grid];

    # Add labels
    [grid]drawtext=text='TRADITIONAL':fontfile=Inter-Black.ttf:fontsize=40:
      fontcolor=red:x=270-text_w/2:y=50:enable='between(t,0,15)',
    drawtext=text='OPTIMIZED':fontfile=Inter-Black.ttf:fontsize=40:
      fontcolor=green:x=810-text_w/2:y=50:enable='between(t,0,15)',

    # VS divider
    drawtext=text='VS':fontfile=Inter-Black.ttf:fontsize=60:
      fontcolor=yellow:box=1:boxcolor=black@0.8:boxborderw=15:
      x=(w-text_w)/2:y=h*0.48:enable='between(t,0,15)',

    # Cost comparison (appears at 3s)
    drawtext=text='\$4,500/yr':fontfile=Inter-Black.ttf:fontsize=50:
      fontcolor=red:x=270-text_w/2:y=h*0.6:enable='between(t,3,15)',
    drawtext=text='\$1,300/yr':fontfile=Inter-Black.ttf:fontsize=50:
      fontcolor=green:x=810-text_w/2:y=h*0.6:enable='between(t,3,15)',

    # Savings callout (appears at 6s)
    drawtext=text='SAVE \$3,200':fontfile=Inter-Black.ttf:fontsize=70:
      fontcolor=yellow:borderw=4:bordercolor=black:
      x=(w-text_w)/2:y=h*0.75:enable='between(t,6,15)'
  " \
  -c:v libx264 -pix_fmt yuv420p output.mp4

Pattern 3: The Explainer (Educational/How-To)

Step-by-step tutorial with numbered progression.

When to use: How-to guides, process explanations, tutorial content.

#!/bin/bash

STEPS=("Open the calculator" "Enter your income" "Select LLC type" "Review deductions" "See your savings")
IMAGES=(calculator.jpg income.jpg llc_type.jpg deductions.jpg savings.jpg)
MOVEMENTS=("zoom_in" "pan_right" "zoom_out" "pan_down" "zoom_in")

for i in "${!STEPS[@]}"; do
  STEP_NUM=$((i + 1))
  STEP_TEXT="${STEPS[$i]}"

  # Determine zoompan expression based on movement type
  case "${MOVEMENTS[$i]}" in
    zoom_in)  ZP="z='min(zoom+0.002,1.2)':x='iw/2-(iw/zoom/2)':y='ih/2-(ih/zoom/2)'" ;;
    zoom_out) ZP="z='if(eq(on,1),1.3,max(zoom-0.002,1.0))':x='iw/2-(iw/zoom/2)':y='ih/2-(ih/zoom/2)'" ;;
    pan_right) ZP="z='1.15':x='if(eq(on,1),0,min(x+2,iw-iw/zoom))':y='ih/2-(ih/zoom/2)'" ;;
    pan_down) ZP="z='1.15':x='iw/2-(iw/zoom/2)':y='if(eq(on,1),0,min(y+1.5,ih-ih/zoom))'" ;;
  esac

  ffmpeg -loop 1 -i "assets/${IMAGES[$i]}" -vf "
    zoompan=${ZP}:d=120:s=1080x1920:fps=30,
    eq=brightness=-0.08:contrast=1.2:saturation=0.8,
    vignette=PI/5:1.0,

    drawtext=text='STEP ${STEP_NUM}':fontfile=Inter-Black.ttf:fontsize=90:
      fontcolor=yellow:borderw=4:bordercolor=black:
      x=(w-text_w)/2:y=h*0.25,

    drawtext=text='${STEP_TEXT}':fontfile=Inter-Bold.ttf:fontsize=50:
      fontcolor=white:x=(w-text_w)/2:y=h*0.25+120

  " -t 4 -c:v libx264 -pix_fmt yuv420p "/tmp/step${STEP_NUM}.mp4"
done

ffmpeg -i /tmp/step1.mp4 -i /tmp/step2.mp4 -i /tmp/step3.mp4 \
  -i /tmp/step4.mp4 -i /tmp/step5.mp4 \
  -filter_complex "
    [0:v][1:v]xfade=transition=slideleft:duration=0.5:offset=3.5[v01];
    [v01][2:v]xfade=transition=slideright:duration=0.5:offset=7[v012];
    [v012][3:v]xfade=transition=slideleft:duration=0.5:offset=10.5[v0123];
    [v0123][4:v]xfade=transition=circlecrop:duration=0.5:offset=14[vout]
  " -map "[vout]" output.mp4

Pattern 4: The Montage (Motivation/Compilation)

Rapid-fire clips with music-driven pacing.

When to use: Motivational content, compilations, brand sizzle reels.

#!/bin/bash

CLIPS=(clip1.mp4 clip2.mp4 clip3.mp4 clip4.mp4 clip5.mp4
       clip6.mp4 clip7.mp4 clip8.mp4 clip9.mp4 clip10.mp4)
TRANSITIONS=(fade slideleft dissolve diagtl smoothup
             circlecrop slidedown fadeblack radial fade)

for i in "${!CLIPS[@]}"; do
  ffmpeg -i "assets/${CLIPS[$i]}" -vf "
    scale=1080:1920:force_original_aspect_ratio=increase,crop=1080:1920,
    zoompan=z='min(zoom+0.004,1.3)':x='iw/2-(iw/zoom/2)':y='ih/2-(ih/zoom/2)':d=60:s=1080x1920:fps=30,
    eq=brightness=-0.15:contrast=1.5:saturation=0.5,
    noise=alls=20:allf=t+u,
    vignette=PI/3:1.4
  " -t 2 -c:v libx264 -pix_fmt yuv420p "/tmp/montage_${i}.mp4"
done

ffmpeg \
  -i /tmp/montage_0.mp4 -i /tmp/montage_1.mp4 -i /tmp/montage_2.mp4 \
  -i /tmp/montage_3.mp4 -i /tmp/montage_4.mp4 -i /tmp/montage_5.mp4 \
  -i /tmp/montage_6.mp4 -i /tmp/montage_7.mp4 -i /tmp/montage_8.mp4 \
  -i /tmp/montage_9.mp4 \
  -filter_complex "
    [0:v][1:v]xfade=transition=${TRANSITIONS[0]}:duration=0.3:offset=1.7[v01];
    [v01][2:v]xfade=transition=${TRANSITIONS[1]}:duration=0.3:offset=3.4[v02];
    [v02][3:v]xfade=transition=${TRANSITIONS[2]}:duration=0.3:offset=5.1[v03];
    [v03][4:v]xfade=transition=${TRANSITIONS[3]}:duration=0.3:offset=6.8[v04];
    [v04][5:v]xfade=transition=${TRANSITIONS[4]}:duration=0.3:offset=8.5[v05];
    [v05][6:v]xfade=transition=${TRANSITIONS[5]}:duration=0.3:offset=10.2[v06];
    [v06][7:v]xfade=transition=${TRANSITIONS[6]}:duration=0.3:offset=11.9[v07];
    [v07][8:v]xfade=transition=${TRANSITIONS[7]}:duration=0.3:offset=13.6[v08];
    [v08][9:v]xfade=transition=${TRANSITIONS[8]}:duration=0.3:offset=15.3[vout]
  " -map "[vout]" montage.mp4

ffmpeg -i montage.mp4 -i music_epic.mp3 \
  -filter_complex "[1:a]volume=0.4,afade=t=in:d=1,afade=t=out:st=16:d=1[m];[m]atrim=0:17[mt]" \
  -map 0:v -map "[mt]" -c:v copy -c:a aac -shortest output.mp4

Example 1: Quick Ken Burns from a Single Image

ffmpeg -loop 1 -i photo.jpg -vf \
  "zoompan=z='min(zoom+0.0015,1.15)':x='iw/2-(iw/zoom/2)':y='ih/2-(ih/zoom/2)':d=150:s=1080x1920:fps=30" \
  -t 5 -c:v libx264 -pix_fmt yuv420p output.mp4

Example 2: Crossfade Between Two Clips

ffmpeg -i clip1.mp4 -i clip2.mp4 \
  -filter_complex "[0:v][1:v]xfade=transition=dissolve:duration=1:offset=4" \
  -c:v libx264 output.mp4

Example 3: Add Background Music with Auto-Ducking

ffmpeg -i video_with_voice.mp4 -i background_music.mp3 \
  -filter_complex "
    [1:a]volume=0.15[music];
    [0:a][music]sidechaincompress=threshold=0.02:ratio=8:attack=200:release=1000[audio]
  " \
  -map 0:v -map "[audio]" -c:v copy -c:a aac output.mp4

Example 4: Animated Counter (0 to $10,000)

ffmpeg -i background.mp4 -vf \
  "drawtext=text='\$%{eif\\:min(floor((t-1)*3333)\\,10000)\\:d\\:,}':
   fontfile=Inter-Black.ttf:fontsize=140:fontcolor=white:
   borderw=5:bordercolor=black:
   x=(w-text_w)/2:y=(h-text_h)/2:
   enable='between(t,1,4)'" \
  -c:v libx264 -c:a copy output.mp4

Example 5: Film Grain + Vignette + Color Grade in One Pass

ffmpeg -i raw_clip.mp4 -vf \
  "eq=brightness=-0.1:contrast=1.3:saturation=0.7,
   noise=alls=15:allf=t+u,
   vignette=PI/4:1.2" \
  -c:v libx264 -c:a copy cinematic.mp4

Example 6: Text with Background Box (Title Card)

ffmpeg -i clip.mp4 -vf \
  "drawtext=text='THE HIDDEN TAX TRAP':
   fontfile=Inter-Black.ttf:fontsize=70:fontcolor=white:
   box=1:boxcolor=red@0.85:boxborderw=25:
   x=(w-text_w)/2:y=h*0.4:
   enable='between(t,0,4)'" \
  -c:v libx264 -c:a copy output.mp4

Example 7: Sound Effect at Specific Timestamp

ffmpeg -i main_video.mp4 -i sfx/whoosh.mp3 \
  -filter_complex "
    [1:a]adelay=3000|3000,volume=0.5[sfx];
    [0:a][sfx]amix=inputs=2:duration=first[audio]
  " \
  -map 0:v -map "[audio]" -c:v copy -c:a aac output.mp4

Example 8: Speed Ramp (Normal to Slow-Mo)

ffmpeg -i action_clip.mp4 -filter_complex "
  [0:v]setpts='if(between(T,3,5),2.0*PTS,PTS)'[v]
" -map "[v]" -an -c:v libx264 output.mp4

Example 9: Picture-in-Picture Overlay

ffmpeg -i main.mp4 -i pip_source.mp4 \
  -filter_complex "
    [1:v]scale=280:500[pip];
    [0:v][pip]overlay=W-w-30:H-h-30:enable='between(t,5,15)'[out]
  " \
  -map "[out]" -map 0:a -c:v libx264 -c:a copy output.mp4

Example 10: Teal and Orange Hollywood Grade

ffmpeg -i footage.mp4 -vf \
  "colorbalance=rs=0.1:gs=-0.1:bs=-0.2:rh=-0.1:gh=0.05:bh=0.15,
   eq=contrast=1.1:saturation=1.2,
   vignette=PI/5:1.0" \
  -c:v libx264 -c:a copy hollywood.mp4

When FFmpeg is not enough β€” or when you need to scale beyond local rendering β€” these cloud APIs provide programmatic video creation.

Feature Comparison

FeatureFFmpeg (local)ShotstackCreatomateJSON2VideoRemotionPlainly
Ken Burnszoompan filtereffects propertyKeyframe animationManual positioningCSS transformsAfter Effects
Multi-trackfilter_complexTracks + clips JSONLayers in JSONSingle trackReact compositionAE tracks
Text animationdrawtext (limited)Basic textAdvanced (cascade, typewriter, bounce)Basic textFull React animationsAE keyframes
Transitionsxfade (40+)Built-in setKeyframe-basedLimitedCustom ReactAE transitions
Color gradingeq, colorbalanceFiltersAdjustmentsNoneCSS filtersAE effects
Audio mixingamix, sidechaincompressTimeline audioAudio layersBasicWeb Audio APIAE audio
TemplatesNone (scripts)JSON templatesVisual editor + JSONNoneCode templatesAE templates
RenderingLocal GPU/CPUCloudCloudCloudLocal/Lambda/CloudCloud
PricingFree$0.049/render (SD)From $20/moCredits$19/mo (Cloud Run)From $59/mo
Best forFull control, costJSON-driven automationTemplate-based brandsSimple slideshowsComplex animationsPremium AE quality
LimitationsNo visual editor, steep learning curveLimited text animationMonthly minimumsSingle track onlyReact knowledge requiredExpensive, AE dependency

Detailed API Profiles

Shotstack

Shotstack renders video from JSON specifications via REST API. The composition model uses timelines, tracks, and clips.

// Shotstack JSON composition structure
interface ShotstackEdit {
  timeline: {
    tracks: Array<{
      clips: Array<{
        asset: {
          type: 'video' | 'image' | 'title' | 'audio' | 'html';
          src?: string;
          text?: string;
          style?: string;
        };
        start: number;     // seconds
        length: number;    // seconds
        transition?: {
          in: 'fade' | 'reveal' | 'wipeLeft' | 'slideLeft';
          out: 'fade' | 'reveal' | 'wipeRight' | 'slideRight';
        };
        effect?: 'zoomIn' | 'zoomOut' | 'slideLeft' | 'slideRight';
        filter?: 'greyscale' | 'boost' | 'contrast' | 'darken';
        opacity?: number;
        position?: 'center' | 'top' | 'bottom';
      }>;
    }>;
    background?: string;
  };
  output: {
    format: 'mp4' | 'gif';
    resolution: 'sd' | 'hd' | '1080';
    fps: number;
  };
}

Pros: Simple JSON model, good documentation, supports multi-track, built-in effects. Cons: Limited text animation, no keyframe control, effects are preset-only.

Pricing: $0.049/render (SD), $0.098/render (HD), $0.196/render (1080p). API docs.

Creatomate

Creatomate offers both template-based and JSON-from-scratch rendering with advanced text animations.

// Creatomate render request
interface CreatomateRender {
  source: {
    output_format: 'mp4' | 'gif' | 'png';
    width: number;
    height: number;
    duration: number;
    elements: Array<{
      type: 'video' | 'image' | 'text' | 'shape' | 'composition';
      source?: string;
      text?: string;
      x: string;       // Supports expressions and percentages
      y: string;
      width: string;
      height: string;
      animations?: Array<{
        type: 'scale' | 'fade' | 'slide' | 'text-typewriter' | 'text-cascade';
        time: 'start' | 'end' | number;
        duration: number;
        easing?: string;
      }>;
      keyframes?: Array<{
        time: number;
        value: Record<string, unknown>;
      }>;
    }>;
  };
}

Pros: Rich text animation (typewriter, cascade, bounce), keyframe support, visual template editor, good for branded content. Cons: Monthly subscription required, less control than raw FFmpeg for custom effects.

Pricing: Starts at $20/month. Developer docs.

Remotion

Remotion renders video using React components. Each frame is a React render.

// Remotion video composition
import { AbsoluteFill, useCurrentFrame, interpolate, Sequence } from 'remotion';

const KenBurnsImage: React.FC<{ src: string; direction: 'in' | 'out' }> = ({ src, direction }) => {
  const frame = useCurrentFrame();

  const scale = direction === 'in'
    ? interpolate(frame, [0, 150], [1, 1.2])
    : interpolate(frame, [0, 150], [1.3, 1.0]);

  return (
    <AbsoluteFill>
      <img
        src={src}
        style={{
          width: '100%',
          height: '100%',
          objectFit: 'cover',
          transform: `scale(${scale})`,
        }}
      />
    </AbsoluteFill>
  );
};

const DataStoryVideo: React.FC = () => {
  return (
    <>
      <Sequence from={0} durationInFrames={90}>
        <KenBurnsImage src="/office.jpg" direction="in" />
        <AnimatedText text="93% of LLCs overpay" y={0.35} />
      </Sequence>
      <Sequence from={90} durationInFrames={90}>
        <KenBurnsImage src="/calculator.jpg" direction="out" />
        <CounterAnimation target={3200} prefix="$" y={0.4} />
      </Sequence>
    </>
  );
};

Pros: Full React ecosystem, unlimited animation complexity, self-hostable, type-safe, testable. Cons: Requires React knowledge, local rendering needs good hardware, Lambda rendering has cold start overhead.

Pricing: Open source (local render is free). Cloud Run: $19/month. Lambda rendering: pay per invocation. Docs.

JSON2Video

JSON2Video converts JSON documents to video with built-in TTS and HTML element support.

// JSON2Video movie structure
interface JSON2VideoMovie {
  resolution: 'full-hd' | 'hd' | 'sd';
  quality: 'high' | 'medium' | 'low';
  scenes: Array<{
    background: string;
    elements: Array<{
      type: 'text' | 'image' | 'video' | 'audio' | 'html' | 'voice';
      src?: string;
      text?: string;
      start?: number;
      duration?: number;
      position?: 'center' | 'custom';
      x?: number;
      y?: number;
    }>;
  }>;
}

Pros: Simple API, built-in TTS, HTML/CSS element support. Cons: Single-track, limited animation, no multi-layer composition, credit-based pricing is unpredictable.

Pricing: Credit-based system. API docs.

Plainly

Plainly renders Adobe After Effects templates in the cloud, providing the highest visual quality at the highest cost.

Pros: Full After Effects quality, complex animations, professional templates, motion graphics. Cons: Requires After Effects knowledge to create templates, expensive, slower rendering.

Pricing: From $59/month for 20 minutes ($3/minute). 100 minutes at $249/month ($2.50/minute). Pricing page.

Decision Matrix

If you need…Use…Because…
Maximum control, lowest costFFmpeg locallyFree, full filter access, $0.01/video
JSON-driven automation at scaleShotstackSimple API, predictable per-render pricing
Branded templates with rich textCreatomateBest text animations, visual editor
Complex custom animationsRemotionReact-based, unlimited creativity
Simple slideshow with TTSJSON2VideoQuickest to implement
Premium motion graphicsPlainlyAfter Effects quality in the cloud
All of the aboveFFmpeg + graduate upStart local, move to cloud when you hit limits

Key insight: Start with FFmpeg. It handles Levels 1-5 with zero cost. Graduate to Shotstack or Creatomate only when you need features FFmpeg cannot provide β€” primarily complex text animation and template management. If you need React-level animation control, use Remotion. If you need After Effects quality, use Plainly.


Video Footage

SourceLicenseAPIQualityVolume
PexelsFree, attribution optionalREST API, 200 req/hrHD-4K50K+ videos
PixabayFree, no attributionREST APIHD-4K30K+ videos
CoverrFree, no attributionManual downloadHD2K+ videos
MixkitFree, no attributionManual downloadHD-4K5K+ videos

Recommendation: Use Pexels API as primary (best search, largest catalog, API access). Pixabay as secondary. Download Coverr/Mixkit clips locally for common backgrounds (abstract, nature, city, technology).

Music

SourceLicenseGenresDownload
Pixabay MusicFree, no attributionAll genresDirect
Mixkit MusicFree, no attributionAll genresDirect
FreesoundCC (varies)SFX + ambientAPI
UppbeatFree tier availableAll genresDirect

Pre-download library by mood (20-30 tracks):

MoodUse CaseSearch Terms
Dark cinematicFinance, business”dark cinematic”, β€œtension”, β€œcorporate dark”
Corporate techSaaS, technology”corporate tech”, β€œinnovation”, β€œdigital”
Motivational epicSuccess, achievement”epic motivation”, β€œtriumph”, β€œinspirational”
Dramatic tensionReveals, surprises”suspense”, β€œdramatic build”, β€œanticipation”
Neutral ambientHow-to, tutorials”ambient”, β€œbackground”, β€œminimal”
Upbeat energyLifestyle, marketing”upbeat pop”, β€œenergetic”, β€œhappy”

Sound Effects

Pre-download library (15-20 effects):

SFXUse CaseSource
Whoosh (3 variants)TransitionsPixabay SFX
Bass dropHook revealPixabay SFX
Cash registerMoney mentionsPixabay SFX
Keyboard typingData revealsFreesound
Camera shutterScreenshot momentsPixabay SFX
Tension droneBackground suspenseFreesound
Success chimeCTA, completionPixabay SFX
Notification pingAlerts, pop-upsPixabay SFX
Record scratchContrarian hooksPixabay SFX
Glass shatterBreaking misconceptionsPixabay SFX

Fonts

FontWeightUse CaseSource
Inter Black900Headlines, numbersGoogle Fonts
Inter Bold700Subheads, labelsGoogle Fonts
Inter Regular400Body text, captionsGoogle Fonts
Montserrat Bold700Alternative headlineGoogle Fonts

Download and install locally β€” FFmpeg’s drawtext requires a fontfile path to a .ttf file.

Background Textures (download 5-10 looping videos)

TextureMoodSearch Terms
Dark digital gridTech, data”digital grid loop”
Abstract particlesUniversal”particles dark background”
Bokeh lightsWarm, lifestyle”bokeh lights loop”
Smoke/fogDramatic, moody”smoke dark background”
Matrix code rainTech, hacking”matrix code loop”
Gradient flowModern, clean”gradient abstract loop”

Where to Run Each Step

StepRun WhereWhy
Script generationCloud (Claude SDK, Gemini, Workers AI)Quality matters β€” use the best model available
Voice synthesisLocal (edge-tts) or ElevenLabs APIedge-tts is free and fast; ElevenLabs for premium
MusicStock librariesDon’t generate β€” curate from free libraries
Stock footage searchPexels/Pixabay APIAlready in API Mom, free tier is generous
Video renderingLocal FFmpeg$0.01/video on consumer hardware
Caption generationLocal (whisper-timestamped)Word-level timestamps, runs on GPU

Cost Comparison

PipelineCost/VideoAt 100 videos/monthQuality
Full local (FFmpeg + edge-tts)$0.01-0.03$1-3Level 1-5
Local + ElevenLabs voice$0.10-0.30$10-30Level 1-5, better voice
Shotstack cloud$0.30-1.00$30-100Level 1-3 (limited effects)
Creatomate cloud$0.50-2.00$50-200Level 1-4 (good text)
Remotion Lambda$0.05-0.15$5-15Level 1-6 (any animation)
Plainly (After Effects)$2.50-3.00$250-300Level 6+ (premium)

Key insight: Local FFmpeg rendering on a consumer GPU (RTX 4070 or similar) costs approximately $0.01/video in electricity. At 100 videos per month, that is $1 versus $100+ for cloud APIs β€” a 100x cost difference with equivalent or better quality for Levels 1-5.

The Graduation Path

Stage 1: FFmpeg for everything (Levels 1-5)
  ↓ When you need: complex text animation, template management
Stage 2: FFmpeg + Creatomate for text-heavy content
  ↓ When you need: React-level custom animation
Stage 3: Remotion for complex sequences, FFmpeg for simple ones
  ↓ When you need: professional motion graphics with existing AE templates
Stage 4: Plainly for premium content, Remotion/FFmpeg for volume

Render Pipeline Architecture (TypeScript)

interface RenderJob {
  id: string;
  script: VideoScript;
  assets: CollectedAssets;
  profile: ColorProfile;
  output: OutputSpec;
  status: 'queued' | 'rendering' | 'complete' | 'failed';
}

interface VideoScript {
  hook: {
    archetype: HookArchetype;
    text: string;
    durationSec: number;
  };
  segments: Array<{
    narration: string;
    durationSec: number;
    visualSearchTerms: string[];  // For Pexels API queries
    dataOverlays: DataOverlay[];  // Numbers, stats to display
    sfxCues: SFXCue[];           // Sound effects at timestamps
  }>;
  cta: {
    text: string;
    durationSec: number;
  };
  totalDurationSec: number;
}

interface CollectedAssets {
  voiceover: { path: string; wordTimestamps: WhisperWord[] };
  music: { path: string; mood: string; bpm: number };
  clips: Array<{
    path: string;
    source: 'pexels' | 'pixabay' | 'local';
    kenBurns: KenBurnsMovement;
    durationSec: number;
  }>;
  sfx: Map<string, string>;  // name β†’ file path
  fonts: Map<string, string>; // weight β†’ file path
}

interface OutputSpec {
  resolution: '1080x1920' | '1920x1080';
  fps: 30;
  codec: 'libx264';
  audioBitrate: '192k';
  format: 'mp4';
}

// The render function builds an FFmpeg command from the job spec
function buildFFmpegCommand(job: RenderJob): string {
  const inputs: string[] = [];
  const filters: string[] = [];

  // Step 1: Add all video inputs with Ken Burns
  job.assets.clips.forEach((clip, i) => {
    inputs.push(`-i ${clip.path}`);
    filters.push(buildKenBurnsFilter(clip, i));
  });

  // Step 2: Chain transitions
  filters.push(buildTransitionChain(job.assets.clips));

  // Step 3: Apply color grade + grain + vignette
  filters.push(buildColorGradeFilter(job.profile));

  // Step 4: Add text overlays
  job.script.segments.forEach(seg => {
    seg.dataOverlays.forEach(overlay => {
      filters.push(buildTextOverlayFilter(overlay));
    });
  });

  // Step 5: Mix audio
  inputs.push(`-i ${job.assets.voiceover.path}`);
  inputs.push(`-i ${job.assets.music.path}`);
  filters.push(buildAudioMixFilter(job));

  return `ffmpeg ${inputs.join(' ')} -filter_complex "${filters.join(';')}" output.mp4`;
}

Don’tDo InsteadWhy
Use static images without Ken BurnsApply zoompan to every imageStatic = retention death. The brain disengages when nothing moves.
Hard cut between every clipUse xfade transitions (0.3-1s)Hard cuts feel jarring and amateur. Crossfades feel intentional.
Same Ken Burns movement on consecutive clipsAlternate from the 7-movement libraryRepetitive movement becomes predictable and boring.
TTS voiceover with no background musicMix music at 15% volume with sidechain duckingMusic sets mood and fills silence gaps. Ducking keeps voice clear.
Single visual per sentence2-3 visuals per sentence (3-second rule)One visual for 9 seconds loses attention. Three visuals maintain pace.
Text without background/borderAlways use borderw or box on drawtextText without contrast against video is unreadable on mobile.
Same transition type throughoutVary transitions (fade, slide, dissolve, circle)Variety maintains visual interest. Same transition = monotonous.
Skip color gradingApply a consistent grade to all clipsRaw footage from different sources looks inconsistent. Grade unifies.
Generate music with AICurate from free stock librariesAI music is detectable, sounds generic, and quality varies. Stock is curated.
Use only the fade transitionUse 6-8 different transitions per videoFFmpeg has 40+ transitions β€” use them. Variety = production value.
Render in the cloud for simple compositionsUse local FFmpeg first, cloud only when neededCloud = $0.30-3.00/video. Local = $0.01/video. Same quality for Levels 1-5.
Skip silence removal in voiceoverRun whisper-timestamped and trim gaps > 0.3sDead air kills pacing. Professional editors obsessively remove silence.
Put all text at the centerVary position (top bar, bottom third, center)Same position becomes invisible (banner blindness). Vary to maintain attention.
Use system fonts in drawtextDownload Inter Black/Bold from Google FontsSystem fonts look amateur. Inter is designed for screens and data display.
Skip the hook (start with content)First 3 seconds must have hook archetype71% decide in 3 seconds. No hook = no viewers.
Captions as an afterthoughtBuild captions into the composition as Track 378% watch on mute. Captions are primary content, not accessibility add-on.

Official Documentation

FFmpeg Tutorials and Guides

Retention Science and Statistics

Kinetic Typography and Motion Design

Cloud API Comparisons

Stock Resources

GitHub Repositories


Edit page
Share this post on:

Previous Post
Video Creation Resources
Next Post
Viral Video Bible