Org Status: π‘ Dormant Cloudflare: N/A Last Audited: 2026-04-28
How to bridge the gap between βAI slideshowβ and viral-quality video β using FFmpeg, cloud rendering APIs, and the exact techniques human editors use to hold attention. Every command is real. Every statistic is sourced.
What youβll learn:
- The 6 production levels from amateur to professional, with exact FFmpeg commands for each
- 4 secrets human editors use that most AI pipelines miss entirely
- Complete FFmpeg filter reference: Ken Burns, transitions, color grading, text overlays, audio mixing, speed effects
- How to build multi-track compositions that create visual depth
- Cloud API comparison: Shotstack vs Creatomate vs Remotion vs Plainly vs JSON2Video
- Hook archetypes that capture viewers in the first 3 seconds
- Stock resource strategy: what to download, what to API, what to skip
- Architecture decisions for rendering at $0.01/video vs $0.30/minute
- The Problem: Why AI Video Looks Amateur
- The Retention Science
- Core Concepts
- The 6 Production Levels
- The 4 Human Editor Secrets
- FFmpeg Effects Reference
- Hook Archetypes
- Patterns: Production-Quality Compositions
- Small Examples
- Cloud API Comparison
- Stock Resource Strategy
- Architecture Decisions
- Anti-Patterns
- References
Most programmatic video pipelines produce content that viewers scroll past in under a second. The symptoms are obvious:
- Static visuals β stock images sit motionless for 5-10 seconds while TTS drones
- Hard cuts β jarring transitions between clips with no visual flow
- Flat composition β single-layer video with text slapped on top
- No audio design β TTS at 100% volume, no music, no sound effects
- No pacing β every segment is the same length regardless of content
The result is what the industry calls a βslideshow videoβ β technically a video file, functionally a PowerPoint with narration. Viewers detect this in under 3 seconds and swipe away.
The gap between slideshow and viral is not talent or expensive software. It is a specific set of techniques that human editors apply instinctively and AI pipelines skip entirely. These techniques are all implementable with FFmpeg filters, and they compound β each one adds 10-30% retention lift, and together they transform amateur content into content that algorithms promote.
What changes if you get this right:
| Metric | Slideshow (Level 1) | Production (Level 4+) |
|---|---|---|
| 3-second retention | 30-40% | 70-85% |
| Average watch time | 15-25% of duration | 45-65% of duration |
| Algorithmic promotion | Minimal | Active distribution |
| RPM (finance niche) | $2-5 | $9-21 |
| Viewer perception | βAI generated" | "Professional contentβ |
| Cost per video (local) | $0.01 | $0.01-0.03 |
The cost difference is negligible. The quality difference is everything. Every technique in this article adds production value without adding meaningful cost when you render locally with FFmpeg.
Before diving into techniques, understand why they work. Video retention is governed by neuroscience, not aesthetics.
The 3-Second Window
71% of viewers decide in the first 3 seconds whether to keep watching. TikTok videos that maintain 70-85% retention in the first 3 seconds receive 2.2x more total views than videos with lower retention. Videos exceeding 85% first-3-second retention achieve viral potential. Content below 60% gets minimal algorithmic promotion.
interface RetentionWindow {
/** Seconds 0-3: The hook. Visual must change, text must appear, audio must hit. */
hook: { durationMs: 3000; targetRetention: 0.85; visualChanges: number }; // minimum 2
/** Seconds 3-10: The promise. Viewer decides if the payoff is worth waiting for. */
promise: { durationMs: 7000; targetRetention: 0.70; paceSecondsPerVisual: 3 };
/** Seconds 10-30: The delivery. Content must match or exceed the hook's promise. */
delivery: { durationMs: 20000; targetRetention: 0.55; paceSecondsPerVisual: 3 };
/** Seconds 30+: The payoff. Reward the viewer for staying. */
payoff: { targetRetention: 0.40; includesCTA: boolean };
}
Why Movement Matters
The human visual system is wired to track movement. A static image on screen triggers the brainβs βnothing is happeningβ response and attention drops. Even subtle movement β a 2% zoom over 5 seconds β keeps the visual cortex engaged.
Key insight: Static visuals are not βneutralβ β they are actively harmful to retention. Every frame must move. This is the single highest-impact change you can make to any video pipeline.
Why Multi-Layer Composition Works
Professional video uses depth β multiple visual planes stacked to create a sense of space. Multi-plane B-roll compositions increase retention by 31% compared to single-layer video. The brain interprets layered visuals as βricherβ content worth paying attention to.
Why Sound Design Matters
78% of social media video is watched on mute. This means captions are mandatory, not optional. But for the 22% who listen, sound design β background music, sound effects, audio ducking β dramatically increases perceived production quality. The combination of visual captions AND sound design covers both audiences.
Optimal Duration
The highest engagement for short-form video occurs in the 60-90 second range. For long-form YouTube, 8-15 minutes is the sweet spot for finance/education niches, with $9-21 RPM.
Concept 1: The Render Pipeline
Every programmatic video follows the same pipeline, regardless of whether you use FFmpeg locally or a cloud API.
interface RenderPipeline {
/** Step 1: Script generation β what the video says */
script: {
hook: string; // First 3 seconds
segments: Segment[]; // Each segment = one visual scene
cta: string; // Call to action
};
/** Step 2: Asset collection β what the video shows */
assets: {
stockVideo: StockClip[]; // B-roll footage from Pexels/Pixabay
stockImages: StockImage[]; // For Ken Burns treatment
music: AudioTrack; // Background music from Pixabay Music
sfx: SoundEffect[]; // Whoosh, bass drop, etc.
voiceover: AudioTrack; // TTS from edge-tts or ElevenLabs
};
/** Step 3: Composition β how it all fits together */
composition: {
tracks: Track[]; // Layered video tracks (background, subject, text)
transitions: Transition[]; // Between clips
colorGrade: ColorProfile; // Overall mood
effects: Effect[]; // Grain, vignette, etc.
};
/** Step 4: Render β produce the final file */
output: {
format: 'mp4';
resolution: '1080x1920' | '1920x1080'; // Vertical or horizontal
fps: 30;
codec: 'libx264' | 'libx265';
audioBitrate: '192k';
};
}
Concept 2: The Visual Change Cadence
The β3-second ruleβ is the most important pacing concept in viral video. The visual on screen must change every 3 seconds. Not every 10 seconds. Not every 5 seconds. Every 3 seconds.
interface VisualCadence {
/** Maximum seconds any single visual can stay on screen */
maxVisualDuration: 3;
/** For a 60-second video, you need at minimum 20 distinct visuals */
visualsPerMinute: 20;
/** Each script segment (sentence) should map to 2-3 visuals, not 1 */
visualsPerSegment: 2 | 3;
/** Never repeat the same visual in a video */
allowRepeat: false;
/** Movement type must vary β never two consecutive clips with same Ken Burns */
movementVariation: true;
}
Key insight: Divide each script segment duration by 3. Thatβs how many distinct visuals you need for that segment. A 9-second sentence needs 3 different visuals, each with its own Ken Burns movement, with crossfade transitions between them.
Concept 3: The Audio Stack
Professional video has 3-4 audio layers, not 1.
interface AudioStack {
/** Layer 1: Voice β the primary content, always at 100% */
voice: {
source: 'edge-tts' | 'elevenlabs' | 'recorded';
volume: 1.0;
processing: 'normalize' | 'compress';
};
/** Layer 2: Music β sets mood, always ducked under voice */
music: {
source: 'stock'; // Pixabay Music, Mixkit
volume: 0.15; // 15% when voice is active
ducking: {
method: 'sidechaincompress';
threshold: 0.02;
ratio: 8;
attackMs: 200;
releaseMs: 1000;
};
};
/** Layer 3: SFX β punctuate transitions and key moments */
sfx: {
onTransition: 'whoosh'; // Every scene change
onHookReveal: 'bass_drop'; // The first number/stat
onMoney: 'cash_register'; // Dollar amounts
onData: 'keyboard_typing'; // Data reveals
onCTA: 'success_chime'; // End call-to-action
volume: 0.5;
};
/** Layer 4: Ambience (optional) β subtle background texture */
ambience?: {
source: 'tension_drone' | 'room_tone';
volume: 0.05;
};
}
Concept 4: The Color Grade
Color grading is the difference between βfootageβ and βcinema.β A single FFmpeg filter chain transforms generic stock footage into a cohesive visual identity.
interface ColorProfile {
name: string;
/** FFmpeg eq filter values */
brightness: number; // -1.0 to 1.0
contrast: number; // 0.0 to 2.0
saturation: number; // 0.0 to 3.0
/** Additional filters */
grain: boolean; // noise filter for texture
vignette: boolean; // dark edges for focus
colorBalance?: { // Teal/orange, cold, warm
shadowsRed: number;
shadowsGreen: number;
shadowsBlue: number;
highlightsRed: number;
highlightsGreen: number;
highlightsBlue: number;
};
}
const PROFILES: Record<string, ColorProfile> = {
darkCinematic: {
name: 'Dark Cinematic',
brightness: -0.1,
contrast: 1.3,
saturation: 0.7,
grain: true,
vignette: true,
},
tealOrange: {
name: 'Teal & Orange (Hollywood)',
brightness: 0,
contrast: 1.1,
saturation: 1.2,
grain: false,
vignette: true,
colorBalance: {
shadowsRed: 0.1, shadowsGreen: -0.1, shadowsBlue: -0.2,
highlightsRed: -0.1, highlightsGreen: 0.05, highlightsBlue: 0.15,
},
},
coldClinical: {
name: 'Cold/Clinical (Tech)',
brightness: 0,
contrast: 1.2,
saturation: 0.6,
grain: false,
vignette: false,
colorBalance: {
shadowsRed: -0.15, shadowsGreen: 0, shadowsBlue: 0.15,
highlightsRed: -0.1, highlightsGreen: 0, highlightsBlue: 0.1,
},
},
};
Concept 5: The Track System
Professional video is composed in tracks, like audio mixing. Each track is a visual layer.
interface TrackSystem {
/** Track 1 (bottom): Full-screen background β stock footage with Ken Burns */
background: {
content: 'stock_video' | 'stock_image_with_zoompan';
movement: KenBurnsEffect;
colorGrade: ColorProfile;
};
/** Track 2 (middle): Subject matter β the thing you're talking about */
subject: {
content: 'screenshot' | 'chart' | 'product_image' | 'person';
scale: 0.8; // 80% of frame size
opacity: 0.9; // Slightly transparent
position: 'center';
shadow: boolean; // Drop shadow for depth
};
/** Track 3 (top): Text and data overlays */
text: {
content: 'caption' | 'statistic' | 'title' | 'counter';
animation: 'slide_up' | 'fade_in' | 'typewriter' | 'scale_pop';
font: 'Inter-Black' | 'Montserrat-Bold';
position: 'bottom_third' | 'center' | 'top_bar';
};
}
Key insight: The 3-track system creates perceived depth. Track 1 moves, Track 2 is semi-transparent with a shadow, Track 3 animates text. The viewerβs brain interprets this as a 3D space, which registers as βprofessional productionβ even when every asset is stock footage composited with FFmpeg.
Each level builds on the previous one. The cost column assumes local FFmpeg rendering on consumer hardware.
Level 1 β Basic (The Slideshow)
What it looks like: Static clips + TTS + captions + hard cuts.
Whatβs wrong: Everything is static. The viewerβs brain sees βnothing is happeningβ and swipes away. No movement, no transitions, no audio design. This is where most AI video pipelines stop.
ffmpeg -i clip1.mp4 -i clip2.mp4 -i clip3.mp4 -i voiceover.mp3 \
-filter_complex "[0:v][1:v][2:v]concat=n=3:v=1:a=0[v]" \
-map "[v]" -map 3:a \
-c:v libx264 -c:a aac -shortest output.mp4
| Attribute | Value |
|---|---|
| Retention impact | Baseline |
| Viewer perception | βAI slideshowβ |
| Implementation time | 1 hour |
| Cost per video | $0.01 |
Level 2 β Movement (Ken Burns + Transitions)
What changes: Every clip moves. Transitions flow. The video feels alive.
This is the single biggest quality jump β going from static to moving. Apply Ken Burns to every visual and crossfade between clips.
ffmpeg -loop 1 -i image1.jpg -vf \
"zoompan=z='min(zoom+0.002,1.2)':x='iw/2-(iw/zoom/2)':y='ih/2-(ih/zoom/2)':d=150:s=1080x1920:fps=30" \
-t 5 -c:v libx264 -pix_fmt yuv420p clip1_kb.mp4
ffmpeg -loop 1 -i image2.jpg -vf \
"zoompan=z='if(eq(on,1),1.3,max(zoom-0.002,1.0))':x='iw/2-(iw/zoom/2)':y='ih/2-(ih/zoom/2)':d=150:s=1080x1920:fps=30" \
-t 5 -c:v libx264 -pix_fmt yuv420p clip2_kb.mp4
ffmpeg -i clip1_kb.mp4 -i clip2_kb.mp4 \
-filter_complex "xfade=transition=fade:duration=1:offset=4" \
-c:v libx264 -pix_fmt yuv420p output.mp4
| Attribute | Value |
|---|---|
| Retention impact | +25-35% watch time |
| Viewer perception | βLooks like a real videoβ |
| Implementation time | 4 hours |
| Cost per video | $0.01 |
Level 3 β Layering (3-Track Composition)
What changes: Visual depth. Background moves, subject floats with shadow, text overlays add data.
ffmpeg -i background.mp4 -i subject.png -i voice.mp3 \
-filter_complex "
[0:v]zoompan=z='min(zoom+0.001,1.15)':x='iw/2-(iw/zoom/2)':y='ih/2-(ih/zoom/2)':d=300:s=1080x1920:fps=30,
eq=brightness=-0.1:contrast=1.3:saturation=0.7,
noise=alls=15:allf=t+u,
vignette=PI/4:1.2[bg];
[1:v]scale=864:-1,format=rgba,colorchannelmixer=aa=0.9[fg];
[bg][fg]overlay=(W-w)/2:(H-h)/2[comp];
[comp]drawtext=text='\$3,200 SAVED':fontfile=Inter-Black.ttf:fontsize=100:
fontcolor=white:borderw=4:bordercolor=black:
x=(w-text_w)/2:y=h*0.75:enable='between(t,3,6)'[out]
" \
-map "[out]" -map 2:a \
-c:v libx264 -c:a aac -t 10 output.mp4
| Attribute | Value |
|---|---|
| Retention impact | +31% over single-layer |
| Viewer perception | βProfessional productionβ |
| Implementation time | 8 hours |
| Cost per video | $0.01-0.02 |
Level 4 β Sound Design (Music + SFX + Ducking)
What changes: Audio becomes an experience. Background music sets mood, sound effects punctuate moments, voice ducks the music automatically.
ffmpeg -i composed_video.mp4 -i music.mp3 -i whoosh.mp3 -i bass_drop.mp3 \
-filter_complex "
[1:a]volume=0.15[bg_music];
[2:a]adelay=4000|4000,volume=0.5[whoosh];
[3:a]adelay=1000|1000,volume=0.7[bass];
[whoosh][bass]amix=inputs=2[sfx];
[bg_music][sfx]amix=inputs=2[music_sfx];
[0:a][music_sfx]sidechaincompress=threshold=0.02:ratio=8:attack=200:release=1000[final_audio]
" \
-map 0:v -map "[final_audio]" \
-c:v copy -c:a aac output.mp4
| Attribute | Value |
|---|---|
| Retention impact | +15-20% perceived quality |
| Viewer perception | βThis has a production teamβ |
| Implementation time | 12 hours |
| Cost per video | $0.01-0.02 |
Level 5 β Typography (Kinetic Text + Counters)
What changes: Text animates. Numbers count up. Statistics slide in. Kinetic typography improves retention by 25-50%.
ffmpeg -i composed_video.mp4 -vf "
drawtext=text='\$%{eif\\:min(floor((t-2)*1600)\\,3200)\\:d}':
fontfile=Inter-Black.ttf:fontsize=120:fontcolor=white:
borderw=4:bordercolor=black:
x=(w-text_w)/2:y=h*0.4:
enable='between(t,2,5)',
drawtext=text='ANNUAL SAVINGS':
fontfile=Inter-Bold.ttf:fontsize=48:fontcolor=white@0.8:
x=(w-text_w)/2:y=h*0.4+130:
enable='between(t,2.5,5)'
" -c:v libx264 -c:a copy output.mp4
| Attribute | Value |
|---|---|
| Retention impact | +25-50% engagement |
| Viewer perception | βMotion graphics qualityβ |
| Implementation time | 16 hours |
| Cost per video | $0.02-0.03 |
Level 6 β Professional (Speed Ramping + Match Cuts + Nano-Hooks)
What changes: Pacing becomes aggressive. Speed ramping on dramatic moments. Visual changes every 1.5 seconds on hooks. Multiple clips per sentence with match-cut transitions.
ffmpeg -i clip.mp4 -filter_complex "
[0:v]setpts='
if(between(T,0,3), PTS,
if(between(T,3,5), 2.0*PTS,
PTS))'[v];
[0:a]atempo=1.0[a]
" -map "[v]" -map "[a]" -c:v libx264 output.mp4
| Attribute | Value |
|---|---|
| Retention impact | Approaches human-edited quality |
| Viewer perception | βCanβt tell this is AIβ |
| Implementation time | 24+ hours |
| Cost per video | $0.02-0.05 |
Key insight: 73% of viewers cannot distinguish AI-assisted video from traditionally produced video when Level 4+ techniques are applied. The quality ceiling for programmatic video is far higher than most pipelines reach.
Level Progression Summary
| Level | Name | Key Addition | Retention Lift | Cumulative Effect |
|---|---|---|---|---|
| 1 | Basic | Clips + TTS | Baseline | βSlideshowβ |
| 2 | Movement | Ken Burns + crossfades | +25-35% | βLooks like videoβ |
| 3 | Layering | 3-track + overlays + grading | +31% | βProfessionalβ |
| 4 | Sound | Music + SFX + ducking | +15-20% | βHas a production teamβ |
| 5 | Typography | Kinetic text + counters | +25-50% | βMotion graphicsβ |
| 6 | Professional | Speed ramps + nano-hooks | +10-15% | βCanβt tell itβs AIβ |
These come from studying million-dollar YouTube channels. They are the techniques that separate amateur from professional, and most AI pipelines implement none of them.
Secret 1: Aggressive Pacing and Silence Removal
Human editors obsessively remove silence. Any gap longer than 0.3 seconds between words gets cut. Word endings overlap with the next wordβs beginning for a relentless pace.
The tool: whisper-timestamped β word-level timestamps from OpenAI Whisper with silence detection built in.
interface SilenceRemoval {
/** Maximum silence duration before trimming */
maxSilenceMs: 300;
/** Overlap word boundaries for pace */
overlapMs: 50;
/** Preserve intentional pauses (before reveals) */
preserveDramaticPause: boolean;
/** whisper-timestamped output gives us word-level timestamps */
timestampSource: 'whisper-timestamped';
}
// whisper-timestamped output format
interface WhisperWord {
text: string;
start: number; // seconds
end: number; // seconds
confidence: number;
}
// Find silences longer than threshold
function findSilences(words: WhisperWord[], thresholdMs: number): Silence[] {
const silences: Silence[] = [];
for (let i = 0; i < words.length - 1; i++) {
const gap = (words[i + 1].start - words[i].end) * 1000;
if (gap > thresholdMs) {
silences.push({
start: words[i].end,
end: words[i + 1].start,
durationMs: gap,
});
}
}
return silences;
}
FFmpeg implementation β trim silences:
whisper_timestamped audio.mp3 --model small --language en --output_format json
ffmpeg -i audio.mp3 -filter_complex "
[0:a]atrim=start=0:end=2.3[s1];
[0:a]atrim=start=2.8:end=5.1[s2];
[0:a]atrim=start=5.9:end=8.4[s3];
[s1][s2][s3]concat=n=3:v=0:a=1[out]
" -map "[out]" trimmed_audio.mp3
Secret 2: Ken Burns Parallax (Movement is Mandatory)
No visual asset is EVER static. Every image gets a zoompan effect. Every video clip gets repositioned. The human editor treats stillness as a bug.
The library β 7 movement variants:
enum KenBurnsMovement {
ZOOM_IN = 'zoom_in', // Intimacy, focus
ZOOM_OUT = 'zoom_out', // Reveals context, scale
PAN_LEFT = 'pan_left', // Follows action
PAN_RIGHT = 'pan_right', // Reverse motion
PAN_DOWN = 'pan_down', // Gravity, weight, reveal
PAN_UP = 'pan_up', // Aspiration, growth
DIAGONAL_DRIFT = 'diagonal', // Subtle, cinematic
}
// Rule: Never two consecutive clips with the same movement
function selectMovement(previous: KenBurnsMovement): KenBurnsMovement {
const movements = Object.values(KenBurnsMovement);
let next: KenBurnsMovement;
do {
next = movements[Math.floor(Math.random() * movements.length)];
} while (next === previous);
return next;
}
Complete FFmpeg commands for all 7 variants (1080x1920 vertical, 5 seconds, 30fps):
ffmpeg -loop 1 -i image.jpg -vf \
"zoompan=z='min(zoom+0.002,1.2)':x='iw/2-(iw/zoom/2)':y='ih/2-(ih/zoom/2)':d=150:s=1080x1920:fps=30" \
-t 5 -c:v libx264 -pix_fmt yuv420p zoom_in.mp4
ffmpeg -loop 1 -i image.jpg -vf \
"zoompan=z='if(eq(on,1),1.3,max(zoom-0.002,1.0))':x='iw/2-(iw/zoom/2)':y='ih/2-(ih/zoom/2)':d=150:s=1080x1920:fps=30" \
-t 5 -c:v libx264 -pix_fmt yuv420p zoom_out.mp4
ffmpeg -loop 1 -i image.jpg -vf \
"zoompan=z='1.15':x='if(eq(on,1),0,min(x+2,iw-iw/zoom))':y='ih/2-(ih/zoom/2)':d=150:s=1080x1920:fps=30" \
-t 5 -c:v libx264 -pix_fmt yuv420p pan_right.mp4
ffmpeg -loop 1 -i image.jpg -vf \
"zoompan=z='1.15':x='if(eq(on,1),iw-iw/zoom,max(x-2,0))':y='ih/2-(ih/zoom/2)':d=150:s=1080x1920:fps=30" \
-t 5 -c:v libx264 -pix_fmt yuv420p pan_left.mp4
ffmpeg -loop 1 -i image.jpg -vf \
"zoompan=z='1.15':x='iw/2-(iw/zoom/2)':y='if(eq(on,1),0,min(y+2,ih-ih/zoom))':d=150:s=1080x1920:fps=30" \
-t 5 -c:v libx264 -pix_fmt yuv420p pan_down.mp4
ffmpeg -loop 1 -i image.jpg -vf \
"zoompan=z='1.15':x='iw/2-(iw/zoom/2)':y='if(eq(on,1),ih-ih/zoom,max(y-2,0))':d=150:s=1080x1920:fps=30" \
-t 5 -c:v libx264 -pix_fmt yuv420p pan_up.mp4
ffmpeg -loop 1 -i image.jpg -vf \
"zoompan=z='1.2':x='if(eq(on,1),0,min(x+1.5,iw-iw/zoom))':y='if(eq(on,1),0,min(y+1,ih-ih/zoom))':d=150:s=1080x1920:fps=30" \
-t 5 -c:v libx264 -pix_fmt yuv420p diagonal.mp4
Secret 3: B-Roll Layering (The Opacity Trick)
The 3-track composition creates perceived depth that single-layer video cannot match:
Track 3 (top): Kinetic typography, captions, data overlays
Track 2 (middle): Subject at 80% scale, 90% opacity, drop shadow
Track 1 (bottom): Moving abstract background with color grade
FFmpeg multi-track composition:
ffmpeg -i abstract_bg.mp4 -i subject.png \
-filter_complex "
# Track 1: Background with Ken Burns + color grade + grain + vignette
[0:v]scale=1080:1920:force_original_aspect_ratio=increase,crop=1080:1920,
zoompan=z='min(zoom+0.001,1.1)':x='iw/2-(iw/zoom/2)':y='ih/2-(ih/zoom/2)':d=300:s=1080x1920:fps=30,
eq=brightness=-0.1:contrast=1.3:saturation=0.7,
noise=alls=15:allf=t+u,
vignette=PI/4:1.2[bg];
# Track 2: Subject at 80% scale, 90% opacity
[1:v]scale=864:-1,format=rgba,colorchannelmixer=aa=0.9[subject];
# Compose Track 1 + Track 2
[bg][subject]overlay=(W-w)/2:(H-h)/2[comp];
# Track 3: Text overlay (data reveal at timestamp)
[comp]drawtext=text='\$3,200':fontfile=Inter-Black.ttf:fontsize=120:
fontcolor=white:borderw=5:bordercolor=black@0.8:
x=(w-text_w)/2:y=h*0.3:
enable='between(t,3,7)',
drawtext=text='ANNUAL TAX SAVINGS':fontfile=Inter-Bold.ttf:fontsize=40:
fontcolor=white@0.8:x=(w-text_w)/2:y=h*0.3+140:
enable='between(t,3.5,7)'[final]
" \
-map "[final]" -c:v libx264 -pix_fmt yuv420p -t 10 output.mp4
Secret 4: 3-Second Visual Hook Swap
The visual on screen must change every 3 seconds. For each script segment, divide the duration by 3, then fetch that many distinct visuals.
interface VisualSwapStrategy {
segmentDurationSec: number;
visualCount: number; // Math.ceil(segmentDurationSec / 3)
transitionDuration: 0.5; // Half-second crossfade between visuals
searchStrategy: 'varied'; // Each visual uses different search terms
/** Example: "LLC saves you $3,200 per year on taxes" (9 seconds)
* Visual 1 (0-3s): Business office footage β zoom in
* Visual 2 (3-6s): Calculator/spreadsheet β pan right
* Visual 3 (6-9s): Money/savings imagery β zoom out
*/
}
function planVisuals(segment: ScriptSegment): VisualPlan[] {
const count = Math.ceil(segment.durationSec / 3);
const visualDuration = segment.durationSec / count;
return Array.from({ length: count }, (_, i) => ({
searchQuery: segment.visualSearchTerms[i],
startTime: segment.startTime + i * visualDuration,
duration: visualDuration,
kenBurns: selectMovement(i > 0 ? plans[i - 1].kenBurns : null),
transition: i > 0 ? 'crossfade' : 'none',
}));
}
Complete reference for every visual and audio effect, with exact syntax.
Ken Burns (zoompan filter)
The zoompan filter accepts values for zoom between 1 and 10. Key parameters:
| Parameter | Description | Default |
|---|---|---|
z | Zoom expression (1.0 = no zoom) | 1 |
x | Horizontal pan position | 0 |
y | Vertical pan position | 0 |
d | Duration in frames (fps * seconds) | 90 |
s | Output size | 1280x720 |
fps | Output frame rate | 25 |
Zoom speed reference:
| Effect | Zoom increment | Duration feel |
|---|---|---|
| Barely perceptible | +0.0005/frame | Very subtle, cinematic |
| Gentle | +0.001/frame | Natural, documentary |
| Standard | +0.002/frame | Noticeable, engaging |
| Aggressive | +0.004/frame | Dramatic, attention-grabbing |
| Fast | +0.008/frame | Action, urgency |
Center-zoom formula explained:
x='iw/2-(iw/zoom/2)' β Centers horizontally as zoom changes
y='ih/2-(ih/zoom/2)' β Centers vertically as zoom changes
Transitions (xfade filter)
The xfade filter provides 40+ built-in transitions between two video streams.
Syntax:
ffmpeg -i clip1.mp4 -i clip2.mp4 \
-filter_complex "xfade=transition=TRANSITION_NAME:duration=SECONDS:offset=SECONDS" \
output.mp4
Complete transition list with use cases:
| Transition | Visual Effect | Best For |
|---|---|---|
fade | Classic fade to black/white | Universal, safe default |
dissolve | Cross-dissolve blend | Emotional moments |
wipeleft | Wipe from right to left | Forward progress |
wiperight | Wipe from left to right | Flashback, reverse |
wipeup | Wipe from bottom to top | Aspiration, growth |
wipedown | Wipe from top to bottom | Gravity, grounding |
slideleft | Second clip slides in from right | Fast pace, lists |
slideright | Second clip slides in from left | Fast pace |
slideup | Second clip slides in from bottom | Reveals |
slidedown | Second clip slides in from top | Drops, emphasis |
circlecrop | Circle expanding from center | Focus, spotlight |
rectcrop | Rectangle expanding from center | Data reveals |
distance | Pixel distance blend | Abstract, artistic |
fadeblack | Fade through black | Scene change, time jump |
fadewhite | Fade through white | Dream, flashback |
radial | Radial wipe | Clock-like, time-based |
smoothleft | Smooth left transition | Professional, clean |
smoothright | Smooth right transition | Professional |
smoothup | Smooth upward transition | Growth narrative |
smoothdown | Smooth downward transition | Grounding |
circleopen | Circle opening out | Spotlight reveal |
circleclose | Circle closing in | Focus, ending |
vertopen | Vertical blinds opening | Data, corporate |
vertclose | Vertical blinds closing | Closing sequence |
horzopen | Horizontal blinds opening | Reveal |
horzclose | Horizontal blinds closing | Closing |
diagtl | Diagonal from top-left | Dynamic, energetic |
diagtr | Diagonal from top-right | Variety |
diagbl | Diagonal from bottom-left | Variety |
diagbr | Diagonal from bottom-right | Variety |
hlslice | Horizontal left slice | Glitch, tech |
hrslice | Horizontal right slice | Glitch, tech |
vuslice | Vertical up slice | Glitch, tech |
vdslice | Vertical down slice | Glitch, tech |
dissolve | Random pixel dissolve | Soft, emotional |
pixelize | Pixelation transition | Retro, gaming |
hblur | Horizontal blur transition | Speed, motion |
fadegrays | Fade through grayscale | Dramatic |
squeezeh | Horizontal squeeze | Compression |
squeezev | Vertical squeeze | Compression |
Chaining multiple transitions (3+ clips):
ffmpeg -i clip1.mp4 -i clip2.mp4 -i clip3.mp4 -i clip4.mp4 \
-filter_complex "
[0:v][1:v]xfade=transition=fade:duration=0.5:offset=4.5[v01];
[v01][2:v]xfade=transition=slideleft:duration=0.5:offset=9[v012];
[v012][3:v]xfade=transition=circlecrop:duration=0.5:offset=13.5[vout]
" \
-map "[vout]" output.mp4
Key insight: The
offsetparameter is cumulative β itβs the timestamp in the OUTPUT where the transition starts, not the input. For a chain, each offset = previous_offset + clip_duration - transition_duration.
Extended transitions with xfade-easing:
The xfade-easing project adds CSS easing functions and ported GLSL transitions to FFmpegβs xfade filter, expanding from 40 to 200+ transitions.
Color Grading
Dark Cinematic (finance/business):
ffmpeg -i clip.mp4 -vf \
"eq=brightness=-0.1:contrast=1.3:saturation=0.7,curves=preset=darker" \
output.mp4
Teal and Orange (Hollywood):
ffmpeg -i clip.mp4 -vf \
"colorbalance=rs=0.1:gs=-0.1:bs=-0.2:rh=-0.1:gh=0.05:bh=0.15" \
output.mp4
Cold/Clinical (tech/business):
ffmpeg -i clip.mp4 -vf \
"colorbalance=rs=-0.15:bs=0.15:rh=-0.1:bh=0.1,eq=contrast=1.2:saturation=0.6" \
output.mp4
Black and White High Contrast:
ffmpeg -i clip.mp4 -vf \
"hue=s=0,eq=contrast=1.5:brightness=0.05" \
output.mp4
Warm Nostalgic (lifestyle):
ffmpeg -i clip.mp4 -vf \
"colorbalance=rs=0.1:gs=0.05:bs=-0.1:rh=0.05:gh=0.02:bh=-0.05,eq=saturation=1.1:brightness=0.05" \
output.mp4
Film Grain:
ffmpeg -i clip.mp4 -vf "noise=alls=15:allf=t+u" output.mp4
ffmpeg -i clip.mp4 -vf "noise=alls=30:allf=t+u" output.mp4
ffmpeg -i clip.mp4 -vf "noise=alls=20:allf=t" output.mp4
Vignette:
ffmpeg -i clip.mp4 -vf "vignette=PI/4:1.2" output.mp4
ffmpeg -i clip.mp4 -vf "vignette=PI/6:0.8" output.mp4
ffmpeg -i clip.mp4 -vf "vignette=PI/3:1.5" output.mp4
Complete color grade chain (apply all at once):
ffmpeg -i clip.mp4 -vf \
"eq=brightness=-0.1:contrast=1.3:saturation=0.7,
noise=alls=15:allf=t+u,
vignette=PI/4:1.2" \
-c:v libx264 -c:a copy output.mp4
Text Overlays (drawtext filter)
The drawtext filter renders text on video frames.
Static text with background box:
ffmpeg -i clip.mp4 -vf \
"drawtext=text='90%% FAIL':
fontfile=Inter-Black.ttf:fontsize=100:
fontcolor=white:
box=1:boxcolor=red@0.8:boxborderw=20:
x=(w-text_w)/2:y=h*0.3:
enable='between(t,1,4)'" \
output.mp4
Text with border (no box):
ffmpeg -i clip.mp4 -vf \
"drawtext=text='\$3,200':
fontfile=Inter-Black.ttf:fontsize=120:
fontcolor=white:
borderw=4:bordercolor=black:
x=(w-text_w)/2:y=(h-text_h)/2:
enable='between(t,3,6)'" \
output.mp4
Slide-up from bottom:
ffmpeg -i clip.mp4 -vf \
"drawtext=text='ANNUAL SAVINGS':
fontfile=Inter-Black.ttf:fontsize=80:fontcolor=white:
borderw=3:bordercolor=black:
x=(w-text_w)/2:
y='if(between(t,3,3.5), h - (h - h*0.4)*(t-3)/0.5, h*0.4)':
enable='between(t,3,7)'" \
output.mp4
Counter animation (counting up to a number):
ffmpeg -i clip.mp4 -vf \
"drawtext=text='\$%{eif\\:min(floor((t-2)*1600)\\,3200)\\:d}':
fontfile=Inter-Black.ttf:fontsize=120:fontcolor=white:
borderw=4:bordercolor=black:
x=(w-text_w)/2:y=h*0.4:
enable='between(t,2,5)'" \
output.mp4
Countdown timer:
ffmpeg -i clip.mp4 -vf \
"drawtext=text='%{eif\\:10-floor(t)\\:d}':
fontfile=Inter-Black.ttf:fontsize=200:fontcolor=red:
borderw=5:bordercolor=white:
x=(w-text_w)/2:y=(h-text_h)/2:
enable='between(t,0,10)'" \
output.mp4
Fade-in text (opacity animation):
ffmpeg -i clip.mp4 -vf \
"drawtext=text='THE LLC TRAP':
fontfile=Inter-Black.ttf:fontsize=100:
fontcolor=white@'if(between(t,2,2.5),(t-2)*2,if(between(t,2.5,5),1,0))':
borderw=4:bordercolor=black@'if(between(t,2,2.5),(t-2)*2,if(between(t,2.5,5),1,0))':
x=(w-text_w)/2:y=h*0.3:
enable='between(t,2,5)'" \
output.mp4
Multiple text overlays (stacked data):
ffmpeg -i clip.mp4 -vf "
drawtext=text='LLC COST BREAKDOWN':fontfile=Inter-Black.ttf:fontsize=60:
fontcolor=white:borderw=3:bordercolor=black:
x=(w-text_w)/2:y=h*0.15:enable='between(t,1,8)',
drawtext=text='State Fee\: \$100':fontfile=Inter-Bold.ttf:fontsize=48:
fontcolor=white:x=(w-text_w)/2:y=h*0.30:enable='between(t,2,8)',
drawtext=text='Agent\: \$120':fontfile=Inter-Bold.ttf:fontsize=48:
fontcolor=white:x=(w-text_w)/2:y=h*0.38:enable='between(t,3,8)',
drawtext=text='EIN\: FREE':fontfile=Inter-Bold.ttf:fontsize=48:
fontcolor=green:x=(w-text_w)/2:y=h*0.46:enable='between(t,4,8)',
drawtext=text='TOTAL\: \$220':fontfile=Inter-Black.ttf:fontsize=72:
fontcolor=yellow:borderw=4:bordercolor=black:
x=(w-text_w)/2:y=h*0.58:enable='between(t,5,8)'
" output.mp4
Audio Effects
Voice + music mixing (voice at 100%, music at 15%):
ffmpeg -i voice.mp3 -i music.mp3 \
-filter_complex "[1:a]volume=0.15[bg];[0:a][bg]amix=inputs=2:duration=first" \
output.mp3
Sidechain compression (duck music when voice plays):
ffmpeg -i voice.mp3 -i music.mp3 \
-filter_complex \
"[1:a]volume=0.3[bg];[0:a][bg]sidechaincompress=threshold=0.02:ratio=8:attack=200:release=1000" \
output.mp3
| Parameter | Value | Effect |
|---|---|---|
threshold | 0.02 | Sensitivity β lower = more ducking |
ratio | 8 | Compression amount β higher = more aggressive duck |
attack | 200ms | How fast music ducks when voice starts |
release | 1000ms | How fast music returns when voice stops |
Sound effect at specific timestamp:
ffmpeg -i main.mp4 -i whoosh.mp3 \
-filter_complex "[1:a]adelay=3000|3000,volume=0.5[sfx];[0:a][sfx]amix=inputs=2:duration=first" \
output.mp4
Multiple SFX at different timestamps:
ffmpeg -i main.mp4 -i bass_drop.mp3 -i whoosh.mp3 -i cash.mp3 \
-filter_complex "
[1:a]adelay=500|500,volume=0.7[sfx1];
[2:a]adelay=4000|4000,volume=0.5[sfx2];
[3:a]adelay=7000|7000,volume=0.4[sfx3];
[sfx1][sfx2][sfx3]amix=inputs=3[all_sfx];
[0:a][all_sfx]amix=inputs=2:duration=first[out]
" -map 0:v -map "[out]" output.mp4
Audio normalization:
ffmpeg -i audio.mp3 -filter:a "loudnorm=I=-16:TP=-1.5:LRA=11" normalized.mp3
ffmpeg -i audio.mp3 -filter:a "dynaudnorm" normalized.mp3
Speed Effects
Slow motion (0.5x):
ffmpeg -i clip.mp4 -vf "setpts=2.0*PTS" -af "atempo=0.5" slow.mp4
Fast motion (2x):
ffmpeg -i clip.mp4 -vf "setpts=0.5*PTS" -af "atempo=2.0" fast.mp4
Speed ramp (normal to slow to normal):
ffmpeg -i clip.mp4 -vf \
"setpts='if(between(T,3,5),2.0*PTS,PTS)'" \
speed_ramp.mp4
Time-lapse (4x speed):
ffmpeg -i clip.mp4 -vf "setpts=0.25*PTS" -an timelapse.mp4
Overlay and Picture-in-Picture
Basic overlay (logo/watermark):
ffmpeg -i main.mp4 -i logo.png \
-filter_complex "[1:v]scale=100:-1[logo];[0:v][logo]overlay=W-w-20:20" \
output.mp4
Picture-in-picture with border:
ffmpeg -i main.mp4 -i pip.mp4 \
-filter_complex "
[1:v]scale=320:180[pip];
[0:v][pip]overlay=W-w-20:H-h-20
" output.mp4
Opacity blending:
ffmpeg -i background.mp4 -i overlay.png \
-filter_complex "
[1:v]format=rgba,colorchannelmixer=aa=0.5[fg];
[0:v][fg]overlay=0:0
" output.mp4
Drop shadow effect (simulate with offset dark copy):
ffmpeg -i background.mp4 -i subject.png \
-filter_complex "
[1:v]scale=800:-1[subj];
[subj]split[shadow][main];
[shadow]colorchannelmixer=rr=0:gg=0:bb=0:aa=0.5,
boxblur=10:10[shadow_blur];
[0:v][shadow_blur]overlay=(W-w)/2+5:(H-h)/2+5[with_shadow];
[with_shadow][main]overlay=(W-w)/2:(H-h)/2
" output.mp4
The first 3 seconds determine everything. These 6 hook archetypes are proven patterns from viral content analysis.
The 6 Archetypes
interface HookArchetype {
name: string;
pattern: string;
psychology: string;
examples: string[];
visualTreatment: string;
}
const ARCHETYPES: HookArchetype[] = [
{
name: 'Fortuneteller',
pattern: 'Teases a future outcome the viewer wants',
psychology: 'Curiosity gap β viewer must watch to see if it applies to them',
examples: [
'How to double your savings in 2026',
'Your LLC is about to cost you $3,200 less',
'What your accountant won\'t tell you about Q4',
],
visualTreatment: 'Crystal ball / chart trending up / calendar with circled date',
},
{
name: 'Magician',
pattern: 'Reveals a surprising condensation or transformation',
psychology: 'Value perception β massive input compressed into digestible output',
examples: [
'I condensed 50 finance books into 60 seconds',
'The entire tax code in one sentence',
'10 years of investing mistakes so you don\'t have to',
],
visualTreatment: 'Stack of books β single page / time-lapse / before-after split',
},
{
name: 'Contrarian',
pattern: 'Challenges commonly accepted knowledge',
psychology: 'Pattern interrupt β brain flags contradictions as important',
examples: [
'Stock market experts HATE this one simple rule',
'Stop saving for retirement (here\'s why)',
'The LLC advice everyone gives is dead wrong',
],
visualTreatment: 'Red X over conventional wisdom / crossed-out text / head shake',
},
{
name: 'Provocateur',
pattern: 'Makes a controversial or emotionally charged statement',
psychology: 'Emotional activation β anger/outrage increases engagement',
examples: [
'California is robbing you blind with this tax',
'Your bank is stealing $400/year and you don\'t know it',
'The IRS designed this system to keep you poor',
],
visualTreatment: 'Red background / alarm / angry emoji / bold accusatory text',
},
{
name: 'Statistician',
pattern: 'Opens with a shocking number',
psychology: 'Concrete specificity signals authority and triggers recall',
examples: [
'$1.4 billion evaporated in 48 hours',
'93% of LLCs overpay by $2,100 per year',
'1 in 4 Americans can\'t cover a $400 emergency',
],
visualTreatment: 'Large number filling screen / counter animation / data visualization',
},
{
name: 'Questioner',
pattern: 'Poses an engaging question the viewer wants answered',
psychology: 'Open loop β the brain seeks closure on unanswered questions',
examples: [
'How much do you REALLY need to quit your job?',
'What would you do with an extra $3,200?',
'Are you in the 93% who overpay on taxes?',
],
visualTreatment: 'Question mark animation / thinking emoji / person looking puzzled',
},
];
Hook Visual Treatment Matrix
| Archetype | Text Style | Background | Animation | SFX |
|---|---|---|---|---|
| Fortuneteller | Gold/white, elegant | Dark, moody | Slow zoom in | Mystical tone |
| Magician | Bold white, large | Book/knowledge imagery | Scale pop | Whoosh |
| Contrarian | Red/white, aggressive | Crossed-out text | Shake effect | Record scratch |
| Provocateur | Red bold, ALL CAPS | Dark red gradient | Flash/pulse | Bass drop |
| Statistician | Yellow/white numbers | Dark with data viz | Counter animation | Cash register |
| Questioner | White italic | Person thinking | Typewriter | Question sound |
Implementing Hooks in FFmpeg
ffmpeg -i background.mp4 -i bass_drop.mp3 \
-filter_complex "
[0:v]eq=brightness=-0.15:contrast=1.4:saturation=0.6,
vignette=PI/3:1.3[bg];
[bg]drawtext=text='\$%{eif\\:min(floor(t*700)\\,1400000000)\\:d\\:,}':
fontfile=Inter-Black.ttf:fontsize=80:fontcolor=yellow:
borderw=4:bordercolor=black:
x=(w-text_w)/2:y=h*0.35:
enable='between(t,0.5,3)',
drawtext=text='evaporated in 48 hours':
fontfile=Inter-Bold.ttf:fontsize=50:fontcolor=white:
x=(w-text_w)/2:y=h*0.35+100:
enable='between(t,1.5,3)'[hooked];
[1:a]adelay=500|500,volume=0.7[sfx];
[hooked]null[vout]
" \
-map "[vout]" -map "[sfx]" \
-c:v libx264 -c:a aac -t 3 hook.mp4
Pattern 1: The Data Story (Finance/Business)
A complete video that reveals a financial insight with progressive data disclosure.
When to use: Financial education, tax tips, business analysis, calculator demos.
#!/bin/bash
FONT_BLACK="Inter-Black.ttf"
FONT_BOLD="Inter-Bold.ttf"
FONT_REG="Inter-Regular.ttf"
RESOLUTION="1080x1920"
FPS=30
ffmpeg -loop 1 -i assets/office.jpg -vf \
"zoompan=z='min(zoom+0.002,1.2)':x='iw/2-(iw/zoom/2)':y='ih/2-(ih/zoom/2)':d=$((FPS*4)):s=${RESOLUTION}:fps=${FPS}" \
-t 4 -c:v libx264 -pix_fmt yuv420p /tmp/scene1.mp4
ffmpeg -loop 1 -i assets/calculator.jpg -vf \
"zoompan=z='1.15':x='if(eq(on,1),0,min(x+2,iw-iw/zoom))':y='ih/2-(ih/zoom/2)':d=$((FPS*4)):s=${RESOLUTION}:fps=${FPS}" \
-t 4 -c:v libx264 -pix_fmt yuv420p /tmp/scene2.mp4
ffmpeg -loop 1 -i assets/money.jpg -vf \
"zoompan=z='if(eq(on,1),1.3,max(zoom-0.002,1.0))':x='iw/2-(iw/zoom/2)':y='ih/2-(ih/zoom/2)':d=$((FPS*4)):s=${RESOLUTION}:fps=${FPS}" \
-t 4 -c:v libx264 -pix_fmt yuv420p /tmp/scene3.mp4
ffmpeg -loop 1 -i assets/document.jpg -vf \
"zoompan=z='1.15':x='iw/2-(iw/zoom/2)':y='if(eq(on,1),0,min(y+1.5,ih-ih/zoom))':d=$((FPS*4)):s=${RESOLUTION}:fps=${FPS}" \
-t 4 -c:v libx264 -pix_fmt yuv420p /tmp/scene4.mp4
ffmpeg -loop 1 -i assets/savings.jpg -vf \
"zoompan=z='min(zoom+0.003,1.25)':x='iw/2-(iw/zoom/2)':y='ih/2-(ih/zoom/2)':d=$((FPS*4)):s=${RESOLUTION}:fps=${FPS}" \
-t 4 -c:v libx264 -pix_fmt yuv420p /tmp/scene5.mp4
ffmpeg -loop 1 -i assets/chart_up.jpg -vf \
"zoompan=z='1.2':x='if(eq(on,1),0,min(x+1,iw-iw/zoom))':y='if(eq(on,1),0,min(y+0.7,ih-ih/zoom))':d=$((FPS*5)):s=${RESOLUTION}:fps=${FPS}" \
-t 5 -c:v libx264 -pix_fmt yuv420p /tmp/scene6.mp4
ffmpeg -i /tmp/scene1.mp4 -i /tmp/scene2.mp4 -i /tmp/scene3.mp4 \
-i /tmp/scene4.mp4 -i /tmp/scene5.mp4 -i /tmp/scene6.mp4 \
-filter_complex "
[0:v][1:v]xfade=transition=fade:duration=0.5:offset=3.5[v01];
[v01][2:v]xfade=transition=slideleft:duration=0.5:offset=7[v012];
[v012][3:v]xfade=transition=dissolve:duration=0.5:offset=10.5[v0123];
[v0123][4:v]xfade=transition=smoothup:duration=0.5:offset=14[v01234];
[v01234][5:v]xfade=transition=circlecrop:duration=0.5:offset=17.5[vchain]
" -map "[vchain]" -c:v libx264 -pix_fmt yuv420p /tmp/chained.mp4
ffmpeg -i /tmp/chained.mp4 -vf "
eq=brightness=-0.1:contrast=1.3:saturation=0.7,
noise=alls=12:allf=t+u,
vignette=PI/4:1.2,
drawtext=text='93%% of LLCs overpay':fontfile=${FONT_BLACK}:fontsize=80:
fontcolor=white:borderw=4:bordercolor=black:
x=(w-text_w)/2:y=h*0.35:enable='between(t,0,3)',
drawtext=text='Here is what they miss':fontfile=${FONT_BOLD}:fontsize=50:
fontcolor=white@0.8:x=(w-text_w)/2:y=h*0.35+100:
enable='between(t,0.5,3)',
drawtext=text='State Fee':fontfile=${FONT_BOLD}:fontsize=48:
fontcolor=white:x=100:y=h*0.25:enable='between(t,4,10)',
drawtext=text='\$100':fontfile=${FONT_BLACK}:fontsize=60:
fontcolor=green:x=w-300:y=h*0.25:enable='between(t,4,10)',
drawtext=text='Agent Fee':fontfile=${FONT_BOLD}:fontsize=48:
fontcolor=white:x=100:y=h*0.33:enable='between(t,5,10)',
drawtext=text='\$0':fontfile=${FONT_BLACK}:fontsize=60:
fontcolor=green:x=w-300:y=h*0.33:enable='between(t,5,10)',
drawtext=text='Tax Savings':fontfile=${FONT_BOLD}:fontsize=48:
fontcolor=white:x=100:y=h*0.41:enable='between(t,6,10)',
drawtext=text='-\$3,200':fontfile=${FONT_BLACK}:fontsize=60:
fontcolor=yellow:x=w-350:y=h*0.41:enable='between(t,6,10)',
drawtext=text='TOTAL SAVINGS':fontfile=${FONT_BLACK}:fontsize=70:
fontcolor=yellow:borderw=4:bordercolor=black:
x=(w-text_w)/2:y=h*0.55:enable='between(t,8,12)',
drawtext=text='\$%{eif\\:min(floor((t-8)*1600)\\,3200)\\:d}':
fontfile=${FONT_BLACK}:fontsize=140:fontcolor=white:
borderw=5:bordercolor=black:
x=(w-text_w)/2:y=h*0.63:enable='between(t,8,12)',
drawtext=text='Link in bio':fontfile=${FONT_BOLD}:fontsize=40:
fontcolor=white@0.7:x=(w-text_w)/2:y=h*0.85:
enable='between(t,10,14)'
" -c:v libx264 -c:a copy /tmp/graded.mp4
ffmpeg -i /tmp/graded.mp4 -i voiceover.mp3 -i music_dark_cinematic.mp3 \
-i sfx/bass_drop.mp3 -i sfx/whoosh.mp3 -i sfx/cash_register.mp3 \
-filter_complex "
[2:a]volume=0.15,afade=t=in:ss=0:d=2,afade=t=out:st=19:d=2[music];
[3:a]adelay=500|500,volume=0.7[bass];
[4:a]adelay=3500|3500,volume=0.4[whoosh1];
[4:a]adelay=7000|7000,volume=0.4[whoosh2];
[5:a]adelay=8000|8000,volume=0.5[cash];
[bass][whoosh1][whoosh2][cash]amix=inputs=4[all_sfx];
[music][all_sfx]amix=inputs=2[bg_audio];
[1:a][bg_audio]sidechaincompress=threshold=0.02:ratio=8:attack=200:release=1000[final_audio]
" \
-map 0:v -map "[final_audio]" \
-c:v copy -c:a aac -b:a 192k -shortest output.mp4
Gotchas:
- The
offsetin xfade chains is cumulative from the start of the output, not the clip adelayvalues are in milliseconds, not secondssidechaincompressexpects the voice as first input (the one being analyzed) and the music as second (the one being compressed)enable='between(t,start,end)'uses seconds in the output timeline
Pattern 2: The Comparison Grid (Product/Tool)
Side-by-side comparison with animated data points.
When to use: Tool comparisons, product reviews, A-vs-B decisions.
#!/bin/bash
ffmpeg -i left_approach.mp4 -i right_approach.mp4 \
-filter_complex "
# Scale both to half-width
[0:v]scale=540:960,
eq=brightness=-0.05:contrast=1.2:saturation=0.8,
pad=540:1920:0:0:black[left];
[1:v]scale=540:960,
eq=brightness=-0.05:contrast=1.2:saturation=0.8,
pad=540:1920:0:960:black[right];
# Stack horizontally
[left][right]hstack[grid];
# Add labels
[grid]drawtext=text='TRADITIONAL':fontfile=Inter-Black.ttf:fontsize=40:
fontcolor=red:x=270-text_w/2:y=50:enable='between(t,0,15)',
drawtext=text='OPTIMIZED':fontfile=Inter-Black.ttf:fontsize=40:
fontcolor=green:x=810-text_w/2:y=50:enable='between(t,0,15)',
# VS divider
drawtext=text='VS':fontfile=Inter-Black.ttf:fontsize=60:
fontcolor=yellow:box=1:boxcolor=black@0.8:boxborderw=15:
x=(w-text_w)/2:y=h*0.48:enable='between(t,0,15)',
# Cost comparison (appears at 3s)
drawtext=text='\$4,500/yr':fontfile=Inter-Black.ttf:fontsize=50:
fontcolor=red:x=270-text_w/2:y=h*0.6:enable='between(t,3,15)',
drawtext=text='\$1,300/yr':fontfile=Inter-Black.ttf:fontsize=50:
fontcolor=green:x=810-text_w/2:y=h*0.6:enable='between(t,3,15)',
# Savings callout (appears at 6s)
drawtext=text='SAVE \$3,200':fontfile=Inter-Black.ttf:fontsize=70:
fontcolor=yellow:borderw=4:bordercolor=black:
x=(w-text_w)/2:y=h*0.75:enable='between(t,6,15)'
" \
-c:v libx264 -pix_fmt yuv420p output.mp4
Pattern 3: The Explainer (Educational/How-To)
Step-by-step tutorial with numbered progression.
When to use: How-to guides, process explanations, tutorial content.
#!/bin/bash
STEPS=("Open the calculator" "Enter your income" "Select LLC type" "Review deductions" "See your savings")
IMAGES=(calculator.jpg income.jpg llc_type.jpg deductions.jpg savings.jpg)
MOVEMENTS=("zoom_in" "pan_right" "zoom_out" "pan_down" "zoom_in")
for i in "${!STEPS[@]}"; do
STEP_NUM=$((i + 1))
STEP_TEXT="${STEPS[$i]}"
# Determine zoompan expression based on movement type
case "${MOVEMENTS[$i]}" in
zoom_in) ZP="z='min(zoom+0.002,1.2)':x='iw/2-(iw/zoom/2)':y='ih/2-(ih/zoom/2)'" ;;
zoom_out) ZP="z='if(eq(on,1),1.3,max(zoom-0.002,1.0))':x='iw/2-(iw/zoom/2)':y='ih/2-(ih/zoom/2)'" ;;
pan_right) ZP="z='1.15':x='if(eq(on,1),0,min(x+2,iw-iw/zoom))':y='ih/2-(ih/zoom/2)'" ;;
pan_down) ZP="z='1.15':x='iw/2-(iw/zoom/2)':y='if(eq(on,1),0,min(y+1.5,ih-ih/zoom))'" ;;
esac
ffmpeg -loop 1 -i "assets/${IMAGES[$i]}" -vf "
zoompan=${ZP}:d=120:s=1080x1920:fps=30,
eq=brightness=-0.08:contrast=1.2:saturation=0.8,
vignette=PI/5:1.0,
drawtext=text='STEP ${STEP_NUM}':fontfile=Inter-Black.ttf:fontsize=90:
fontcolor=yellow:borderw=4:bordercolor=black:
x=(w-text_w)/2:y=h*0.25,
drawtext=text='${STEP_TEXT}':fontfile=Inter-Bold.ttf:fontsize=50:
fontcolor=white:x=(w-text_w)/2:y=h*0.25+120
" -t 4 -c:v libx264 -pix_fmt yuv420p "/tmp/step${STEP_NUM}.mp4"
done
ffmpeg -i /tmp/step1.mp4 -i /tmp/step2.mp4 -i /tmp/step3.mp4 \
-i /tmp/step4.mp4 -i /tmp/step5.mp4 \
-filter_complex "
[0:v][1:v]xfade=transition=slideleft:duration=0.5:offset=3.5[v01];
[v01][2:v]xfade=transition=slideright:duration=0.5:offset=7[v012];
[v012][3:v]xfade=transition=slideleft:duration=0.5:offset=10.5[v0123];
[v0123][4:v]xfade=transition=circlecrop:duration=0.5:offset=14[vout]
" -map "[vout]" output.mp4
Pattern 4: The Montage (Motivation/Compilation)
Rapid-fire clips with music-driven pacing.
When to use: Motivational content, compilations, brand sizzle reels.
#!/bin/bash
CLIPS=(clip1.mp4 clip2.mp4 clip3.mp4 clip4.mp4 clip5.mp4
clip6.mp4 clip7.mp4 clip8.mp4 clip9.mp4 clip10.mp4)
TRANSITIONS=(fade slideleft dissolve diagtl smoothup
circlecrop slidedown fadeblack radial fade)
for i in "${!CLIPS[@]}"; do
ffmpeg -i "assets/${CLIPS[$i]}" -vf "
scale=1080:1920:force_original_aspect_ratio=increase,crop=1080:1920,
zoompan=z='min(zoom+0.004,1.3)':x='iw/2-(iw/zoom/2)':y='ih/2-(ih/zoom/2)':d=60:s=1080x1920:fps=30,
eq=brightness=-0.15:contrast=1.5:saturation=0.5,
noise=alls=20:allf=t+u,
vignette=PI/3:1.4
" -t 2 -c:v libx264 -pix_fmt yuv420p "/tmp/montage_${i}.mp4"
done
ffmpeg \
-i /tmp/montage_0.mp4 -i /tmp/montage_1.mp4 -i /tmp/montage_2.mp4 \
-i /tmp/montage_3.mp4 -i /tmp/montage_4.mp4 -i /tmp/montage_5.mp4 \
-i /tmp/montage_6.mp4 -i /tmp/montage_7.mp4 -i /tmp/montage_8.mp4 \
-i /tmp/montage_9.mp4 \
-filter_complex "
[0:v][1:v]xfade=transition=${TRANSITIONS[0]}:duration=0.3:offset=1.7[v01];
[v01][2:v]xfade=transition=${TRANSITIONS[1]}:duration=0.3:offset=3.4[v02];
[v02][3:v]xfade=transition=${TRANSITIONS[2]}:duration=0.3:offset=5.1[v03];
[v03][4:v]xfade=transition=${TRANSITIONS[3]}:duration=0.3:offset=6.8[v04];
[v04][5:v]xfade=transition=${TRANSITIONS[4]}:duration=0.3:offset=8.5[v05];
[v05][6:v]xfade=transition=${TRANSITIONS[5]}:duration=0.3:offset=10.2[v06];
[v06][7:v]xfade=transition=${TRANSITIONS[6]}:duration=0.3:offset=11.9[v07];
[v07][8:v]xfade=transition=${TRANSITIONS[7]}:duration=0.3:offset=13.6[v08];
[v08][9:v]xfade=transition=${TRANSITIONS[8]}:duration=0.3:offset=15.3[vout]
" -map "[vout]" montage.mp4
ffmpeg -i montage.mp4 -i music_epic.mp3 \
-filter_complex "[1:a]volume=0.4,afade=t=in:d=1,afade=t=out:st=16:d=1[m];[m]atrim=0:17[mt]" \
-map 0:v -map "[mt]" -c:v copy -c:a aac -shortest output.mp4
Example 1: Quick Ken Burns from a Single Image
ffmpeg -loop 1 -i photo.jpg -vf \
"zoompan=z='min(zoom+0.0015,1.15)':x='iw/2-(iw/zoom/2)':y='ih/2-(ih/zoom/2)':d=150:s=1080x1920:fps=30" \
-t 5 -c:v libx264 -pix_fmt yuv420p output.mp4
Example 2: Crossfade Between Two Clips
ffmpeg -i clip1.mp4 -i clip2.mp4 \
-filter_complex "[0:v][1:v]xfade=transition=dissolve:duration=1:offset=4" \
-c:v libx264 output.mp4
Example 3: Add Background Music with Auto-Ducking
ffmpeg -i video_with_voice.mp4 -i background_music.mp3 \
-filter_complex "
[1:a]volume=0.15[music];
[0:a][music]sidechaincompress=threshold=0.02:ratio=8:attack=200:release=1000[audio]
" \
-map 0:v -map "[audio]" -c:v copy -c:a aac output.mp4
Example 4: Animated Counter (0 to $10,000)
ffmpeg -i background.mp4 -vf \
"drawtext=text='\$%{eif\\:min(floor((t-1)*3333)\\,10000)\\:d\\:,}':
fontfile=Inter-Black.ttf:fontsize=140:fontcolor=white:
borderw=5:bordercolor=black:
x=(w-text_w)/2:y=(h-text_h)/2:
enable='between(t,1,4)'" \
-c:v libx264 -c:a copy output.mp4
Example 5: Film Grain + Vignette + Color Grade in One Pass
ffmpeg -i raw_clip.mp4 -vf \
"eq=brightness=-0.1:contrast=1.3:saturation=0.7,
noise=alls=15:allf=t+u,
vignette=PI/4:1.2" \
-c:v libx264 -c:a copy cinematic.mp4
Example 6: Text with Background Box (Title Card)
ffmpeg -i clip.mp4 -vf \
"drawtext=text='THE HIDDEN TAX TRAP':
fontfile=Inter-Black.ttf:fontsize=70:fontcolor=white:
box=1:boxcolor=red@0.85:boxborderw=25:
x=(w-text_w)/2:y=h*0.4:
enable='between(t,0,4)'" \
-c:v libx264 -c:a copy output.mp4
Example 7: Sound Effect at Specific Timestamp
ffmpeg -i main_video.mp4 -i sfx/whoosh.mp3 \
-filter_complex "
[1:a]adelay=3000|3000,volume=0.5[sfx];
[0:a][sfx]amix=inputs=2:duration=first[audio]
" \
-map 0:v -map "[audio]" -c:v copy -c:a aac output.mp4
Example 8: Speed Ramp (Normal to Slow-Mo)
ffmpeg -i action_clip.mp4 -filter_complex "
[0:v]setpts='if(between(T,3,5),2.0*PTS,PTS)'[v]
" -map "[v]" -an -c:v libx264 output.mp4
Example 9: Picture-in-Picture Overlay
ffmpeg -i main.mp4 -i pip_source.mp4 \
-filter_complex "
[1:v]scale=280:500[pip];
[0:v][pip]overlay=W-w-30:H-h-30:enable='between(t,5,15)'[out]
" \
-map "[out]" -map 0:a -c:v libx264 -c:a copy output.mp4
Example 10: Teal and Orange Hollywood Grade
ffmpeg -i footage.mp4 -vf \
"colorbalance=rs=0.1:gs=-0.1:bs=-0.2:rh=-0.1:gh=0.05:bh=0.15,
eq=contrast=1.1:saturation=1.2,
vignette=PI/5:1.0" \
-c:v libx264 -c:a copy hollywood.mp4
When FFmpeg is not enough β or when you need to scale beyond local rendering β these cloud APIs provide programmatic video creation.
Feature Comparison
| Feature | FFmpeg (local) | Shotstack | Creatomate | JSON2Video | Remotion | Plainly |
|---|---|---|---|---|---|---|
| Ken Burns | zoompan filter | effects property | Keyframe animation | Manual positioning | CSS transforms | After Effects |
| Multi-track | filter_complex | Tracks + clips JSON | Layers in JSON | Single track | React composition | AE tracks |
| Text animation | drawtext (limited) | Basic text | Advanced (cascade, typewriter, bounce) | Basic text | Full React animations | AE keyframes |
| Transitions | xfade (40+) | Built-in set | Keyframe-based | Limited | Custom React | AE transitions |
| Color grading | eq, colorbalance | Filters | Adjustments | None | CSS filters | AE effects |
| Audio mixing | amix, sidechaincompress | Timeline audio | Audio layers | Basic | Web Audio API | AE audio |
| Templates | None (scripts) | JSON templates | Visual editor + JSON | None | Code templates | AE templates |
| Rendering | Local GPU/CPU | Cloud | Cloud | Cloud | Local/Lambda/Cloud | Cloud |
| Pricing | Free | $0.049/render (SD) | From $20/mo | Credits | $19/mo (Cloud Run) | From $59/mo |
| Best for | Full control, cost | JSON-driven automation | Template-based brands | Simple slideshows | Complex animations | Premium AE quality |
| Limitations | No visual editor, steep learning curve | Limited text animation | Monthly minimums | Single track only | React knowledge required | Expensive, AE dependency |
Detailed API Profiles
Shotstack
Shotstack renders video from JSON specifications via REST API. The composition model uses timelines, tracks, and clips.
// Shotstack JSON composition structure
interface ShotstackEdit {
timeline: {
tracks: Array<{
clips: Array<{
asset: {
type: 'video' | 'image' | 'title' | 'audio' | 'html';
src?: string;
text?: string;
style?: string;
};
start: number; // seconds
length: number; // seconds
transition?: {
in: 'fade' | 'reveal' | 'wipeLeft' | 'slideLeft';
out: 'fade' | 'reveal' | 'wipeRight' | 'slideRight';
};
effect?: 'zoomIn' | 'zoomOut' | 'slideLeft' | 'slideRight';
filter?: 'greyscale' | 'boost' | 'contrast' | 'darken';
opacity?: number;
position?: 'center' | 'top' | 'bottom';
}>;
}>;
background?: string;
};
output: {
format: 'mp4' | 'gif';
resolution: 'sd' | 'hd' | '1080';
fps: number;
};
}
Pros: Simple JSON model, good documentation, supports multi-track, built-in effects. Cons: Limited text animation, no keyframe control, effects are preset-only.
Pricing: $0.049/render (SD), $0.098/render (HD), $0.196/render (1080p). API docs.
Creatomate
Creatomate offers both template-based and JSON-from-scratch rendering with advanced text animations.
// Creatomate render request
interface CreatomateRender {
source: {
output_format: 'mp4' | 'gif' | 'png';
width: number;
height: number;
duration: number;
elements: Array<{
type: 'video' | 'image' | 'text' | 'shape' | 'composition';
source?: string;
text?: string;
x: string; // Supports expressions and percentages
y: string;
width: string;
height: string;
animations?: Array<{
type: 'scale' | 'fade' | 'slide' | 'text-typewriter' | 'text-cascade';
time: 'start' | 'end' | number;
duration: number;
easing?: string;
}>;
keyframes?: Array<{
time: number;
value: Record<string, unknown>;
}>;
}>;
};
}
Pros: Rich text animation (typewriter, cascade, bounce), keyframe support, visual template editor, good for branded content. Cons: Monthly subscription required, less control than raw FFmpeg for custom effects.
Pricing: Starts at $20/month. Developer docs.
Remotion
Remotion renders video using React components. Each frame is a React render.
// Remotion video composition
import { AbsoluteFill, useCurrentFrame, interpolate, Sequence } from 'remotion';
const KenBurnsImage: React.FC<{ src: string; direction: 'in' | 'out' }> = ({ src, direction }) => {
const frame = useCurrentFrame();
const scale = direction === 'in'
? interpolate(frame, [0, 150], [1, 1.2])
: interpolate(frame, [0, 150], [1.3, 1.0]);
return (
<AbsoluteFill>
<img
src={src}
style={{
width: '100%',
height: '100%',
objectFit: 'cover',
transform: `scale(${scale})`,
}}
/>
</AbsoluteFill>
);
};
const DataStoryVideo: React.FC = () => {
return (
<>
<Sequence from={0} durationInFrames={90}>
<KenBurnsImage src="/office.jpg" direction="in" />
<AnimatedText text="93% of LLCs overpay" y={0.35} />
</Sequence>
<Sequence from={90} durationInFrames={90}>
<KenBurnsImage src="/calculator.jpg" direction="out" />
<CounterAnimation target={3200} prefix="$" y={0.4} />
</Sequence>
</>
);
};
Pros: Full React ecosystem, unlimited animation complexity, self-hostable, type-safe, testable. Cons: Requires React knowledge, local rendering needs good hardware, Lambda rendering has cold start overhead.
Pricing: Open source (local render is free). Cloud Run: $19/month. Lambda rendering: pay per invocation. Docs.
JSON2Video
JSON2Video converts JSON documents to video with built-in TTS and HTML element support.
// JSON2Video movie structure
interface JSON2VideoMovie {
resolution: 'full-hd' | 'hd' | 'sd';
quality: 'high' | 'medium' | 'low';
scenes: Array<{
background: string;
elements: Array<{
type: 'text' | 'image' | 'video' | 'audio' | 'html' | 'voice';
src?: string;
text?: string;
start?: number;
duration?: number;
position?: 'center' | 'custom';
x?: number;
y?: number;
}>;
}>;
}
Pros: Simple API, built-in TTS, HTML/CSS element support. Cons: Single-track, limited animation, no multi-layer composition, credit-based pricing is unpredictable.
Pricing: Credit-based system. API docs.
Plainly
Plainly renders Adobe After Effects templates in the cloud, providing the highest visual quality at the highest cost.
Pros: Full After Effects quality, complex animations, professional templates, motion graphics. Cons: Requires After Effects knowledge to create templates, expensive, slower rendering.
Pricing: From $59/month for 20 minutes ($3/minute). 100 minutes at $249/month ($2.50/minute). Pricing page.
Decision Matrix
| If you need⦠| Use⦠| Because⦠|
|---|---|---|
| Maximum control, lowest cost | FFmpeg locally | Free, full filter access, $0.01/video |
| JSON-driven automation at scale | Shotstack | Simple API, predictable per-render pricing |
| Branded templates with rich text | Creatomate | Best text animations, visual editor |
| Complex custom animations | Remotion | React-based, unlimited creativity |
| Simple slideshow with TTS | JSON2Video | Quickest to implement |
| Premium motion graphics | Plainly | After Effects quality in the cloud |
| All of the above | FFmpeg + graduate up | Start local, move to cloud when you hit limits |
Key insight: Start with FFmpeg. It handles Levels 1-5 with zero cost. Graduate to Shotstack or Creatomate only when you need features FFmpeg cannot provide β primarily complex text animation and template management. If you need React-level animation control, use Remotion. If you need After Effects quality, use Plainly.
Video Footage
| Source | License | API | Quality | Volume |
|---|---|---|---|---|
| Pexels | Free, attribution optional | REST API, 200 req/hr | HD-4K | 50K+ videos |
| Pixabay | Free, no attribution | REST API | HD-4K | 30K+ videos |
| Coverr | Free, no attribution | Manual download | HD | 2K+ videos |
| Mixkit | Free, no attribution | Manual download | HD-4K | 5K+ videos |
Recommendation: Use Pexels API as primary (best search, largest catalog, API access). Pixabay as secondary. Download Coverr/Mixkit clips locally for common backgrounds (abstract, nature, city, technology).
Music
| Source | License | Genres | Download |
|---|---|---|---|
| Pixabay Music | Free, no attribution | All genres | Direct |
| Mixkit Music | Free, no attribution | All genres | Direct |
| Freesound | CC (varies) | SFX + ambient | API |
| Uppbeat | Free tier available | All genres | Direct |
Pre-download library by mood (20-30 tracks):
| Mood | Use Case | Search Terms |
|---|---|---|
| Dark cinematic | Finance, business | βdark cinematicβ, βtensionβ, βcorporate darkβ |
| Corporate tech | SaaS, technology | βcorporate techβ, βinnovationβ, βdigitalβ |
| Motivational epic | Success, achievement | βepic motivationβ, βtriumphβ, βinspirationalβ |
| Dramatic tension | Reveals, surprises | βsuspenseβ, βdramatic buildβ, βanticipationβ |
| Neutral ambient | How-to, tutorials | βambientβ, βbackgroundβ, βminimalβ |
| Upbeat energy | Lifestyle, marketing | βupbeat popβ, βenergeticβ, βhappyβ |
Sound Effects
Pre-download library (15-20 effects):
| SFX | Use Case | Source |
|---|---|---|
| Whoosh (3 variants) | Transitions | Pixabay SFX |
| Bass drop | Hook reveal | Pixabay SFX |
| Cash register | Money mentions | Pixabay SFX |
| Keyboard typing | Data reveals | Freesound |
| Camera shutter | Screenshot moments | Pixabay SFX |
| Tension drone | Background suspense | Freesound |
| Success chime | CTA, completion | Pixabay SFX |
| Notification ping | Alerts, pop-ups | Pixabay SFX |
| Record scratch | Contrarian hooks | Pixabay SFX |
| Glass shatter | Breaking misconceptions | Pixabay SFX |
Fonts
| Font | Weight | Use Case | Source |
|---|---|---|---|
| Inter Black | 900 | Headlines, numbers | Google Fonts |
| Inter Bold | 700 | Subheads, labels | Google Fonts |
| Inter Regular | 400 | Body text, captions | Google Fonts |
| Montserrat Bold | 700 | Alternative headline | Google Fonts |
Download and install locally β FFmpegβs drawtext requires a fontfile path to a .ttf file.
Background Textures (download 5-10 looping videos)
| Texture | Mood | Search Terms |
|---|---|---|
| Dark digital grid | Tech, data | βdigital grid loopβ |
| Abstract particles | Universal | βparticles dark backgroundβ |
| Bokeh lights | Warm, lifestyle | βbokeh lights loopβ |
| Smoke/fog | Dramatic, moody | βsmoke dark backgroundβ |
| Matrix code rain | Tech, hacking | βmatrix code loopβ |
| Gradient flow | Modern, clean | βgradient abstract loopβ |
Where to Run Each Step
| Step | Run Where | Why |
|---|---|---|
| Script generation | Cloud (Claude SDK, Gemini, Workers AI) | Quality matters β use the best model available |
| Voice synthesis | Local (edge-tts) or ElevenLabs API | edge-tts is free and fast; ElevenLabs for premium |
| Music | Stock libraries | Donβt generate β curate from free libraries |
| Stock footage search | Pexels/Pixabay API | Already in API Mom, free tier is generous |
| Video rendering | Local FFmpeg | $0.01/video on consumer hardware |
| Caption generation | Local (whisper-timestamped) | Word-level timestamps, runs on GPU |
Cost Comparison
| Pipeline | Cost/Video | At 100 videos/month | Quality |
|---|---|---|---|
| Full local (FFmpeg + edge-tts) | $0.01-0.03 | $1-3 | Level 1-5 |
| Local + ElevenLabs voice | $0.10-0.30 | $10-30 | Level 1-5, better voice |
| Shotstack cloud | $0.30-1.00 | $30-100 | Level 1-3 (limited effects) |
| Creatomate cloud | $0.50-2.00 | $50-200 | Level 1-4 (good text) |
| Remotion Lambda | $0.05-0.15 | $5-15 | Level 1-6 (any animation) |
| Plainly (After Effects) | $2.50-3.00 | $250-300 | Level 6+ (premium) |
Key insight: Local FFmpeg rendering on a consumer GPU (RTX 4070 or similar) costs approximately $0.01/video in electricity. At 100 videos per month, that is $1 versus $100+ for cloud APIs β a 100x cost difference with equivalent or better quality for Levels 1-5.
The Graduation Path
Stage 1: FFmpeg for everything (Levels 1-5)
β When you need: complex text animation, template management
Stage 2: FFmpeg + Creatomate for text-heavy content
β When you need: React-level custom animation
Stage 3: Remotion for complex sequences, FFmpeg for simple ones
β When you need: professional motion graphics with existing AE templates
Stage 4: Plainly for premium content, Remotion/FFmpeg for volume
Render Pipeline Architecture (TypeScript)
interface RenderJob {
id: string;
script: VideoScript;
assets: CollectedAssets;
profile: ColorProfile;
output: OutputSpec;
status: 'queued' | 'rendering' | 'complete' | 'failed';
}
interface VideoScript {
hook: {
archetype: HookArchetype;
text: string;
durationSec: number;
};
segments: Array<{
narration: string;
durationSec: number;
visualSearchTerms: string[]; // For Pexels API queries
dataOverlays: DataOverlay[]; // Numbers, stats to display
sfxCues: SFXCue[]; // Sound effects at timestamps
}>;
cta: {
text: string;
durationSec: number;
};
totalDurationSec: number;
}
interface CollectedAssets {
voiceover: { path: string; wordTimestamps: WhisperWord[] };
music: { path: string; mood: string; bpm: number };
clips: Array<{
path: string;
source: 'pexels' | 'pixabay' | 'local';
kenBurns: KenBurnsMovement;
durationSec: number;
}>;
sfx: Map<string, string>; // name β file path
fonts: Map<string, string>; // weight β file path
}
interface OutputSpec {
resolution: '1080x1920' | '1920x1080';
fps: 30;
codec: 'libx264';
audioBitrate: '192k';
format: 'mp4';
}
// The render function builds an FFmpeg command from the job spec
function buildFFmpegCommand(job: RenderJob): string {
const inputs: string[] = [];
const filters: string[] = [];
// Step 1: Add all video inputs with Ken Burns
job.assets.clips.forEach((clip, i) => {
inputs.push(`-i ${clip.path}`);
filters.push(buildKenBurnsFilter(clip, i));
});
// Step 2: Chain transitions
filters.push(buildTransitionChain(job.assets.clips));
// Step 3: Apply color grade + grain + vignette
filters.push(buildColorGradeFilter(job.profile));
// Step 4: Add text overlays
job.script.segments.forEach(seg => {
seg.dataOverlays.forEach(overlay => {
filters.push(buildTextOverlayFilter(overlay));
});
});
// Step 5: Mix audio
inputs.push(`-i ${job.assets.voiceover.path}`);
inputs.push(`-i ${job.assets.music.path}`);
filters.push(buildAudioMixFilter(job));
return `ffmpeg ${inputs.join(' ')} -filter_complex "${filters.join(';')}" output.mp4`;
}
| Donβt | Do Instead | Why |
|---|---|---|
| Use static images without Ken Burns | Apply zoompan to every image | Static = retention death. The brain disengages when nothing moves. |
| Hard cut between every clip | Use xfade transitions (0.3-1s) | Hard cuts feel jarring and amateur. Crossfades feel intentional. |
| Same Ken Burns movement on consecutive clips | Alternate from the 7-movement library | Repetitive movement becomes predictable and boring. |
| TTS voiceover with no background music | Mix music at 15% volume with sidechain ducking | Music sets mood and fills silence gaps. Ducking keeps voice clear. |
| Single visual per sentence | 2-3 visuals per sentence (3-second rule) | One visual for 9 seconds loses attention. Three visuals maintain pace. |
| Text without background/border | Always use borderw or box on drawtext | Text without contrast against video is unreadable on mobile. |
| Same transition type throughout | Vary transitions (fade, slide, dissolve, circle) | Variety maintains visual interest. Same transition = monotonous. |
| Skip color grading | Apply a consistent grade to all clips | Raw footage from different sources looks inconsistent. Grade unifies. |
| Generate music with AI | Curate from free stock libraries | AI music is detectable, sounds generic, and quality varies. Stock is curated. |
Use only the fade transition | Use 6-8 different transitions per video | FFmpeg has 40+ transitions β use them. Variety = production value. |
| Render in the cloud for simple compositions | Use local FFmpeg first, cloud only when needed | Cloud = $0.30-3.00/video. Local = $0.01/video. Same quality for Levels 1-5. |
| Skip silence removal in voiceover | Run whisper-timestamped and trim gaps > 0.3s | Dead air kills pacing. Professional editors obsessively remove silence. |
| Put all text at the center | Vary position (top bar, bottom third, center) | Same position becomes invisible (banner blindness). Vary to maintain attention. |
| Use system fonts in drawtext | Download Inter Black/Bold from Google Fonts | System fonts look amateur. Inter is designed for screens and data display. |
| Skip the hook (start with content) | First 3 seconds must have hook archetype | 71% decide in 3 seconds. No hook = no viewers. |
| Captions as an afterthought | Build captions into the composition as Track 3 | 78% watch on mute. Captions are primary content, not accessibility add-on. |
Official Documentation
- FFmpeg Filters Documentation β Complete filter reference including zoompan, xfade, drawtext, eq, colorbalance, noise, vignette, sidechaincompress
- FFmpeg Filtering Guide β Official wiki guide to filter graphs and filter_complex
- Shotstack API Documentation β REST API reference for JSON-to-video rendering
- Creatomate Developer Docs β API reference for template and JSON-based video rendering
- Remotion Documentation β React-based programmatic video creation framework
- JSON2Video API Reference β JSON-to-video API with built-in TTS
- Plainly Product β After Effects cloud rendering API
- Pexels API Documentation β Free stock photo and video API
- Pixabay API Documentation β Free stock media API
- whisper-timestamped (GitHub) β Word-level timestamp extraction for silence detection
- WhisperX (GitHub) β Automatic speech recognition with word-level timestamps and diarization
FFmpeg Tutorials and Guides
- Ken Burns Effect with FFmpeg β Bannerbear β Detailed zoompan tutorial with examples
- Ken Burns Effect Slideshows with FFmpeg β mko.re β Practical slideshow creation guide
- Ken Burns Effect Using FFmpeg β hadna.space β Zoom and pan reference
- How to Zoom Images and Videos using FFmpeg β Creatomate β Zoom techniques guide
- How to Create a Slideshow from Images using FFmpeg β Creatomate β Complete slideshow pipeline
- XFade Transitions β OTTVerse β Complete xfade transition guide with examples
- xfade-easing (GitHub) β 200+ extended transitions with CSS easing functions
- FFmpeg drawtext filter β OTTVerse β Dynamic text overlays, scrolling text, timestamps
- FFmpeg drawtext animations β Brayden Blackwell β Exploration of drawtext animation expressions
- FFmpeg Audio Manipulation Guide β Audio mixing, ducking, effects
- sidechaincompress β FFmpeg Docs β Sidechain compression filter reference
- How to Add a Transparent Overlay β Creatomate β Overlay and opacity techniques
- FFmpeg Engineering Handbook β Filters β Filter graph fundamentals
Retention Science and Statistics
- TikTok First 3 Seconds Hook Retention Rate Statistics β TTS Vibes β 71% decide in first 3 seconds, 85% retention = viral potential
- Social Media Attention Span Statistics 2026 β SQ Magazine β 73% canβt distinguish AI-assisted from traditional video
- 2025 YouTube Audience Retention Benchmarks β Retention Rabbit β Average 23.7% retention, 8-second consideration window
- Short Form Video Statistics 2025 β Marketing LTB β 92% completion for under-15-second videos, 60-90 second sweet spot
- 2025 Social Media Video Statistics β Social Insider β 78% watch on mute, cross-platform video benchmarks
- YouTube Audience Retention 2026 Guide β Social Rails β Complete retention optimization guide
- User Attention Span Statistics 2026 β Amra and Elma β Digital focus collapse data
Kinetic Typography and Motion Design
- Kinetic Typography for Video Engagement 2025 β Influencers Time β 25-50% retention improvement with motion text
- Kinetic Typography for Short-Form Video β Influencers Time β Short-form specific motion text techniques
- Kinetic Typography: Complete Guide 2026 β IK Agency β Comprehensive motion text design guide
- AI-Powered Text Animation Trends 2025 β SuperAGI β AI tools for kinetic typography
Cloud API Comparisons
- 7 Best Video Editing APIs 2026 β Plainly β Comprehensive API comparison
- Best Video Generation APIs Reviewed β Creatomate β Cloud rendering API comparison
- Best Stock Image and Video APIs β Shotstack β Stock media API comparison
- Top 10 Stock Video APIs β Plainly β Stock footage API roundup
Stock Resources
- Pexels β Free stock photos and videos, REST API
- Pixabay β Free stock photos, videos, music, and sound effects
- Pixabay Music β Free background music, no attribution required
- Mixkit β Free stock video, music, and sound effects
- Coverr β Free stock video footage
- Freesound β Community-sourced sound effects (Creative Commons)
- Uppbeat β Free music for creators
- Google Fonts β Inter β Primary font for video text overlays
- Google Fonts β Montserrat β Alternative headline font
GitHub Repositories
- Remotion (GitHub) β React framework for programmatic video
- Shotstack JSON Examples (GitHub) β Collection of Shotstack API examples
- JSON2Video Node.js SDK (GitHub) β JSON2Video SDK for Node.js
- xfade-ffmpeg-script (GitHub) β Bash script for chaining xfade transitions
- FFmpeg leveler (GitHub) β Automated leveling and ducking with FFmpeg
- cerberus/ffmpeg/zoompan (GitHub) β zoompan filter reference and examples