You paste in a script. A few minutes later, a finished vertical video comes out: voiced, captioned, scene-matched visuals, ready to upload to TikTok or YouTube Shorts. No timeline scrubbing, no manual captioning, no sourcing footage. We built this for our own content first, and it changed how we think about short-form production entirely.
Here's the whole pipeline, honestly, including the parts that still break.
The Pipeline, End to End
The script is the only input. From there, every step hands off to the next:
- Voice: the script goes to ElevenLabs, which picks one of our pre-selected character voices and returns a clean audio track.
- Captions: that audio runs through Whisper, which transcribes it and gives us word-level timestamps. Those timestamps are what make captions snap to the voice perfectly instead of drifting.
- Scene breakdown: the script goes to a language model (we use Groq for speed, Claude when we want better judgment) that splits it into scenes and writes a tailored image prompt for each one.
- Visuals: each scene prompt goes to Gemini, which generates the image for that beat.
- Render: everything (audio, timed captions, generated images) gets composed and rendered by Remotion into the final video.
The magic isn't any single tool. It's that the output of each step is already in the exact shape the next step needs.
Why Each Tool Earns Its Place
We didn't pick these for hype. Each one does a job nothing else does as well:
- ElevenLabs gives us voices with actual personality, and the API means we can swap characters without re-recording anything.
- Whisper is the unsung hero. Word-level timing is the difference between captions that feel native and captions that feel bolted on.
- The LLM scene step is where taste lives. A good breakdown turns a wall of text into a sequence with rhythm, and writing a strong image prompt per scene is the difference between coherent visuals and random stock-feeling images.
- Remotion lets us treat video as code. Layouts, caption styles, transitions: all reusable, all version-controlled, all consistent across every video.
Where It Breaks
We're not going to pretend it's flawless.
The weakest link is visual consistency. Gemini generates each scene independently, so a character or product can look subtly different from one beat to the next. We've reduced this with stricter, more descriptive prompts and reference styling, but it's still the thing we watch most closely.
Scene breakdowns also occasionally misjudge pacing: cramming too much into one beat or stretching a thin line across two. That's a prompt-engineering problem more than a model problem, and it keeps improving.
And nothing here replaces judgment about what the script should say. The pipeline produces the video. It doesn't decide whether the idea is good.
Why This Matters
What used to be a few hours of an editor's time (voiceover, syncing, sourcing visuals, assembling a timeline) is now a few minutes of compute and a few cents of API cost. That doesn't make editors obsolete. It moves the work up the stack: from assembling video to directing it.
For a studio, that's the real shift. We spend less time on the mechanical middle and more time on the script and the look, the parts that actually decide whether anyone watches.



