Whisper Transcription for Video Editors: Costs, Word Timestamps and Setup
By the Caption Plug team · Published June 12, 2026 · 8 min read
Whisper - OpenAI's open-source speech recognition model - is what powers most modern caption tools, and using it through an API costs video editors almost nothing: $0.006 per audio minute on OpenAI (about 36 cents for an hour of audio), with Groq running the same whisper-large-v3 model faster at comparably tiny prices. Here's what editors actually need to know: word timestamps, the 25 MB limit, privacy, and where it fits in a caption workflow.
Why Whisper took over captioning
Whisper was released by OpenAI in late 2022 and open-sourced, which did two things: set a new accuracy bar for messy real-world audio (accents, music beds, crosstalk), and let every provider host it. The version that matters for captions is whisper-large-v3 with word-level timestamps- each word comes back with start and end times, which is precisely what per-word caption animation needs. Sentence-level timing (what basic subtitle tools use) can't drive a highlight that follows the voice.
What it costs, in editor terms
| Job | Audio length | OpenAI cost (@$0.006/min) |
|---|---|---|
| TikTok/Short | 60 sec | ~$0.006 |
| YouTube video | 12 min | ~$0.07 |
| Podcast episode | 60 min | ~$0.36 |
| A month of daily shorts | 30 × 60 sec | ~$0.18 |
Groq prices its hosted whisper-large-v3 similarly low (fractions of a cent per audio minute, metered per second) and is dramatically faster - often transcribing a clip in a few seconds. Prices drift; check platform.openai.com and console.groq.com for current numbers. The takeaway doesn't drift: transcription is a rounding error next to literally any subscription.
OpenAI vs Groq for caption work
- Groq: runs whisper-large-v3 (the strongest open Whisper), very fast, generous on names and proper nouns. Our default recommendation for captions.
- OpenAI: the reference implementation, rock-solid availability, same API shape. The whisper-1 endpoint is slightly weaker on unusual names than large-v3.
The 25 MB limit (and the MP3 trick)
Both providers cap each upload at 25 MB. That sounds small until you realize it's an audio limit: a 96 kbps mono MP3 fits roughly 35 minutes in 25 MB. The standard workflow is exporting MP3 audio rather than WAV (a 10-minute WAV is ~100 MB; the same audio as MP3 is ~7 MB) and splitting anything longer than ~30 minutes. Tools with ffmpeg available can compress automatically.
Privacy: what leaves your machine
With a direct API integration, the audio goes from your machine to the provider you chose - nobody else's servers in between. OpenAI and Groq both state API data isn't used for model training by default (check their current policies). Contrast that with upload-your-video caption services, where your full footage sits in someone else's render pipeline. For client work under NDA, audio-only to a no-training API is a much easier conversation.
How this plugs into Premiere
Caption Pluguses exactly this setup: you paste your own OpenAI or Groq key once (the panel walks you through creating one in about two minutes), it exports your timeline audio as MP3 using Premiere's own encoder, sends it to your provider, and turns the word timestamps into animated captions rendered frame-accurately on your timeline. Transcripts are cached, so restyling or toggling the censornever re-uploads anything. Your key, your audio, your machine - the plugin's servers never see either.
Quick answers
How much does Whisper transcription cost per video?
OpenAI's API is $0.006 per audio minute (as of mid-2026): a 60-second short costs well under a cent, a 20-minute episode about $0.12. Groq's whisper-large-v3 is similarly cheap and noticeably faster. Either way, transcription cost is effectively a rounding error.
Is my audio used to train models when I use the API?
OpenAI states API data isn't used for training by default, and Groq makes the same commitment. That's a real difference from pasting content into consumer chat apps - check each provider's current data policy for specifics.
What about the 25 MB upload limit?
Both providers cap uploads at 25 MB, which is roughly 30-40 minutes of 96 kbps MP3 speech. Export audio as MP3 rather than WAV, or split long recordings. Some tools compress automatically with ffmpeg when a file is too big.