How To Generate Captions Using AI
Generate accurate video captions with AI: VisionDraft generate_captions MCP tool, Faster-Whisper transcription, timeline JSON, and burned-in exports.
Captions are no longer optional — accessibility law, silent social feeds, and SEO all demand accurate timed text. AI captioning via speech-to-text has matured; the gap is getting captions into your video pipeline without export-import hell.
VisionDraft closes that gap with generate_captions, an MCP tool agents call after ingest. This guide covers how to generate captions using AI in agent workflows — transcription engine, timeline storage, and burned-in renders.
Why Captions Belong in Infrastructure
Caption workflows traditionally:
- Export audio from NLE
- Upload to web caption SaaS
- Download SRT
- Re-import and style
- Re-export video
Each hop is manual or brittle Zapier glue. MCP-native infrastructure collapses steps:
upload_asset → generate_captions → render_project (burn_captions: true)
One agent session. One account. Job IDs you can audit.
The generate_captions Tool
From VisionDraft MCP (/docs):
Inputs
project_id— target projectasset_id— video or audio with speechlanguage— optional, default"en"
Process
- Fetch asset from Supabase storage
- Write temp file for processing
- Faster-Whisper transcribes with segment timestamps
- Save caption record to database
- Apply
addCaptionSegmentsmutation to timeline JSON
Returns — caption metadata, updated timeline, segmentCount
Agents use segment text for summaries, Shorts hooks, or blog drafts — not only display.
Agent Prompt Examples
Basic
After upload, generate_captions for asset {id} in project {pid} language en, then render with burned captions as {export_name}.
Multilingual channel
generate_captions language es, render export_name episode-12-es.
Caption-only check before render
generate_captions and show me the first 5 segment texts without rendering yet.
Claude: workflow guide. ChatGPT: video guide.
Burning Captions Into Video
Soft captions (sidecar SRT) vs hard burn:
| Mode | How | Use case |
|---|---|---|
| Soft | Future sidecar export | Player-controlled |
| Hard burn | render_project(burn_captions: true) | TikTok, Instagram, LinkedIn |
Default burn_captions is true. FFmpeg worker reads timeline caption segments and composites text during render.
Poll get_render_status, then download_export.
Accuracy Tips
Audio quality — Room noise hurts word error rate. Denoise upstream if possible.
Names and jargon — Post-render human review; re-prompt agent to note corrections for next episode glossary.
Language parameter — Set correctly; auto-detect paths vary by engine version.
Long files — Caption minutes consume pricing quota; plan accordingly.
Accessibility and Compliance
WCAG-oriented teams need:
- Accurate sync (Whisper segments are timecoded)
- Readable burn styles (platform-safe margins)
- Review for critical communications
Agents accelerate draft captions; humans approve regulated content.
Captions in Automated Pipelines
Automate content creation standard playbook always includes generate_captions between ingest and render.
Create shorts automatically depends on caption text for hook selection.
MCP Setup
- /signup
- /mcp — Server URL + API key
- Connect Claude MCP or ChatGPT MCP
Troubleshooting
| Issue | Action |
|---|---|
| Empty segments | Check audio track exists on asset |
| Wrong language | Re-run with correct language |
| Quota error | Upgrade or wait billing period |
| Burn not visible | Confirm burn_captions true on render |
VisionDraft vs Caption-Only SaaS
Standalone caption apps stop at SRT. VisionDraft is full MCP-native video infrastructure — captions are one tool in a chain ending in download_export.
Not competing on caption UI alone; competing on agent-orchestrated production.
Caption Segment Structure
generate_captions produces timed segments stored in timeline JSON — typically start/end seconds plus text. Agents can:
- Summarize full transcript for show notes
- Extract quotable moments for social cards
- Flag segments mentioning competitor names for legal review
Caption data is production metadata, not just display text.
Re-Transcription After Audio Fixes
Noise-reduced audio replacement should trigger new generate_captions run — old segments become stale. Version exports with -v2 suffix in export_name.
Speaker Diarization (Future)
Multi-speaker labeling may extend caption tooling. Today assume single mixed transcript; plan manual speaker tags in post for panel discussions.
SRT/VTT Export Paths
Burned captions suit social; accessibility archives may need sidecar SRT. Monitor VisionDraft docs for sidecar export tools; until available, retain Whisper output from agent-reported segment JSON.
Caption QA Sampling
QC teams sample 3 random 30-second windows per episode comparing audio to caption text. Target <2% word error rate before series automation goes fully unattended.
Custom Vocabulary Hints
Product names and CEO names confuse ASR. Maintain glossary; future Whisper prompt biasing may accept glossary — today post-process find-replace in segment text before re-render.
Caption Timing Drift
If audio replaced after captioning, timing drifts. Always re-run generate_captions after audio swap — never only re-render with stale segments.
Offline Caption Review Tools
Export segment JSON to review UI (internal or spreadsheet) before burn. Low-tech CSV review columns: start, end, text, approved_yes_no.
Caption Style and Brand
Burned caption font and positioning follow FFmpeg render config today — limited versus After Effects kinetic type. Set stakeholder expectations: automated captions prioritize accuracy and legibility over bespoke motion typography until advanced style tools ship.
Multilingual Publishing Workflow
Same master video, parallel generate_captions calls with language es, fr, de — separate render_project per language export. Agent tracks six jobs; polling table prevents missed completions.
Legal Transcript Retention
Caption segment JSON may constitute official transcript for regulated industries — backup caption records with same retention policy as source video. VisionDraft persistence supports audit; your org defines retention years.
Whisper Model Updates
Faster-Whisper model upgrades may shift wording slightly — re-baseline QA sampling after platform announces model bump in release notes.
Reference Appendix: Implementation Notes
Production teams should treat this guide as a living document tied to VisionDraft's MCP tool surface at /docs. Before any batch automation goes live, run a golden path test on a five-second sample clip: create_project, ingest, generate_captions, render_project, poll get_render_status, and download_export. Archive the resulting job_id and export_id as regression fixtures.
Credential hygiene remains the top security issue. API keys from /mcp belong in host connector settings or secrets managers — never in blog comments, ticket attachments, or Git repositories. Rotate keys when employees leave or when a connector was exposed in a screen share. For agencies, separate keys per client prevent accidental cross-posting of exports between brands.
Quota planning on pricing avoids mid-campaign surprises. Model monthly demand: number of episodes × (caption minutes + render minutes per episode) + Shorts derivative factor. Upgrade tier before Black Friday or conference season, not after queue saturation. VisionDraft enforces limits server-side; agents surface errors but cannot override billing.
Async discipline separates hobby workflows from production. Every operator must internalize: render_project returns immediately; completion requires get_render_status polling until completed or failed. Scripts should use exponential backoff (30s, 45s, 60s caps) and alert if p95 latency exceeds SLA. Do not chain duplicate render calls hoping to "speed up" a stuck job — diagnose the existing job_id first.
Human review gates protect brand and compliance. Automate mechanical captioning and encoding; keep humans on claims, regulated statements, music rights, and talent releases. Download URLs from download_export expire — copy files to your CDN or DAM within the signed URL window (typically one hour).
Cross-host portability is a core benefit of MCP-native infrastructure. The same VisionDraft project namespace works from Claude Desktop, ChatGPT connectors, or headless JSON-RPC clients. If one host has an outage, failover procedures should document alternate host configuration hitting identical Server URL and a backup API key.
Observability: log project_id, asset_id, job_id, and export_id for every production run. When stakeholders ask "which export went live Tuesday?", IDs answer definitively unlike chat transcripts. Pair logs with VisionDraft dashboard render history during postmortems.
Related reading: what is MCP, complete guide to AI video automation, VisionDraft MCP infrastructure. Next step: create your account and configure /mcp to run the golden path test today.
Extended Checklist for Operators
Use this checklist weekly:
- Verify MCP connector responds to
list_projectswithout 401 errors. - Confirm render worker queue depth is normal — no growing backlog of
queuedjobs older than one hour. - Review caption QA sample (minimum three random 30-second windows per active series).
- Validate
export_namenaming conventions match current marketing calendar prefixes. - Check storage usage against plan limits; archive stale exports to cold storage if needed.
- Update prompt playbooks when VisionDraft /docs changelog notes new tools or parameters.
- Reconcile billing tier with trailing 30-day render and caption minute consumption.
- Run failover drill: invoke
create_projectfrom backup MCP host configuration. - Ensure contractors' API keys are revoked within 24 hours of offboarding.
- Document any failed
job_idin team runbook with root cause and preventive action.
Operators who skip checklist items six and seven typically discover tool schema drift or quota exhaustion during deadline week — preventable with discipline.
Frequently Asked Questions
How VisionDraft captions work?
generate_captions + Faster-Whisper + timeline JSON segments.
Agent triggered?
Yes — standard MCP workflow step.
Languages?
language parameter on generate_captions; many Whisper locales.
Editable?
Segments in timeline; re-render after changes.
Plan limits?
Caption minutes per pricing.
Frequently asked questions
How does VisionDraft generate captions?
The generate_captions MCP tool downloads the asset, runs Faster-Whisper transcription, saves caption records, and inserts timed segments into the project timeline JSON.
Can agents trigger captioning automatically?
Yes. Claude or ChatGPT calls generate_captions after upload_asset or complete_upload, then render_project with burn_captions true for hardcoded subtitles.
What languages are supported?
Pass the language parameter on generate_captions (default en). Whisper-family models support many languages; verify quality for your locale.
Are captions editable?
Segments live in timeline JSON and caption tables. Re-run generate_captions or future edit tools; re-render to update burned output.
Do captions count against plan limits?
Yes. Caption generation is metered per plan — see pricing for included caption minutes.
Build video workflows with AI agents
VisionDraft is MCP-native video editing infrastructure. Connect ChatGPT or Claude, upload assets, generate captions, render, and export — without a timeline editor.
Related articles
ChatGPT Video Editing: Complete Guide
Complete ChatGPT video editing guide using VisionDraft MCP: setup, uploads, captions, renders, troubleshooting, and production prompt templates.
Claude Video Editing Workflow Guide
Step-by-step Claude video editing workflow with VisionDraft MCP: Desktop setup, caption renders, large uploads, and multi-project production tips.
The Complete Guide To AI Video Automation
Everything you need for AI video automation: MCP setup, ingest, captions, renders, pipelines, troubleshooting, and VisionDraft infrastructure reference.