Video Editing8 min readJune 23, 2026

How To Generate Captions Using AI

Generate accurate video captions with AI: VisionDraft generate_captions MCP tool, Faster-Whisper transcription, timeline JSON, and burned-in exports.

By VisionDraft Team

Captions are no longer optional — accessibility law, silent social feeds, and SEO all demand accurate timed text. AI captioning via speech-to-text has matured; the gap is getting captions into your video pipeline without export-import hell.

VisionDraft closes that gap with generate_captions, an MCP tool agents call after ingest. This guide covers how to generate captions using AI in agent workflows — transcription engine, timeline storage, and burned-in renders.

Why Captions Belong in Infrastructure

Caption workflows traditionally:

  1. Export audio from NLE
  2. Upload to web caption SaaS
  3. Download SRT
  4. Re-import and style
  5. Re-export video

Each hop is manual or brittle Zapier glue. MCP-native infrastructure collapses steps:

upload_asset → generate_captions → render_project (burn_captions: true)

One agent session. One account. Job IDs you can audit.

The generate_captions Tool

From VisionDraft MCP (/docs):

Inputs

  • project_id — target project
  • asset_id — video or audio with speech
  • language — optional, default "en"

Process

  1. Fetch asset from Supabase storage
  2. Write temp file for processing
  3. Faster-Whisper transcribes with segment timestamps
  4. Save caption record to database
  5. Apply addCaptionSegments mutation to timeline JSON

Returns — caption metadata, updated timeline, segmentCount

Agents use segment text for summaries, Shorts hooks, or blog drafts — not only display.

Agent Prompt Examples

Basic

After upload, generate_captions for asset {id} in project {pid} language en, then render with burned captions as {export_name}.

Multilingual channel

generate_captions language es, render export_name episode-12-es.

Caption-only check before render

generate_captions and show me the first 5 segment texts without rendering yet.

Claude: workflow guide. ChatGPT: video guide.

Burning Captions Into Video

Soft captions (sidecar SRT) vs hard burn:

ModeHowUse case
SoftFuture sidecar exportPlayer-controlled
Hard burnrender_project(burn_captions: true)TikTok, Instagram, LinkedIn

Default burn_captions is true. FFmpeg worker reads timeline caption segments and composites text during render.

Poll get_render_status, then download_export.

Accuracy Tips

Audio quality — Room noise hurts word error rate. Denoise upstream if possible.

Names and jargon — Post-render human review; re-prompt agent to note corrections for next episode glossary.

Language parameter — Set correctly; auto-detect paths vary by engine version.

Long files — Caption minutes consume pricing quota; plan accordingly.

Accessibility and Compliance

WCAG-oriented teams need:

  • Accurate sync (Whisper segments are timecoded)
  • Readable burn styles (platform-safe margins)
  • Review for critical communications

Agents accelerate draft captions; humans approve regulated content.

Captions in Automated Pipelines

Automate content creation standard playbook always includes generate_captions between ingest and render.

Create shorts automatically depends on caption text for hook selection.

MCP Setup

  1. /signup
  2. /mcp — Server URL + API key
  3. Connect Claude MCP or ChatGPT MCP

Troubleshooting

IssueAction
Empty segmentsCheck audio track exists on asset
Wrong languageRe-run with correct language
Quota errorUpgrade or wait billing period
Burn not visibleConfirm burn_captions true on render

VisionDraft vs Caption-Only SaaS

Standalone caption apps stop at SRT. VisionDraft is full MCP-native video infrastructure — captions are one tool in a chain ending in download_export.

Not competing on caption UI alone; competing on agent-orchestrated production.

Caption Segment Structure

generate_captions produces timed segments stored in timeline JSON — typically start/end seconds plus text. Agents can:

  • Summarize full transcript for show notes
  • Extract quotable moments for social cards
  • Flag segments mentioning competitor names for legal review

Caption data is production metadata, not just display text.

Re-Transcription After Audio Fixes

Noise-reduced audio replacement should trigger new generate_captions run — old segments become stale. Version exports with -v2 suffix in export_name.

Speaker Diarization (Future)

Multi-speaker labeling may extend caption tooling. Today assume single mixed transcript; plan manual speaker tags in post for panel discussions.

SRT/VTT Export Paths

Burned captions suit social; accessibility archives may need sidecar SRT. Monitor VisionDraft docs for sidecar export tools; until available, retain Whisper output from agent-reported segment JSON.

Caption QA Sampling

QC teams sample 3 random 30-second windows per episode comparing audio to caption text. Target <2% word error rate before series automation goes fully unattended.

Custom Vocabulary Hints

Product names and CEO names confuse ASR. Maintain glossary; future Whisper prompt biasing may accept glossary — today post-process find-replace in segment text before re-render.

Caption Timing Drift

If audio replaced after captioning, timing drifts. Always re-run generate_captions after audio swap — never only re-render with stale segments.

Offline Caption Review Tools

Export segment JSON to review UI (internal or spreadsheet) before burn. Low-tech CSV review columns: start, end, text, approved_yes_no.

Caption Style and Brand

Burned caption font and positioning follow FFmpeg render config today — limited versus After Effects kinetic type. Set stakeholder expectations: automated captions prioritize accuracy and legibility over bespoke motion typography until advanced style tools ship.

Multilingual Publishing Workflow

Same master video, parallel generate_captions calls with language es, fr, de — separate render_project per language export. Agent tracks six jobs; polling table prevents missed completions.

Caption segment JSON may constitute official transcript for regulated industries — backup caption records with same retention policy as source video. VisionDraft persistence supports audit; your org defines retention years.

Whisper Model Updates

Faster-Whisper model upgrades may shift wording slightly — re-baseline QA sampling after platform announces model bump in release notes.

Reference Appendix: Implementation Notes

Production teams should treat this guide as a living document tied to VisionDraft's MCP tool surface at /docs. Before any batch automation goes live, run a golden path test on a five-second sample clip: create_project, ingest, generate_captions, render_project, poll get_render_status, and download_export. Archive the resulting job_id and export_id as regression fixtures.

Credential hygiene remains the top security issue. API keys from /mcp belong in host connector settings or secrets managers — never in blog comments, ticket attachments, or Git repositories. Rotate keys when employees leave or when a connector was exposed in a screen share. For agencies, separate keys per client prevent accidental cross-posting of exports between brands.

Quota planning on pricing avoids mid-campaign surprises. Model monthly demand: number of episodes × (caption minutes + render minutes per episode) + Shorts derivative factor. Upgrade tier before Black Friday or conference season, not after queue saturation. VisionDraft enforces limits server-side; agents surface errors but cannot override billing.

Async discipline separates hobby workflows from production. Every operator must internalize: render_project returns immediately; completion requires get_render_status polling until completed or failed. Scripts should use exponential backoff (30s, 45s, 60s caps) and alert if p95 latency exceeds SLA. Do not chain duplicate render calls hoping to "speed up" a stuck job — diagnose the existing job_id first.

Human review gates protect brand and compliance. Automate mechanical captioning and encoding; keep humans on claims, regulated statements, music rights, and talent releases. Download URLs from download_export expire — copy files to your CDN or DAM within the signed URL window (typically one hour).

Cross-host portability is a core benefit of MCP-native infrastructure. The same VisionDraft project namespace works from Claude Desktop, ChatGPT connectors, or headless JSON-RPC clients. If one host has an outage, failover procedures should document alternate host configuration hitting identical Server URL and a backup API key.

Observability: log project_id, asset_id, job_id, and export_id for every production run. When stakeholders ask "which export went live Tuesday?", IDs answer definitively unlike chat transcripts. Pair logs with VisionDraft dashboard render history during postmortems.

Related reading: what is MCP, complete guide to AI video automation, VisionDraft MCP infrastructure. Next step: create your account and configure /mcp to run the golden path test today.

Extended Checklist for Operators

Use this checklist weekly:

  1. Verify MCP connector responds to list_projects without 401 errors.
  2. Confirm render worker queue depth is normal — no growing backlog of queued jobs older than one hour.
  3. Review caption QA sample (minimum three random 30-second windows per active series).
  4. Validate export_name naming conventions match current marketing calendar prefixes.
  5. Check storage usage against plan limits; archive stale exports to cold storage if needed.
  6. Update prompt playbooks when VisionDraft /docs changelog notes new tools or parameters.
  7. Reconcile billing tier with trailing 30-day render and caption minute consumption.
  8. Run failover drill: invoke create_project from backup MCP host configuration.
  9. Ensure contractors' API keys are revoked within 24 hours of offboarding.
  10. Document any failed job_id in team runbook with root cause and preventive action.

Operators who skip checklist items six and seven typically discover tool schema drift or quota exhaustion during deadline week — preventable with discipline.

Frequently Asked Questions

How VisionDraft captions work?

generate_captions + Faster-Whisper + timeline JSON segments.

Agent triggered?

Yes — standard MCP workflow step.

Languages?

language parameter on generate_captions; many Whisper locales.

Editable?

Segments in timeline; re-render after changes.

Plan limits?

Caption minutes per pricing.


Caption at the speed of agents. Sign up · configure /mcp

Frequently asked questions

How does VisionDraft generate captions?

The generate_captions MCP tool downloads the asset, runs Faster-Whisper transcription, saves caption records, and inserts timed segments into the project timeline JSON.

Can agents trigger captioning automatically?

Yes. Claude or ChatGPT calls generate_captions after upload_asset or complete_upload, then render_project with burn_captions true for hardcoded subtitles.

What languages are supported?

Pass the language parameter on generate_captions (default en). Whisper-family models support many languages; verify quality for your locale.

Are captions editable?

Segments live in timeline JSON and caption tables. Re-run generate_captions or future edit tools; re-render to update burned output.

Do captions count against plan limits?

Yes. Caption generation is metered per plan — see pricing for included caption minutes.

Build video workflows with AI agents

VisionDraft is MCP-native video editing infrastructure. Connect ChatGPT or Claude, upload assets, generate captions, render, and export — without a timeline editor.

Related articles

Complete ChatGPT video editing guide using VisionDraft MCP: setup, uploads, captions, renders, troubleshooting, and production prompt templates.

VisionDraft TeamRead

Step-by-step Claude video editing workflow with VisionDraft MCP: Desktop setup, caption renders, large uploads, and multi-project production tips.

VisionDraft TeamRead

Everything you need for AI video automation: MCP setup, ingest, captions, renders, pipelines, troubleshooting, and VisionDraft infrastructure reference.

VisionDraft TeamRead