Skip to main content
Speech to Text

ElevenLabs Scribe Transcription

Transcribe audio or video files using the Scribe speech-to-text model with automatic language detection, speaker diarization, and word-level timestamps. Ideal for meeting notes, podcast transcription, and subtitle generation.

View details

Inputs

Loading input fields...
Execution Steps

Loading workflow structure...

Loading curated examples...

Overview

ElevenLabs Scribe Transcription turns a public audio or video file into transcript text with detected language, optional speaker diarization, and downloadable word-level timing. Use it for meetings, interviews, podcasts, videos, voice notes, captions, and quote extraction.

Use cases

  • Transcribe an interview or podcast and download word timing for clips or captions.
  • Create a meeting or voice-note transcript with optional speaker grouping.
  • Capture language confidence, transcript text, word count, and timing data for a content-repurposing brief.

Input tips

  • Provide a public audio_url or video URL that can be fetched without login.
  • Leave language_code blank for automatic language detection, or set it when the language is known.
  • Enable diarize when you need speaker grouping; use num_speakers when you know the expected speaker count.
  • Use tag_audio_events when laughter, music, applause, or similar events matter.
  • Keep word timestamps on when you need clip, caption, or quote timing.
  • Use seed for reproducible results when comparing runs.

Expected output

The AI Tool returns detected language code and confidence, the full transcript text, total word count, optional transcription ID, a downloadable JSON URL for word-level timing with speaker and confidence fields, and cost metadata. The output view supports copying the transcript, downloading timing data, and showing speaker-grouped text when diarization data is available.

Caveats

  • Review transcripts before quoting, publishing, or using them as a source of record.
  • Noisy audio, overlapping speakers, accents, music, or poor source quality can reduce accuracy.
  • Speaker labels are generic and may need human review.
  • Timing data is returned as a downloadable JSON file rather than embedded inline.
  • Use Forced Alignment instead when you already have transcript text and need precise timing against audio.