ElevenLabs Scribe Transcription
Transcribe audio or video files using the Scribe speech-to-text model with automatic language detection, speaker diarization, and word-level timestamps. Ideal for meeting notes, podcast transcription, and subtitle generation.
View detailsInputs
Loading workflow structure...
Overview
ElevenLabs Scribe Transcription turns a public audio or video file into transcript text with detected language, optional speaker diarization, and downloadable word-level timing. Use it for meetings, interviews, podcasts, videos, voice notes, captions, and quote extraction.
Use cases
- Transcribe an interview or podcast and download word timing for clips or captions.
- Create a meeting or voice-note transcript with optional speaker grouping.
- Capture language confidence, transcript text, word count, and timing data for a content-repurposing brief.
Input tips
- Provide a public audio_url or video URL that can be fetched without login.
- Leave language_code blank for automatic language detection, or set it when the language is known.
- Enable diarize when you need speaker grouping; use num_speakers when you know the expected speaker count.
- Use tag_audio_events when laughter, music, applause, or similar events matter.
- Keep word timestamps on when you need clip, caption, or quote timing.
- Use seed for reproducible results when comparing runs.
Expected output
The AI Tool returns detected language code and confidence, the full transcript text, total word count, optional transcription ID, a downloadable JSON URL for word-level timing with speaker and confidence fields, and cost metadata. The output view supports copying the transcript, downloading timing data, and showing speaker-grouped text when diarization data is available.
Caveats
- Review transcripts before quoting, publishing, or using them as a source of record.
- Noisy audio, overlapping speakers, accents, music, or poor source quality can reduce accuracy.
- Speaker labels are generic and may need human review.
- Timing data is returned as a downloadable JSON file rather than embedded inline.
- Use Forced Alignment instead when you already have transcript text and need precise timing against audio.
Related AI Tools

OpenAI GPT-4o Speech-to-Text
Transcribe audio or video files using GPT-4o transcribe models. Supports prompt guidance for improved accuracy with proper nouns and terminology. Best for single-speaker content or when speaker identification is not needed. Maximum file size 25 MB.

Audio-Text Forced Alignment
Force align an audio file to a text transcript and get precise timing information for each character and word. Ideal for subtitles, lip-sync, karaoke, and audio-text synchronization.

YouTube Transcript
Fetch transcript segments and plain text for a public YouTube video when captions are available, preserving unavailable and language-missing states without failing useful paid runs.