Speech to Text

ElevenLabs Scribe Transcription

Transcribe audio or video files using the Scribe speech-to-text model with automatic language detection, speaker diarization, and word-level timestamps. Ideal for meeting notes, podcast transcription, and subtitle generation.

View details

Try it in Ampere

Inputs

Loading input fields...

Execution Steps

Loading workflow structure...

Loading curated examples...

Overview

ElevenLabs Scribe Transcription turns a public audio or video file into transcript text with detected language, optional speaker diarization, and downloadable word-level timing. Use it for meetings, interviews, podcasts, videos, voice notes, captions, and quote extraction.

Use cases

Transcribe an interview or podcast and download word timing for clips or captions.
Create a meeting or voice-note transcript with optional speaker grouping.
Capture language confidence, transcript text, word count, and timing data for a content-repurposing brief.

Input tips

Provide a public audio_url or video URL that can be fetched without login.
Leave language_code blank for automatic language detection, or set it when the language is known.
Enable diarize when you need speaker grouping; use num_speakers when you know the expected speaker count.
Use tag_audio_events when laughter, music, applause, or similar events matter.
Keep word timestamps on when you need clip, caption, or quote timing.
Use seed for reproducible results when comparing runs.

Expected output

The AI Tool returns detected language code and confidence, the full transcript text, total word count, optional transcription ID, a downloadable JSON URL for word-level timing with speaker and confidence fields, and cost metadata. The output view supports copying the transcript, downloading timing data, and showing speaker-grouped text when diarization data is available.

Caveats

Review transcripts before quoting, publishing, or using them as a source of record.
Noisy audio, overlapping speakers, accents, music, or poor source quality can reduce accuracy.
Speaker labels are generic and may need human review.
Timing data is returned as a downloadable JSON file rather than embedded inline.
Use Forced Alignment instead when you already have transcript text and need precise timing against audio.

ElevenLabs Scribe Transcription

Inputs

Use cases

Input tips

Expected output

Caveats

Related AI Tools

OpenAI GPT-4o Speech-to-Text

Audio-Text Forced Alignment

YouTube Transcript