Skip to main content
Text to Audio

ElevenLabs Dialogue v3

Generate multi-speaker dialogue audio from text inputs with precise voice segment timing using Turbo v3 model. Ideal for podcasts, conversations, and character dialogues.

View details

Inputs

Loading input fields...
Execution Steps

Loading workflow structure...

Loading curated examples...

Overview

ElevenLabs Dialogue v3 generates multi-speaker dialogue audio from a sequence of text turns, each paired with a voice ID. Use it for podcast-style exchanges, character conversations, multi-speaker ad reads, or scripted product conversations that need speaker timing.

Use cases

  • Turn a two- or multi-speaker script into generated conversation audio.
  • Create a podcast intro, customer-style scene, or character dialogue draft for a campaign asset.
  • Generate speaker timing data for editing, captions, or downstream audio/video synchronization.
  • Test voices, language settings, stability, and text normalization before a final recording.

Input tips

  • Add 1-50 dialogue turns; each turn needs text and a voice_id.
  • Use different voice IDs for different speakers, and keep speaker turns clearly separated.
  • Use supported audio tags such as [laughing], [whispering], [pause], or emotion cues only when they fit the script.
  • Set language only when you need the model to enforce a specific ISO language code.
  • Adjust stability when you need more emotional range or more consistent delivery.
  • Choose output_format only when a specific handoff format matters; otherwise use the default.
  • Use seed for repeatability, but do not treat it as guaranteed.

Expected output

The AI Tool returns one generated multi-speaker audio file with downloadable URL, content type, optional file size, alignment JSON URLs when available, a voice-segments JSON URL, and cost metadata. The output view renders an audio player and, when speaker data is available, a speaker timeline with downloadable JSON.

Caveats

  • Voice IDs must be valid and usable for the selected dialogue.
  • Generated voices, timing, emotional tags, and pronunciation need listening review before publishing.
  • Seeded generation is best effort; exact determinism is not guaranteed.
  • Speaker timeline and alignment files are auxiliary JSON downloads, not a full transcript editor.
  • Long, ambiguous, or poorly separated turns can make speaker attribution harder to review.
  • This AI Tool generates audio only; it does not create video or avatars.