Skip to main content
Text to Speech

ElevenLabs TTS Multilingual v2

Generate high-quality speech from text with character-level timing using Multilingual v2 model. Supports style exaggeration and speaker boost for enhanced voice quality.

View details

Inputs

Loading input fields...
Execution Steps

Loading workflow structure...

Loading curated examples...

Overview

ElevenLabs TTS Multilingual v2 turns text into speech audio with a selected voice ID using ElevenLabs Multilingual v2. Use it for multilingual voiceover drafts, ad narration, product demos, podcasts, or explainer scripts when style exaggeration, speaker boost, speed, continuity text, and character-level timing are useful.

Use cases

  • Generate voiceover audio for a demo, ad, podcast segment, or explainer script.
  • Use style exaggeration and speaker boost to test more expressive or more voice-faithful delivery.
  • Generate character-level timing JSON for captions, lip-sync prep, or audio-text synchronization.
  • Use previous_text and next_text to improve continuity across separately generated script sections.

Input tips

  • Keep text under 10,000 characters.
  • Provide a valid ElevenLabs voice_id from available default or custom voices.
  • Set language only when you need Multilingual v2 to enforce a specific language.
  • Use stability, similarity_boost, style, speed, and speaker boost to shape delivery.
  • Choose output_format only when a specific audio handoff format matters; otherwise use the default.
  • Use SSML break tags up to 3 seconds, dashes, or ellipses for pauses and hesitation.
  • Use seed for repeatability, but treat it as best effort.

Expected output

The AI Tool returns one generated speech audio file with downloadable URL, content type, optional file size, alignment JSON URLs when available, and cost metadata. The shared ElevenLabs TTS view renders an audio player and download links for character-level timing and normalized timing when present.

Caveats

  • Voice IDs must be valid and permitted for use.
  • Generated speech should be reviewed for pronunciation, tone, pacing, and brand fit.
  • Style, similarity, speaker boost, speed, and latency settings can change delivery and generation time.
  • Seeded generation is best effort; exact determinism is not guaranteed.
  • Timing JSON is auxiliary synchronization data, not a full transcript editor.
  • Some requested formats or language settings may fail if unsupported by the selected model.