Skip to main content
Text to Speech

Minimax Speech-02

Generate high-quality natural speech audio from text using Minimax Speech-02 models with expressive voice options and emotion control (up to 10K characters)

View details

Inputs

Loading input fields...
Execution Steps

Loading workflow structure...

Loading curated examples...

Overview

Minimax Speech-02 turns text into speech audio with selectable voices, emotion control, language optimization, pronunciation guidance, and detailed audio settings. Use it for reliable text-to-speech drafts for ads, product videos, podcast segments, explainers, and narration up to 10,000 characters.

Use cases

  • Generate a spoken version of a product script before recording a final voiceover.
  • Create narration audio for a social video, demo, tutorial, or podcast segment.
  • Test voice IDs, emotion, speed, pitch, language boost, and audio formats for a production handoff.

Input tips

  • Keep text under 10,000 characters and listen through the result before sharing.
  • Use a built-in MiniMax voice ID or an approved custom cloned voice ID.
  • Choose speech-02-hd for the default quality path or speech-02-turbo when speed matters.
  • Use emotion, speed, volume, and pitch controls to tune delivery.
  • Add pronunciation overrides for names, acronyms, or product terms.
  • Choose mp3 for most previews; wav, pcm, flac, and aac are available when needed.

Expected output

The AI Tool returns one generated speech audio file with a downloadable URL, content type, file name, file size, duration, sample rate, bitrate, audio format, channel count, word count, billed-character count, status metadata, and cost metadata. The Speech-02 template renders an audio player and key technical details.

Caveats

  • Generated speech should be reviewed for pronunciation, tone, pacing, and brand fit.
  • Invalid or unavailable voice IDs will fail validation.
  • Emotion choices are limited to the supported named emotions; fluent and whisper are not supported.
  • Pronunciation overrides can help with custom terms but are not a substitute for listening review.
  • Longer text and richer settings can increase generation time and cost.