Text to Speech

InWorld Text-to-Speech

Generate ultra-realistic speech from text using InWorld AI TTS models with rich expressive voices and word-level timestamp alignment

View details

Try it in Ampere

Inputs

Loading input fields...

Execution Steps

Loading workflow structure...

Loading curated examples...

Overview

InWorld Text-to-Speech turns up to 2,000 characters of text into speech audio with a selected InWorld system or cloned voice, model tier, audio format, speaking-rate, temperature, and text-normalization controls. Use it for voiceover drafts, narration, ad reads, and audio that may need timestamp alignment for captions or lip-sync prep.

Use cases

Generate a narration, ad read, product-demo script, or explainer voiceover from text.
Use a valid cloned voice ID when a saved InWorld voice should perform the script.
Create audio with timestamp alignment data for caption, highlighting, or lip-sync preparation.
Compare standard and max model tiers before choosing a voiceover direction.

Input tips

Keep text under 2,000 characters.
Provide a valid InWorld voice_id, either a system voice or a cloned voice ID from your organization.
Choose inworld-tts-1 for the standard model or inworld-tts-1-max when the max tier is needed.
Use audio_config for output encoding, bit rate, sample rate, and 0.5-1.5 speaking rate.
Adjust temperature only when you want more or less variation in delivery.
Leave text_normalization on when numbers, dates, or abbreviations should be spoken naturally.

Expected output

The AI Tool returns one generated speech audio file with downloadable URL, optional content type, file name, file size, optional timestamp alignment data, and cost metadata. The output view renders an audio player and shows word count, alignment type, and estimated duration when timestamp data is available.

Caveats

Voice IDs must be valid and usable for the selected InWorld voice.
Generated speech should be reviewed for pronunciation, pacing, tone, and brand fit.
Timestamp alignment data may be absent or partial depending on the returned audio metadata.
This AI Tool returns audio only; it does not create video, avatars, transcripts, or cloned voices.
Very long scripts need to be split into separate runs because text is capped at 2,000 characters.

InWorld Text-to-Speech

Inputs

Use cases

Input tips

Expected output

Caveats

Related AI Tools

InWorld Voice Clone

ElevenLabs TTS v3

Minimax Speech v2.8