Skip to main content
Speech to Text

OpenAI GPT-4o Speaker Diarization

Transcribe audio with speaker identification using GPT-4o transcribe diarize. Identifies who said what with speaker-labeled segments and timing. Supports known speaker references for accurate labeling. Best for meetings, interviews, and podcasts. Maximum file size 25 MB.

View details

Inputs

Loading input fields...
Execution Steps

Loading workflow structure...

Loading curated examples...

Overview

OpenAI GPT-4o Speaker Diarization transcribes a public audio or video file into transcript text with speaker-labeled segments and timing. Use it for meetings, interviews, podcasts, panel audio, and video clips when you need to see who said what rather than a single combined transcript.

Use cases

  • Turn a customer interview into speaker-separated notes for quotes, summaries, and follow-up content.
  • Transcribe a podcast or panel clip with speaker labels before creating captions or show notes.
  • Create a meeting transcript that keeps speaker turns separate for review and research.
  • Use known speaker references when you want the AI Tool to label up to four expected speakers.

Input tips

  • Provide a public audio_url or video URL that can be downloaded without login.
  • Keep source files within the 25 MB maximum.
  • Leave language blank for auto-detection, or select a language when it is known.
  • Keep response_format set to diarized_json when you need speaker-labeled segments.
  • Pair known_speaker_names with matching public known_speaker_references when labeling known speakers.
  • Use short, clean speaker reference clips when known speaker labeling matters.

Expected output

The AI Tool returns full transcript text, optional detected or specified language code, optional audio duration, total word count, optional detected speaker count, a downloadable segment-data JSON URL with speaker labels and start/end times, and cost metadata. The output view supports copying the transcript, toggling speaker view, and downloading segment data.

Caveats

  • Speaker labels and transcript text should be reviewed before quoting, publishing, or using as a source of record.
  • Noisy audio, overlapping speakers, accents, music, or poor source quality can reduce accuracy.
  • This diarized mode does not support prompt guidance for terminology.
  • Known speaker names and reference clips must be provided as matching pairs.
  • Segment data is returned as a downloadable JSON file rather than embedded inline.