Diarization
Identify and separate multiple speakers in audio recordings with speaker diarization.
Overview
Speaker diarization answers the question "who spoke when?" It segments audio by speaker, making it invaluable for:
- Meeting transcripts — Attribute statements to participants
- Interviews — Separate interviewer and interviewee
- Podcasts — Identify hosts and guests
- Call recordings — Distinguish callers
Getting Started
Download an ASR Model
Diarization uses ASR models with speaker detection:
izwi pull qwen3-asr-0.6bStart the Server
izwi serveUsing the Web UI
- Navigate to Diarization in the sidebar
- Upload an audio file with multiple speakers
- Click Analyze
- View the speaker-segmented transcript
Output
The diarization view shows:
- Speaker labels — Speaker 1, Speaker 2, etc.
- Timestamps — When each speaker talks
- Transcript — What each speaker said
Example output:
[00:00 - 00:05] Speaker 1: Welcome to the meeting. [00:05 - 00:12] Speaker 2: Thanks for having me. [00:12 - 00:20] Speaker 1: Let's start with the agenda.Using the API
Endpoint
POST /v1/audio/diarizeRequest (multipart/form-data)
| Field | Type | Description |
|---|---|---|
file | File | Audio file to analyze |
model | String | Model name |
num_speakers | Integer | Expected speakers (optional) |
Example (curl)
curl -X POST http://localhost:8080/v1/audio/diarize \ -F "file=@meeting.wav" \ -F "model=qwen3-asr-0.6b"Response
{
"segments": [
{
"speaker": "Speaker 1",
"start": 0.0,
"end": 5.2,
"text": "Welcome to the meeting."
},
{
"speaker": "Speaker 2",
"start": 5.5,
"end": 12.1,
"text": "Thanks for having me."
}
],
"num_speakers": 2,
"duration": 120.5
}Configuration
Number of Speakers
If you know how many speakers are in the audio, specify it for better accuracy:
# Via API curl -X POST http://localhost:8080/v1/audio/diarize \ -F "file=@meeting.wav" \ -F "num_speakers=3"Speaker Labels
By default, speakers are labeled "Speaker 1", "Speaker 2", etc. You can rename them in the UI after processing.
Tips for Best Results
- Quality audio — Clear recordings with minimal background noise
- Distinct voices — Works best when speakers have different voice characteristics
- Minimal overlap — Speakers talking over each other reduces accuracy
- Specify speaker count — If known, helps the algorithm
- Longer segments — Short utterances are harder to attribute
Limitations
- Similar voices — May confuse speakers with very similar voices
- Overlapping speech — Simultaneous talking is challenging
- Background noise — Reduces speaker detection accuracy
- Very short clips — Need enough audio to identify speaker patterns
Use Cases
Meeting Minutes
Upload a meeting recording to get a transcript with speaker attribution:
- Record your meeting
- Upload to Diarization
- Export the speaker-labeled transcript
- Edit speaker names as needed
Interview Transcription
Perfect for journalist interviews or research:
- Record the interview
- Process with diarization
- Get clean Q&A format output
Podcast Production
Identify speakers for editing and show notes:
- Upload raw podcast audio
- See who spoke when
- Use timestamps for editing
See Also
- Transcription — Single-speaker transcription
- Voice Mode — Real-time conversations
- CLI Reference — Command documentation