Izwi - Local Audio Inference Engine

Overview

Speaker diarization answers the question "who spoke when?" It segments audio by speaker, making it invaluable for:

Meeting transcripts — Attribute statements to participants
Interviews — Separate interviewer and interviewee
Podcasts — Identify hosts and guests
Call recordings — Distinguish callers

Getting Started

Download an ASR Model

Diarization uses ASR models with speaker detection:

izwi pull qwen3-asr-0.6b

Start the Server

izwi serve

Using the Web UI

Navigate to Diarization in the sidebar
Upload an audio file with multiple speakers
Click Analyze
View the speaker-segmented transcript

Output

The diarization view shows:

Speaker labels — Speaker 1, Speaker 2, etc.
Timestamps — When each speaker talks
Transcript — What each speaker said

Example output:

[00:00 - 00:05] Speaker 1: Welcome to the meeting. [00:05 - 00:12] Speaker 2: Thanks for having me. [00:12 - 00:20] Speaker 1: Let's start with the agenda.

Using the API

Endpoint

POST /v1/audio/diarize

Request (multipart/form-data)

Field	Type	Description
`file`	File	Audio file to analyze
`model`	String	Model name
`num_speakers`	Integer	Expected speakers (optional)

Example (curl)

curl -X POST http://localhost:8080/v1/audio/diarize \ -F "file=@meeting.wav" \ -F "model=qwen3-asr-0.6b"

Response

{
  "segments": [
    {
      "speaker": "Speaker 1",
      "start": 0.0,
      "end": 5.2,
      "text": "Welcome to the meeting."
    },
    {
      "speaker": "Speaker 2",
      "start": 5.5,
      "end": 12.1,
      "text": "Thanks for having me."
    }
  ],
  "num_speakers": 2,
  "duration": 120.5
}

Configuration

Number of Speakers

If you know how many speakers are in the audio, specify it for better accuracy:

# Via API curl -X POST http://localhost:8080/v1/audio/diarize \ -F "file=@meeting.wav" \ -F "num_speakers=3"

Speaker Labels

By default, speakers are labeled "Speaker 1", "Speaker 2", etc. You can rename them in the UI after processing.

Tips for Best Results

Quality audio — Clear recordings with minimal background noise
Distinct voices — Works best when speakers have different voice characteristics
Minimal overlap — Speakers talking over each other reduces accuracy
Specify speaker count — If known, helps the algorithm
Longer segments — Short utterances are harder to attribute

Limitations

Similar voices — May confuse speakers with very similar voices
Overlapping speech — Simultaneous talking is challenging
Background noise — Reduces speaker detection accuracy
Very short clips — Need enough audio to identify speaker patterns

Use Cases

Meeting Minutes

Upload a meeting recording to get a transcript with speaker attribution:

Record your meeting
Upload to Diarization
Export the speaker-labeled transcript
Edit speaker names as needed

Interview Transcription

Perfect for journalist interviews or research:

Record the interview
Process with diarization
Get clean Q&A format output

Podcast Production

Identify speakers for editing and show notes:

Upload raw podcast audio
See who spoke when
Use timestamps for editing

Diarization