Transcription
Convert audio to text with high accuracy using automatic speech recognition (ASR).
Overview
Izwi's transcription feature converts spoken audio into written text. Capabilities include:
- High accuracy — State-of-the-art speech recognition
- Multiple formats — Support for WAV, MP3, M4A, FLAC, and more
- Language detection — Automatic language identification
- Timestamps — Optional word-level timing
- Local processing — Complete privacy, no cloud
Getting Started
Download an ASR Model
izwi pull qwen3-asr-0.6bTranscribe Audio
izwi transcribe audio.wavUsing the CLI
Basic Usage
izwi transcribe <audio-file>Options
| Option | Description | Default |
|---|---|---|
--model, -m | ASR model to use | qwen3-asr-0.6b |
--language, -l | Language hint | auto-detect |
--format, -f | Output format | text |
--output, -o | Output file | stdout |
--word-timestamps | Include word timing | — |
Examples
Basic transcription:
izwi transcribe meeting.wavSave to file:
izwi transcribe meeting.wav --output transcript.txtJSON output with metadata:
izwi transcribe meeting.wav --format json --output transcript.jsonWith word timestamps:
izwi transcribe meeting.wav --format verbose_json --word-timestampsSpecify language:
izwi transcribe audio.wav --language en izwi transcribe audio.wav --language esUsing the Web UI
- Navigate to Transcription in the sidebar
- Upload an audio file or record directly
- Select the ASR model
- Click Transcribe
- View, copy, or download the transcript
Features
- Drag and drop — Upload files easily
- Record — Transcribe directly from microphone
- Copy — One-click copy to clipboard
- Download — Save as text or JSON
Using the API
Endpoint
POST /v1/audio/transcriptionsRequest (multipart/form-data)
| Field | Type | Description |
|---|---|---|
file | File | Audio file to transcribe |
model | String | Model name |
language | String | Language code (optional) |
response_format | String | text, json, or verbose_json |
Example (curl)
curl -X POST http://localhost:8080/v1/audio/transcriptions \ -F "file=@audio.wav" \ -F "model=qwen3-asr-0.6b" \ -F "response_format=json"Response (JSON)
{
"text": "Hello, this is a transcription test.",
"language": "en",
"duration": 3.5
}Response (verbose_json)
{
"text": "Hello, this is a transcription test.",
"language": "en",
"duration": 3.5,
"words": [
{"word": "Hello", "start": 0.0, "end": 0.5},
{"word": "this", "start": 0.6, "end": 0.8},
...
]
}Supported Audio Formats
| Format | Extension | Notes |
|---|---|---|
| WAV | .wav | Best quality, recommended |
| MP3 | .mp3 | Widely compatible |
| M4A | .m4a | Apple format |
| FLAC | .flac | Lossless |
| OGG | .ogg | Open format |
| WebM | .webm | Web recordings |
Available Models
| Model | Size | Accuracy | Speed |
|---|---|---|---|
qwen3-asr-0.6b | 1.2 GB | Good | Fast |
qwen3-asr-1.7b | 3.4 GB | Better | Medium |
Use larger models for:
- Noisy audio
- Accented speech
- Technical vocabulary
Output Formats
Text
Plain text transcript:
Hello, this is a transcription test.JSON
{
"text": "Hello, this is a transcription test."
}Verbose JSON
Includes word-level timestamps and metadata:
{
"text": "Hello, this is a transcription test.",
"language": "en",
"duration": 3.5,
"words": [
{"word": "Hello", "start": 0.0, "end": 0.5},
{"word": "this", "start": 0.6, "end": 0.8}
]
}Tips for Best Results
- Use quality audio — Clear recordings transcribe better
- Minimize noise — Background noise reduces accuracy
- Proper format — WAV files work best
- Right model size — Larger models for difficult audio
- Language hints — Specify language if known
See Also
- Diarization — Identify multiple speakers
- Voice Mode — Real-time transcription
- CLI Reference — Full command documentation