Voice Cloning

Overview

Voice cloning creates a custom voice from a reference audio sample. Use it to:

Personalize TTS — Generate speech in a specific voice
Create characters — Unique voices for games or media
Accessibility — Preserve a person's voice
Localization — Maintain voice consistency across languages

Getting Started

Download a Voice Cloning Model

izwi pull qwen3-tts-0.6b-customvoice

Clone a Voice

Prepare a reference audio file (5-30 seconds of clear speech)
Use the voice for TTS generation

Using the Web UI

Step 1: Upload Reference Audio

Navigate to Voice Cloning in the sidebar
Upload a reference audio file
The audio should be:
- 5-30 seconds long
- Clear speech, minimal background noise
- Single speaker

Step 2: Generate Speech

Enter the text you want to speak
Click Generate
Listen to the output in the cloned voice

Step 3: Save and Reuse

Download generated audio
Save the voice profile for future use

Using the CLI

Generate with Reference Audio

izwi tts "Hello, this is my cloned voice" \ --model qwen3-tts-0.6b-customvoice \ --speaker /path/to/reference.wav \ --output cloned.wav

Options

Option	Description
`--speaker`, `-s`	Path to reference audio file
`--model`, `-m`	Must use a `customvoice` model

Using the API

Endpoint

POST /v1/audio/speech

Request (multipart/form-data)

Field	Type	Description
`model`	String	`qwen3-tts-0.6b-customvoice`
`input`	String	Text to synthesize
`reference_audio`	File	Reference voice sample

Example (curl)

curl -X POST http://localhost:8080/v1/audio/speech \ -F "model=qwen3-tts-0.6b-customvoice" \ -F "input=Hello, this is my cloned voice" \ -F "reference_audio=@reference.wav" \ --output cloned.wav

Reference Audio Guidelines

Ideal Reference Audio

Aspect	Recommendation
Duration	5-30 seconds
Quality	High quality, clear audio
Content	Natural speech, varied intonation
Background	Minimal noise
Speaker	Single speaker only

Good Examples

Podcast clips
Interview segments
Voice memos
Audiobook excerpts

Poor Examples

Music with vocals
Multiple speakers
Heavy background noise
Very short clips (<3 seconds)
Whispered or distorted speech

Tips for Best Results

Quality over quantity — A clear 10-second clip beats a noisy 30-second one
Natural speech — Avoid monotone or exaggerated delivery
Match content — Reference emotion should match desired output
Consistent volume — Avoid clips with volume changes
No music — Background music interferes with cloning

Available Models

Model	Size	Quality
`qwen3-tts-0.6b-customvoice`	1.2 GB	Good
`qwen3-tts-1.7b-customvoice`	3.4 GB	Better

Larger models produce more accurate voice clones.

Ethical Considerations

Voice cloning is a powerful technology. Please use it responsibly:

Get consent — Only clone voices with permission
Don't impersonate — Never use cloned voices to deceive
Respect privacy — Don't clone voices without authorization
Legal compliance — Follow applicable laws and regulations

Limitations

Accent accuracy — May not perfectly capture all accents
Emotional range — Cloned voices may have limited expressiveness
Unique characteristics — Some voice qualities are hard to replicate
Language — Best results in the model's primary language

Voice Cloning

Overview

Getting Started

Download a Voice Cloning Model

Clone a Voice

Using the Web UI

Step 1: Upload Reference Audio

Step 2: Generate Speech

Step 3: Save and Reuse

Using the CLI

Generate with Reference Audio

Options

Using the API

Endpoint

Request (multipart/form-data)

Example (curl)

Reference Audio Guidelines

Ideal Reference Audio

Good Examples

Poor Examples

Tips for Best Results

Available Models

Ethical Considerations

Limitations

See Also