Give your OpenClaw agents a truly local voice

OpenClaw has taken the AI world by storm, becoming one of the most popular open-source personal AI assistant frameworks. It runs on your machine, connects to the messaging platforms you already use—WhatsApp, Telegram, Slack, Discord, Signal, iMessage, and more—and can browse the web, control your system, and extend itself with skills and plugins. OpenClaw even has a built-in Talk Mode for voice conversations. But that voice support relies on ElevenLabs, a cloud service. That's where Izwi comes in.

The Local Voice Gap in AI Agents

If you've been using OpenClaw, you know how powerful it is. It can browse the web, read and write files, run shell commands, and extend itself with community skills or ones you build yourself. It already supports voice through Talk Mode—but that voice pipeline sends your audio to ElevenLabs' servers for processing.

Voice isn't just a nice-to-have feature. It's the most natural way humans communicate. When your AI agent can speak and listen, the interaction becomes fluid, intuitive, and remarkably human. But there's a catch: cloud-based voice solutions require sending your audio to remote servers, introducing privacy concerns and latency issues.

Why Local Voice Matters for Agents

When you're building an AI agent that lives on your machine—like OpenClaw—you want it to be truly yours. That means:

Privacy: Your voice data never leaves your device
Speed: Sub-50ms latency for real-time conversations
Reliability: Works offline, no API rate limits
Control: You own the models and the data

This is exactly why we built Izwi—a local-first audio inference engine that runs entirely on your hardware.

Integrating Izwi with OpenClaw

The integration is surprisingly simple. Since Izwi provides an OpenAI-compatible API, you can drop it right into your existing OpenClaw setup. Here's how.

Step 1: Install and Start Izwi

First, download and install Izwi from GitHub Releases (.dmg for macOS, .deb for Linux, or the Windows installer). Then start the server:

bash01izwi serve

This starts a local HTTP server on localhost:8080 with OpenAI-compatible endpoints.

Step 2: Pull the Required Models

Izwi needs models for speech recognition and synthesis. Pull them with:

bash01izwi pull Qwen3-TTS-12Hz-0.6B-Base02izwi pull Qwen3-ASR-0.6B

These compact models run efficiently on most hardware while delivering impressive quality.

Step 3: Configure OpenClaw to Use Local Voice

In your OpenClaw configuration, you can now point audio processing to Izwi. Create a custom skill that leverages Izwi's endpoints:

python01from openai import OpenAI02 03izwi_client = OpenAI(04    base_url="http://localhost:8080/v1",05    api_key="not-needed"06)07 08def transcribe_audio(audio_path):09    with open(audio_path, "rb") as audio_file:10        transcript = izwi_client.audio.transcriptions.create(11            model="qwen3-asr-0.6b",12            file=audio_file13        )14    return transcript.text15 16def synthesize_speech(text, output_path):17    response = izwi_client.audio.speech.create(18        model="qwen3-tts-0.6b-base",19        input=text,20        voice="alloy"21    )22    response.stream_to_file(output_path)

Step 4: Create Voice Skills for Your Agent

Now you can extend OpenClaw with fully local voice capabilities. Here are some ideas:

Voice Notes: Record voice messages and have your agent transcribe, summarize, and file them appropriately
Spoken Notifications: Have your agent announce important events through your speakers using TTS
Voice Commands: Create a skill that listens for wake words and processes voice commands locally
Meeting Summaries: Transcribe meeting recordings and generate audio summaries

Advanced Features

Voice Cloning

Want your agent to have a specific voice? Izwi's voice cloning feature lets you create custom voices from just a few seconds of audio. Perfect for giving your agent a consistent, recognizable personality. You can manage voices through the Izwi desktop app or the API—see the Voice Cloning docs for details.

Speaker Diarization

Processing meeting recordings? Izwi can automatically identify and separate different speakers, making it easy to generate structured transcripts with speaker labels. This is available through both the desktop app and the API.

Voice Design

Create entirely new voices from text descriptions. Describe the voice you want—warm and professional, energetic and friendly—and Izwi will generate it.

Performance Considerations

Running voice AI locally requires some hardware resources, but modern machines handle it well:

Hardware	Model Size	Latency	Quality
Apple Silicon (M1/M2/M3)	0.6B	<50ms	Good
Apple Silicon (M1/M2/M3)	1.7B	<100ms	Excellent
NVIDIA GPU (8GB+)	0.6B	<30ms	Good
NVIDIA GPU (8GB+)	1.7B	<60ms	Excellent

The Privacy Advantage

Here's what makes local voice processing special: your voice never leaves your machine. Unlike cloud-based solutions that transmit every audio snippet to remote servers for processing, Izwi handles everything locally.

This matters because your voice contains:

Biometric data that can identify you
Context about your life, work, and relationships
Potentially sensitive information you speak without thinking

With Izwi + OpenClaw, you get the best of both worlds: powerful AI agents with voice capabilities, and the peace of mind that comes with true local processing.

Getting Started

Ready to give your OpenClaw agents a truly local voice? Here's what you need:

Download Izwi from GitHub Releases or the downloads page
Check our documentation at izwiai.com/docs
Join the community on GitHub

The combination of OpenClaw's agent capabilities and Izwi's local voice processing opens up entirely new possibilities for private, powerful AI assistants. Your agents can finally speak and listen—all without sending a single byte of audio to the cloud.