Sogni Voice
Sogni Voice is a local speech engine for Apple Silicon Macs. It wraps three open-source TTS models (Kokoro, Pocket, Qwen3-TTS) and parakeet-mlx transcription behind a single REST API — no third-party service calls, no audio leaves the machine. Drop it next to a chat bot, an agent, or an avatar pipeline for low-latency voice in and voice out.
#Source on GitHub →

#What it does
- Text-to-speech, three models. Kokoro (32 voices, 4 languages, MLX-accelerated), Pocket TTS (8 English voices, CPU-only, voice cloning from a 5s sample), and Qwen3-TTS (11 languages, emotion control, voice design from text).
- Speech-to-text via parakeet-mlx. Upload audio, get a transcript — optionally with sentence- or word-level timestamps suitable for subtitles.
- Voice cloning. Both Pocket and Qwen3 support cloning from a short reference clip; clones export and import as
safetensorsZIPs, never pickle. - Local-first by default. The server binds to
127.0.0.1and CORS is restricted to loopback origins until you opt in explicitly. OptionalAUTH_API_KEYfor networked deployments. - MLX and MPS acceleration. Built for M1/M2/M3/M4 Macs. Persistent Python daemons keep models warm between requests — subsequent calls are 2–5x faster than cold starts.
#Install
brew install ffmpeg uv
npm install
./setup.sh
npm run dev
./setup.sh is interactive and can pre-download every enabled model so the first request isn't a 4 GB wait. For production, npm run pm2:start runs the service under PM2.
#Endpoints at a glance
| Endpoint | What it does |
|---|---|
GET /health |
Liveness probe |
POST /transcribe |
Audio file in, transcript out; optional sentence/word timestamps |
POST /tts |
Kokoro TTS — text, voice, speed, format (wav/opus/buffer) |
GET /tts/voices |
List Kokoro voices |
POST /qwen-tts |
Qwen3-TTS standard generation |
POST /qwen-tts/custom-voice |
Qwen3 emotion/style control with instruct field |
POST /qwen-tts/voice-design |
Generate a voice from a text description |
POST /qwen-tts/voices/clone |
Create a voice clone from a reference clip |
POST /pocket-tts |
Pocket TTS — CPU-only, low-latency English |
Quick TTS round trip:
curl -X POST http://localhost:3000/tts \
-H "Content-Type: application/json" \
-d '{"text": "Hello world"}' \
--output output.wav
Transcription with sentence timestamps:
curl -X POST http://localhost:3000/transcribe \
-F "[email protected]" \
-F "timestamps=true"
#Workflows
Voice for an agent. Pair Sogni Voice with Creative Agent Skill or any Claude Code / Hermes setup so the agent can speak responses and transcribe spoken prompts without a third-party API.
Voice cloning for avatars. Record 5–10 seconds of clean speech, POST it to /pocket-tts/voices/clone or /qwen-tts/voices/clone, then call /generate with that clone ID for every subsequent line.
Subtitle generation. Run any video's audio through /transcribe with wordTimestamps=true; pipe the JSON into your subtitle generator.
Style control with Qwen3. Use qwen-tts/custom-voice with an instruct like "Speak with excitement and joy" or "Use a calm, soothing voice" to render the same text in different emotional registers.
#Tips for getting started
- Apple Silicon only. Intel Macs and other platforms won't work — MLX requires the unified-memory architecture.
- Enable only what you'll use. Each TTS model downloads gigabytes on first request; flip the
*_ENABLEDenv vars off for models you're not running. - Keep daemons pre-warmed. Leave
PREWARM_*=1unless memory is tight — cold starts are 5–10x slower than warm requests. - Lock down before exposing. Setting
HOST=0.0.0.0without also settingAUTH_ENABLED=1+AUTH_API_KEYis a footgun — the defaults exist for a reason. - Clones are
safetensors, not pickle. Pickle-format clones are intentionally rejected on import.
#See also
- Creative Agent Skill — agent skill that pairs naturally with local TTS/STT.
- Sogni Chat — the conversational creative studio Voice slots underneath.
- Sogni Client SDK — image/video/music side of the same stack.
- Source on GitHub