Sogni Voice

Sogni Voice is a local speech engine for Apple Silicon Macs. It wraps three open-source TTS models (Kokoro, Pocket, Qwen3-TTS) and parakeet-mlx transcription behind a single REST API — no third-party service calls, no audio leaves the machine. Drop it next to a chat bot, an agent, or an avatar pipeline for low-latency voice in and voice out.

#Source on GitHub →

Sogni Voice

#What it does

Text-to-speech, three models. Kokoro (32 voices, 4 languages, MLX-accelerated), Pocket TTS (8 English voices, CPU-only, voice cloning from a 5s sample), and Qwen3-TTS (11 languages, emotion control, voice design from text).
Speech-to-text via parakeet-mlx. Upload audio, get a transcript — optionally with sentence- or word-level timestamps suitable for subtitles.
Voice cloning. Both Pocket and Qwen3 support cloning from a short reference clip; clones export and import as safetensors ZIPs, never pickle.
Local-first by default. The server binds to 127.0.0.1 and CORS is restricted to loopback origins until you opt in explicitly. Optional AUTH_API_KEY for networked deployments.
MLX and MPS acceleration. Built for M1/M2/M3/M4 Macs. Persistent Python daemons keep models warm between requests — subsequent calls are 2–5x faster than cold starts.

#Install

brew install ffmpeg uv
npm install
./setup.sh
npm run dev

./setup.sh is interactive and can pre-download every enabled model so the first request isn't a 4 GB wait. For production, npm run pm2:start runs the service under PM2.

#Endpoints at a glance

Endpoint	What it does
`GET /health`	Liveness probe
`POST /transcribe`	Audio file in, transcript out; optional sentence/word timestamps
`POST /tts`	Kokoro TTS — `text`, `voice`, `speed`, `format` (wav/opus/buffer)
`GET /tts/voices`	List Kokoro voices
`POST /qwen-tts`	Qwen3-TTS standard generation
`POST /qwen-tts/custom-voice`	Qwen3 emotion/style control with `instruct` field
`POST /qwen-tts/voice-design`	Generate a voice from a text description
`POST /qwen-tts/voices/clone`	Create a voice clone from a reference clip
`POST /pocket-tts`	Pocket TTS — CPU-only, low-latency English

Quick TTS round trip:

curl -X POST http://localhost:3000/tts \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello world"}' \
  --output output.wav

Transcription with sentence timestamps:

curl -X POST http://localhost:3000/transcribe \
  -F "[email protected]" \
  -F "timestamps=true"

#Workflows

Voice for an agent. Pair Sogni Voice with Creative Agent Skill or any Claude Code / Hermes setup so the agent can speak responses and transcribe spoken prompts without a third-party API.

Voice cloning for avatars. Record 5–10 seconds of clean speech, POST it to /pocket-tts/voices/clone or /qwen-tts/voices/clone, then call /generate with that clone ID for every subsequent line.

Subtitle generation. Run any video's audio through /transcribe with wordTimestamps=true; pipe the JSON into your subtitle generator.

Style control with Qwen3. Use qwen-tts/custom-voice with an instruct like "Speak with excitement and joy" or "Use a calm, soothing voice" to render the same text in different emotional registers.

#Tips for getting started

Apple Silicon only. Intel Macs and other platforms won't work — MLX requires the unified-memory architecture.
Enable only what you'll use. Each TTS model downloads gigabytes on first request; flip the *_ENABLED env vars off for models you're not running.
Keep daemons pre-warmed. Leave PREWARM_*=1 unless memory is tight — cold starts are 5–10x slower than warm requests.
Lock down before exposing. Setting HOST=0.0.0.0 without also setting AUTH_ENABLED=1 + AUTH_API_KEY is a footgun — the defaults exist for a reason.
Clones are safetensors, not pickle. Pickle-format clones are intentionally rejected on import.

#See also

Creative Agent Skill — agent skill that pairs naturally with local TTS/STT.
Sogni Chat — the conversational creative studio Voice slots underneath.
Sogni Client SDK — image/video/music side of the same stack.
Source on GitHub