Sogni: Learn logo

Sogni Voice

Sogni Voice is a local speech engine for Apple Silicon Macs. It wraps three open-source TTS models (Kokoro, Pocket, Qwen3-TTS) and parakeet-mlx transcription behind a single REST API — no third-party service calls, no audio leaves the machine. Drop it next to a chat bot, an agent, or an avatar pipeline for low-latency voice in and voice out.

#Source on GitHub →

Sogni Voice

#What it does

  • Text-to-speech, three models. Kokoro (32 voices, 4 languages, MLX-accelerated), Pocket TTS (8 English voices, CPU-only, voice cloning from a 5s sample), and Qwen3-TTS (11 languages, emotion control, voice design from text).
  • Speech-to-text via parakeet-mlx. Upload audio, get a transcript — optionally with sentence- or word-level timestamps suitable for subtitles.
  • Voice cloning. Both Pocket and Qwen3 support cloning from a short reference clip; clones export and import as safetensors ZIPs, never pickle.
  • Local-first by default. The server binds to 127.0.0.1 and CORS is restricted to loopback origins until you opt in explicitly. Optional AUTH_API_KEY for networked deployments.
  • MLX and MPS acceleration. Built for M1/M2/M3/M4 Macs. Persistent Python daemons keep models warm between requests — subsequent calls are 2–5x faster than cold starts.

#Install

brew install ffmpeg uv
npm install
./setup.sh
npm run dev

./setup.sh is interactive and can pre-download every enabled model so the first request isn't a 4 GB wait. For production, npm run pm2:start runs the service under PM2.

#Endpoints at a glance

Endpoint What it does
GET /health Liveness probe
POST /transcribe Audio file in, transcript out; optional sentence/word timestamps
POST /tts Kokoro TTS — text, voice, speed, format (wav/opus/buffer)
GET /tts/voices List Kokoro voices
POST /qwen-tts Qwen3-TTS standard generation
POST /qwen-tts/custom-voice Qwen3 emotion/style control with instruct field
POST /qwen-tts/voice-design Generate a voice from a text description
POST /qwen-tts/voices/clone Create a voice clone from a reference clip
POST /pocket-tts Pocket TTS — CPU-only, low-latency English

Quick TTS round trip:

curl -X POST http://localhost:3000/tts \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello world"}' \
  --output output.wav

Transcription with sentence timestamps:

curl -X POST http://localhost:3000/transcribe \
  -F "[email protected]" \
  -F "timestamps=true"

#Workflows

Voice for an agent. Pair Sogni Voice with Creative Agent Skill or any Claude Code / Hermes setup so the agent can speak responses and transcribe spoken prompts without a third-party API.

Voice cloning for avatars. Record 5–10 seconds of clean speech, POST it to /pocket-tts/voices/clone or /qwen-tts/voices/clone, then call /generate with that clone ID for every subsequent line.

Subtitle generation. Run any video's audio through /transcribe with wordTimestamps=true; pipe the JSON into your subtitle generator.

Style control with Qwen3. Use qwen-tts/custom-voice with an instruct like "Speak with excitement and joy" or "Use a calm, soothing voice" to render the same text in different emotional registers.

#Tips for getting started

  • Apple Silicon only. Intel Macs and other platforms won't work — MLX requires the unified-memory architecture.
  • Enable only what you'll use. Each TTS model downloads gigabytes on first request; flip the *_ENABLED env vars off for models you're not running.
  • Keep daemons pre-warmed. Leave PREWARM_*=1 unless memory is tight — cold starts are 5–10x slower than warm requests.
  • Lock down before exposing. Setting HOST=0.0.0.0 without also setting AUTH_ENABLED=1 + AUTH_API_KEY is a footgun — the defaults exist for a reason.
  • Clones are safetensors, not pickle. Pickle-format clones are intentionally rejected on import.

#See also