Voice Cloner
Production voice platform — TTS, conversations, and audiobook production on a single RTX 3080
●The Challenge
Professional voice talent costs $500+/hour, and 95% of books lack audio versions because producing a single audiobook costs $2K-$5K in human narration fees. Existing AI TTS solutions (ElevenLabs, Play.ht) handle single-voice generation but offer nothing for multi-character production workflows — no conversation builder, no manuscript parsing, no chapter management, no pronunciation dictionaries. Content creators need a complete production pipeline: from raw manuscript to distribution-ready M4B with chapter markers and consistent multi-voice narration. The core engineering challenge: running a 1.7B parameter TTS model in bfloat16 on a single RTX 3080 (10GB VRAM) with consistent sub-15-second latency while supporting three distinct production modes (single TTS, multi-speaker conversations, full audiobook chapters) through a unified inference pipeline.
●The Approach
Chose Qwen3-TTS 1.7B after benchmarking against XTTS, Bark, and Tortoise — best quality-to-VRAM ratio for zero-shot cloning from 10-30 second reference samples. Built the inference pipeline on FastAPI with a 4-tier Redis priority queue (admin > enterprise > pro > free). The critical architectural decision was making each audiobook chapter a Conversation record internally, reusing the entire existing TTS pipeline, per-line effects engine, takes system, and timeline editor with zero code duplication. The Audiobook Studio layer adds manuscript parsing (DOCX via python-docx, PDF via PyMuPDF, TXT via regex), chapter management, character-to-voice casting that propagates across all chapters, and a pronunciation dictionary that applies regex substitutions before TTS generation. Implemented proactive worker recycling every 500 generations to combat PyTorch VRAM fragmentation.
●Tech Decisions
●Technical Challenges
●The Solution
Voice Cloner runs three production modes through a unified FastAPI backend: (1) Single-voice TTS for quick generation, (2) Multi-speaker Conversations with drag-and-drop line ordering, per-line effects (speed/volume/gap), stage directions, multiple takes per line, ambient audio layers, and a waveform timeline editor, (3) Audiobook Studio that parses manuscripts into chapters, detects dialogue and character names, assigns AI voices to each character, applies book-wide pronunciation dictionaries, and exports as M4B with chapter markers or MP3/WAV zip with LUFS mastering. The frontend is Next.js 15 on Cloudflare Workers with wavesurfer.js visualization. 41+ curated voices plus custom uploads with SNR quality gating. Stripe handles tiered billing, Clerk manages auth, Sentry + Amplitude provide observability. Running at 99.95% uptime with 0.03% error rate on a single server.
●Key Takeaways
●Related Projects
Have a similar challenge?
I help teams solve complex technical problems. Let's discuss your project.