Skip to content
AI/ML

Voice Cloner

Production voice platform — TTS, conversations, and audiobook production on a single RTX 3080

12s
P50 Latency
Short text generation on RTX 3080
99.95%
Uptime
Single-server production deployment
0.03%
Error Rate
Across all generation requests
41+
Curated Voices
Presidents, celebrities, plus custom uploads

The Challenge

Professional voice talent costs $500+/hour, and 95% of books lack audio versions because producing a single audiobook costs $2K-$5K in human narration fees. Existing AI TTS solutions (ElevenLabs, Play.ht) handle single-voice generation but offer nothing for multi-character production workflows — no conversation builder, no manuscript parsing, no chapter management, no pronunciation dictionaries. Content creators need a complete production pipeline: from raw manuscript to distribution-ready M4B with chapter markers and consistent multi-voice narration. The core engineering challenge: running a 1.7B parameter TTS model in bfloat16 on a single RTX 3080 (10GB VRAM) with consistent sub-15-second latency while supporting three distinct production modes (single TTS, multi-speaker conversations, full audiobook chapters) through a unified inference pipeline.

The Approach

Chose Qwen3-TTS 1.7B after benchmarking against XTTS, Bark, and Tortoise — best quality-to-VRAM ratio for zero-shot cloning from 10-30 second reference samples. Built the inference pipeline on FastAPI with a 4-tier Redis priority queue (admin > enterprise > pro > free). The critical architectural decision was making each audiobook chapter a Conversation record internally, reusing the entire existing TTS pipeline, per-line effects engine, takes system, and timeline editor with zero code duplication. The Audiobook Studio layer adds manuscript parsing (DOCX via python-docx, PDF via PyMuPDF, TXT via regex), chapter management, character-to-voice casting that propagates across all chapters, and a pronunciation dictionary that applies regex substitutions before TTS generation. Implemented proactive worker recycling every 500 generations to combat PyTorch VRAM fragmentation.

Tech Decisions

TTS Model
Qwen3-TTS 1.7B (bfloat16)

Best quality-to-VRAM ratio for zero-shot voice cloning on 10GB VRAM. XTTS requires more memory for comparable quality; Bark and Tortoise are slower with inferior zero-shot capabilities. bfloat16 precision halves memory usage with negligible quality impact.

Audiobook Architecture
Chapter-as-Conversation pattern

Each audiobook chapter maps to a Conversation record, reusing the entire TTS pipeline, per-line effects, takes system, and timeline editor with zero code duplication. This avoided building a parallel generation system and gave audiobooks instant access to all conversation features.

Queue System
Redis 4-Tier Priority Queue

Subscription tiers need differentiated service levels without separate infrastructure. Redis sorted sets with tier-based scoring ensure Enterprise requests process before Pro, Pro before Free — all on the same GPU.

VRAM Management
Worker Recycling (500 generations)

PyTorch's CUDA memory allocator fragments over time, degrading latency from 12s to 45s+ after thousands of generations. Proactive worker recycling every 500 generations resets allocator state. The 3-second restart penalty is invisible to users.

Technical Challenges

The Solution

Voice Cloner runs three production modes through a unified FastAPI backend: (1) Single-voice TTS for quick generation, (2) Multi-speaker Conversations with drag-and-drop line ordering, per-line effects (speed/volume/gap), stage directions, multiple takes per line, ambient audio layers, and a waveform timeline editor, (3) Audiobook Studio that parses manuscripts into chapters, detects dialogue and character names, assigns AI voices to each character, applies book-wide pronunciation dictionaries, and exports as M4B with chapter markers or MP3/WAV zip with LUFS mastering. The frontend is Next.js 15 on Cloudflare Workers with wavesurfer.js visualization. 41+ curated voices plus custom uploads with SNR quality gating. Stripe handles tiered billing, Clerk manages auth, Sentry + Amplitude provide observability. Running at 99.95% uptime with 0.03% error rate on a single server.

Key Takeaways

Reusable Insights
  • Reusing existing systems through smart data modeling (chapter-as-conversation) avoids the trap of building parallel pipelines for related features.
  • GPU inference services need proactive VRAM management — memory fragmentation is silent and cumulative, degrading latency until the service appears broken.
  • Zero-shot voice cloning quality is bounded by reference audio quality. Input validation on uploads is the highest-ROI investment for user satisfaction.
  • Multi-tier manuscript parsing (structure > regex > fallback) handles real-world document diversity better than any single detection strategy.
  • Single-server GPU deployments can serve production SaaS workloads at 99.95% uptime with proper queue management and proactive maintenance.

Related Projects

2025 / Developer Tools

TraceForge

Cut vector conversion time from 45 minutes to 8 seconds per asset—a 337x speedup. Design teams were hemorrhaging billable hours manually tracing logos and icons in Illustrator. Built a GPU-accelerated pipeline combining neural upscaling with dual vectorization engines (Potrace + VTracer), plus an SVGO optimization stage that reduces file sizes by 40-60%. Now processing 2,000+ conversions monthly with zero manual intervention.

PythonFastAPIPotraceVTracerCUDA
Case Study
2025 / Developer Tools

Claude Pilot

Recovered 2+ hours daily lost to context-switching between terminal, database clients, and config files. Claude Code power users were drowning in fragmented tooling—no unified view of sessions, memory state, or MCP server health. Architected a native Electron control center with 25 tRPC endpoints managing PostgreSQL, Memgraph, and Qdrant memory systems. 80% test coverage, zero production incidents since launch.

ElectronReactTypeScripttRPCZod
Case Study
2024 / AI/ML

PhotoKeep Pro

Slashed cloud GPU costs by 73% while boosting restoration quality by 4dB over commercial alternatives. A restoration startup was burning $12k/month on fragmented API calls with inconsistent results. Engineered a unified orchestration layer managing 14+ deep learning models (SUPIR, HAT, CodeFormer) with thread-safe VRAM allocation and LRU eviction across 49GB. Now delivering 28.5dB PSNR quality at 99.95% uptime—outperforming Magnific AI and Topaz on blind tests.

PythonFastAPIPyTorchReactTypeScript
Case Study
2024 / AI/ML

PenQWEN

Reduced security assessment setup time from 4 hours to 12 minutes with zero hallucinated commands. Pentesting teams were wasting senior hours on boilerplate reconnaissance while generic LLMs generated dangerous garbage. Built a domain-adapted Qwen2.5 model through two-stage LoRA training: cybersecurity corpus adaptation, then agentic fine-tuning for tool calling and OPSEC. 3.6GB adapters trained on 12GB curated security data now automate 60% of routine enumeration tasks.

PythonPyTorchLoRAQwen2.5Transformers
Case Study

Have a similar challenge?

I help teams solve complex technical problems. Let's discuss your project.

START_CONVERSATION()