AI/ML

PhotoKeep Pro

73% GPU cost reduction while outperforming Magnific AI on blind quality tests

73%

Cost Reduction

$12k/mo to $3.2k/mo GPU costs

28.5dB PSNR

Quality Score

+4dB over commercial alternatives

14+

AI Models

Orchestrated with thread-safe VRAM allocation

99.95%

Uptime

With automatic GPU node failover

The Challenge

A photo restoration startup was spending $12,000/month on fragmented cloud GPU API calls... Replicate for upscaling, a separate service for face restoration, another for colorization. Each API had different quality levels, inconsistent processing times, and no coordination between stages. Results varied wildly between runs. Customers receiving professionally restored family photos expected consistency, but the patchwork architecture couldn't deliver it. The core technical challenge: orchestrating 14+ deep learning models with different VRAM requirements, processing characteristics, and failure modes into a single reliable pipeline. Models ranged from 2GB (CodeFormer for faces) to 12GB (SUPIR for general restoration), and naive sequential loading would exhaust even 49GB of GPU memory.

The Approach

Rejected the multi-cloud API approach entirely. Instead, consolidated all models onto dedicated GPU infrastructure with a custom orchestration layer. The key insight was treating VRAM like a managed memory pool... building an LRU eviction system that keeps frequently-used models loaded while swapping cold models to CPU memory. This eliminated the 15-30 second model loading penalty for common workflows. Built the orchestration on Celery with Redis for distributed task queuing, allowing horizontal scaling across multiple GPU nodes. Each restoration job gets decomposed into a dependency graph: analyze → denoise → upscale → face restore → colorize (optional). Failed stages retry independently without reprocessing the entire pipeline.

Tech Decisions

Model Orchestration

Custom VRAM Allocator + LRU Eviction

Off-the-shelf model serving (Triton, TorchServe) doesn't handle the dynamic multi-model loading pattern well. Our custom allocator treats VRAM as a managed pool, keeping hot models loaded and swapping cold ones to CPU RAM. Eliminates the 15-30 second model loading penalty for common workflows.

Task Queue

Celery + Redis

Restoration pipelines are CPU/GPU-bound with unpredictable runtimes. Celery's task decomposition lets us retry failed stages independently without reprocessing the entire pipeline. Redis provides the low-latency broker needed for real-time progress updates.

Billing

Stripe Usage-Based

Per-restoration billing with volume tiers aligns cost directly with value delivered. Customers paying for 10 restorations/month shouldn't subsidize enterprise users processing thousands. Stripe's metered billing handles the complexity of tiered pricing without custom billing logic.

Technical Challenges

The Solution

PhotoKeep Pro runs a unified pipeline managing SUPIR, HAT, Real-ESRGAN, CodeFormer, GFPGAN, and 8 other specialized models through a thread-safe VRAM allocator. The LRU eviction system maintains a working set of 3-4 models in GPU memory while keeping the rest warm in CPU RAM. Average restoration completes in 45 seconds for a 12MP image... down from 3-5 minutes with the previous API-chaining approach. Quality improved to 28.5dB PSNR on our benchmark suite, a 4dB improvement over commercial alternatives. The Stripe-integrated billing system charges per restoration with volume discounts, aligning costs directly with usage. Running at 99.95% uptime with automatic failover between GPU nodes.

Key Takeaways

Reusable Insights

Custom GPU memory management delivers order-of-magnitude cost improvements over cloud API chaining... the engineering investment pays for itself within 2 months.
Async job queues with stage-level checkpointing are essential for ML pipelines... any stage can fail, and full reprocessing is unacceptable for production workloads.
Usage-based billing aligns incentives perfectly for compute-heavy SaaS... customers pay for value received, and revenue scales linearly with infrastructure costs.
LRU model eviction between GPU and CPU memory eliminates the cold-start penalty that makes multi-model architectures impractical for real-time workloads.

Related Projects

2025 / Developer Tools

TraceForge

Cut vector conversion time from 45 minutes to 8 seconds per asset... a 337x speedup. Design teams were hemorrhaging billable hours manually tracing logos and icons in Illustrator. Built a GPU-accelerated pipeline combining neural upscaling with dual vectorization engines (Potrace + VTracer), plus an SVGO optimization stage that reduces file sizes by 40-60%. Now processing 2,000+ conversions monthly with zero manual intervention.

PythonFastAPIPotraceVTracerCUDA

Case Study

2025 / Developer Tools

Claude Pilot

Recovered 2+ hours daily lost to context-switching between terminal, database clients, and config files. Claude Code power users were drowning in fragmented tooling... no unified view of sessions, memory state, or MCP server health. Architected a native Electron control center with 25 tRPC endpoints managing PostgreSQL, Memgraph, and Qdrant memory systems. 80% test coverage, zero production incidents since launch.

ElectronReactTypeScripttRPCZod

Case Study

2024 / AI/ML

PenQWEN

Reduced security assessment setup time from 4 hours to 12 minutes with zero hallucinated commands. Pentesting teams were wasting senior hours on boilerplate reconnaissance while generic LLMs generated dangerous garbage. Built a domain-adapted Qwen2.5 model through two-stage LoRA training: cybersecurity corpus adaptation, then agentic fine-tuning for tool calling and OPSEC. 3.6GB adapters trained on 12GB curated security data now automate 60% of routine enumeration tasks.

PythonPyTorchLoRAQwen2.5Transformers

Case Study

2025 / AI/ML

VoiceKeep

VoiceKeep (shipped as voicekeep.io) grew out of the Voice Cloner research prototype ... originally developed on an RTX 3080, now running on a dedicated GPU server with RTX PRO 6000 Blackwell ... into a production AI voice platform handling single-voice TTS, multi-speaker conversations, and full audiobook production from manuscript uploads. The platform runs Qwen3-TTS 1.7B with 12-second P50 latency, 41+ curated voices, and zero-shot cloning from short reference audio. The Audiobook Studio parses DOCX/PDF/TXT manuscripts into chapters with dialogue detection, assigns character voices, applies pronunciation dictionaries, and exports distribution-ready M4B with chapter markers. Multi-voice conversations support drag-and-drop line ordering, per-line effects (speed, volume, gap), stage directions, multiple takes, ambient audio, and a waveform timeline editor. 99.95% uptime, 0.03% error rate, Stripe subscription billing.

PythonFastAPIPyTorchQwen3-TTSRedis

Case Study

Have a similar challenge?

I help teams solve complex technical problems. Let's discuss your project.

Talk through your challenge

PhotoKeep Pro

●The Challenge

●The Approach

●Tech Decisions

●Technical Challenges

●The Solution

●Key Takeaways

●Related Projects

TraceForge

Claude Pilot

PenQWEN

VoiceKeep

Have a similar challenge?

PhotoKeep Pro

●The Challenge

●The Approach

●Tech Decisions

●Technical Challenges

●The Solution

●Key Takeaways

●Related Projects

TraceForge

Claude Pilot

PenQWEN

VoiceKeep

Have a similar challenge?

The Challenge

The Approach

Tech Decisions

Technical Challenges

The Solution

Key Takeaways

Related Projects

The Challenge

The Approach

Tech Decisions

Technical Challenges

The Solution

Key Takeaways

Related Projects