PhotoKeep Pro
73% GPU cost reduction while outperforming Magnific AI on blind quality tests
●The Challenge
A photo restoration startup was spending $12,000/month on fragmented cloud GPU API calls—Replicate for upscaling, a separate service for face restoration, another for colorization. Each API had different quality levels, inconsistent processing times, and no coordination between stages. Results varied wildly between runs. Customers receiving professionally restored family photos expected consistency, but the patchwork architecture couldn't deliver it. The core technical challenge: orchestrating 14+ deep learning models with different VRAM requirements, processing characteristics, and failure modes into a single reliable pipeline. Models ranged from 2GB (CodeFormer for faces) to 12GB (SUPIR for general restoration), and naive sequential loading would exhaust even 49GB of GPU memory.
●The Approach
Rejected the multi-cloud API approach entirely. Instead, consolidated all models onto dedicated GPU infrastructure with a custom orchestration layer. The key insight was treating VRAM like a managed memory pool—building an LRU eviction system that keeps frequently-used models loaded while swapping cold models to CPU memory. This eliminated the 15-30 second model loading penalty for common workflows. Built the orchestration on Celery with Redis for distributed task queuing, allowing horizontal scaling across multiple GPU nodes. Each restoration job gets decomposed into a dependency graph: analyze → denoise → upscale → face restore → colorize (optional). Failed stages retry independently without reprocessing the entire pipeline.
●Tech Decisions
●Technical Challenges
●The Solution
PhotoKeep Pro runs a unified pipeline managing SUPIR, HAT, Real-ESRGAN, CodeFormer, GFPGAN, and 8 other specialized models through a thread-safe VRAM allocator. The LRU eviction system maintains a working set of 3-4 models in GPU memory while keeping the rest warm in CPU RAM. Average restoration completes in 45 seconds for a 12MP image—down from 3-5 minutes with the previous API-chaining approach. Quality improved to 28.5dB PSNR on our benchmark suite, a 4dB improvement over commercial alternatives. The Stripe-integrated billing system charges per restoration with volume discounts, aligning costs directly with usage. Running at 99.95% uptime with automatic failover between GPU nodes.
●Key Takeaways
●Related Projects
Have a similar challenge?
I help teams solve complex technical problems. Let's discuss your project.