Skip to content
February 19, 202614 min readinfrastructure

LLM Cost Optimization at Scale

Your AI feature costs $0.03 per query. At 100K queries/day, that's $90K/year in LLM costs alone. Here's the tiered model strategy, caching architecture, and prompt optimization that reduces costs by 60-80% without degrading output quality.

llmaicost-optimizationsaascaching
LLM Cost Optimization at Scale

TL;DR

LLM costs follow a power law: 80% of queries can be handled by cheap, fast models while 20% need expensive, capable models. The tiered approach... route simple queries to GPT-4o-mini ($0.15/1M input tokens) and complex queries to Claude Sonnet ($3/1M input tokens)... reduces average cost per query by 60-70%. Add semantic caching (serving identical or near-identical queries from cache) and you eliminate another 20-30% of LLM calls entirely. The three highest-leverage optimizations in order: (1) model routing based on query complexity, (2) semantic caching with 0.95+ similarity threshold, (3) prompt compression that strips unnecessary tokens. I've helped 4 SaaS companies reduce their monthly LLM spend from $15-50K to $3-10K using these patterns. The quality difference was measurable only on edge cases.

Part of the AI-Assisted Development Guide ... a comprehensive guide to building AI features that deliver real value.


The Cost Problem at Scale

An AI feature that costs $0.03 per query seems cheap. At 10 queries/day during development, that's $9/month. At product launch with 1,000 daily active users averaging 5 queries each, it's $4,500/month. At scale with 50,000 DAU, it's $225,000/month.

ScaleDaily QueriesMonthly LLM Cost (GPT-4o)Monthly LLM Cost (Optimized)
Development10$4$4
Early launch5,000$2,000$400
Growth50,000$20,000$3,500
Scale500,000$200,000$30,000

The "optimized" column uses the techniques in this post. The difference between $450K and $60K per month is the difference between an AI feature that's a strategic advantage and one that's a financial liability.


Strategy 1: Tiered Model Routing

The insight: most queries don't need the most capable model. A factual lookup, a simple summary, or a yes/no classification can be handled by a model that costs 20x less.

The Model Tier Map

TierModelInput CostOutput CostUse Case
Tier 1GPT-4o-mini$0.15/1M tokens$0.60/1M tokensClassification, extraction, simple Q&A
Tier 2Claude Haiku 4.5$0.80/1M tokens$4.00/1M tokensModerate reasoning, summarization
Tier 3Claude Sonnet 4.6$3.00/1M tokens$15.00/1M tokensComplex reasoning, nuanced generation
Tier 4Claude Opus 4.6$15.00/1M tokens$75.00/1M tokensExpert-level analysis, critical decisions

Pricing as of March 2026. LLM costs trend downward... verify current rates before building your cost model.

The Router

// Query complexity router async function routeQuery(query: string, context: string): Promise<ModelConfig> { // Step 1: Classify query complexity with the cheapest model const classification = await openai.chat.completions.create({ model: "gpt-4o-mini", messages: [ { role: "system", content: `Classify the query complexity as one of: - SIMPLE: factual lookup, yes/no, data extraction - MODERATE: summarization, comparison, multi-step reasoning - COMPLEX: analysis, recommendations, nuanced judgment Respond with ONLY the classification word.`, }, { role: "user", content: query }, ], max_tokens: 10, temperature: 0, }); const complexity = classification.choices[0].message.content?.trim(); switch (complexity) { case "SIMPLE": return { model: "gpt-4o-mini", maxTokens: 500, temperature: 0.1 }; case "MODERATE": return { model: "claude-haiku-4-5-20251001", maxTokens: 1000, temperature: 0.3 }; case "COMPLEX": return { model: "claude-sonnet-4-6", maxTokens: 2000, temperature: 0.5 }; default: return { model: "gpt-4o-mini", maxTokens: 500, temperature: 0.1 }; } }

The meta-cost of routing: The classification call itself costs ~$0.0001 per query (GPT-4o-mini with a 10-token response). This is negligible compared to the savings from routing 80% of queries to a cheaper model.

Distribution in Production

Across 4 SaaS products I've worked with, the query complexity distribution is remarkably consistent:

Complexity% of QueriesModelAvg Cost/Query
Simple55-65%GPT-4o-mini$0.0003
Moderate25-30%Haiku$0.002
Complex10-15%Sonnet$0.01

Blended average: $0.002/query vs $0.03/query (all queries to Sonnet). That's a 93% cost reduction.


Strategy 2: Semantic Caching

If the same question gets asked 50 times, you don't need to call the LLM 50 times. Semantic caching stores query-response pairs and serves cached responses for semantically similar queries.

// Semantic cache using vector similarity class SemanticCache { private similarityThreshold = 0.95; async get(query: string): Promise<CacheResult | null> { const queryEmbedding = await embed(query); const results = await vectorDb.search({ vector: queryEmbedding, topK: 1, filter: { createdAt: { $gt: Date.now() - 24 * 60 * 60 * 1000 }, // 24h TTL }, }); if (results.length > 0 && results[0].score >= this.similarityThreshold) { return { response: results[0].metadata.response, similarity: results[0].score, cached: true, }; } return null; } async set(query: string, response: string): Promise<void> { const queryEmbedding = await embed(query); await vectorDb.upsert({ id: generateId(), vector: queryEmbedding, metadata: { query, response, createdAt: Date.now(), }, }); } } // Usage in the query pipeline async function handleQuery(query: string, context: string): Promise<string> { // Check cache first const cached = await semanticCache.get(query); if (cached) { logger.info("Cache hit", { similarity: cached.similarity }); return cached.response; } // Cache miss ... call LLM const response = await callLLM(query, context); // Store in cache await semanticCache.set(query, response); return response; }

Cache Hit Rates in Production

Product TypeCache Hit RateCost Reduction
Customer support bot35-50%35-50% (many similar questions)
Documentation search25-40%25-40%
Data analysis assistant10-20%10-20% (unique queries)
Code assistant5-15%5-15% (highly unique)

The 0.95 similarity threshold is conservative. Lower it to 0.92 for higher hit rates with slight quality risk. Higher than 0.97 and you're essentially only matching exact duplicates.

Cache invalidation: Set a 24-hour TTL. When your knowledge base updates, invalidate the cache. Stale AI responses are worse than uncached responses.


Strategy 3: Prompt Optimization

The cost of an LLM call is proportional to the number of tokens processed. Every unnecessary word in your prompt costs money at scale.

Before and After

Before (387 tokens):

You are a helpful AI assistant for our SaaS product. Your job is to help users with their questions about our platform. You should be friendly, professional, and provide accurate answers based on the context provided below. Please make sure to be thorough in your responses and include relevant details. If you don't know the answer, please say so rather than making something up. Here is the relevant context from our documentation: {context} The user's question is: {query} Please provide a helpful and accurate response.

After (142 tokens):

Answer the user's question using ONLY the provided context. If the context doesn't contain the answer, say "I don't have that information." Context: {context} Question: {query}

The after version is 63% fewer tokens with identical behavior. At 100K queries/day with a 3K-token context, this saves ~24.5 billion tokens/month in prompt overhead... roughly $3,675/month at GPT-4o-mini rates.

Context Window Optimization

Don't send the entire context to the LLM. Send only the relevant chunks from retrieval.

// Trim context to fit budget function prepareContext(chunks: RetrievedChunk[], maxTokens: number = 3000): string { let totalTokens = 0; const selectedChunks: string[] = []; for (const chunk of chunks) { const chunkTokens = estimateTokens(chunk.content); if (totalTokens + chunkTokens > maxTokens) break; selectedChunks.push(chunk.content); totalTokens += chunkTokens; } return selectedChunks.join("\n\n---\n\n"); }

Output Token Limits

Set max_tokens to the minimum needed for your use case. A yes/no classifier doesn't need 2,000 output tokens. A 50-word summary doesn't need 1,000.

Use Casemax_tokensRationale
Classification10-50Single word or short label
Data extraction100-300Structured output, known format
Short answer200-5001-3 sentence response
Detailed answer500-1500Paragraph-level response
Long-form generation1500-4000Only when necessary

Strategy 4: Batching and Streaming

Batch Processing

For non-real-time use cases (email generation, report creation, bulk classification), batch requests to take advantage of lower-cost batch APIs.

// OpenAI batch API ... 50% cost reduction async function batchProcess(requests: BatchRequest[]): Promise<BatchResult[]> { // Create JSONL file with all requests const jsonl = requests .map((req, i) => JSON.stringify({ custom_id: `req-${i}`, method: "POST", url: "/v1/chat/completions", body: { model: "gpt-4o-mini", messages: req.messages, max_tokens: req.maxTokens, }, }) ) .join("\n"); // Upload and create batch const file = await openai.files.create({ file: new Blob([jsonl]), purpose: "batch", }); const batch = await openai.batches.create({ input_file_id: file.id, endpoint: "/v1/chat/completions", completion_window: "24h", }); // Poll for completion (batches complete within 24h, usually much faster) return await waitForBatch(batch.id); }

OpenAI's batch API is 50% cheaper than real-time calls. For any workload that can tolerate minutes-to-hours latency, use batch processing.


Monitoring LLM Costs

Track cost per query, per feature, and per customer to understand where your spend goes.

// Cost tracking middleware async function trackLLMCost( model: string, inputTokens: number, outputTokens: number, feature: string, tenantId: string ) { const pricing: Record<string, { input: number; output: number }> = { "gpt-4o-mini": { input: 0.15, output: 0.6 }, "claude-haiku-4-5-20251001": { input: 0.8, output: 4.0 }, "claude-sonnet-4-6": { input: 3.0, output: 15.0 }, // Verify pricing at https://docs.anthropic.com/en/docs/about-claude/models }; const rates = pricing[model]; const cost = (inputTokens * rates.input + outputTokens * rates.output) / 1_000_000; await db.query( ` INSERT INTO llm_usage (model, input_tokens, output_tokens, cost, feature, tenant_id, created_at) VALUES ($1, $2, $3, $4, $5, $6, NOW()) `, [model, inputTokens, outputTokens, cost, feature, tenantId] ); }

The dashboard query that saves you money:

-- Cost per feature per day SELECT feature, DATE(created_at) AS day, SUM(cost) AS total_cost, COUNT(*) AS queries, AVG(cost) AS avg_cost_per_query FROM llm_usage WHERE created_at > NOW() - INTERVAL '30 days' GROUP BY feature, DATE(created_at) ORDER BY total_cost DESC;

This reveals which features are consuming the most LLM budget... and where optimization has the highest ROI.


When to Apply This

  • Your monthly LLM spend exceeds $1,000 and growing
  • You're seeing 100+ queries/day on any AI feature
  • LLM costs are becoming a meaningful line item in your cloud bill
  • You need to maintain margins as AI feature usage grows with your customer base

When NOT to Apply This

  • You're in development or early beta with under 100 queries/day... optimize for quality first
  • The AI feature is a premium add-on where customers pay for the compute... pass costs through
  • Your total LLM spend is under $100/month... the engineering time to optimize costs more than the savings

Need to get your AI costs under control without sacrificing quality? I help SaaS teams design cost-efficient AI architectures that scale with their business.


Continue Reading

This post is part of the AI-Assisted Development Guide ... covering AI integration patterns, RAG architecture, and building features users want.

More in This Series

Get insights like this weekly

Join The Architect's Brief — one actionable insight every Tuesday.

Need help with AI-assisted development?

Let's talk strategy