TL;DR
LLM costs follow a power law: 80% of queries can be handled by cheap, fast models while 20% need expensive, capable models. The tiered approach... route simple queries to GPT-4o-mini ($0.15/1M input tokens) and complex queries to Claude Sonnet ($3/1M input tokens)... reduces average cost per query by 60-70%. Add semantic caching (serving identical or near-identical queries from cache) and you eliminate another 20-30% of LLM calls entirely. The three highest-leverage optimizations in order: (1) model routing based on query complexity, (2) semantic caching with 0.95+ similarity threshold, (3) prompt compression that strips unnecessary tokens. I've helped 4 SaaS companies reduce their monthly LLM spend from $15-50K to $3-10K using these patterns. The quality difference was measurable only on edge cases.
Part of the AI-Assisted Development Guide ... a comprehensive guide to building AI features that deliver real value.
The Cost Problem at Scale
An AI feature that costs $0.03 per query seems cheap. At 10 queries/day during development, that's $9/month. At product launch with 1,000 daily active users averaging 5 queries each, it's $4,500/month. At scale with 50,000 DAU, it's $225,000/month.
| Scale | Daily Queries | Monthly LLM Cost (GPT-4o) | Monthly LLM Cost (Optimized) |
|---|---|---|---|
| Development | 10 | $4 | $4 |
| Early launch | 5,000 | $2,000 | $400 |
| Growth | 50,000 | $20,000 | $3,500 |
| Scale | 500,000 | $200,000 | $30,000 |
The "optimized" column uses the techniques in this post. The difference between $450K and $60K per month is the difference between an AI feature that's a strategic advantage and one that's a financial liability.
Strategy 1: Tiered Model Routing
The insight: most queries don't need the most capable model. A factual lookup, a simple summary, or a yes/no classification can be handled by a model that costs 20x less.
The Model Tier Map
| Tier | Model | Input Cost | Output Cost | Use Case |
|---|---|---|---|---|
| Tier 1 | GPT-4o-mini | $0.15/1M tokens | $0.60/1M tokens | Classification, extraction, simple Q&A |
| Tier 2 | Claude Haiku 4.5 | $0.80/1M tokens | $4.00/1M tokens | Moderate reasoning, summarization |
| Tier 3 | Claude Sonnet 4.6 | $3.00/1M tokens | $15.00/1M tokens | Complex reasoning, nuanced generation |
| Tier 4 | Claude Opus 4.6 | $15.00/1M tokens | $75.00/1M tokens | Expert-level analysis, critical decisions |
Pricing as of March 2026. LLM costs trend downward... verify current rates before building your cost model.
The Router
// Query complexity router
async function routeQuery(query: string, context: string): Promise<ModelConfig> {
// Step 1: Classify query complexity with the cheapest model
const classification = await openai.chat.completions.create({
model: "gpt-4o-mini",
messages: [
{
role: "system",
content: `Classify the query complexity as one of:
- SIMPLE: factual lookup, yes/no, data extraction
- MODERATE: summarization, comparison, multi-step reasoning
- COMPLEX: analysis, recommendations, nuanced judgment
Respond with ONLY the classification word.`,
},
{ role: "user", content: query },
],
max_tokens: 10,
temperature: 0,
});
const complexity = classification.choices[0].message.content?.trim();
switch (complexity) {
case "SIMPLE":
return { model: "gpt-4o-mini", maxTokens: 500, temperature: 0.1 };
case "MODERATE":
return { model: "claude-haiku-4-5-20251001", maxTokens: 1000, temperature: 0.3 };
case "COMPLEX":
return { model: "claude-sonnet-4-6", maxTokens: 2000, temperature: 0.5 };
default:
return { model: "gpt-4o-mini", maxTokens: 500, temperature: 0.1 };
}
}
The meta-cost of routing: The classification call itself costs ~$0.0001 per query (GPT-4o-mini with a 10-token response). This is negligible compared to the savings from routing 80% of queries to a cheaper model.
Distribution in Production
Across 4 SaaS products I've worked with, the query complexity distribution is remarkably consistent:
| Complexity | % of Queries | Model | Avg Cost/Query |
|---|---|---|---|
| Simple | 55-65% | GPT-4o-mini | $0.0003 |
| Moderate | 25-30% | Haiku | $0.002 |
| Complex | 10-15% | Sonnet | $0.01 |
Blended average: $0.002/query vs $0.03/query (all queries to Sonnet). That's a 93% cost reduction.
Strategy 2: Semantic Caching
If the same question gets asked 50 times, you don't need to call the LLM 50 times. Semantic caching stores query-response pairs and serves cached responses for semantically similar queries.
// Semantic cache using vector similarity
class SemanticCache {
private similarityThreshold = 0.95;
async get(query: string): Promise<CacheResult | null> {
const queryEmbedding = await embed(query);
const results = await vectorDb.search({
vector: queryEmbedding,
topK: 1,
filter: {
createdAt: { $gt: Date.now() - 24 * 60 * 60 * 1000 }, // 24h TTL
},
});
if (results.length > 0 && results[0].score >= this.similarityThreshold) {
return {
response: results[0].metadata.response,
similarity: results[0].score,
cached: true,
};
}
return null;
}
async set(query: string, response: string): Promise<void> {
const queryEmbedding = await embed(query);
await vectorDb.upsert({
id: generateId(),
vector: queryEmbedding,
metadata: {
query,
response,
createdAt: Date.now(),
},
});
}
}
// Usage in the query pipeline
async function handleQuery(query: string, context: string): Promise<string> {
// Check cache first
const cached = await semanticCache.get(query);
if (cached) {
logger.info("Cache hit", { similarity: cached.similarity });
return cached.response;
}
// Cache miss ... call LLM
const response = await callLLM(query, context);
// Store in cache
await semanticCache.set(query, response);
return response;
}
Cache Hit Rates in Production
| Product Type | Cache Hit Rate | Cost Reduction |
|---|---|---|
| Customer support bot | 35-50% | 35-50% (many similar questions) |
| Documentation search | 25-40% | 25-40% |
| Data analysis assistant | 10-20% | 10-20% (unique queries) |
| Code assistant | 5-15% | 5-15% (highly unique) |
The 0.95 similarity threshold is conservative. Lower it to 0.92 for higher hit rates with slight quality risk. Higher than 0.97 and you're essentially only matching exact duplicates.
Cache invalidation: Set a 24-hour TTL. When your knowledge base updates, invalidate the cache. Stale AI responses are worse than uncached responses.
Strategy 3: Prompt Optimization
The cost of an LLM call is proportional to the number of tokens processed. Every unnecessary word in your prompt costs money at scale.
Before and After
Before (387 tokens):
You are a helpful AI assistant for our SaaS product. Your job is to help users
with their questions about our platform. You should be friendly, professional,
and provide accurate answers based on the context provided below. Please make
sure to be thorough in your responses and include relevant details. If you
don't know the answer, please say so rather than making something up.
Here is the relevant context from our documentation:
{context}
The user's question is:
{query}
Please provide a helpful and accurate response.
After (142 tokens):
Answer the user's question using ONLY the provided context. If the context
doesn't contain the answer, say "I don't have that information."
Context:
{context}
Question: {query}
The after version is 63% fewer tokens with identical behavior. At 100K queries/day with a 3K-token context, this saves ~24.5 billion tokens/month in prompt overhead... roughly $3,675/month at GPT-4o-mini rates.
Context Window Optimization
Don't send the entire context to the LLM. Send only the relevant chunks from retrieval.
// Trim context to fit budget
function prepareContext(chunks: RetrievedChunk[], maxTokens: number = 3000): string {
let totalTokens = 0;
const selectedChunks: string[] = [];
for (const chunk of chunks) {
const chunkTokens = estimateTokens(chunk.content);
if (totalTokens + chunkTokens > maxTokens) break;
selectedChunks.push(chunk.content);
totalTokens += chunkTokens;
}
return selectedChunks.join("\n\n---\n\n");
}
Output Token Limits
Set max_tokens to the minimum needed for your use case. A yes/no classifier doesn't need 2,000 output tokens. A 50-word summary doesn't need 1,000.
| Use Case | max_tokens | Rationale |
|---|---|---|
| Classification | 10-50 | Single word or short label |
| Data extraction | 100-300 | Structured output, known format |
| Short answer | 200-500 | 1-3 sentence response |
| Detailed answer | 500-1500 | Paragraph-level response |
| Long-form generation | 1500-4000 | Only when necessary |
Strategy 4: Batching and Streaming
Batch Processing
For non-real-time use cases (email generation, report creation, bulk classification), batch requests to take advantage of lower-cost batch APIs.
// OpenAI batch API ... 50% cost reduction
async function batchProcess(requests: BatchRequest[]): Promise<BatchResult[]> {
// Create JSONL file with all requests
const jsonl = requests
.map((req, i) =>
JSON.stringify({
custom_id: `req-${i}`,
method: "POST",
url: "/v1/chat/completions",
body: {
model: "gpt-4o-mini",
messages: req.messages,
max_tokens: req.maxTokens,
},
})
)
.join("\n");
// Upload and create batch
const file = await openai.files.create({
file: new Blob([jsonl]),
purpose: "batch",
});
const batch = await openai.batches.create({
input_file_id: file.id,
endpoint: "/v1/chat/completions",
completion_window: "24h",
});
// Poll for completion (batches complete within 24h, usually much faster)
return await waitForBatch(batch.id);
}
OpenAI's batch API is 50% cheaper than real-time calls. For any workload that can tolerate minutes-to-hours latency, use batch processing.
Monitoring LLM Costs
Track cost per query, per feature, and per customer to understand where your spend goes.
// Cost tracking middleware
async function trackLLMCost(
model: string,
inputTokens: number,
outputTokens: number,
feature: string,
tenantId: string
) {
const pricing: Record<string, { input: number; output: number }> = {
"gpt-4o-mini": { input: 0.15, output: 0.6 },
"claude-haiku-4-5-20251001": { input: 0.8, output: 4.0 },
"claude-sonnet-4-6": { input: 3.0, output: 15.0 },
// Verify pricing at https://docs.anthropic.com/en/docs/about-claude/models
};
const rates = pricing[model];
const cost = (inputTokens * rates.input + outputTokens * rates.output) / 1_000_000;
await db.query(
`
INSERT INTO llm_usage (model, input_tokens, output_tokens, cost, feature, tenant_id, created_at)
VALUES ($1, $2, $3, $4, $5, $6, NOW())
`,
[model, inputTokens, outputTokens, cost, feature, tenantId]
);
}
The dashboard query that saves you money:
-- Cost per feature per day
SELECT
feature,
DATE(created_at) AS day,
SUM(cost) AS total_cost,
COUNT(*) AS queries,
AVG(cost) AS avg_cost_per_query
FROM llm_usage
WHERE created_at > NOW() - INTERVAL '30 days'
GROUP BY feature, DATE(created_at)
ORDER BY total_cost DESC;
This reveals which features are consuming the most LLM budget... and where optimization has the highest ROI.
When to Apply This
- Your monthly LLM spend exceeds $1,000 and growing
- You're seeing 100+ queries/day on any AI feature
- LLM costs are becoming a meaningful line item in your cloud bill
- You need to maintain margins as AI feature usage grows with your customer base
When NOT to Apply This
- You're in development or early beta with under 100 queries/day... optimize for quality first
- The AI feature is a premium add-on where customers pay for the compute... pass costs through
- Your total LLM spend is under $100/month... the engineering time to optimize costs more than the savings
Need to get your AI costs under control without sacrificing quality? I help SaaS teams design cost-efficient AI architectures that scale with their business.
- Technical Advisor for Startups ... AI cost and architecture strategy
- Next.js Development for SaaS ... AI features with built-in cost controls
- Technical Due Diligence ... AI infrastructure cost assessment
Continue Reading
This post is part of the AI-Assisted Development Guide ... covering AI integration patterns, RAG architecture, and building features users want.
More in This Series
- AI Cost Optimization ... Foundational cost management patterns
- RAG Architecture for SaaS Products ... Optimizing retrieval costs
- LLM Integration Architecture ... Patterns for efficient LLM integration
- Building AI Features Users Want ... Investing AI budget where it matters
Related Guides
- Vector Databases: When to Build vs Buy ... Infrastructure cost decisions
- Caching Strategies That Actually Work ... Caching patterns that apply to LLM responses
