What is the 80/20 rule of LLM cost optimization?

80% of queries can be handled by cheap fast models (GPT-4o-mini at $0.15/1M input); 20% need expensive capable models (Claude Sonnet at $3/1M). Tiered routing cuts costs 60-70%.

How does an AI feature go from $4/month to $225K/month?

Scale. $0.03 per query is $9/month at 10 queries/day, $4,500/month at 1,000 DAU averaging 5 queries, and $225,000/month at 50,000 DAU. Optimize before you scale.

What is semantic caching and how much does it save?

Serving near-identical queries from cache using 0.95+ similarity thresholds. Eliminates 20-30% of LLM calls entirely on top of model routing savings. Compounds with prompt compression.

LLM Cost Optimization at Scale

TL;DR

LLM costs follow a power law: 80% of queries can be handled by cheap, fast models while 20% need expensive, capable models. The tiered approach... route simple queries to GPT-4o-mini ($0.15/1M input tokens) and complex queries to Claude Sonnet ($3/1M input tokens)... reduces average cost per query by 60-70%. Add semantic caching (serving identical or near-identical queries from cache) and you eliminate another 20-30% of LLM calls entirely. The three highest-leverage optimizations in order: (1) model routing based on query complexity, (2) semantic caching with 0.95+ similarity threshold, (3) prompt compression that strips unnecessary tokens. I've helped 4 SaaS companies reduce their monthly LLM spend from $15-50K to $3-10K using these patterns. The quality difference was measurable only on edge cases.

Part of the AI-Assisted Development Guide ... a comprehensive guide to building AI features that deliver real value.

The Cost Problem at Scale

An AI feature that costs $0.03 per query seems cheap. At 10 queries/day during development, that's $9/month. At product launch with 1,000 daily active users averaging 5 queries each, it's $4,500/month. At scale with 50,000 DAU, it's $225,000/month.

Scale	Daily Queries	Monthly LLM Cost (GPT-4o)	Monthly LLM Cost (Optimized)
Development	10	$4	$4
Early launch	5,000	$2,000	$400
Growth	50,000	$20,000	$3,500
Scale	500,000	$200,000	$30,000

The "optimized" column uses the techniques in this post. The difference between $450K and $60K per month is the difference between an AI feature that's a strategic advantage and one that's a financial liability.

Strategy 1: Tiered Model Routing

The insight: most queries don't need the most capable model. A factual lookup, a simple summary, or a yes/no classification can be handled by a model that costs 20x less.

The Model Tier Map

Tier	Model	Input Cost	Output Cost	Use Case
Tier 1	GPT-4o-mini	$0.15/1M tokens	$0.60/1M tokens	Classification, extraction, simple Q&A
Tier 2	Claude Haiku 4.5	$0.80/1M tokens	$4.00/1M tokens	Moderate reasoning, summarization
Tier 3	Claude Sonnet 4.6	$3.00/1M tokens	$15.00/1M tokens	Complex reasoning, nuanced generation
Tier 4	Claude Opus 4.6	$15.00/1M tokens	$75.00/1M tokens	Expert-level analysis, critical decisions

Pricing as of March 2026. LLM costs trend downward... verify current rates before building your cost model.

The Router


// Query complexity router
async function routeQuery(query: string, context: string): Promise<ModelConfig> {
	// Step 1: Classify query complexity with the cheapest model
	const classification = await openai.chat.completions.create({
		model: "gpt-4o-mini",
		messages: [
			{
				role: "system",
				content: `Classify the query complexity as one of:
- SIMPLE: factual lookup, yes/no, data extraction
- MODERATE: summarization, comparison, multi-step reasoning
- COMPLEX: analysis, recommendations, nuanced judgment
Respond with ONLY the classification word.`,
			},
			{ role: "user", content: query },
		],
		max_tokens: 10,
		temperature: 0,
	});

	const complexity = classification.choices[0].message.content?.trim();

	switch (complexity) {
		case "SIMPLE":
			return { model: "gpt-4o-mini", maxTokens: 500, temperature: 0.1 };
		case "MODERATE":
			return { model: "claude-haiku-4-5-20251001", maxTokens: 1000, temperature: 0.3 };
		case "COMPLEX":
			return { model: "claude-sonnet-4-6", maxTokens: 2000, temperature: 0.5 };
		default:
			return { model: "gpt-4o-mini", maxTokens: 500, temperature: 0.1 };
	}
}

The meta-cost of routing: The classification call itself costs ~$0.0001 per query (GPT-4o-mini with a 10-token response). This is negligible compared to the savings from routing 80% of queries to a cheaper model.

Distribution in Production

Across 4 SaaS products I've worked with, the query complexity distribution is remarkably consistent:

Complexity	% of Queries	Model	Avg Cost/Query
Simple	55-65%	GPT-4o-mini	$0.0003
Moderate	25-30%	Haiku	$0.002
Complex	10-15%	Sonnet	$0.01

Blended average: $0.002/query vs $0.03/query (all queries to Sonnet). That's a 93% cost reduction.

Strategy 2: Semantic Caching

If the same question gets asked 50 times, you don't need to call the LLM 50 times. Semantic caching stores query-response pairs and serves cached responses for semantically similar queries.


// Semantic cache using vector similarity
class SemanticCache {
	private similarityThreshold = 0.95;

	async get(query: string): Promise<CacheResult | null> {
		const queryEmbedding = await embed(query);

		const results = await vectorDb.search({
			vector: queryEmbedding,
			topK: 1,
			filter: {
				createdAt: { $gt: Date.now() - 24 * 60 * 60 * 1000 }, // 24h TTL
			},
		});

		if (results.length > 0 && results[0].score >= this.similarityThreshold) {
			return {
				response: results[0].metadata.response,
				similarity: results[0].score,
				cached: true,
			};
		}

		return null;
	}

	async set(query: string, response: string): Promise<void> {
		const queryEmbedding = await embed(query);

		await vectorDb.upsert({
			id: generateId(),
			vector: queryEmbedding,
			metadata: {
				query,
				response,
				createdAt: Date.now(),
			},
		});
	}
}

// Usage in the query pipeline
async function handleQuery(query: string, context: string): Promise<string> {
	// Check cache first
	const cached = await semanticCache.get(query);
	if (cached) {
		logger.info("Cache hit", { similarity: cached.similarity });
		return cached.response;
	}

	// Cache miss ... call LLM
	const response = await callLLM(query, context);

	// Store in cache
	await semanticCache.set(query, response);

	return response;
}

Cache Hit Rates in Production

Product Type	Cache Hit Rate	Cost Reduction
Customer support bot	35-50%	35-50% (many similar questions)
Documentation search	25-40%	25-40%
Data analysis assistant	10-20%	10-20% (unique queries)
Code assistant	5-15%	5-15% (highly unique)

The 0.95 similarity threshold is conservative. Lower it to 0.92 for higher hit rates with slight quality risk. Higher than 0.97 and you're essentially only matching exact duplicates.

Cache invalidation: Set a 24-hour TTL. When your knowledge base updates, invalidate the cache. Stale AI responses are worse than uncached responses.

Strategy 3: Prompt Optimization

The cost of an LLM call is proportional to the number of tokens processed. Every unnecessary word in your prompt costs money at scale.

Before and After

Before (387 tokens):


You are a helpful AI assistant for our SaaS product. Your job is to help users
with their questions about our platform. You should be friendly, professional,
and provide accurate answers based on the context provided below. Please make
sure to be thorough in your responses and include relevant details. If you
don't know the answer, please say so rather than making something up.

Here is the relevant context from our documentation:
{context}

The user's question is:
{query}

Please provide a helpful and accurate response.

After (142 tokens):


Answer the user's question using ONLY the provided context. If the context
doesn't contain the answer, say "I don't have that information."

Context:
{context}

Question: {query}

The after version is 63% fewer tokens with identical behavior. At 100K queries/day with a 3K-token context, this saves ~24.5 billion tokens/month in prompt overhead... roughly $3,675/month at GPT-4o-mini rates.

Context Window Optimization

Don't send the entire context to the LLM. Send only the relevant chunks from retrieval.


// Trim context to fit budget
function prepareContext(chunks: RetrievedChunk[], maxTokens: number = 3000): string {
	let totalTokens = 0;
	const selectedChunks: string[] = [];

	for (const chunk of chunks) {
		const chunkTokens = estimateTokens(chunk.content);
		if (totalTokens + chunkTokens > maxTokens) break;

		selectedChunks.push(chunk.content);
		totalTokens += chunkTokens;
	}

	return selectedChunks.join("\n\n---\n\n");
}

Output Token Limits

Set max_tokens to the minimum needed for your use case. A yes/no classifier doesn't need 2,000 output tokens. A 50-word summary doesn't need 1,000.

Use Case	max_tokens	Rationale
Classification	10-50	Single word or short label
Data extraction	100-300	Structured output, known format
Short answer	200-500	1-3 sentence response
Detailed answer	500-1500	Paragraph-level response
Long-form generation	1500-4000	Only when necessary

Strategy 4: Batching and Streaming

Batch Processing

For non-real-time use cases (email generation, report creation, bulk classification), batch requests to take advantage of lower-cost batch APIs.


// OpenAI batch API ... 50% cost reduction
async function batchProcess(requests: BatchRequest[]): Promise<BatchResult[]> {
	// Create JSONL file with all requests
	const jsonl = requests
		.map((req, i) =>
			JSON.stringify({
				custom_id: `req-${i}`,
				method: "POST",
				url: "/v1/chat/completions",
				body: {
					model: "gpt-4o-mini",
					messages: req.messages,
					max_tokens: req.maxTokens,
				},
			})
		)
		.join("\n");

	// Upload and create batch
	const file = await openai.files.create({
		file: new Blob([jsonl]),
		purpose: "batch",
	});

	const batch = await openai.batches.create({
		input_file_id: file.id,
		endpoint: "/v1/chat/completions",
		completion_window: "24h",
	});

	// Poll for completion (batches complete within 24h, usually much faster)
	return await waitForBatch(batch.id);
}

OpenAI's batch API is 50% cheaper than real-time calls. For any workload that can tolerate minutes-to-hours latency, use batch processing.

Monitoring LLM Costs

Track cost per query, per feature, and per customer to understand where your spend goes.


// Cost tracking middleware
async function trackLLMCost(
	model: string,
	inputTokens: number,
	outputTokens: number,
	feature: string,
	tenantId: string
) {
	const pricing: Record<string, { input: number; output: number }> = {
		"gpt-4o-mini": { input: 0.15, output: 0.6 },
		"claude-haiku-4-5-20251001": { input: 0.8, output: 4.0 },
		"claude-sonnet-4-6": { input: 3.0, output: 15.0 },
		// Verify pricing at https://docs.anthropic.com/en/docs/about-claude/models
	};

	const rates = pricing[model];
	const cost = (inputTokens * rates.input + outputTokens * rates.output) / 1_000_000;

	await db.query(
		`
    INSERT INTO llm_usage (model, input_tokens, output_tokens, cost, feature, tenant_id, created_at)
    VALUES ($1, $2, $3, $4, $5, $6, NOW())
  `,
		[model, inputTokens, outputTokens, cost, feature, tenantId]
	);
}

The dashboard query that saves you money:


-- Cost per feature per day
SELECT
  feature,
  DATE(created_at) AS day,
  SUM(cost) AS total_cost,
  COUNT(*) AS queries,
  AVG(cost) AS avg_cost_per_query
FROM llm_usage
WHERE created_at > NOW() - INTERVAL '30 days'
GROUP BY feature, DATE(created_at)
ORDER BY total_cost DESC;

This reveals which features are consuming the most LLM budget... and where optimization has the highest ROI.

When to Apply This

Your monthly LLM spend exceeds $1,000 and growing
You're seeing 100+ queries/day on any AI feature
LLM costs are becoming a meaningful line item in your cloud bill
You need to maintain margins as AI feature usage grows with your customer base

When NOT to Apply This

You're in development or early beta with under 100 queries/day... optimize for quality first
The AI feature is a premium add-on where customers pay for the compute... pass costs through
Your total LLM spend is under $100/month... the engineering time to optimize costs more than the savings

Need to get your AI costs under control without sacrificing quality? I help SaaS teams design cost-efficient AI architectures that scale with their business.

Technical Advisor for Startups ... AI cost and architecture strategy
Next.js Development for SaaS ... AI features with built-in cost controls
Technical Due Diligence ... AI infrastructure cost assessment

Continue Reading

This post is part of the AI-Assisted Development Guide ... covering AI integration patterns, RAG architecture, and building features users want.

TL;DR

Part of the AI-Assisted Development Guide ... a comprehensive guide to building AI features that deliver real value.

The Cost Problem at Scale

Scale	Daily Queries	Monthly LLM Cost (GPT-4o)	Monthly LLM Cost (Optimized)
Development	10	$4	$4
Early launch	5,000	$2,000	$400
Growth	50,000	$20,000	$3,500
Scale	500,000	$200,000	$30,000

Strategy 1: Tiered Model Routing

The insight: most queries don't need the most capable model. A factual lookup, a simple summary, or a yes/no classification can be handled by a model that costs 20x less.

The Model Tier Map

Tier	Model	Input Cost	Output Cost	Use Case
Tier 1	GPT-4o-mini	$0.15/1M tokens	$0.60/1M tokens	Classification, extraction, simple Q&A
Tier 2	Claude Haiku 4.5	$0.80/1M tokens	$4.00/1M tokens	Moderate reasoning, summarization
Tier 3	Claude Sonnet 4.6	$3.00/1M tokens	$15.00/1M tokens	Complex reasoning, nuanced generation
Tier 4	Claude Opus 4.6	$15.00/1M tokens	$75.00/1M tokens	Expert-level analysis, critical decisions

Pricing as of March 2026. LLM costs trend downward... verify current rates before building your cost model.

The Router


// Query complexity router
async function routeQuery(query: string, context: string): Promise<ModelConfig> {
	// Step 1: Classify query complexity with the cheapest model
	const classification = await openai.chat.completions.create({
		model: "gpt-4o-mini",
		messages: [
			{
				role: "system",
				content: `Classify the query complexity as one of:
- SIMPLE: factual lookup, yes/no, data extraction
- MODERATE: summarization, comparison, multi-step reasoning
- COMPLEX: analysis, recommendations, nuanced judgment
Respond with ONLY the classification word.`,
			},
			{ role: "user", content: query },
		],
		max_tokens: 10,
		temperature: 0,
	});

	const complexity = classification.choices[0].message.content?.trim();

	switch (complexity) {
		case "SIMPLE":
			return { model: "gpt-4o-mini", maxTokens: 500, temperature: 0.1 };
		case "MODERATE":
			return { model: "claude-haiku-4-5-20251001", maxTokens: 1000, temperature: 0.3 };
		case "COMPLEX":
			return { model: "claude-sonnet-4-6", maxTokens: 2000, temperature: 0.5 };
		default:
			return { model: "gpt-4o-mini", maxTokens: 500, temperature: 0.1 };
	}
}

Distribution in Production

Across 4 SaaS products I've worked with, the query complexity distribution is remarkably consistent:

Complexity	% of Queries	Model	Avg Cost/Query
Simple	55-65%	GPT-4o-mini	$0.0003
Moderate	25-30%	Haiku	$0.002
Complex	10-15%	Sonnet	$0.01

Blended average: $0.002/query vs $0.03/query (all queries to Sonnet). That's a 93% cost reduction.

Strategy 2: Semantic Caching

If the same question gets asked 50 times, you don't need to call the LLM 50 times. Semantic caching stores query-response pairs and serves cached responses for semantically similar queries.


// Semantic cache using vector similarity
class SemanticCache {
	private similarityThreshold = 0.95;

	async get(query: string): Promise<CacheResult | null> {
		const queryEmbedding = await embed(query);

		const results = await vectorDb.search({
			vector: queryEmbedding,
			topK: 1,
			filter: {
				createdAt: { $gt: Date.now() - 24 * 60 * 60 * 1000 }, // 24h TTL
			},
		});

		if (results.length > 0 && results[0].score >= this.similarityThreshold) {
			return {
				response: results[0].metadata.response,
				similarity: results[0].score,
				cached: true,
			};
		}

		return null;
	}

	async set(query: string, response: string): Promise<void> {
		const queryEmbedding = await embed(query);

		await vectorDb.upsert({
			id: generateId(),
			vector: queryEmbedding,
			metadata: {
				query,
				response,
				createdAt: Date.now(),
			},
		});
	}
}

// Usage in the query pipeline
async function handleQuery(query: string, context: string): Promise<string> {
	// Check cache first
	const cached = await semanticCache.get(query);
	if (cached) {
		logger.info("Cache hit", { similarity: cached.similarity });
		return cached.response;
	}

	// Cache miss ... call LLM
	const response = await callLLM(query, context);

	// Store in cache
	await semanticCache.set(query, response);

	return response;
}

Cache Hit Rates in Production

Product Type	Cache Hit Rate	Cost Reduction
Customer support bot	35-50%	35-50% (many similar questions)
Documentation search	25-40%	25-40%
Data analysis assistant	10-20%	10-20% (unique queries)
Code assistant	5-15%	5-15% (highly unique)

The 0.95 similarity threshold is conservative. Lower it to 0.92 for higher hit rates with slight quality risk. Higher than 0.97 and you're essentially only matching exact duplicates.

Cache invalidation: Set a 24-hour TTL. When your knowledge base updates, invalidate the cache. Stale AI responses are worse than uncached responses.

Strategy 3: Prompt Optimization

The cost of an LLM call is proportional to the number of tokens processed. Every unnecessary word in your prompt costs money at scale.

Before and After

Before (387 tokens):


You are a helpful AI assistant for our SaaS product. Your job is to help users
with their questions about our platform. You should be friendly, professional,
and provide accurate answers based on the context provided below. Please make
sure to be thorough in your responses and include relevant details. If you
don't know the answer, please say so rather than making something up.

Here is the relevant context from our documentation:
{context}

The user's question is:
{query}

Please provide a helpful and accurate response.

After (142 tokens):


Answer the user's question using ONLY the provided context. If the context
doesn't contain the answer, say "I don't have that information."

Context:
{context}

Question: {query}

Context Window Optimization

Don't send the entire context to the LLM. Send only the relevant chunks from retrieval.


// Trim context to fit budget
function prepareContext(chunks: RetrievedChunk[], maxTokens: number = 3000): string {
	let totalTokens = 0;
	const selectedChunks: string[] = [];

	for (const chunk of chunks) {
		const chunkTokens = estimateTokens(chunk.content);
		if (totalTokens + chunkTokens > maxTokens) break;

		selectedChunks.push(chunk.content);
		totalTokens += chunkTokens;
	}

	return selectedChunks.join("\n\n---\n\n");
}

Output Token Limits

Set max_tokens to the minimum needed for your use case. A yes/no classifier doesn't need 2,000 output tokens. A 50-word summary doesn't need 1,000.

Use Case	max_tokens	Rationale
Classification	10-50	Single word or short label
Data extraction	100-300	Structured output, known format
Short answer	200-500	1-3 sentence response
Detailed answer	500-1500	Paragraph-level response
Long-form generation	1500-4000	Only when necessary

Strategy 4: Batching and Streaming

Batch Processing

For non-real-time use cases (email generation, report creation, bulk classification), batch requests to take advantage of lower-cost batch APIs.


// OpenAI batch API ... 50% cost reduction
async function batchProcess(requests: BatchRequest[]): Promise<BatchResult[]> {
	// Create JSONL file with all requests
	const jsonl = requests
		.map((req, i) =>
			JSON.stringify({
				custom_id: `req-${i}`,
				method: "POST",
				url: "/v1/chat/completions",
				body: {
					model: "gpt-4o-mini",
					messages: req.messages,
					max_tokens: req.maxTokens,
				},
			})
		)
		.join("\n");

	// Upload and create batch
	const file = await openai.files.create({
		file: new Blob([jsonl]),
		purpose: "batch",
	});

	const batch = await openai.batches.create({
		input_file_id: file.id,
		endpoint: "/v1/chat/completions",
		completion_window: "24h",
	});

	// Poll for completion (batches complete within 24h, usually much faster)
	return await waitForBatch(batch.id);
}

OpenAI's batch API is 50% cheaper than real-time calls. For any workload that can tolerate minutes-to-hours latency, use batch processing.

Monitoring LLM Costs

Track cost per query, per feature, and per customer to understand where your spend goes.


// Cost tracking middleware
async function trackLLMCost(
	model: string,
	inputTokens: number,
	outputTokens: number,
	feature: string,
	tenantId: string
) {
	const pricing: Record<string, { input: number; output: number }> = {
		"gpt-4o-mini": { input: 0.15, output: 0.6 },
		"claude-haiku-4-5-20251001": { input: 0.8, output: 4.0 },
		"claude-sonnet-4-6": { input: 3.0, output: 15.0 },
		// Verify pricing at https://docs.anthropic.com/en/docs/about-claude/models
	};

	const rates = pricing[model];
	const cost = (inputTokens * rates.input + outputTokens * rates.output) / 1_000_000;

	await db.query(
		`
    INSERT INTO llm_usage (model, input_tokens, output_tokens, cost, feature, tenant_id, created_at)
    VALUES ($1, $2, $3, $4, $5, $6, NOW())
  `,
		[model, inputTokens, outputTokens, cost, feature, tenantId]
	);
}

The dashboard query that saves you money:


-- Cost per feature per day
SELECT
  feature,
  DATE(created_at) AS day,
  SUM(cost) AS total_cost,
  COUNT(*) AS queries,
  AVG(cost) AS avg_cost_per_query
FROM llm_usage
WHERE created_at > NOW() - INTERVAL '30 days'
GROUP BY feature, DATE(created_at)
ORDER BY total_cost DESC;

This reveals which features are consuming the most LLM budget... and where optimization has the highest ROI.

When to Apply This

Your monthly LLM spend exceeds $1,000 and growing
You're seeing 100+ queries/day on any AI feature
LLM costs are becoming a meaningful line item in your cloud bill
You need to maintain margins as AI feature usage grows with your customer base

When NOT to Apply This

You're in development or early beta with under 100 queries/day... optimize for quality first
The AI feature is a premium add-on where customers pay for the compute... pass costs through
Your total LLM spend is under $100/month... the engineering time to optimize costs more than the savings

Need to get your AI costs under control without sacrificing quality? I help SaaS teams design cost-efficient AI architectures that scale with their business.

Technical Advisor for Startups ... AI cost and architecture strategy
Next.js Development for SaaS ... AI features with built-in cost controls
Technical Due Diligence ... AI infrastructure cost assessment

Continue Reading

This post is part of the AI-Assisted Development Guide ... covering AI integration patterns, RAG architecture, and building features users want.

●TL;DR

●The Cost Problem at Scale

●Strategy 1: Tiered Model Routing

The Model Tier Map

The Router

Distribution in Production

●Strategy 2: Semantic Caching

Cache Hit Rates in Production

●Strategy 3: Prompt Optimization

Before and After

Context Window Optimization

Output Token Limits

●Strategy 4: Batching and Streaming

Batch Processing

●Monitoring LLM Costs

●When to Apply This

●When NOT to Apply This

●Continue Reading

More in This Series

Related Guides

●Related Insights

Get insights like this weekly

●TL;DR

●The Cost Problem at Scale

●Strategy 1: Tiered Model Routing

The Model Tier Map

The Router

Distribution in Production

●Strategy 2: Semantic Caching

Cache Hit Rates in Production

●Strategy 3: Prompt Optimization

Before and After

Context Window Optimization

Output Token Limits

●Strategy 4: Batching and Streaming

Batch Processing

●Monitoring LLM Costs

●When to Apply This

●When NOT to Apply This

●Continue Reading

More in This Series

Related Guides

●Related Insights

Get insights like this weekly

TL;DR

The Cost Problem at Scale

Strategy 1: Tiered Model Routing

Strategy 2: Semantic Caching

Strategy 3: Prompt Optimization

Strategy 4: Batching and Streaming

Monitoring LLM Costs

When to Apply This

When NOT to Apply This

Continue Reading

Related Insights

TL;DR

The Cost Problem at Scale

Strategy 1: Tiered Model Routing

Strategy 2: Semantic Caching

Strategy 3: Prompt Optimization

Strategy 4: Batching and Streaming

Monitoring LLM Costs

When to Apply This

When NOT to Apply This

Continue Reading

Related Insights