The Architect's Brief — Issue #6

RAG Architecture for Your SaaS

March 18, 2026

Subject: Your RAG pipeline is retrieval-broken, not AI-broken

Hey there,

A Series B SaaS asked me to review their AI-powered help center. Demo was impressive ... contextual answers, natural language queries, looked polished. Then I ran 50 real user questions through it. 19 returned irrelevant or wrong answers. Their retrieval accuracy was 62%.

The LLM was fine. The retrieval was broken.

This Week's Decision

The Situation: You've shipped a RAG-powered feature ... help center, document search, knowledge base Q&A. It works in demos, but users complain about irrelevant answers. You haven't measured retrieval accuracy because you're not sure how.

The Insight: Most RAG implementations fail at retrieval, not generation. Naive chunking (split every 512 tokens) with cosine similarity search gives you roughly 62% retrieval accuracy. That means 4 in 10 queries feed the LLM wrong context ... and the LLM confidently generates wrong answers from that wrong context.

Three changes push accuracy from 62% to 85-92%:

1. Semantic chunking instead of fixed-size. Split on topic boundaries, not token counts. A paragraph about billing should stay together, not get split mid-sentence into a chunk about authentication.

2. Hybrid search: vector + keyword. Pure vector search misses exact matches. When a user searches "error code E-4012," cosine similarity on embeddings might return content about error handling in general. BM25 keyword search catches exact terms. Combine both with reciprocal rank fusion.


# Hybrid search with reciprocal rank fusion
def hybrid_search(query, k=10):
    vector_results = vector_store.search(embed(query), k=k)
    keyword_results = bm25_index.search(query, k=k)

    scores = {}
    for rank, doc in enumerate(vector_results):
        scores[doc.id] = scores.get(doc.id, 0) + 1 / (60 + rank)
    for rank, doc in enumerate(keyword_results):
        scores[doc.id] = scores.get(doc.id, 0) + 1 / (60 + rank)

    return sorted(scores.items(), key=lambda x: -x[1])[:k]

3. Reranking with a cross-encoder. Bi-encoder embeddings are fast but approximate. After retrieving 20-50 candidates, rerank with a cross-encoder (like Cohere Rerank or a local model) that scores query-document pairs directly. Adds 50-100ms latency, gains 10-15% accuracy.

Two more optimizations that matter for production SaaS:

Semantic caching. Hash similar queries, cache responses. 40% hit rate on typical help center traffic. Cuts LLM costs and latency simultaneously.
Metadata filtering for multi-tenant. Always filter by tenant ID before similarity search. Without it, Tenant A's confidential docs can leak into Tenant B's results through vector proximity.

When to Apply This:

SaaS adding AI-powered search or Q&A over product documentation
Multi-tenant applications where data isolation is non-negotiable
Any RAG system where you haven't measured retrieval accuracy (you should)

Worth Your Time

Anthropic: Contextual Retrieval ... Anthropic's approach to adding context to chunks before embedding. Prepending a short description of where each chunk fits in the document improves retrieval by 49%. Simple technique, significant impact.
Pinecone: Chunking Strategies ... Comprehensive comparison of fixed-size, recursive, semantic, and document-aware chunking. The benchmarks on retrieval quality by strategy are worth bookmarking.
LangChain: RAG Evaluation ... If you can't measure retrieval accuracy, you can't improve it. RAGAS framework gives you faithfulness, answer relevancy, and context precision scores. Set this up before optimizing anything.

Tool of the Week

Cohere Rerank ... Drop-in reranking API that scores query-document relevance. Takes your existing retrieval results and reorders them by actual relevance. 50-100ms latency, measurable accuracy improvement. The API-based approach means you can add it without changing your embedding pipeline. Start with their free tier to benchmark against your current retrieval quality.

That's it for this week.

Hit reply if you're building RAG and want a second opinion on your chunking strategy. I've reviewed a dozen of these pipelines ... the failure patterns are predictable. I read every response.

– Alex

P.S. For the complete guide to building production AI features in SaaS ... from model selection to cost optimization: AI-Assisted Development Guide.

Get insights like this weekly

Join The Architect's Brief ... one actionable insight every Tuesday.

●This Week's Decision

●Worth Your Time

●Tool of the Week

Get insights like this weekly

This Week's Decision

Worth Your Time

Tool of the Week