The Anatomy of a High-Precision SaaS

TL;DR

Building a B2B SaaS to 100,000 users isn't about choosing the "best" technology... it's about choosing the right constraints at the right time. Start on Vercel, implement Row-Level Security from day one, use tRPC for internal APIs, and plan your escape to AWS when bandwidth costs exceed $500/month. The architecture that gets you to 10k users is not the architecture that gets you to 100k.

Part of the SaaS Architecture Decision Framework ... a comprehensive guide to architecture decisions from MVP to scale.

The Problem Nobody Talks About

Most SaaS architecture guides fall into two categories: the "hello world" tutorial that stops at 100 users, or the Netflix-scale distributed systems talk that's irrelevant until you've raised $50M.

The gap between them... the 10k to 100k user phase... is where most B2B products either break or bleed money. I've watched three startups hit this wall. One burned $40k in a single month on Vercel bandwidth overages. Another spent six months rewriting their database layer because schema-per-tenant doesn't scale past 300 customers. The third is still running, but their P99 latency is 4 seconds and their customers are leaving.

This is the guide I wish I'd had. It's opinionated, specific, and focused on the decisions that actually matter.

The Stack

Before we dive in, here's what we're building toward:


┌─────────────────────────────────────────────────────────────┐
│  PRESENTATION                                               │
│  Next.js 16 (App Router) → Vercel/Cloudflare → CDN Edge    │
├─────────────────────────────────────────────────────────────┤
│  API LAYER                                                  │
│  tRPC (internal) + REST (public) → Type-safe end-to-end    │
├─────────────────────────────────────────────────────────────┤
│  DATA                                                       │
│  PostgreSQL + RLS → Supavisor/PgBouncer → Connection Pool  │
├─────────────────────────────────────────────────────────────┤
│  INFRASTRUCTURE                                             │
│  Phase 1: Vercel → Phase 2: Hybrid → Phase 3: AWS ECS      │
└─────────────────────────────────────────────────────────────┘

This isn't the only valid architecture. But it's the one I've seen work repeatedly for B2B products in the $10k-$100k MRR range.

Part 1: Infrastructure Economics

The Vercel Question

Every Next.js project starts with the same question: Vercel or self-host?

Here's the honest answer: Vercel until it hurts.

The math is straightforward. Vercel Pro costs $20/month per team member. For a 5-person startup, that's $100/month plus usage. An experienced SRE costs $150,000-$200,000 annually. If your infrastructure bill stays under $2,000/month, you cannot justify the hiring cost of someone to manage AWS.

But Vercel has cliffs. Hard ones.

The Bandwidth Cliff

Vercel includes 1TB of bandwidth on Pro. Overage costs $0.15 per GB... that's $150 per additional TB. For a document-heavy B2B app (think: PDF generation, file attachments, rich dashboards), you can hit 2-3TB monthly at 50k users.

I worked with a team whose Vercel bill jumped from $400 to $2,100 in a single billing cycle. The cause? A marketing campaign drove traffic, and their PDF export feature served 800GB of files in three weeks. No warning. No throttling. Just a bill.

The Compliance Cliff

Some enterprise customers require static IP addresses for firewall allowlisting. Vercel doesn't offer this. If a $200k/year contract depends on IP whitelisting, you're migrating to AWS whether you're ready or not.

The Latency Cliff

Serverless functions have cold starts. Vercel has improved this dramatically with Fluid Compute... cold starts are now approximately 100ms in optimal conditions. But "optimal" means predictable traffic patterns. A burst of 500 concurrent users at 9am Monday (common in B2B) can still trigger cold starts across your function fleet.

For applications where P99 latency must stay under 200ms consistently, serverless is the wrong model. You need always-on containers.

The Migration Trajectory

Here's the pattern I've observed across a dozen B2B products:

Phase	MAU	Infrastructure	Monthly Cost
0-10k	0-10,000	Vercel Pro	$100-$500
10k-50k	10,000-50,000	Vercel + Optimization	$500-$2,000
50k-100k	50,000-100,000	AWS ECS/Fargate	$300-$800 + labor

The counterintuitive insight: AWS is often cheaper at scale, but more expensive at the start. A minimal high-availability setup on AWS (NAT Gateway, ALB, monitoring) runs $150/month before you deploy any code. On Vercel, that's $0.

The optimal path is not "AWS from day one." It's "Vercel until the economics flip, then migrate with intention."

The Cloudflare Alternative

There's a third option that deserves mention: Cloudflare Workers with Pages.

Cloudflare runs on V8 isolates rather than containers. The practical difference: cold starts of 40-150ms versus Vercel's 100-500ms. For latency-sensitive applications, this matters.

Cloudflare also charges based on CPU time, not wall-clock duration. If your function spends 90% of its time waiting for database responses (common in B2B), you're billed for 10% of what you'd pay on Vercel or Lambda.

The trade-off is compatibility. Cloudflare Workers aren't Node.js... they're a subset. Many npm packages don't work. Native modules are out. If your codebase relies heavily on the Node.js ecosystem, the migration cost may exceed the savings.

I've deployed Next.js to Cloudflare via OpenNext for three projects. Two went smoothly. One required rewriting a PDF generation pipeline because the library used native bindings. Know your dependencies before committing.

When to Migrate

Migrate to AWS when any of these are true:

Bandwidth exceeds 1.5TB/month ... You're paying $75+ in overages, trending up
Compliance requires static IPs ... Enterprise sales are blocked
Cold starts violate SLAs ... Your contracts specify latency guarantees
Background jobs exceed 60 seconds ... Vercel function timeouts are hard limits
You need WebSockets at scale ... Serverless can't maintain persistent connections

Don't migrate because:

"AWS is cheaper" (it's not, until it is)
"We might need it someday" (you won't, until you do)
"Real companies use AWS" (real companies ship products)

The Migration Itself

When you do migrate, here's what the path looks like:

Week 1-2: Infrastructure Setup

VPC with public/private subnets across 2+ availability zones
NAT Gateway for outbound traffic from private subnets
Application Load Balancer with SSL termination
ECS cluster with Fargate capacity provider
ECR repository for Docker images

Week 3: Application Configuration

Standalone Next.js Docker build
Environment variable management (AWS Secrets Manager or Parameter Store)
Health check endpoints for ALB target groups
Log aggregation to CloudWatch

Week 4: Cutover

Deploy to ECS behind a staging domain
Load test to verify performance
DNS cutover with low TTL
Monitor for 48 hours before celebrating

Total engineering time: 80-120 hours for a senior developer familiar with AWS. If that's not you, budget for consulting or expect 2x the timeline.

The teams that struggle are the ones who try to migrate while also shipping features. Treat migration as a dedicated project with its own timeline.

Part 2: The Database Layer

The database is the one component you can't easily swap. Choose wrong here and you're facing a 6-month rewrite.

Why PostgreSQL

I've evaluated MongoDB, PlanetScale (MySQL), CockroachDB, and various NewSQL options for B2B SaaS. I keep coming back to PostgreSQL for three reasons:

Row-Level Security ... Native, battle-tested multi-tenancy at the database level
JSONB ... Document flexibility without abandoning relational integrity
Ecosystem ... Supabase, Neon, and every managed provider supports it

MongoDB is fine for prototypes. But the moment you need ACID transactions across multiple collections, or you're implementing audit trails for SOC 2 compliance, you'll wish you'd started with Postgres.

Multi-Tenancy: The Decision That Haunts You

There are three models for multi-tenant databases:

Model 1: Database-per-Tenant


tenant_acme.database → tenant_acme's data
tenant_globex.database → tenant_globex's data

Maximum isolation. Impossible to scale. I've never seen this work past 50 tenants.

Model 2: Schema-per-Tenant


public.tenant_acme.users → tenant_acme's users
public.tenant_globex.users → tenant_globex's users

Sounds elegant. Breaks at approximately 200-300 tenants. Here's why:

Running a migration (adding a column) requires iterating every schema. At 500 tenants with a 2-second migration per schema, your deployment takes 16 minutes. At 5,000 tenants, it takes nearly 3 hours. During this time, your application is in a mixed state... some schemas have the new column, some don't. You either accept downtime or build complex migration orchestration.

I watched a team spend four months building "migration sharding" to work around this. They should have used the right model from the start.

Model 3: Shared Schema with Row-Level Security


CREATE TABLE users (
  id UUID PRIMARY KEY,
  tenant_id UUID NOT NULL,
  email TEXT NOT NULL,
  -- ... other columns
);

-- The magic: database-enforced isolation
ALTER TABLE users ENABLE ROW LEVEL SECURITY;

CREATE POLICY tenant_isolation ON users
  USING (tenant_id = current_setting('app.current_tenant')::UUID);

One table. One migration. Infinite tenants. The database itself enforces that Tenant A never sees Tenant B's data... even if your application code has a bug.

This is the only model I recommend for B2B SaaS in 2026.

RLS Performance: The Truth

The common objection to RLS: "Doesn't checking a policy on every row kill performance?"

The answer is nuanced.

Naive RLS is slow. If you write:


SELECT * FROM widgets;

And rely entirely on RLS to filter, the query planner may choose a sequential scan. On a table with 10 million rows across 1,000 tenants, this is catastrophic.

Explicit filtering with RLS is fast. If you write:


SELECT * FROM widgets WHERE tenant_id = 'abc-123';

The query planner uses your index on tenant_id, and RLS acts as a safety net... not the primary filter. Benchmarks show this approach performs within 5% of queries without RLS, while providing defense-in-depth against data leaks.

The pattern: Always include tenant_id in your WHERE clause. Use RLS as the backstop, not the filter.

The Index That Matters

For multi-tenant B2B, one index pattern dominates:


CREATE INDEX idx_widgets_tenant_created
ON widgets (tenant_id, created_at DESC);

This composite index serves the most common B2B query: "Show me the latest items for my organization." The database jumps directly to the tenant's data block in the B-tree, ignoring millions of rows from other tenants.

Without this index, every "recent activity" query scans the entire table. With it, the same query touches only the tenant's rows.

Connection Pooling: Non-Negotiable

Serverless functions are stateless. Each invocation can potentially open a new database connection. A traffic spike that spawns 500 concurrent function instances will attempt 500 database connections.

PostgreSQL typically limits connections to 100-500. Without a pooler, your application crashes under load.

PgBouncer is the traditional solution... lightweight, battle-tested, but single-threaded. It becomes a bottleneck on high-throughput systems.

Supavisor is the modern alternative. Built in Elixir, it's been benchmarked handling 1 million concurrent connections while maintaining 20,000 queries per second. If you're using Supabase, you get Supavisor automatically. If not, deploy PgBouncer and plan to upgrade.

The rule: If you're running serverless at any scale, connection pooling is not optional.

Database Provider Selection

A quick note on managed PostgreSQL providers, since this question comes up constantly:

Supabase ... Best developer experience. RLS is a first-class citizen. Supavisor included. The dashboard is excellent for debugging. Downside: you're tied to their ecosystem (auth, storage, realtime). If you just want Postgres, you're paying for features you won't use.

Neon ... Serverless Postgres with branching. The killer feature is database branches for preview deployments... each PR gets its own database copy. Downside: relatively new, and the serverless scaling can cause latency spikes during cold starts.

AWS RDS ... The enterprise standard. Rock-solid reliability. Downside: no built-in connection pooling (you'll deploy PgBouncer yourself), and the console is a maze.

PlanetScale ... MySQL, not Postgres. I mention it because teams ask. Their branching workflow is excellent, but you lose RLS, JSONB, and the PostgreSQL ecosystem. For B2B SaaS with multi-tenancy requirements, I don't recommend it.

My default recommendation: Supabase for 0-50k users, then evaluate whether to migrate to RDS based on specific requirements (compliance, existing AWS infrastructure, cost optimization at scale).

Part 3: The API Layer

tRPC for Internal, REST for External

The API architecture question has a clear answer for B2B SaaS:

Internal dashboard → tRPC
Public/partner API → REST with OpenAPI

tRPC provides end-to-end type safety in a TypeScript monorepo. You define a procedure on the server:


// server/routers/widgets.ts
export const widgetRouter = router({
	list: protectedProcedure
		.input(z.object({ limit: z.number().default(10) }))
		.query(async ({ ctx, input }) => {
			return ctx.db.widget.findMany({
				where: { tenantId: ctx.tenant.id },
				take: input.limit,
			});
		}),
});

And the client immediately knows the return type:


// No API documentation needed. No type generation.
// The types flow from server to client automatically.
const { data } = trpc.widget.list.useQuery({ limit: 20 });
// data is fully typed: Widget[]

This eliminates an entire category of bugs... mismatched request/response types, outdated API documentation, runtime type errors from backend changes.

But tRPC tightly couples client and server. You can't give a tRPC endpoint to a customer and say "integrate with this." They'd need your type definitions, your TypeScript setup, your entire build system.

For public APIs, REST with OpenAPI specification remains the standard. Enterprise customers expect Swagger documentation, not a TypeScript monorepo.

The hybrid approach: tRPC powers your dashboard (80% of traffic), REST handles integrations (20% of traffic, 80% of documentation effort).

Bundle Size Considerations

Your choice of data-fetching library affects Time to Interactive:

Library	Size (min+gzip)	Use Case
SWR	~5.5 kB	Simple REST fetching
tRPC Client	~5-11 kB	Type-safe internal APIs
TanStack Query	~13 kB	Complex caching/mutations
Apollo Client	~20-40 kB	GraphQL with normalization

For B2B dashboards where users are on corporate networks, these differences matter less than in consumer apps. But if your bundle is already 500kB, adding Apollo's 40kB is a 8% increase. That adds up.

I default to tRPC + TanStack Query. The bundle cost (~18kB combined) is justified by the developer experience gains.

Part 4: The Next.js Application Layer

App Router in 2026

Next.js 16 (released October 2025) made React Server Components the default. This is no longer experimental... it's the standard.

The trade-off is explicit: smaller client bundles in exchange for potentially slower Time to First Byte (TTFB).

Server Components render on the server. The client receives HTML, not JavaScript. For a complex dashboard component that previously shipped 50kB of JavaScript, the client now receives 0kB... just the rendered output.

But that rendering happens on every request (unless cached). If your Server Component fetches data from three APIs sequentially, TTFB increases by the sum of those latencies.

The pattern that works:


// Parallel data fetching with Suspense
async function Dashboard() {
	// These fetch in parallel, not sequentially
	const [metrics, activity, alerts] = await Promise.all([
		getMetrics(),
		getRecentActivity(),
		getActiveAlerts(),
	]);

	return (
		<div>
			<MetricsPanel data={metrics} />
			<ActivityFeed data={activity} />
			<AlertsBanner data={alerts} />
		</div>
	);
}

Caching: The Breaking Change

Next.js 15 changed caching from opt-out to opt-in. This was the right call.

Previous versions cached aggressively by default. I've debugged countless issues where B2B dashboards showed stale data... inventory counts that didn't update after orders, user lists that missed recent additions. The fix was always "add cache: 'no-store'" but developers had to know to add it.

In Next.js 15+, nothing is cached unless you explicitly enable it:


// This fetches fresh data on every request
const data = await fetch("/api/widgets");

// This caches for 60 seconds
const data = await fetch("/api/widgets", {
	next: { revalidate: 60 },
});

For B2B SaaS where data freshness is critical, this default is correct. Opt-in caching forces you to think about what should be cached, rather than discovering stale data in production.

Docker Optimization

If you're self-hosting Next.js (on AWS ECS or similar), Docker configuration matters.

A naive Docker build produces a 2GB+ image:


# DON'T DO THIS
FROM node:20
COPY . .
RUN npm install
RUN npm run build
CMD ["npm", "start"]

This copies your entire node_modules and build cache into the image. Deployment takes minutes. Container startup is slow. Storage costs add up.

The fix is Next.js standalone output:


// next.config.js
module.exports = {
	output: "standalone",
};


# Production Dockerfile
FROM node:20-alpine AS base

FROM base AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build

FROM base AS runner
WORKDIR /app
ENV NODE_ENV=production

# Copy only what's needed
COPY --from=builder /app/.next/standalone ./
COPY --from=builder /app/.next/static ./.next/static
COPY --from=builder /app/public ./public

EXPOSE 3000
CMD ["node", "server.js"]

This produces a ~200MB image...90% smaller. Deployments are faster. Cold starts are faster. Storage is cheaper.

One more critical setting:


ENV NODE_OPTIONS="--max-old-space-size=768"

Set this to ~75% of your container's memory allocation. Without it, Node.js will consume memory until the container is killed by the orchestrator, causing restarts under load.

Part 5: The Phased Approach

Architecture isn't a destination. It's a series of decisions made at the right time.

Phase 1: Zero to 10k MAU

Goal: Ship features. Validate product-market fit.

Infrastructure: Vercel Pro
Database: Managed PostgreSQL (Supabase or Neon) with RLS enabled from day one
API: tRPC for everything
Monitoring: Vercel Analytics + basic error tracking

Don't optimize. Don't worry about AWS. The "Vercel tax" is cheaper than hiring DevOps. Your only job is to find customers.

Phase 2: 10k to 50k MAU

Goal: Optimize without over-engineering.

Enable Next.js standalone output (prepare for containerization)
Add composite indexes for common query patterns
Implement caching for static content (navigation, feature flags)
Add OpenTelemetry tracing to identify bottlenecks
Build your public REST API for enterprise integrations

Watch your Vercel bill. If bandwidth approaches 1TB, start planning migration.

Phase 3: 50k to 100k+ MAU

Goal: Take control of infrastructure economics.

Migrate to AWS ECS/Fargate if bandwidth exceeds 1.5TB or compliance requires it
Deploy Supavisor or PgBouncer for connection pooling
Implement proper CI/CD with preview environments
Consider read replicas if database CPU exceeds 70%

This phase requires DevOps capability... either hire or contract. The infrastructure complexity now justifies the labor cost.

What This Doesn't Cover

This guide intentionally omits:

Real-time features (WebSockets, presence) ... These require dedicated infrastructure (PartyKit, dedicated Node.js servers)
Background jobs over 60 seconds ... Use AWS Step Functions, Temporal, or Bull queues on dedicated workers
AI/ML workloads ... GPU infrastructure is a different discipline
Mobile applications ... The API layer changes significantly for mobile-first products

Each of these deserves its own deep-dive. This guide is about the core architecture that handles 80% of B2B SaaS requirements.

Common Mistakes I've Seen

Before closing, here are the patterns that consistently cause pain:

Mistake 1: Premature Kubernetes

I've watched a 4-person startup spend three months setting up a Kubernetes cluster "for scale." They had 200 users. They should have shipped features on Vercel and worried about K8s at 50k users... if ever. ECS Fargate handles 100k users without the operational complexity of Kubernetes.

Unless you have dedicated platform engineers, Kubernetes is a distraction until you're well past the scale this guide covers.

Mistake 2: GraphQL for Internal APIs

GraphQL solves a real problem: mobile apps with bandwidth constraints need to request exactly the data they need. But for a web-only B2B dashboard where the frontend and backend are in the same repo? GraphQL adds schema maintenance, code generation, and a 40kB client library for no benefit.

tRPC gives you the same type safety with none of the ceremony. Save GraphQL for when you have mobile apps or third-party developers who need query flexibility.

Mistake 3: Ignoring Multi-Tenancy Until Later

"We'll add proper tenant isolation when we have more customers."

No, you won't. You'll have a hundred customers with intertwined data, a codebase full of WHERE company_id = clauses that sometimes get forgotten, and a month of terror when you discover a data leak during a SOC 2 audit.

Implement RLS on day one. It's five lines of SQL per table. The cost of doing it later is measured in months, not hours.

Mistake 4: Optimizing Before Measuring

I once reviewed a codebase where the team had implemented Redis caching, read replicas, and a CDN... for an app with 300 users and a P95 latency of 180ms. They'd spent two months on infrastructure that provided no measurable benefit.

Before optimizing anything, add basic observability: request latency percentiles, database query times, error rates. Optimize the slowest thing. Then measure again. Repeat until the metrics are acceptable.

The teams that ship fast are the ones that resist premature optimization.

The Takeaway

Building to 100k users is not about choosing perfect technology. It's about:

Starting simple ... Vercel + Supabase gets you to 10k users for $100/month
Making irreversible decisions carefully ... Database schema is hard to change; hosting is not
Migrating with intention ... Move to AWS when the economics demand it, not when your ego does
Using RLS from day one ... You cannot retrofit multi-tenant security
Measuring before optimizing ... Intuition about performance is usually wrong

The companies that reach 100k users aren't the ones with the most sophisticated architecture. They're the ones that shipped fast, paid attention to the cliffs, and evolved their infrastructure alongside their business.

Build the machine that builds the machine. Start with something that works. Make it better when "better" matters.

Continue Reading

This post is part of the SaaS Architecture Decision Framework ... covering multi-tenancy, deployment models, database scaling, and cost optimization from MVP to $1M ARR.

The Anatomy of a High-Precision SaaS: From Zero to 100k Users

The Anatomy of a High-Precision SaaS

TL;DR

The Problem Nobody Talks About

The Stack

Part 1: Infrastructure Economics

The Vercel Question

The Migration Trajectory

The Cloudflare Alternative

When to Migrate

The Migration Itself

Part 2: The Database Layer

Why PostgreSQL

Multi-Tenancy: The Decision That Haunts You

RLS Performance: The Truth

The Index That Matters

Connection Pooling: Non-Negotiable

Database Provider Selection

Part 3: The API Layer

tRPC for Internal, REST for External

Bundle Size Considerations

Part 4: The Next.js Application Layer

App Router in 2026

Caching: The Breaking Change

Docker Optimization

Part 5: The Phased Approach

Phase 1: Zero to 10k MAU

Phase 2: 10k to 50k MAU

Phase 3: 50k to 100k+ MAU

What This Doesn't Cover

Common Mistakes I've Seen

The Takeaway

Further Reading

Continue Reading

More in This Series

Get insights like this weekly

The Anatomy of a High-Precision SaaS

●TL;DR

●The Problem Nobody Talks About

●The Stack

●Part 1: Infrastructure Economics

The Vercel Question

The Migration Trajectory

The Cloudflare Alternative

When to Migrate

The Migration Itself

●Part 2: The Database Layer

Why PostgreSQL

Multi-Tenancy: The Decision That Haunts You

RLS Performance: The Truth

The Index That Matters

Connection Pooling: Non-Negotiable

Database Provider Selection

●Part 3: The API Layer

tRPC for Internal, REST for External

Bundle Size Considerations

●Part 4: The Next.js Application Layer

App Router in 2026

Caching: The Breaking Change

Docker Optimization

●Part 5: The Phased Approach

Phase 1: Zero to 10k MAU

Phase 2: 10k to 50k MAU

Phase 3: 50k to 100k+ MAU

●What This Doesn't Cover

●Common Mistakes I've Seen

●The Takeaway

●Further Reading

●Continue Reading

More in This Series

Get insights like this weekly

TL;DR

The Problem Nobody Talks About

The Stack

Part 1: Infrastructure Economics

Part 2: The Database Layer

Part 3: The API Layer

Part 4: The Next.js Application Layer

Part 5: The Phased Approach

What This Doesn't Cover

Common Mistakes I've Seen

The Takeaway

Further Reading

Continue Reading