Event-Driven Architecture for SaaS at Scale

TL;DR

Event-driven architecture solves real problems... decoupling services, handling bursty workloads, enabling independent team deployment. But I've watched teams adopt it 18 months too early and spend $50-100K/year on infrastructure they didn't need. The decision threshold isn't about ARR... it's about pain: cascading failures from synchronous calls, database locks from concurrent writes, and cross-team deployment bottlenecks. Start with the outbox pattern and a single event flow before migrating your entire architecture. Kafka handles 500K-1M messages/second but requires 8+ cores and 64-128GB RAM per broker. NATS JetStream delivers sub-millisecond latency at a fraction of the operational cost. Choose based on your actual throughput requirements, not your aspirational ones.

Part of the SaaS Architecture Decision Framework ... a comprehensive guide to architecture decisions from MVP to scale.

The Timing Problem

Every SaaS architecture conversation about events follows the same trajectory. At 5 engineers and $500K ARR, someone reads about how LinkedIn processes 1 trillion messages per day through Kafka and decides the team needs event-driven architecture. Six months later, they have a Kafka cluster nobody fully understands, an event schema that breaks every sprint, and debugging sessions that take 3x longer than the synchronous version.

I've advised companies on both sides of this mistake. The ones who adopted EDA too early spent 20-25% of their engineering headcount managing infrastructure instead of shipping features. The ones who waited too long suffered cascading failures that took down production for hours because a single downstream service was slow.

The right answer is neither "always events" nor "never events." It's knowing when the pain of synchronous communication exceeds the complexity cost of asynchronous messaging.

When Synchronous Breaks

Before discussing patterns and technologies, understand what actually forces the transition.

Cascading Failures

Your order service calls the payment service, which calls the fraud service, which calls the risk scoring service. Each call adds latency and a failure point. When the risk scoring service has a bad deploy and responds in 30 seconds instead of 300ms, every upstream service backs up. Connection pools saturate. Your entire checkout flow goes down because one service is slow.

This isn't theoretical. I've seen this exact pattern take down production at three different B2B SaaS companies.

Database Lock Contention

Multiple services writing to the same database... or even the same tables... create lock contention under load. An inventory update blocks an order creation. A reporting query locks a table that the billing service needs. At 100 concurrent users, this is invisible. At 1,000, your p99 latency spikes from 200ms to 5 seconds.

Team Coupling

This is the one nobody talks about in architecture diagrams. When three teams share synchronous APIs, every deployment requires coordination. Team A can't ship their payment refactoring because Team B's invoice service depends on the exact response format. Release cycles slow from weekly to monthly. Your best engineers spend more time in cross-team sync meetings than writing code.

Core Patterns You Actually Need

EDA isn't one pattern... it's a family of patterns solving different problems. Most teams need two or three, not all of them.

The Outbox Pattern: Start Here

The outbox pattern solves the fundamental consistency problem: how do you update your database and publish an event atomically?


-- Single transaction: business logic + event publishing
BEGIN;

UPDATE orders SET status = 'confirmed' WHERE id = $1;

INSERT INTO outbox (event_type, payload, created_at)
VALUES ('order.confirmed', $2, NOW());

COMMIT;

A separate relay process polls the outbox table and publishes events to your broker. The database transaction guarantees both the state change and the event record succeed or fail together.

This gives you at-least-once delivery. Your consumers must be idempotent... processing the same event twice should produce the same result. This sounds like a constraint, but it forces better design. Every event handler that can safely retry is an event handler that recovers from failures automatically.

Two implementation approaches:

Approach	Mechanism	Latency	Complexity
Polling	Scheduled job queries outbox table	100ms-5s	Low... any team can implement
CDC (Change Data Capture)	PostgreSQL logical replication streams WAL changes	10-50ms	Medium... requires WAL configuration

For most SaaS applications under 10,000 events per minute, polling every 500ms is sufficient and dramatically simpler.

Choreography vs. Orchestration

Two fundamentally different approaches to coordinating multi-service workflows.

Choreography (decentralized): Each service reacts to events independently. No central coordinator. The order service publishes order.created, the payment service consumes it and publishes payment.processed, the inventory service consumes that and publishes inventory.reserved.

Orchestration (centralized): A coordinator service manages the workflow. The order orchestrator explicitly calls the payment service, waits for confirmation, then calls the inventory service.

The decision matrix isn't about preference... it's about the workflow characteristics:

Factor	Choreography	Orchestration
Services involved	2-3	4+
Failure handling	Compensating events	Centralized rollback
Visibility	Distributed tracing required	Orchestrator holds full state
Team autonomy	High... teams own their reactions	Lower... orchestrator team is bottleneck
Debugging	Harder... follow event chain	Easier... single coordinator log

Most SaaS applications benefit from a hybrid: choreography for simple, fire-and-forget flows (notifications, analytics, audit logging) and orchestration for critical business transactions (order fulfillment, payment processing, provisioning).

The Saga Pattern: Distributed Transactions

Sagas manage multi-step business transactions across services where you need rollback capability. An order saga might:

Reserve inventory
Process payment
Generate invoice
Send confirmation

If step 3 fails, the saga executes compensation: refund the payment, release the inventory.

Where sagas go wrong:

Undefined compensation actions. Every forward step needs a reverse step. "Release inventory" isn't always the reverse of "reserve inventory" if another order claimed the freed units in the meantime.
Isolation violations. Two concurrent sagas modifying the same data can cause lost updates. Saga A reads inventory count, Saga B reads the same count, both decrement... you've oversold.
Orchestrator as single point of failure. If your saga coordinator goes down mid-transaction, you need recovery logic to resume or roll back incomplete sagas.

The honest recommendation: don't implement sagas until you have a specific multi-service transaction that's causing data inconsistency problems. Sagas are the most complex EDA pattern and the one most often adopted prematurely.

Technology Selection

I'll compare the three technologies I recommend to SaaS teams, along with the managed alternative that makes sense for AWS-native architectures.

Performance Comparison (2025 Benchmarks)

Technology	Throughput	Latency	Min. Resources	Best For
Apache Kafka	500K-1M+ msg/sec	10-50ms	8 cores, 64GB RAM, 10Gig NIC	High-volume streaming, event sourcing, analytics pipelines
NATS JetStream	200K-400K msg/sec	Sub-ms (memory), 1-5ms (disk)	2 cores, 4GB RAM	Low-latency microservices, IoT, edge deployments
RabbitMQ	50K-100K msg/sec	5-20ms	4 cores, 8GB RAM	Task queues, request-reply patterns, moderate throughput
AWS EventBridge	Scales automatically	50-200ms	None (serverless)	AWS-native routing, low-volume event-driven workflows

These numbers are from 2025 VPS benchmarks on standardized 4 vCPU, 8GB RAM, NVMe configurations. Production numbers vary based on message size, replication factor, and network topology.

Cost Analysis

Cost is where the technology decision gets real.

Apache Kafka (Self-Hosted)

Software: $0 (open source)
Infrastructure: 3-node cluster minimum. On AWS, three m7g.xlarge instances run ~$1,200/month
Operations: Requires distributed systems expertise. Budget 10-20% of one senior engineer's time for cluster management
Hidden cost: ZooKeeper management (or KRaft migration), partition rebalancing, schema registry operations

Apache Kafka (AWS MSK)

Brokers: $0.15-0.20/hour per m7g.large (~$330-440/month per broker)
Storage: $0.10/GB-month (EBS)
3-broker minimum: ~$1,000-1,300/month before storage and data transfer
Eliminates operational overhead but less flexible than self-hosted

Confluent Cloud

Compute: CKU-based pricing (varies by throughput)
Networking: $0.025/GB (US regions)
Connectors: $0.20/task/hour
Generally 40-60% more expensive than MSK but includes Schema Registry, ksqlDB, and managed connectors

NATS JetStream (Self-Hosted)

Software: $0 (open source)
Infrastructure: Runs on significantly smaller instances. Three t3.medium instances handle most SaaS workloads (~$120/month)
Operations: Minimal. No ZooKeeper, no partition management, no JVM tuning
Cloud IOPS: Dramatically lower than Kafka... significant cost difference on AWS/GCP

AWS EventBridge

~$1.00 per million events published
Example: 5M events/month = ~$5.40 total
Zero operational overhead
Limitation: AWS-only, not suitable for high-throughput streaming

The Decision Framework

Choose Kafka when:

You process more than 100K messages/second sustained
You need event replay and long-term event storage
You're building analytics pipelines or event sourcing systems
You have (or will hire) engineers with Kafka operational experience

Choose NATS JetStream when:

Latency matters more than maximum throughput
Your team is under 20 engineers and operational simplicity is critical
You're running at the edge or in resource-constrained environments
Your sustained throughput is under 100K messages/second

Choose EventBridge when:

You're fully committed to AWS
Event volume is under 1M events/day
You want zero infrastructure management
You're routing events between AWS services, not building streaming pipelines

Choose RabbitMQ when:

You need request-reply messaging patterns
Your workload is task distribution (job queues, background processing)
You're already running RabbitMQ and your throughput needs are under 50K/second

Event Schema Design

Schema evolution is where most EDA implementations accumulate technical debt. Get this wrong and every consumer breaks when you add a field.

Rules That Prevent Breaking Changes

1. Always add, never remove or rename fields.


// Version 1
interface OrderCreated {
	orderId: string;
	customerId: string;
	total: number;
}

// Version 2 ... backward compatible
interface OrderCreated {
	orderId: string;
	customerId: string;
	total: number;
	currency?: string; // New field, optional
	lineItems?: LineItem[]; // New field, optional
}

New fields must have default values or be optional. Old consumers ignore fields they don't recognize. New consumers handle missing optional fields.

2. Version your events explicitly.


interface EventEnvelope {
	eventType: string;
	version: number;
	timestamp: string;
	correlationId: string;
	payload: unknown;
}

When a breaking change is unavoidable (rare, but it happens), publish both versions simultaneously during the migration period. Old consumers read v1, new consumers read v2. Decommission v1 after all consumers migrate.

3. Events should be autonomous.

Every event must contain all information the consumer needs. Don't publish { orderId: "123" } and force consumers to call back to the order service for details. That re-introduces the synchronous coupling you were trying to eliminate.

Eventual Consistency: The Part Nobody Warns You About

Eventual consistency is the tax you pay for event-driven architecture. Organizations transitioning from ACID databases underestimate the business impact by roughly 3.7x on average, based on post-implementation surveys.

The Stale Read Problem

User creates an order. Your API returns 201 Created. User immediately refreshes the orders page. The order isn't there... the read model hasn't caught up to the write.

This doesn't manifest in development because your event bus and read model are on the same machine. In production, event processing lag of 200ms-2 seconds is normal. Under load, it can spike to 10+ seconds.

Mitigation strategies:

Read-your-writes consistency. After a write, read from the primary (not the eventual read model) for that specific user's subsequent requests. Adds complexity but eliminates the most jarring UX issue.
Optimistic UI updates. The frontend assumes the write succeeded and shows the result immediately, reconciling when the read model catches up. Works for most CRUD-style operations.
Polling with backoff. After a write, the client polls the read model with exponential backoff until the change appears. Simple to implement, slightly worse UX.

Processing Lag Between Services

Two services consuming the same event stream process at different rates. The billing service is 30 seconds behind the notification service. A customer gets a "payment confirmed" email before the billing dashboard shows the payment. Your support team gets confused calls.

There is no silver bullet for this. You manage it with:

SLA-based monitoring on consumer lag
Correlation IDs that let you trace an event across every service
Circuit breakers that prevent a slow consumer from falling infinitely behind

Migration Strategy: The Strangler Fig Approach

Don't migrate your entire architecture to events simultaneously. The strangler fig pattern works: wrap existing synchronous calls with event publishing, migrate consumers one at a time.

Step 1: Single Event Flow

Pick one low-risk event. User signup is ideal... it's well-understood, low-volume, and failures are non-critical.


// Before: synchronous call chain
async function createUser(data: UserData) {
	const user = await db.users.create(data);
	await emailService.sendWelcome(user); // Sync call
	await analyticsService.track(user); // Sync call
	return user;
}

// After: publish event, let consumers handle side effects
async function createUser(data: UserData) {
	const user = await db.users.create(data);
	await outbox.publish("user.created", {
		userId: user.id,
		email: user.email,
		plan: user.plan,
	});
	return user;
}

The email and analytics services become event consumers. The user creation endpoint returns faster. If the email service is down, the event waits in the queue instead of failing the signup.

Step 2: Monitor and Validate

Run the event-driven path alongside the synchronous path for two weeks. Compare:

Event delivery success rate (target: 99.9%+)
End-to-end latency (event published to consumer processed)
Consumer error rates
Message ordering correctness

Step 3: Incremental Migration

Once the first flow is stable, migrate the next most-critical workflow. For most SaaS applications, the priority order is:

Notifications (email, push, in-app) ... lowest risk, highest decoupling benefit
Analytics and audit logging ... fire-and-forget, no business logic dependency
Billing events ... higher stakes, but well-bounded domain
Core business workflows ... only after the team has 3-6 months of event-driven operational experience

The Adoption Checklist

Before committing to event-driven architecture, verify these conditions:

You should adopt EDA when:

Synchronous calls between services cause cascading failures monthly
Database lock contention is measurable and increasing
Three or more teams coordinate deployments weekly
You need independent scaling of read and write workloads
Audit trail requirements exceed what application-level logging provides

You should not adopt EDA when:

Your entire team fits in one standup (under 8 engineers)
Synchronous API calls handle your current and 12-month projected load
You're pre-product-market-fit and pivoting quarterly
Nobody on the team has operated a message broker in production

The honest rule of thumb: If you're asking "should we adopt EDA?", you probably don't need it yet. The teams that need it know they need it because they're already suffering from the problems it solves.

Building a SaaS that's hitting the synchronous wall? I help teams plan and execute the migration to event-driven architecture without the 18-month learning curve.

Technical Advisor for Startups ... Architecture decisions from MVP to scale
Next.js Development for SaaS ... Production-grade Node.js systems
Technical Due Diligence ... Pre-investment architecture assessment

Continue Reading

This post is part of the SaaS Architecture Decision Framework ... covering database design, multi-region deployment, API strategy, and infrastructure patterns.

Event-Driven Architecture for SaaS at Scale

TL;DR

The Timing Problem

When Synchronous Breaks

Cascading Failures

Database Lock Contention

Team Coupling

Core Patterns You Actually Need

The Outbox Pattern: Start Here

Choreography vs. Orchestration

The Saga Pattern: Distributed Transactions

Technology Selection

Performance Comparison (2025 Benchmarks)

Cost Analysis

The Decision Framework

Event Schema Design

Rules That Prevent Breaking Changes

Eventual Consistency: The Part Nobody Warns You About

The Stale Read Problem

Processing Lag Between Services

Migration Strategy: The Strangler Fig Approach

Step 1: Single Event Flow

Step 2: Monitor and Validate

Step 3: Incremental Migration

The Adoption Checklist

Continue Reading

More in This Series

Get insights like this weekly

●TL;DR

●The Timing Problem

●When Synchronous Breaks

Cascading Failures

Database Lock Contention

Team Coupling

●Core Patterns You Actually Need

The Outbox Pattern: Start Here

Choreography vs. Orchestration

The Saga Pattern: Distributed Transactions

●Technology Selection

Performance Comparison (2025 Benchmarks)

Cost Analysis

The Decision Framework

●Event Schema Design

Rules That Prevent Breaking Changes

●Eventual Consistency: The Part Nobody Warns You About

The Stale Read Problem

Processing Lag Between Services

●Migration Strategy: The Strangler Fig Approach

Step 1: Single Event Flow

Step 2: Monitor and Validate

Step 3: Incremental Migration

●The Adoption Checklist

●Continue Reading

More in This Series

Related Guides

Get insights like this weekly

TL;DR

The Timing Problem

When Synchronous Breaks

Core Patterns You Actually Need

Technology Selection

Event Schema Design

Eventual Consistency: The Part Nobody Warns You About

Migration Strategy: The Strangler Fig Approach

The Adoption Checklist

Continue Reading