SaaS Reliability at Scale: Monitoring, Alerting, and Incident Response

TL;DR

Three pillars: logs for debugging, metrics for dashboards, traces for request flows. Golden signals: latency (P50/P95/P99), traffic (requests/sec), errors (rate and types), saturation (CPU/memory/connections). Alert on symptoms (user impact), not causes (high CPU). Build runbooks before you need them. Seed stage: Sentry + Checkly + PagerDuty. Series A: add Grafana Cloud. Series B+: evaluate Datadog or self-host with OpenTelemetry Collector.

Part of the SaaS Architecture Decision Framework ... a comprehensive guide to architecture decisions from MVP to scale.

The Uptime Math

Let's establish what we're actually talking about:

SLA	Annual Downtime	Monthly Downtime	Daily Downtime
99%	87.6 hours	7.3 hours	14.4 minutes
99.9%	8.76 hours	43.8 minutes	1.44 minutes
99.95%	4.38 hours	21.9 minutes	43 seconds
99.99%	52.6 minutes	4.38 minutes	8.6 seconds
99.999%	5.26 minutes	26.3 seconds	0.86 seconds

Most B2B SaaS products target 99.9% as the contractual SLA... which sounds impressive until you realize it's nearly 9 hours of downtime per year. I've built observability stacks for systems running at 99.95% actual uptime, and the difference between 99.9% and 99.95% isn't just 0.05%... it's a fundamentally different operational posture.

The infrastructure that gets you to 99.9% will not get you to 99.99%. This guide focuses on the practical middle ground: building observability that catches problems before users notice them, without creating an alert storm that burns out your on-call engineers.

The Three Pillars of Observability

Everyone talks about logs, metrics, and traces. Few explain when to use each.

Logs: The Forensic Record

Logs answer "what happened?" They're discrete events with timestamps and context. Use logs when:

Debugging specific errors after the fact
Understanding the sequence of events in a request
Auditing user actions for compliance
Investigating security incidents

Logs are expensive at scale. A busy API generating 1KB per request at 1,000 req/s produces 86GB daily. Structure your logs from day one... JSON, not plaintext.


// Bad: Unstructured logging
console.log(`User ${userId} created order ${orderId}`);

// Good: Structured logging
logger.info({
	event: "order.created",
	userId,
	orderId,
	amount: order.total,
	items: order.items.length,
	latencyMs: Date.now() - startTime,
});

Metrics: The Vital Signs

Metrics answer "how is the system performing?" They're numeric values aggregated over time. Use metrics for:

Dashboards and real-time visibility
Alerting on thresholds
Capacity planning
SLA tracking

Metrics are cheap to store and query. A 1-minute resolution metric costs roughly 1,440 data points per day... trivial compared to logs.

Traces: The Request Journey

Traces answer "why was this request slow?" They track a single request across services. Use traces when:

Debugging distributed system latency
Understanding service dependencies
Identifying which service caused a timeout
Profiling request paths for optimization

Traces are expensive but essential for microservices. Sample them...100% trace collection is neither necessary nor economical.

The Four Golden Signals

Google's SRE book introduced the golden signals. I've seen teams implement all four correctly exactly twice in my career. Most monitor the wrong things.

1. Latency

Not just average latency... percentiles matter.

Percentile	Meaning	Alert Threshold (typical API)
P50	Median experience	Monitor, don't alert
P95	1 in 20 users affected	Warning at 200ms
P99	1 in 100 users affected	Alert at 500ms
P99.9	1 in 1,000 users affected	Alert at 2,000ms

The P99 is where problems hide. An API with 50ms P50 and 3,000ms P99 has a 3-second tail that's ruining 1% of user experiences.


// Instrument latency with histogram buckets
import { Histogram } from "prom-client";

const httpRequestDuration = new Histogram({
	name: "http_request_duration_seconds",
	help: "Duration of HTTP requests in seconds",
	labelNames: ["method", "route", "status_code"],
	buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
});

// In your middleware
const end = httpRequestDuration.startTimer();
await next();
end({ method: req.method, route: req.route.path, status_code: res.statusCode });

2. Traffic

Requests per second, broken down by endpoint and status.

This is your baseline for everything else. A 50% traffic drop at 2am on a Tuesday is normal. A 50% traffic drop at 2pm on a weekday is an outage.

3. Errors

Error rate as a percentage of total traffic. Types of errors matter:

4xx (Client errors): Usually not your fault, but track spikes
5xx (Server errors): Your fault. Alert at 1% error rate
Timeouts: Often worse than crashes... they tie up resources


# Prometheus alerting rule
groups:
  - name: api-errors
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status_code=~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m]))
          > 0.01
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Error rate above 1% for 2 minutes"
          description: "{{ $value | humanizePercentage }} of requests returning 5xx"

4. Saturation

How full is your system? This is about capacity headroom:

CPU utilization: Alert at 80%, page at 90%
Memory utilization: Alert at 85%, page at 95%
Connection pool usage: Alert at 70%, page at 85%
Disk I/O: Alert when queue depth exceeds core count

Saturation alerts are predictive. High CPU usage at 2am gives you time to scale before the 9am traffic spike hits.

OpenTelemetry Instrumentation

OpenTelemetry is the standard. Vendor-specific SDKs lock you in. I've helped three companies migrate away from proprietary instrumentation... it's painful and expensive.

Node.js/Next.js Setup


// instrumentation.ts (Next.js 14+)
import { NodeSDK } from "@opentelemetry/sdk-node";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
import { OTLPMetricExporter } from "@opentelemetry/exporter-metrics-otlp-http";
import { PeriodicExportingMetricReader } from "@opentelemetry/sdk-metrics";
import { Resource } from "@opentelemetry/resources";
import { SemanticResourceAttributes } from "@opentelemetry/semantic-conventions";

const sdk = new NodeSDK({
	resource: new Resource({
		[SemanticResourceAttributes.SERVICE_NAME]: "my-saas-api",
		[SemanticResourceAttributes.SERVICE_VERSION]: process.env.GIT_SHA || "unknown",
		[SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV,
	}),
	traceExporter: new OTLPTraceExporter({
		url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT + "/v1/traces",
	}),
	metricReader: new PeriodicExportingMetricReader({
		exporter: new OTLPMetricExporter({
			url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT + "/v1/metrics",
		}),
		exportIntervalMillis: 15000, // 15 second resolution
	}),
	instrumentations: [
		getNodeAutoInstrumentations({
			"@opentelemetry/instrumentation-fs": { enabled: false }, // Too noisy
			"@opentelemetry/instrumentation-http": {
				ignoreIncomingPaths: ["/health", "/ready", "/metrics"],
			},
		}),
	],
});

sdk.start();

process.on("SIGTERM", () => {
	sdk
		.shutdown()
		.then(() => console.log("Tracing terminated"))
		.catch((error) => console.error("Error terminating tracing", error))
		.finally(() => process.exit(0));
});

Custom Business Metrics

Auto-instrumentation catches HTTP and database calls. Business metrics require manual instrumentation:


// metrics/business.ts
import { metrics } from "@opentelemetry/api";

const meter = metrics.getMeter("business-metrics");

// Counter for events
const ordersCreated = meter.createCounter("orders_created_total", {
	description: "Total number of orders created",
});

// Histogram for values
const orderValue = meter.createHistogram("order_value_dollars", {
	description: "Value of orders in dollars",
	unit: "dollars",
});

// Gauge for current state
const activeSubscriptions = meter.createObservableGauge("active_subscriptions", {
	description: "Number of active subscriptions",
});

// Usage
export function recordOrderCreated(order: Order) {
	ordersCreated.add(1, {
		plan: order.plan,
		source: order.source,
	});
	orderValue.record(order.total, {
		plan: order.plan,
		currency: order.currency,
	});
}

// For gauges, register a callback
activeSubscriptions.addCallback(async (result) => {
	const count = await db.subscription.count({ where: { status: "active" } });
	result.observe(count, { tier: "all" });
});

Alert Design: Avoiding Alert Fatigue

I've audited on-call setups where engineers received 200+ alerts weekly. Actual incidents: 3. That's a 98.5% false positive rate. Those engineers were burned out within 6 months.

Symptom-Based vs. Cause-Based Alerts

Cause-based (wrong): "CPU above 80%" Symptom-based (right): "P99 latency above 500ms"

High CPU might mean your service is successfully handling a traffic spike. High latency means users are suffering. Alert on user impact, investigate causes.

Alert Hierarchy


# Three severity levels are enough
groups:
  - name: api-availability
    rules:
      # Critical: Page immediately
      - alert: APIDown
        expr: probe_success{job="api-blackbox"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "API is unreachable"
          runbook_url: "https://runbooks.internal/api-down"

      # Warning: Notify Slack, review during business hours
      - alert: HighLatency
        expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P99 latency above 500ms"

      # Info: Dashboard only, no notification
      - alert: ElevatedErrorRate
        expr: |
          sum(rate(http_requests_total{status_code=~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m]))
          > 0.005
        for: 5m
        labels:
          severity: info
        annotations:
          summary: "Error rate slightly elevated (0.5%)"

The 5-Minute Rule

Don't alert on instantaneous spikes. Require conditions to persist:

Critical alerts: 1-2 minutes (real outages are sustained)
Warning alerts: 5 minutes (filter transient noise)
Info alerts: 10 minutes (trend indicators only)

A 30-second CPU spike during deployment isn't an incident. A 5-minute CPU spike during steady state might be.

On-Call Setup

Rotation Design

For teams of 4-8 engineers, a weekly rotation works. For larger teams, use sub-teams with 2-week rotations.


# PagerDuty schedule example
schedules:
  primary:
    type: weekly
    rotation_virtual_start: "2026-01-06T09:00:00-05:00" # Monday 9am EST
    users:
      - alice@company.com
      - bob@company.com
      - carol@company.com
      - dave@company.com

  secondary:
    type: weekly
    rotation_virtual_start: "2026-01-13T09:00:00-05:00" # Offset by one week
    users:
      - alice@company.com
      - bob@company.com
      - carol@company.com
      - dave@company.com

escalation_policy:
  - targets:
      - schedule: primary
    escalation_timeout: 10 # minutes
  - targets:
      - schedule: secondary
    escalation_timeout: 15
  - targets:
      - user: engineering-manager@company.com

Escalation Policies

Primary on-call: 10 minutes to acknowledge
Secondary on-call: 15 minutes (if primary doesn't respond)
Engineering manager: 20 minutes (if both fail)
CTO: 30 minutes (nuclear option)

If an alert escalates to the CTO, something is broken beyond the incident itself.

Runbook Template

Every alert should link to a runbook. Here's the template I use:


# Runbook: High Error Rate

## Alert Condition

Error rate > 1% for 2+ minutes

## Impact

Users experiencing failed requests. API returning 5xx errors.

## Immediate Actions

1. Check deployment status: Did we just deploy?
   ```bash
   kubectl rollout history deployment/api
   ```

Check database connectivity:


curl -s https://api.internal/health | jq .database

Check downstream dependencies:


curl -s https://api.internal/health | jq .dependencies

Diagnostic Commands


# Error distribution by endpoint
kubectl logs -l app=api --since=5m | grep "level\":\"error" | jq .path | sort | uniq -c

# Recent deployments
kubectl get pods -o wide | grep api

# Database connection pool
kubectl exec -it deployment/api -- node -e "require('./db').pool.totalCount"

Mitigation Steps

If recent deployment caused it:


kubectl rollout undo deployment/api

If database connection issues:

Check connection pool exhaustion
Restart pods if pool is corrupted
Scale up if load is legitimate

If downstream dependency failure:

Enable circuit breaker
Return cached/degraded response
Notify dependency team

Escalation Criteria

Error rate > 5% for 5 minutes: Escalate to secondary
Complete outage: Escalate to engineering manager
Data corruption suspected: Escalate to CTO immediately

Post-Incident

Create incident ticket
Update this runbook if steps were missing
Schedule blameless postmortem



---

## Incident Response Framework

I've been through enough 2am incidents to know that having a framework matters. Panic is the enemy of resolution.

### The DTMRR Framework

**Detect:** Alert fires, human acknowledges
**Triage:** Assess severity, determine blast radius
**Mitigate:** Stop the bleeding, restore service
**Resolve:** Fix root cause properly
**Review:** Blameless postmortem, improve systems

### Detection (Minutes 0-5)

1. Acknowledge the alert
2. Join the incident channel (Slack: `#incident-YYYYMMDD-NNN`)
3. Announce you're investigating
4. Open relevant dashboards

@here Incident declared. High error rate on API. Status: Investigating Impact: Estimated 5% of requests failing Incident Commander: @alice



### Triage (Minutes 5-15)

Ask three questions:
1. **What's broken?** (Specific symptom)
2. **Who's affected?** (Blast radius)
3. **When did it start?** (Correlate with changes)

Severity levels:

| Level | Definition                           | Response Time |
| ----- | ------------------------------------ | ------------- |
| SEV1  | Complete outage, all users affected  | All hands     |
| SEV2  | Major feature down, many users       | On-call + 1   |
| SEV3  | Minor feature impacted, some users   | On-call only  |
| SEV4  | Degraded but functional, few notices | Next business |

### Mitigation (Minutes 15-60)

This is not debugging. This is restoring service.

**Common mitigation actions:**
- Rollback recent deployment
- Scale up infrastructure
- Enable circuit breakers
- Failover to secondary region
- Disable problematic feature flag
- Restart affected services

The goal is mean time to restore (MTTR), not mean time to root cause.

### Resolution (Hours to Days)

After service is restored, fix the underlying issue properly. This might involve:
- Code fix with proper testing
- Infrastructure changes
- Dependency updates
- Architecture improvements

### Review (Within 48 Hours)

Blameless postmortem template:

```markdown
# Incident Postmortem: [Date] [Brief Title]

## Summary
- Duration: [Start time] to [End time] (X hours, Y minutes)
- Severity: SEV[N]
- Impact: [User-facing impact]
- Root Cause: [One sentence]

## Timeline
- HH:MM Alert fired
- HH:MM Engineer acknowledged
- HH:MM Root cause identified
- HH:MM Mitigation applied
- HH:MM Service restored

## Root Cause Analysis
[Detailed technical explanation]

## What Went Well
- [Thing that worked]
- [Thing that worked]

## What Went Poorly
- [Thing that failed or was missing]
- [Thing that failed or was missing]

## Action Items
| Action | Owner | Due Date |
| ------ | ----- | -------- |
| Add missing metric | @alice | 2026-02-01 |
| Update runbook | @bob | 2026-02-03 |
| Implement circuit breaker | @carol | 2026-02-15 |

## Lessons Learned
[One paragraph summary]

Tool Stack by Company Stage

Observability tooling should match your scale and budget. I've seen seed-stage startups blow $3,000/month on Datadog for a system that could run on $200/month of tooling.

Seed Stage ($0-50K MRR)

Stack:

Error Tracking: Sentry ($26/month)
Uptime Monitoring: Checkly or Better Uptime ($20/month)
Alerting: PagerDuty free tier (up to 5 users)
Logs: Console logs → CloudWatch/Vercel (included)
APM: None yet... premature optimization

Total: ~$50/month

Focus on knowing when you're down and capturing errors. Deep observability can wait.

Series A ($50K-500K MRR)

Stack:

Metrics + Logs + Traces: Grafana Cloud ($50-200/month)
Error Tracking: Sentry Team ($80/month)
Uptime + API Monitoring: Checkly ($40/month)
Alerting: PagerDuty Professional ($25/user/month)
Status Page: Instatus or Statuspage ($29/month)

Total: ~$300-500/month

At this stage, you need dashboards, historical data, and proper on-call rotation.

Series B+ ($500K+ MRR)

Option A: Managed (Datadog)

Full Datadog suite: $1,500-5,000/month
Pros: Everything integrated, great UX
Cons: Expensive, vendor lock-in

Option B: Self-Managed (OpenTelemetry + OSS)

OpenTelemetry Collector → Grafana Mimir/Tempo/Loki
Hosted on Kubernetes
Pros: No vendor lock-in, unlimited scale
Cons: Requires dedicated SRE time

My recommendation: Datadog until your monthly bill exceeds the cost of an SRE (roughly $15K/month in tooling = one headcount). Then evaluate self-hosting.

The Cold Start Problem

Monitoring systems have their own reliability requirements. I've seen companies discover their alerting was down... after a 4-hour outage.

Monitor Your Monitoring

Heartbeat checks: Send a test alert every hour
Alert delivery verification: Use a separate channel (SMS backup)
Dashboard access: Can you access Grafana during an AWS outage?

Multi-Provider Strategy

For critical alerting, use two providers:

Primary: PagerDuty
Backup: Opsgenie or plain SMS via Twilio

If PagerDuty is down during your incident, you need a fallback.

What Actually Matters

I've built observability for systems with 99.95% uptime over 12 months. The difference between mediocre and excellent reliability isn't tooling... it's discipline:

Alert on symptoms, not causes... High CPU isn't an incident; high latency is
Every alert needs a runbook... If you can't write remediation steps, you can't alert on it
Page only for user impact... Everything else is a Slack notification
Measure what matters to customers... Latency, errors, availability
Practice incident response... Game days reveal process gaps before real incidents

The goal isn't zero downtime. It's predictable, managed downtime with fast recovery.

For deeper dives into related topics, see my posts on the Lambda tax and cold starts, architecting SaaS for scale, and hunting Node.js memory leaks.

Next Steps

If you're building a SaaS and want to get observability right from the start... or you're struggling with alert fatigue and unclear incident response... I work with startups as a technical advisor. The cost of getting this wrong is measured in customer churn and engineer burnout. The cost of getting it right is a few hours of architectural guidance.

Continue Reading

This post is part of the SaaS Architecture Decision Framework ... covering multi-tenancy, deployment models, database scaling, and cost optimization from MVP to $1M ARR.

●TL;DR

●The Uptime Math

●The Three Pillars of Observability