TL;DR
The median SaaS company monitors uptime and average response time. Both are nearly useless for catching real performance problems. Averages hide the pain... your p50 is 200ms but your p99 is 4 seconds, and the 1% of users hitting that 4-second response are your highest-paying enterprise customers on complex dashboards. The monitoring stack that actually works: percentile-based latency tracking (p50, p95, p99) per endpoint, real user monitoring (RUM) with device and network segmentation, error budget burn rate alerting (not static thresholds), and distributed tracing that follows a request across all services. I helped one team cut their MTTR from 45 minutes to under 8 minutes by switching from average-based monitoring to percentile-based alerting with error budgets.
Part of the Performance Engineering Playbook ... a comprehensive guide to building systems that stay fast under real-world load.
Why Averages Lie
A SaaS application serving 10,000 requests per minute with a 200ms average response time sounds healthy. But the distribution matters more than the average.
| Percentile | Response Time | Users Affected |
|---|---|---|
| p50 | 120ms | 5,000 requests ... fast |
| p75 | 250ms | 2,500 requests ... acceptable |
| p90 | 800ms | 1,000 requests ... noticeable |
| p95 | 2,500ms | 500 requests ... frustrating |
| p99 | 8,000ms | 100 requests ... broken |
That "200ms average" hides 100 requests per minute that take 8 seconds. Those 100 requests aren't random... they correlate with specific user segments: enterprise customers with large datasets, users on mobile networks, or requests that hit unoptimized query paths.
The average tells you the system is fine. The percentiles tell you 1% of your users are having a terrible experience... and those users are often the ones paying the most.
The Four Pillars of Performance Monitoring
Pillar 1: Percentile-Based Latency
Track p50, p95, and p99 for every endpoint. Alert on p99 because that's where problems surface first.
// Histogram-based latency tracking
import { Histogram } from "prom-client";
const httpDuration = new Histogram({
name: "http_request_duration_seconds",
help: "Duration of HTTP requests in seconds",
labelNames: ["method", "route", "status_code"],
buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
});
// Middleware
function metricsMiddleware(req: Request, res: Response, next: NextFunction) {
const end = httpDuration.startTimer({
method: req.method,
route: req.route?.path || req.path,
});
res.on("finish", () => {
end({ status_code: res.statusCode });
});
next();
}
Why histograms, not summaries: Histograms are aggregatable across instances. If you have 10 application servers, you can calculate the global p99 from histograms. You can't do that with pre-computed summaries... each instance computes its own p99, and the average of 10 p99s is not the global p99.
Pillar 2: Real User Monitoring (RUM)
Synthetic monitoring tells you what the experience could be. RUM tells you what the experience is.
// Client-side RUM collection
import { onLCP, onINP, onCLS, onTTFB } from "web-vitals";
interface RUMPayload {
metric: string;
value: number;
page: string;
connection: string;
device: string;
viewport: string;
timestamp: number;
}
function collectRUM(metric: { name: string; value: number }) {
const payload: RUMPayload = {
metric: metric.name,
value: metric.value,
page: window.location.pathname,
connection: (navigator as any).connection?.effectiveType || "unknown",
device: /Mobile|Android/.test(navigator.userAgent) ? "mobile" : "desktop",
viewport: `${window.innerWidth}x${window.innerHeight}`,
timestamp: Date.now(),
};
// Batch and send with beacon API (survives page navigation)
navigator.sendBeacon("/api/rum", JSON.stringify(payload));
}
onLCP(collectRUM);
onINP(collectRUM);
onCLS(collectRUM);
onTTFB(collectRUM);
The power of RUM is segmentation. When p99 INP spikes to 800ms, you can filter by connection type and discover the problem only affects users on 3G networks. Or filter by page and find that /dashboard/analytics has 3x worse LCP than every other page.
Pillar 3: Error Budget Burn Rate
Static thresholds ("alert when p99 > 1 second") generate too many alerts. A temporary spike during a deploy is different from a sustained degradation.
Error budgets flip the model: define how much unreliability is acceptable over a window, then alert when you're consuming that budget too fast.
# SLO definition: 99.9% of requests under 1 second
# Error budget: 0.1% = 43.2 minutes of downtime per month
# Prometheus alerting rule
groups:
- name: slo-burn-rate
rules:
- alert: HighBurnRate
expr: |
(
sum(rate(http_request_duration_seconds_bucket{le="1"}[5m]))
/
sum(rate(http_request_duration_seconds_count[5m]))
) < 0.999
for: 5m
labels:
severity: warning
annotations:
summary: "SLO burn rate exceeds budget"
description: "Error budget being consumed at {{ $value | humanize }}x normal rate"
- alert: CriticalBurnRate
expr: |
(
sum(rate(http_request_duration_seconds_bucket{le="1"}[1m]))
/
sum(rate(http_request_duration_seconds_count[1m]))
) < 0.99
for: 2m
labels:
severity: critical
annotations:
summary: "SLO burn rate critical - 10x budget consumption"
The magic of burn rate alerting: it automatically adjusts for traffic patterns. A 2-second spike during a 1,000 RPS period is significant. The same spike during a 10 RPS period at 3 AM is noise. Burn rate alerting distinguishes between the two.
Pillar 4: Distributed Tracing
When a request touches 3+ services, you need to trace the full journey to identify bottlenecks.
// OpenTelemetry instrumentation
import { trace, SpanStatusCode } from "@opentelemetry/api";
const tracer = trace.getTracer("order-service");
async function processOrder(orderId: string): Promise<Order> {
return tracer.startActiveSpan("processOrder", async (span) => {
try {
span.setAttribute("order.id", orderId);
// Each sub-operation gets its own span
const order = await tracer.startActiveSpan("fetchOrder", async (childSpan) => {
const result = await db.query("SELECT * FROM orders WHERE id = $1", [orderId]);
childSpan.setAttribute("db.rows_returned", result.rowCount);
childSpan.end();
return result.rows[0];
});
const payment = await tracer.startActiveSpan("processPayment", async (childSpan) => {
const result = await paymentService.charge(order.total, order.customerId);
childSpan.setAttribute("payment.amount", order.total);
childSpan.end();
return result;
});
span.setStatus({ code: SpanStatusCode.OK });
return { ...order, paymentId: payment.id };
} catch (error) {
span.setStatus({
code: SpanStatusCode.ERROR,
message: error instanceof Error ? error.message : "Unknown error",
});
throw error;
} finally {
span.end();
}
});
}
A single trace shows that the processOrder operation took 1,200ms...800ms in processPayment (external API) and 350ms in fetchOrder (missing index). Without tracing, you'd see "order endpoint is slow" with no idea which component to optimize.
Alerting That Doesn't Cause Alert Fatigue
The average SaaS team has 50-200 active alerts. The on-call engineer ignores 90% of them because they're noise. When a real problem fires, it's lost in the noise.
The Three-Tier Alert Model
| Tier | Response | SLA | Example |
|---|---|---|---|
| P1: Page | Wake someone up | 5 min response | Error budget burned 50% in 1 hour |
| P2: Ticket | Next business day | 24 hours | p99 latency above SLO for 30 minutes |
| P3: Dashboard | Weekly review | None | Cache hit rate dropped below 90% |
The rule: If an alert fires and nobody needs to take action, it shouldn't be an alert. Move it to a dashboard.
Alert Deduplication
# Alertmanager configuration
route:
group_by: ["alertname", "service"]
group_wait: 30s # Wait for related alerts to arrive
group_interval: 5m # Don't re-fire for 5 minutes
repeat_interval: 4h # Re-fire every 4 hours if unresolved
inhibit_rules:
# If the database is down, suppress all downstream service alerts
- source_match:
alertname: "DatabaseDown"
target_match:
severity: "warning"
equal: ["environment"]
Inhibition rules are critical for preventing alert storms. When the database goes down, every service that depends on it fires an alert. Without inhibition, the on-call engineer gets 50 notifications instead of 1.
The Performance Dashboard
Build one dashboard that answers these questions without clicking into sub-pages:
- Is the system healthy right now? (Traffic light SLO status)
- How much error budget remains? (Burn rate gauge)
- Where is latency worst? (Top 5 endpoints by p99)
- What changed recently? (Deploy markers on latency graphs)
- Who is affected? (RUM segmentation by device/network)
# Grafana panels for the performance dashboard
# Panel 1: Current SLO status (gauge)
sum(rate(http_request_duration_seconds_bucket{le="1"}[1h]))
/
sum(rate(http_request_duration_seconds_count[1h]))
# Panel 2: Error budget remaining (stat)
1 - (
sum(increase(http_request_duration_seconds_count[30d]))
- sum(increase(http_request_duration_seconds_bucket{le="1"}[30d]))
)
/ (sum(increase(http_request_duration_seconds_count[30d])) * 0.001)
# Panel 3: p99 latency by endpoint (time series)
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, route)
)
# Panel 4: Request rate with deploy markers (time series + annotations)
sum(rate(http_request_duration_seconds_count[1m])) by (route)
Reducing MTTR from 45 Minutes to 8 Minutes
Mean Time to Resolution (MTTR) is the metric that matters most for operational excellence. A company I advised had an MTTR of 45 minutes. After implementing the monitoring stack above, MTTR dropped to 8 minutes. Here's what changed:
Before: Alert fires → Engineer logs into Grafana → Checks 5 dashboards → Reads 3 log streams → Identifies the problem → Deploys fix → Verifies resolution.
After: Alert fires with context (which endpoint, which percentile, which user segment) → Engineer checks the trace link in the alert → Identifies the slow span → Deploys fix → SLO gauge confirms resolution.
The difference: context in the alert and traces that follow the request. The engineer doesn't waste 30 minutes on investigation... the alert tells them where to look.
When to Apply This
- Your SaaS serves 1,000+ requests per minute and needs SLOs
- On-call engineers are overwhelmed by alert noise
- Customers report performance issues that your monitoring doesn't catch
- You're running 3+ services and need to trace requests across them
When NOT to Apply This
- You're a small team with a monolith... basic uptime monitoring and application logs are sufficient
- Your application has fewer than 100 concurrent users
- You're pre-revenue and optimizing for shipping speed, not reliability
Ready to build monitoring that catches problems before your customers do? I help SaaS teams design observability stacks that reduce MTTR and improve operational confidence.
- Technical Advisor for Startups ... Architecture and operations guidance
- Next.js Development for SaaS ... Production-grade systems with built-in observability
- Technical Due Diligence ... Operational maturity assessment
Continue Reading
This post is part of the Performance Engineering Playbook ... covering latency optimization, caching strategies, and zero-downtime operations.
More in This Series
- SaaS Reliability Monitoring ... Uptime monitoring and incident detection
- Core Web Vitals Audit Checklist ... RUM data for frontend performance
- Node.js Memory Leaks in Production ... Monitoring memory consumption in real time
- Caching Strategies That Actually Work ... Monitoring cache hit rates and staleness
Related Guides
- Incident Response Playbook for SaaS Teams ... From detection to resolution
- SaaS Architecture Decision Framework ... Building observable architectures from the start
