Skip to content
March 3, 202614 min readinfrastructure

Real-Time Performance Monitoring at Scale

Your monitoring dashboard says everything is fine. Your customers say the app is slow. Here's why the gap exists and how to build monitoring that catches performance degradation before users notice.

monitoringobservabilityperformancesaasmetrics
Real-Time Performance Monitoring at Scale

TL;DR

The median SaaS company monitors uptime and average response time. Both are nearly useless for catching real performance problems. Averages hide the pain... your p50 is 200ms but your p99 is 4 seconds, and the 1% of users hitting that 4-second response are your highest-paying enterprise customers on complex dashboards. The monitoring stack that actually works: percentile-based latency tracking (p50, p95, p99) per endpoint, real user monitoring (RUM) with device and network segmentation, error budget burn rate alerting (not static thresholds), and distributed tracing that follows a request across all services. I helped one team cut their MTTR from 45 minutes to under 8 minutes by switching from average-based monitoring to percentile-based alerting with error budgets.

Part of the Performance Engineering Playbook ... a comprehensive guide to building systems that stay fast under real-world load.


Why Averages Lie

A SaaS application serving 10,000 requests per minute with a 200ms average response time sounds healthy. But the distribution matters more than the average.

PercentileResponse TimeUsers Affected
p50120ms5,000 requests ... fast
p75250ms2,500 requests ... acceptable
p90800ms1,000 requests ... noticeable
p952,500ms500 requests ... frustrating
p998,000ms100 requests ... broken

That "200ms average" hides 100 requests per minute that take 8 seconds. Those 100 requests aren't random... they correlate with specific user segments: enterprise customers with large datasets, users on mobile networks, or requests that hit unoptimized query paths.

The average tells you the system is fine. The percentiles tell you 1% of your users are having a terrible experience... and those users are often the ones paying the most.


The Four Pillars of Performance Monitoring

Pillar 1: Percentile-Based Latency

Track p50, p95, and p99 for every endpoint. Alert on p99 because that's where problems surface first.

// Histogram-based latency tracking import { Histogram } from "prom-client"; const httpDuration = new Histogram({ name: "http_request_duration_seconds", help: "Duration of HTTP requests in seconds", labelNames: ["method", "route", "status_code"], buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10], }); // Middleware function metricsMiddleware(req: Request, res: Response, next: NextFunction) { const end = httpDuration.startTimer({ method: req.method, route: req.route?.path || req.path, }); res.on("finish", () => { end({ status_code: res.statusCode }); }); next(); }

Why histograms, not summaries: Histograms are aggregatable across instances. If you have 10 application servers, you can calculate the global p99 from histograms. You can't do that with pre-computed summaries... each instance computes its own p99, and the average of 10 p99s is not the global p99.

Pillar 2: Real User Monitoring (RUM)

Synthetic monitoring tells you what the experience could be. RUM tells you what the experience is.

// Client-side RUM collection import { onLCP, onINP, onCLS, onTTFB } from "web-vitals"; interface RUMPayload { metric: string; value: number; page: string; connection: string; device: string; viewport: string; timestamp: number; } function collectRUM(metric: { name: string; value: number }) { const payload: RUMPayload = { metric: metric.name, value: metric.value, page: window.location.pathname, connection: (navigator as any).connection?.effectiveType || "unknown", device: /Mobile|Android/.test(navigator.userAgent) ? "mobile" : "desktop", viewport: `${window.innerWidth}x${window.innerHeight}`, timestamp: Date.now(), }; // Batch and send with beacon API (survives page navigation) navigator.sendBeacon("/api/rum", JSON.stringify(payload)); } onLCP(collectRUM); onINP(collectRUM); onCLS(collectRUM); onTTFB(collectRUM);

The power of RUM is segmentation. When p99 INP spikes to 800ms, you can filter by connection type and discover the problem only affects users on 3G networks. Or filter by page and find that /dashboard/analytics has 3x worse LCP than every other page.

Pillar 3: Error Budget Burn Rate

Static thresholds ("alert when p99 > 1 second") generate too many alerts. A temporary spike during a deploy is different from a sustained degradation.

Error budgets flip the model: define how much unreliability is acceptable over a window, then alert when you're consuming that budget too fast.

# SLO definition: 99.9% of requests under 1 second # Error budget: 0.1% = 43.2 minutes of downtime per month # Prometheus alerting rule groups: - name: slo-burn-rate rules: - alert: HighBurnRate expr: | ( sum(rate(http_request_duration_seconds_bucket{le="1"}[5m])) / sum(rate(http_request_duration_seconds_count[5m])) ) < 0.999 for: 5m labels: severity: warning annotations: summary: "SLO burn rate exceeds budget" description: "Error budget being consumed at {{ $value | humanize }}x normal rate" - alert: CriticalBurnRate expr: | ( sum(rate(http_request_duration_seconds_bucket{le="1"}[1m])) / sum(rate(http_request_duration_seconds_count[1m])) ) < 0.99 for: 2m labels: severity: critical annotations: summary: "SLO burn rate critical - 10x budget consumption"

The magic of burn rate alerting: it automatically adjusts for traffic patterns. A 2-second spike during a 1,000 RPS period is significant. The same spike during a 10 RPS period at 3 AM is noise. Burn rate alerting distinguishes between the two.

Pillar 4: Distributed Tracing

When a request touches 3+ services, you need to trace the full journey to identify bottlenecks.

// OpenTelemetry instrumentation import { trace, SpanStatusCode } from "@opentelemetry/api"; const tracer = trace.getTracer("order-service"); async function processOrder(orderId: string): Promise<Order> { return tracer.startActiveSpan("processOrder", async (span) => { try { span.setAttribute("order.id", orderId); // Each sub-operation gets its own span const order = await tracer.startActiveSpan("fetchOrder", async (childSpan) => { const result = await db.query("SELECT * FROM orders WHERE id = $1", [orderId]); childSpan.setAttribute("db.rows_returned", result.rowCount); childSpan.end(); return result.rows[0]; }); const payment = await tracer.startActiveSpan("processPayment", async (childSpan) => { const result = await paymentService.charge(order.total, order.customerId); childSpan.setAttribute("payment.amount", order.total); childSpan.end(); return result; }); span.setStatus({ code: SpanStatusCode.OK }); return { ...order, paymentId: payment.id }; } catch (error) { span.setStatus({ code: SpanStatusCode.ERROR, message: error instanceof Error ? error.message : "Unknown error", }); throw error; } finally { span.end(); } }); }

A single trace shows that the processOrder operation took 1,200ms...800ms in processPayment (external API) and 350ms in fetchOrder (missing index). Without tracing, you'd see "order endpoint is slow" with no idea which component to optimize.


Alerting That Doesn't Cause Alert Fatigue

The average SaaS team has 50-200 active alerts. The on-call engineer ignores 90% of them because they're noise. When a real problem fires, it's lost in the noise.

The Three-Tier Alert Model

TierResponseSLAExample
P1: PageWake someone up5 min responseError budget burned 50% in 1 hour
P2: TicketNext business day24 hoursp99 latency above SLO for 30 minutes
P3: DashboardWeekly reviewNoneCache hit rate dropped below 90%

The rule: If an alert fires and nobody needs to take action, it shouldn't be an alert. Move it to a dashboard.

Alert Deduplication

# Alertmanager configuration route: group_by: ["alertname", "service"] group_wait: 30s # Wait for related alerts to arrive group_interval: 5m # Don't re-fire for 5 minutes repeat_interval: 4h # Re-fire every 4 hours if unresolved inhibit_rules: # If the database is down, suppress all downstream service alerts - source_match: alertname: "DatabaseDown" target_match: severity: "warning" equal: ["environment"]

Inhibition rules are critical for preventing alert storms. When the database goes down, every service that depends on it fires an alert. Without inhibition, the on-call engineer gets 50 notifications instead of 1.


The Performance Dashboard

Build one dashboard that answers these questions without clicking into sub-pages:

  1. Is the system healthy right now? (Traffic light SLO status)
  2. How much error budget remains? (Burn rate gauge)
  3. Where is latency worst? (Top 5 endpoints by p99)
  4. What changed recently? (Deploy markers on latency graphs)
  5. Who is affected? (RUM segmentation by device/network)
# Grafana panels for the performance dashboard # Panel 1: Current SLO status (gauge) sum(rate(http_request_duration_seconds_bucket{le="1"}[1h])) / sum(rate(http_request_duration_seconds_count[1h])) # Panel 2: Error budget remaining (stat) 1 - ( sum(increase(http_request_duration_seconds_count[30d])) - sum(increase(http_request_duration_seconds_bucket{le="1"}[30d])) ) / (sum(increase(http_request_duration_seconds_count[30d])) * 0.001) # Panel 3: p99 latency by endpoint (time series) histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, route) ) # Panel 4: Request rate with deploy markers (time series + annotations) sum(rate(http_request_duration_seconds_count[1m])) by (route)

Reducing MTTR from 45 Minutes to 8 Minutes

Mean Time to Resolution (MTTR) is the metric that matters most for operational excellence. A company I advised had an MTTR of 45 minutes. After implementing the monitoring stack above, MTTR dropped to 8 minutes. Here's what changed:

Before: Alert fires → Engineer logs into Grafana → Checks 5 dashboards → Reads 3 log streams → Identifies the problem → Deploys fix → Verifies resolution.

After: Alert fires with context (which endpoint, which percentile, which user segment) → Engineer checks the trace link in the alert → Identifies the slow span → Deploys fix → SLO gauge confirms resolution.

The difference: context in the alert and traces that follow the request. The engineer doesn't waste 30 minutes on investigation... the alert tells them where to look.


When to Apply This

  • Your SaaS serves 1,000+ requests per minute and needs SLOs
  • On-call engineers are overwhelmed by alert noise
  • Customers report performance issues that your monitoring doesn't catch
  • You're running 3+ services and need to trace requests across them

When NOT to Apply This

  • You're a small team with a monolith... basic uptime monitoring and application logs are sufficient
  • Your application has fewer than 100 concurrent users
  • You're pre-revenue and optimizing for shipping speed, not reliability

Ready to build monitoring that catches problems before your customers do? I help SaaS teams design observability stacks that reduce MTTR and improve operational confidence.


Continue Reading

This post is part of the Performance Engineering Playbook ... covering latency optimization, caching strategies, and zero-downtime operations.

More in This Series

Get insights like this weekly

Join The Architect's Brief — one actionable insight every Tuesday.

Need help with performance?

Let's talk strategy