TL;DR
Most engineering documentation is written once and never read. Not because engineers don't value docs... because the docs don't match how engineers actually look for information. Engineers don't read documentation top to bottom like a textbook. They search for answers to specific questions at specific moments: "How do I deploy?" "What does this error mean?" "Why was this architecture decision made?" The four documentation types that engineers actually use: decision records (ADRs), runbooks, onboarding guides, and API contracts. Everything else... comprehensive architecture overviews, process documents, team wikis... has a half-life of about 6 weeks before it's outdated and actively misleading. Write less documentation, but keep what you write accurate.
Part of the Engineering Leadership: Founder to CTO ... a comprehensive guide to scaling engineering teams and practices.
Why Documentation Fails
I've audited engineering documentation at 12 companies in the past 2 years. These numbers are anecdotal — drawn from my advisory work, not a formal study — but the pattern is consistent:
- 40-60% of docs were outdated (didn't match current codebase or processes)
- 70%+ of engineers reported finding incorrect information in docs at least monthly
- 90%+ of "comprehensive" architecture documents hadn't been updated in 6+ months
The problem isn't motivation. Every team I've worked with wants good documentation. The problem is that most documentation strategies optimize for comprehensiveness instead of accuracy... and comprehensive docs that are wrong are worse than no docs at all.
Wrong documentation costs more than missing documentation. When an engineer follows outdated instructions and breaks something, they lose trust in all documentation. From that point forward, they ask a colleague instead of reading the docs... which defeats the entire purpose.
The Four Types That Work
Type 1: Architecture Decision Records (ADRs)
ADRs capture the why behind technical decisions. They're the single most valuable documentation type because they answer a question that source code never answers: "Why did we choose this approach over the alternatives?"
# ADR-007: Use PostgreSQL Row-Level Security for Tenant Isolation
## Status: Accepted (2026-01-15)
## Context
Our SaaS application needs to isolate tenant data. We evaluated three approaches:
1. Separate databases per tenant
2. Shared database with application-level filtering
3. Shared database with PostgreSQL Row-Level Security (RLS)
## Decision
We chose PostgreSQL RLS (option 3).
## Rationale
- **Separate databases** (option 1) provides strong isolation but makes schema migrations
O(N) where N = number of tenants. At 500+ tenants, migrations take 30+ minutes.
- **Application filtering** (option 2) requires every query to include a WHERE clause.
A single missed filter exposes cross-tenant data. This has caused data breaches at
companies I've advised.
- **RLS** (option 3) enforces isolation at the database layer. A missing WHERE clause
returns zero rows instead of another tenant's data. Fail-safe by default.
## Trade-offs
- RLS adds 2-5% query overhead (measured on our workload)
- Requires SET app.tenant_id on every connection (handled via Prisma middleware)
- Debugging is harder... queries return empty results instead of errors when tenant context is wrong
## Consequences
- All database queries automatically filtered by tenant
- New engineers can't accidentally write cross-tenant queries
- Monitoring needed for the Prisma middleware setting tenant context
Why ADRs work: They're written once, at the moment of decision, when context is fresh. They don't need updating because the decision itself doesn't change... even if the implementation evolves.
Maintenance cost: Near zero. ADRs are immutable records. If a decision is reversed, write a new ADR that supersedes the old one.
Type 2: Runbooks
Runbooks are step-by-step procedures for operational tasks. They answer: "How do I do this specific thing right now?"
# Runbook: Deploy Hotfix to Production
## When to Use
Production issue requiring an immediate fix that can't wait for the normal release cycle.
## Prerequisites
- [ ] GitHub access to main branch
- [ ] Cloudflare dashboard access (for rollback if needed)
## Steps
1. Create hotfix branch from main:
```bash
git checkout main && git pull && git checkout -b hotfix/description
```
-
Apply fix, commit, push:
git add <files> && git commit -m "fix: description" && git push -u origin hotfix/description -
Create PR targeting main. Add
hotfixlabel. -
Get one approval (skip normal two-reviewer requirement for hotfixes).
-
Merge to main. GitHub Actions deploys automatically.
-
Verify:
curl -s https://alexmayhew.dev/api/health | jq -
If deployment fails, rollback via Cloudflare Dashboard: Workers & Pages > alexmayhew-dev > Deployments > Rollback
Escalation
If steps above don't resolve the issue, contact [on-call IC] via PagerDuty.
**Why runbooks work:** They're designed for the exact moment when an engineer needs them... under stress, with limited context. Every step is explicit. No assumed knowledge.
**Maintenance strategy:** Every time a runbook is used and a step is wrong, fix the runbook immediately. Treat runbook inaccuracy as a bug with the same priority as a code bug.
### Type 3: Onboarding Guides
New engineer onboarding documentation has the highest ROI of any documentation type. Every hour invested in onboarding docs saves that hour multiplied by every future engineer who joins.
The effective onboarding guide isn't a wiki page... it's a checklist with day-by-day tasks:
```markdown
# Engineering Onboarding: Week 1
## Day 1: Environment Setup
### Morning
- [ ] Clone the monorepo: `git clone git@github.com:org/repo.git`
- [ ] Install dependencies: `npm install` (requires Node 20+)
- [ ] Set up local database: `docker compose up -d`
- [ ] Run the dev server: `npm run dev` ... verify http://localhost:3001 loads
- [ ] Run tests: `npm test` ... all should pass
### Afternoon
- [ ] Read ADR-001 through ADR-010 (architecture context)
- [ ] Read the incident postmortem archive (last 5 postmortems)
- [ ] Set up your IDE: TypeScript strict mode, ESLint plugin, Prettier
## Day 2: First Contribution
- [ ] Pick a "good first issue" from the backlog
- [ ] Open a PR. Your onboarding buddy will review within 2 hours.
- [ ] Submit your first PR by end of day (any size is fine)
## Day 3-5: Domain Context
- [ ] Shadow the on-call engineer for 2 hours
- [ ] Read the runbook for the service area you're joining
- [ ] Meet your team lead for architecture walkthrough (30 min)
- [ ] Complete second PR with a slightly larger scope
Why onboarding guides work: They remove decision-making from the new hire's first week. The checklist tells them exactly what to do, in what order, with expected outcomes at each step.
Maintenance strategy: Every new engineer adds one improvement to the onboarding guide as their second week task. The guide improves with every hire.
Type 4: API Contracts
API documentation that's generated from code stays accurate because it changes when the code changes.
// OpenAPI spec generated from code (e.g., tRPC, Zod, or decorators)
// The contract IS the code ... there's no separate document to maintain
const createOrderSchema = z.object({
customerId: z.string().uuid().describe("The customer placing the order"),
items: z
.array(
z.object({
productId: z.string().uuid(),
quantity: z.number().int().positive(),
})
)
.min(1)
.describe("Order line items (at least one required)"),
currency: z.enum(["USD", "EUR", "GBP"]).default("USD"),
});
Why API contracts work: They're enforced by the compiler or runtime. If the contract says the field is required and the code doesn't send it, the build breaks.
Maintenance cost: Zero. The documentation is the code.
What to Stop Documenting
These documentation types consistently fail:
| Type | Why It Fails | Alternative |
|---|---|---|
| Comprehensive architecture overview | Outdated within 2 months | ADRs for decisions, code for current state |
| Process documents | Never read, frequently wrong | Automated checks (CI) that enforce process |
| Meeting notes | Written for the writer, not the reader | Decision records (ADRs) for outcomes |
| "How it works" deep-dives | Outdated when code changes | Well-named code + inline comments |
| Style guides | Replaced by automated formatters | Prettier/ESLint/gofmt configs |
The honest truth: if documentation can be replaced by automation, replace it. A linter rule that enforces a naming convention is more reliable than a style guide that describes it.
Documentation Decay and Prevention
Documentation decays at a predictable rate. The half-life depends on the type:
| Documentation Type | Half-Life | Decay Cause |
|---|---|---|
| API contracts (code-generated) | Infinite | Changes when code changes |
| ADRs | 2-5 years | Decisions rarely change |
| Runbooks | 3-6 months | Tooling and processes evolve |
| Onboarding guides | 2-4 months | Dependencies and setup change |
| Architecture overviews | 4-8 weeks | Code changes faster than docs |
Prevention Strategies
1. Docs-as-code. Store documentation in the same repository as the code. PRs that change code should include documentation updates. Code review catches doc drift.
2. Automated freshness checks.
# GitHub Action: flag stale docs
name: Doc Freshness Check
on:
schedule:
- cron: "0 9 * * 1" # Every Monday
jobs:
check-freshness:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Find stale docs
run: |
find docs/ -name "*.md" -mtime +90 | while read file; do
echo "::warning file=$file::This document hasn't been updated in 90+ days. Verify accuracy or archive."
done
3. Owner assignment. Every document has an owner. When the owner leaves the team, ownership transfers. Unowned docs get archived.
When to Apply This
- Your team is growing past 5 engineers and knowledge transfer is becoming a bottleneck
- New engineers take more than 2 weeks to become productive
- Engineers frequently encounter outdated documentation
- On-call engineers spend significant time figuring out procedures during incidents
When NOT to Apply This
- Solo developer or two-person team... verbal communication is faster
- Prototype or throwaway project... documentation outliving the code is waste
- First week of a startup... ship first, document later
Building an engineering culture where documentation actually works? I help teams implement documentation strategies that scale with their growth.
- Technical Advisor for Startups ... Engineering culture and process guidance
- Next.js Development for SaaS ... Teams that ship with documentation built in
- Technical Due Diligence ... Engineering process maturity assessment
Continue Reading
This post is part of the Engineering Leadership: Founder to CTO ... covering hiring, team scaling, technical strategy, and operational excellence.
More in This Series
- Code Review Practices That Scale ... Review processes that improve with team size
- First Engineering Team Playbook ... Building your first 5 engineering hires
- Hiring Your First Staff Engineer ... When and how to make the senior-plus hire
- Technical Debt Strategy ... Managing complexity at scale
Related Guides
- Boring Technology Wins ... Decision-making frameworks worth documenting
- SaaS Architecture Decision Framework ... ADRs for architecture decisions
