Runbook Writer Skill

Produces operational runbooks for services, incident types, and deployment procedures — structured so an on-call engineer who's never touched the system can follow them under pressure.

Required Inputs

Ask for these if not provided:

What the runbook is for (e.g. deploying the payment service, responding to a database failover, rotating API keys)
Runbook type (Deployment / Incident Response / Maintenance / Disaster Recovery)
System/service name and what it does (brief description)
Audience (new on-call engineers / experienced SREs / DevOps team)
Tech stack (where relevant — e.g. Kubernetes, AWS RDS, Node.js)
Monitoring tools (e.g. Grafana, Datadog, CloudWatch, Splunk — used to name specific dashboards and alert links in the steps)
Key environment details (e.g. Kubernetes cluster name, AWS account/region, relevant namespaces or resource names — paste what's relevant for exact commands)

Output Format

Runbook: [Runbook Title] Service: [Service Name] Type: [Deployment / Incident Response / Maintenance / DR] Last Updated: [Insert today's date in YYYY-MM-DD format] Owner: [Team or person] Severity: [P1 / P2 / P3 — if incident-type]

Overview

What this runbook covers: [1–2 sentences on the scenario this runbook handles]

When to use this runbook:

[Specific trigger condition 1 — e.g. PagerDuty alert: high-error-rate-payment-service]
[Specific trigger condition 2 — e.g. Deploy needed after PR merged to main]

Estimated time to complete: [X minutes / X–Y minutes depending on outcome]

Impact if not completed correctly: [e.g. Payment processing degraded / Data loss risk / Users locked out]

Prerequisites

Access required:

[System/tool access — e.g. AWS Console: production-account]
[Credential — e.g. vault read secret/payment-service]
[VPN / bastion access if needed]

Tools required:

[Tool name and version — e.g. kubectl v1.28+]
[CLI or dashboard name]

Before you start:

[Prerequisite check — e.g. Verify current deployment is healthy in Grafana]
[Prerequisite action — e.g. Announce in #ops-live that you're starting]

Procedure

Number every step. Use exact commands. Do not paraphrase tool names or flags.

Step 1: [Action name] [What you're doing and why — one sentence]

# Exact command
[command here]

Expected output: [what should appear if this worked] If this fails: [Exact error message to look for] → [What to do, or see Troubleshooting]

Step 2: [Action name] [Same structure as Step 1]

Step 3: Verify Always include a verification step after the main procedure:

[verification command]

Expected state: [What a healthy system looks like after this runbook completes]

Rollback

How to undo this procedure if something went wrong:

Step R1: [Rollback action]

[rollback command]

Verify rollback: [command to confirm rollback succeeded]

Troubleshooting

Symptom	Likely Cause	Resolution
[Error message or observable symptom]	[Why this happens]	[Exact fix or next step]
[Another symptom]	[Cause]	[Resolution]

Escalation

If this runbook does not resolve the issue:

Condition	Who to Contact	How
[e.g. DB unavailable after 10 min]	[DBA on-call]	[PagerDuty policy: `db-oncall`]
[e.g. Payment provider unresponsive]	[Vendor contact]	[Contact in 1Password: `vendor-escalation`]

Always update the incident timeline in [tool] before escalating.

Post-Procedure Checklist

After completing the runbook:

Announce completion in #ops-live with outcome
Update the incident ticket / deploy log
Verify alerts have resolved in monitoring dashboard
If this revealed a gap in this runbook — update it now (link to edit process)

Quality Checks

Every step has an exact command (no "run the deploy script")
Expected output is specified for each step so engineer knows if it worked
Failure path is explicit for each step (not "if it fails, investigate")
Rollback procedure is complete and independently testable
Escalation table has no cells containing only "[Team name]" — every row must either have a real contact or be explicitly flagged as [FILL IN: on-call rotation link]
Rollback section contains at least one concrete command (not left as "[rollback command]" placeholder)
Runbook can be followed by someone who has never touched this system

Usage Examples

"Write a runbook for [service] deployment"
"Create an incident response runbook for [alert type]"
"I need a runbook for [procedure]"
"Document the operational procedure for [X]"
"Write an ops playbook for [scenario]"

Anti-Patterns

Do not write steps as vague actions like "run the deploy script" — every step must include the exact command
Do not leave the rollback section as a placeholder — a runbook without a tested rollback procedure is incomplete and dangerous
Do not omit expected output for each step — without it, the on-call engineer cannot tell if the step succeeded
Do not write escalation contacts as "[Team name]" — every escalation row must have a real contact or an explicit flag to fill in
Do not assume the reader knows the system — write for someone who has never touched it before

Runbook Writer

Runbook Writer Skill

Required Inputs

Output Format

Overview

Prerequisites

Procedure

Rollback

Troubleshooting

Escalation

Post-Procedure Checklist

Quality Checks

Usage Examples

Anti-Patterns

Bundled with this artifact

More on the bench

Sprint Velocity Analysis

Scrum Master

Confluence Expert