Runbook Writer

Write an operational runbook for a service, incident type, or deployment procedure. Use when asked to write a runbook, create an ops guide, document an operational procedure, or prepare an incident response playbook. Produces a runbook with overview, prerequisites, step-by-step procedures, rollback steps, troubleshooting table, and escalation paths.

Published by @Mohit Aggarwal·0 agent reads / 30d·0 saves·

Runbook Writer Skill

Produces operational runbooks for services, incident types, and deployment procedures — structured so an on-call engineer who's never touched the system can follow them under pressure.

Required Inputs

Ask for these if not provided:

  • What the runbook is for (e.g. deploying the payment service, responding to a database failover, rotating API keys)
  • Runbook type (Deployment / Incident Response / Maintenance / Disaster Recovery)
  • System/service name and what it does (brief description)
  • Audience (new on-call engineers / experienced SREs / DevOps team)
  • Tech stack (where relevant — e.g. Kubernetes, AWS RDS, Node.js)
  • Monitoring tools (e.g. Grafana, Datadog, CloudWatch, Splunk — used to name specific dashboards and alert links in the steps)
  • Key environment details (e.g. Kubernetes cluster name, AWS account/region, relevant namespaces or resource names — paste what's relevant for exact commands)

Output Format


Runbook: [Runbook Title] Service: [Service Name] Type: [Deployment / Incident Response / Maintenance / DR] Last Updated: [Insert today's date in YYYY-MM-DD format] Owner: [Team or person] Severity: [P1 / P2 / P3 — if incident-type]


Overview

What this runbook covers: [1–2 sentences on the scenario this runbook handles]

When to use this runbook:

  • [Specific trigger condition 1 — e.g. PagerDuty alert: high-error-rate-payment-service]
  • [Specific trigger condition 2 — e.g. Deploy needed after PR merged to main]

Estimated time to complete: [X minutes / X–Y minutes depending on outcome]

Impact if not completed correctly: [e.g. Payment processing degraded / Data loss risk / Users locked out]


Prerequisites

Access required:

  • [System/tool access — e.g. AWS Console: production-account]
  • [Credential — e.g. vault read secret/payment-service]
  • [VPN / bastion access if needed]

Tools required:

  • [Tool name and version — e.g. kubectl v1.28+]
  • [CLI or dashboard name]

Before you start:

  • [Prerequisite check — e.g. Verify current deployment is healthy in Grafana]
  • [Prerequisite action — e.g. Announce in #ops-live that you're starting]

Procedure

Number every step. Use exact commands. Do not paraphrase tool names or flags.

Step 1: [Action name] [What you're doing and why — one sentence]

# Exact command
[command here]

Expected output: [what should appear if this worked] If this fails: [Exact error message to look for] → [What to do, or see Troubleshooting]

Step 2: [Action name] [Same structure as Step 1]

Step 3: Verify Always include a verification step after the main procedure:

[verification command]

Expected state: [What a healthy system looks like after this runbook completes]


Rollback

How to undo this procedure if something went wrong:

Step R1: [Rollback action]

[rollback command]

Verify rollback: [command to confirm rollback succeeded]


Troubleshooting

SymptomLikely CauseResolution
[Error message or observable symptom][Why this happens][Exact fix or next step]
[Another symptom][Cause][Resolution]

Escalation

If this runbook does not resolve the issue:

ConditionWho to ContactHow
[e.g. DB unavailable after 10 min][DBA on-call][PagerDuty policy: db-oncall]
[e.g. Payment provider unresponsive][Vendor contact][Contact in 1Password: vendor-escalation]

Always update the incident timeline in [tool] before escalating.


Post-Procedure Checklist

After completing the runbook:

  • Announce completion in #ops-live with outcome
  • Update the incident ticket / deploy log
  • Verify alerts have resolved in monitoring dashboard
  • If this revealed a gap in this runbook — update it now (link to edit process)

Quality Checks

  • Every step has an exact command (no "run the deploy script")
  • Expected output is specified for each step so engineer knows if it worked
  • Failure path is explicit for each step (not "if it fails, investigate")
  • Rollback procedure is complete and independently testable
  • Escalation table has no cells containing only "[Team name]" — every row must either have a real contact or be explicitly flagged as [FILL IN: on-call rotation link]
  • Rollback section contains at least one concrete command (not left as "[rollback command]" placeholder)
  • Runbook can be followed by someone who has never touched this system

Usage Examples

  • "Write a runbook for [service] deployment"
  • "Create an incident response runbook for [alert type]"
  • "I need a runbook for [procedure]"
  • "Document the operational procedure for [X]"
  • "Write an ops playbook for [scenario]"

Anti-Patterns

  • Do not write steps as vague actions like "run the deploy script" — every step must include the exact command
  • Do not leave the rollback section as a placeholder — a runbook without a tested rollback procedure is incomplete and dangerous
  • Do not omit expected output for each step — without it, the on-call engineer cannot tell if the step succeeded
  • Do not write escalation contacts as "[Team name]" — every escalation row must have a real contact or an explicit flag to fill in
  • Do not assume the reader knows the system — write for someone who has never touched it before

Bundled with this artifact

1 file

Reference files that ship alongside this artifact. Agents pull these in only when the task needs them.

More on the bench

SKILL0

Sprint Velocity Analysis

Analyze sprint velocity data and produce an engineering team health report covering delivery trends, capacity utilization, and improvement recommendations. Use when asked to analyze sprint velocity, review team delivery health, identify delivery risks, or produce a retrospective data analysis. Produces a velocity trend analysis, health diagnosis table, top improvement recommendations with implementation steps, and a next-sprint capacity forecast.

software-engineering+2
0
SKILL0

Scrum Master

Advanced Scrum Master skill for data-driven agile team analysis and coaching. Use when the user asks about sprint planning, velocity tracking, retrospectives, standup facilitation, backlog grooming, story points, burndown charts, blocker resolution, or agile team health. Runs Python scripts to analyse sprint JSON exports from Jira or similar tools: velocity_analyzer.py for Monte Carlo sprint forecasting, sprint_health_scorer.py for multi-dimension health scoring, and retrospective_analyzer.py for action-item and theme tracking. Produces confidence-interval forecasts, health grade reports, and improvement-velocity trends for high-performing Scrum teams.

project-management+2
0
SKILL0

Confluence Expert

Atlassian Confluence expert for creating and managing spaces, knowledge bases, and documentation. Configures space permissions and hierarchies, creates page templates with macros, sets up documentation taxonomies, designs page layouts, and manages content governance. Use when users need to build or restructure a Confluence space, design page hierarchies with permission structures, author or standardise documentation templates, embed Jira reports in pages, run knowledge base audits, or establish documentation standards and collaborative workflows.

project-management+2
0