Incident Postmortem Skill

This skill produces a complete, blameless incident postmortem document following industry-standard format. Output enforces blameless framing throughout — system gaps over individual failures — and drives toward specific, closeable action items rather than vague process commitments.

Required Inputs

Ask the user for these if not provided:

Incident title / ID
Severity (P1 / P2 / P3 or SEV1 / SEV2 / SEV3)
Date and duration of the incident
What happened (rough notes are fine — the skill will structure them)
Services or systems affected
Customer impact (how many users, what was degraded)
How it was detected
How it was resolved
Initial thoughts on root cause
Action items already identified (optional)
Responders (who was on-call or responded — names or roles; used for the timeline, not for blame)
Customer or external communications sent (optional — any status page updates, emails, or support messages with timestamps)

Output Format

Incident Postmortem: [Incident Title]

Incident ID: [ID] Severity: [P1/P2/P3] Date: [Date] Duration: [Start time → Resolution time — total duration] Status: [Resolved / Monitoring / Ongoing] Author: [Leave blank for user to fill] Last updated: [Date]

Executive Summary

[3–5 sentences. Describe what happened, who was affected, and what was done to resolve it. Written for a non-technical stakeholder. No jargon. No blame.]

Impact

Dimension	Details
Users affected	[Number or percentage]
Services degraded	[List affected services]
Business impact	[Revenue, SLA breach, support tickets, etc. if known]
Duration	[Total time from first detection to full resolution]

Timeline

List events in chronological order. Each entry: [HH:MM UTC] — [What happened. Who did what. What changed.]

Rules for timeline entries:

Use passive or system-focused language — avoid "X made a mistake"
Include: first symptom, detection, escalation, hypothesis tested, fix applied, confirmation of resolution
Note time between key events (e.g. "22 minutes between detection and escalation")

Root Cause

Primary root cause: [One clear sentence. Technical but plain. "A misconfigured deployment config caused..."]

Contributing factors:

[Factor 1 — e.g. lack of canary deployment meant change hit 100% of traffic immediately]
[Factor 2 — e.g. alert threshold was set too high to catch the initial degradation]
[Factor 3 — add as many as are relevant]

Why did our existing safeguards not prevent this? [Honest paragraph explaining why monitoring, tests, or processes didn't catch this earlier. This is where blameless analysis matters most — focus on system gaps, not individual failures.]

Detection

How was it first detected? [Customer report / automated alert / internal monitoring / manual observation]
Time from incident start to detection: [X minutes]
Should we have detected this faster? [Yes / No — and why]

Resolution

What fixed it? [Clear description of the actual fix — one paragraph] Why did this work? [Brief technical explanation] Was there a temporary mitigation before full resolution? [Yes/No — describe if yes]

Action Items

#	Action	Owner	Due Date	Priority
1	[Specific, testable action]	[Team or person]	[Date]	P1/P2/P3

Rules for action items:

Each action must be specific enough to close as "done" or "not done" — no vague items like "improve monitoring"
Distinguish between: Prevent recurrence (fix the root cause), Improve detection (catch it faster next time), Improve response (resolve it faster next time)
Assign a real owner — not "team" or "TBD" if avoidable
Flag P1 actions as items that block the incident from being marked fully closed

What Went Well

[3–5 honest observations about the response. Include: fast collaboration, good runbooks used, effective escalation, clear communication. This section builds team confidence and reinforces good habits.]

Lessons Learned

[3–5 key insights from this incident that are worth sharing beyond this team. Write these as transferable lessons — e.g. "Our runbook for database failover didn't account for read-replica lag. All runbooks involving database failover should be reviewed."]

Communication Log

[Optional — list external communications sent: status page updates, customer emails, support responses. Include timestamps.]

Quality Checks

Timeline has no blame-focused language
Root cause is specific (not "human error")
Root cause answers "why did this happen?" not just "what happened?" — it names a system or process gap, not a symptom
Contributing factors explain the systemic gaps
Every action item has an owner and due date
"What went well" section is genuine, not token
No action item contains vague language like "improve monitoring", "increase resilience", or "better testing" — each must name a specific change
Executive summary is readable by non-technical leadership

Anti-Patterns

Do not assign blame to individuals — postmortems must focus on system and process failures
Do not write action items with vague language like "improve monitoring" — each must name a specific, ownable change
Do not skip the contributing factors — root cause alone misses the systemic issues that enable incidents
Do not omit the detection timeline — how long it took to detect matters as much as how long it took to resolve
Do not treat the postmortem as closed until all action items have named owners and due dates

Usage Examples

"Write a postmortem for the [incident name] outage"
"Help me write a P1 incident report"
"Generate an RCA document for [service] going down on [date]"
"Draft a blameless postmortem from these notes: [paste notes]"

Incident Postmortem