Cicd Playbook

Write a CI/CD pipeline playbook for a service or team. Use when asked to document a CI/CD pipeline, write a deployment process, define release gates, document build and test stages, or create a deployment guide. Produces a structured playbook covering pipeline stages, environment definitions, deployment gates, rollback procedures, and on-call responsibilities.

Published by @Mohit Aggarwal·0 agent reads / 30d·0 saves·

CI/CD Playbook Skill

Produce a complete, actionable CI/CD playbook for a service or team — covering everything a new engineer needs to understand, contribute to, and operate the pipeline safely.

A good playbook is not a diagram. It is a document that answers: what runs, when, why, who owns it, and what to do when it breaks.

Required Inputs

Ask for these if not already provided:

  • Service name and brief description
  • Tech stack — language, framework, containerisation (Docker, etc.)
  • Source control — GitHub / GitLab / Bitbucket, branching strategy
  • CI platform — GitHub Actions / CircleCI / Jenkins / BuildKite / other
  • CD platform / deployment target — Kubernetes, ECS, Lambda, Heroku, VMs, etc.
  • Environments — e.g. dev, staging, production (and any canary / feature environments)
  • Deployment frequency — how often does the team ship?
  • Any existing gates — manual approvals, smoke tests, feature flags
  • On-call setup — who's responsible during deploys?

Output Format


CI/CD Playbook: [Service Name]

Service: [Name] | Team: [Team name] Last updated: [Date] | Owner: [Name / role] Pipeline platform: [CI tool] → [CD tool / platform]


Overview

[2–3 sentences describing what this service does and why the CI/CD pipeline is structured the way it is. Include the deployment target and how frequently the team ships.]

Deployment frequency: [Multiple times per day / Daily / Weekly / On-demand] Average pipeline duration: [X minutes] Rollback time (p95): [X minutes]


Pipeline Stages

[Branch push]
    │
    ▼
[1. Build & Lint] ──fail──▶ ❌ Block PR
    │
    ▼
[2. Unit Tests] ──fail──▶ ❌ Block PR
    │
    ▼
[3. Integration Tests] ──fail──▶ ❌ Block PR
    │
    ▼
[4. Security Scan] ──fail──▶ ⚠️ [Block / Warn — specify]
    │
    ▼
[5. Build Artefact / Container Image]
    │
    ▼
[6. Deploy to Staging] ──fail──▶ ❌ Block promotion
    │
    ▼
[7. Smoke Tests (Staging)]
    │
    ▼
[8. Manual Approval Gate] ──(if required)
    │
    ▼
[9. Deploy to Production] ──fail──▶ 🔁 Auto-rollback (if configured)
    │
    ▼
[10. Post-deploy checks]

Stage Definitions

Stage 1 — Build & Lint

What runs: [Build command] + [Linter — e.g. ESLint, golangci-lint, flake8] Trigger: Every commit to any branch Blocking: Yes — PR cannot be merged if this fails Typical duration: [X minutes] Owner if it fails: PR author

Common failure causes:

  • [e.g. Missing dependency — run npm install locally before pushing]
  • [e.g. Lint rule violation — run npm run lint --fix to auto-fix most issues]

Stage 2 — Unit Tests

What runs: [Test command — e.g. npm test, go test ./..., pytest] Coverage gate: [X]% minimum — pipeline fails below this threshold Trigger: Every commit Blocking: Yes Typical duration: [X minutes]

Coverage report: [Where to find it — e.g. uploaded to Codecov, available in CI artifacts]


Stage 3 — Integration Tests

What runs: [Test suite description — e.g. "API integration tests against a test database using Docker Compose"] Environment: [Ephemeral test environment / shared test DB / etc.] Trigger: Every commit to main and feature branches targeting main Blocking: Yes Typical duration: [X minutes]

If slow: [e.g. "Integration tests can be skipped locally with SKIP_INTEGRATION=true — never skip in CI"]


Stage 4 — Security Scan

Tools: [e.g. Snyk, Trivy, OWASP Dependency Check, Semgrep] What it checks: [Dependency vulnerabilities / SAST / secrets detection — list what applies] Blocking on: Critical and High severity findings Non-blocking on: Medium and Low (flagged, not blocking) Trigger: Every commit to main

How to handle a flagged vulnerability:

  1. Check if a fix is available — upgrade the dependency
  2. If no fix available, open a security ticket and add a suppression with justification
  3. Never suppress without a ticket and owner

Stage 5 — Build Artefact

What is produced: [Docker image / binary / zip — be specific] Registry: [ECR / GCR / Docker Hub / Artifactory — URL] Tagging convention: [service-name]:[git-sha] (also tagged :latest on main) Trigger: Commits to main only (not feature branches)


Stage 6 — Deploy to Staging

Deployment method: [e.g. Helm upgrade / kubectl apply / ecs deploy / Terraform apply] Staging URL: [URL] Trigger: Automatic on successful artefact build from main Who can deploy to staging: Any engineer (automatic)

Environment variables: Managed in [Vault / AWS SSM / GitHub Secrets / etc.] Staging is not production: [Any differences in config, scale, or data — state them here]


Stage 7 — Smoke Tests (Staging)

What runs: [Description — e.g. "10 critical path tests covering login, core API endpoints, and payment flow"] Tool: [e.g. Playwright / Postman / custom script] Pass criteria: All smoke tests pass within [X seconds] timeout Blocking: Yes — production deploy will not proceed if smoke tests fail

Smoke test suite location: [Link to test files or folder]


Stage 8 — Manual Approval Gate

Required for: [Production deploys / deploys affecting >X% of traffic / deploys to specific regions] Who can approve: [e.g. Any engineer on the team / Lead engineer / On-call engineer] Approval timeout: [e.g. 24 hours — auto-cancelled if no approval] How to approve: [GitHub Actions approve step / Slack command / other — with link]

When to withhold approval:

  • Active incident in production
  • Deploy is outside the deployment window (see below)
  • On-call engineer has not been notified

Stage 9 — Deploy to Production

Deployment method: [Same as staging or different — specify] Deployment window: [e.g. Monday–Thursday 09:00–16:00 UTC — no deploys on Fridays or before bank holidays] Canary / progressive rollout: [Yes — X% initial traffic, full rollout after Y minutes / No — full deploy] Deployment notifications: [Slack channel — #deployments]

Who is on-call during deploy: Deploying engineer is responsible until post-deploy checks pass.


Stage 10 — Post-Deploy Checks

Automated checks (run for [X minutes] after deploy):

  • Error rate: <[X]% (baseline: [Y]%)
  • P99 latency: <[X]ms (baseline: [Y]ms)
  • [Key business metric]: within [X]% of baseline

Where to watch: [Datadog / Grafana / CloudWatch dashboard — link]

If a check fails: See Rollback Procedure below.


Environments

EnvironmentPurposeDeploy triggerURLData
DevLocal developmentManuallocalhostSeeded test data
StagingPre-production validationAutomatic (main)[URL]Anonymised prod copy
ProductionLive trafficManual approval[URL]Live data

Branching Strategy

Model: [Trunk-based / GitFlow / GitHub Flow — describe briefly]

BranchPurposeWho mergesDeploy target
mainProduction-ready codePR + reviewStaging → Production
feature/*Feature developmentAuthorNone (CI only)
hotfix/*Critical production fixesLead engineerCan bypass staging gate with approval

Hotfix process: [Describe when and how to use a hotfix branch — what level of incident justifies bypassing the standard process]


Rollback Procedure

Automated rollback: [Yes — triggered if post-deploy error rate exceeds [X]% / No — manual only]

Manual rollback steps:

# 1. Identify the last known good image tag
[command to list recent deployments]

# 2. Deploy the previous version
[deployment command with previous tag]

# 3. Confirm rollback is live
[smoke test command or health check URL]

# 4. Notify the team
[Slack command or template]

Rollback decision authority: Any engineer on-call can initiate a rollback without waiting for approval.

After a rollback:

  1. Create a post-deploy incident report (see [incident-postmortem skill])
  2. Do not re-deploy the same commit without fixing the root cause
  3. Notify [stakeholder / support team] of the rollback and expected fix timeline

Secrets and Configuration Management

Secret store: [Vault / AWS SSM / GitHub Secrets / Doppler — specify] How to add a new secret:

  1. [Step 1]
  2. [Step 2] Who has access: [Role or team] Rotation policy: [How often secrets are rotated and who owns it]

Never do: Commit secrets to source control, even in .env files. The pipeline includes secret scanning (Stage 4) which will flag this.


Common Failures and Fixes

FailureLikely causeFix
Build fails with "module not found"Dependency not installedRun [install command] and commit lock file
Integration tests timeoutTest DB not seeded / external service downCheck [service] status; re-run pipeline
Smoke tests fail after staging deployEnvironment variable missingCheck [config location]; compare staging and prod env vars
Production deploy stuck at approvalApprover not notifiedTag @[on-call handle] in #deployments
Post-deploy error rate spikeBad deploy / upstream dependencyCheck [dashboard]; initiate rollback if >5 min

On-Call Responsibilities During Deploy

  • The deploying engineer is responsible for monitoring post-deploy checks for [X minutes] after a production deploy
  • If you cannot monitor after deploying, hand off explicitly to another engineer in #deployments
  • For deploys outside business hours: only hotfixes — always page the on-call engineer before deploying

Anti-Patterns

  • Do not describe a rollback procedure that has never been tested — a theoretical rollback is not a rollback plan; test it in staging before production
  • Do not allow deploys on Fridays or before holidays without an explicit on-call engineer who will monitor through the weekend
  • Do not commit secrets to source control even in non-production branches — secret scanning in the pipeline catches this, but prevention is the standard
  • Do not skip post-deploy monitoring after a production deploy — the deploying engineer must watch error rates and latency for the specified observation window
  • Do not suppress a security scan finding without a linked ticket and a named owner — suppressions without accountability accumulate into unmanaged risk

Quality Checks

  • Every stage has a clear owner when it fails
  • Rollback procedure is tested — not theoretical
  • Secrets management section names the actual tool used (not "use secrets management")
  • Deployment window is specific — not "during business hours"
  • Post-deploy check thresholds are calibrated to actual baseline metrics

Bundled with this artifact

1 file

Reference files that ship alongside this artifact. Agents pull these in only when the task needs them.

More on the bench

SKILL0

Xlsx

Use this skill any time a spreadsheet file is the primary input or output. This means any task where the user wants to: open, read, edit, or fix an existing .xlsx, .xlsm, .csv, or .tsv file (e.g., adding columns, computing formulas, formatting, charting, cleaning messy data); create a new spreadsheet from scratch or from other data sources; or convert between tabular file formats. Trigger especially when the user references a spreadsheet file by name or path — even casually (like "the xlsx in my downloads") — and wants something done to it or produced from it. Also trigger for cleaning or restructuring messy tabular data files (malformed rows, misplaced headers, junk data) into proper spreadsheets. The deliverable must be a spreadsheet file. Do NOT trigger when the primary deliverable is a Word document, HTML report, standalone Python script, database pipeline, or Google Sheets API integration, even if tabular data is involved.

software-engineering+2
0
SKILL0

Docx

Use this skill whenever the user wants to create, read, edit, or manipulate Word documents (.docx files). Triggers include: any mention of 'Word doc', 'word document', '.docx', or requests to produce professional documents with formatting like tables of contents, headings, page numbers, or letterheads. Also use when extracting or reorganizing content from .docx files, inserting or replacing images in documents, performing find-and-replace in Word files, working with tracked changes or comments, or converting content into a polished Word document. If the user asks for a 'report', 'memo', 'letter', 'template', or similar deliverable as a Word or .docx file, use this skill. Do NOT use for PDFs, spreadsheets, Google Docs, or general coding tasks unrelated to document generation.

software-engineering+1
0
SKILL0

Ticket Triage

Triage incoming support tickets by categorizing issues, assigning priority (P1-P4), and recommending routing. Use when a new ticket or customer issue comes in, when assessing severity, or when deciding which team should handle an issue.

customer-success+2
0