Monitoring Setup Guide

Write a monitoring setup guide for a service — defining what to measure, how to alert on it, and how to build the observability stack covering the four golden signals, business metrics, log strategy, distributed tracing, alerting rules, dashboard layout, and observability debt. Use when asked to set up monitoring for a service, define alerting strategy, write an observability plan, create a dashboard specification, or document logging standards for a team. Produces a metric definitions table, alert rules specification, dashboard layout wireframe, log schema, tracing setup checklist, and monitoring gap analysis.

Published by @Mohit Aggarwal·from mohitagw15856/pm-claude-skills·0 agent reads / 30d·0 saves·

Monitoring Setup Guide Skill

Produce a complete monitoring setup guide for a service — defining exactly what to measure, how to structure logs, how to configure alerts with actionable thresholds, and how to build dashboards that answer real operational questions. A good monitoring guide eliminates "we don't know what's happening in production" as a root cause category, and gives on-call engineers a single source of truth for what healthy looks like.

Required Inputs

Ask for these if not already provided:

  • Service name and description — what the service does and its role in the system
  • Tech stack — language, framework, and infrastructure (e.g. Go/gRPC on Kubernetes, Python/FastAPI on ECS)
  • Current monitoring tooling — Datadog, Prometheus + Grafana, CloudWatch, New Relic, Honeycomb, or none yet
  • Key user journeys — the 2–4 most important things a user or consumer does with the service (these drive what to alert on)
  • Existing alerts — paste any existing alert configurations or describe what's currently monitored

Output Format


Monitoring Setup Guide: [Service Name]

Team: [Team name] | Tech lead: [Name] Stack: [Language/Framework] on [Infrastructure] Monitoring platform: [Datadog / Prometheus+Grafana / CloudWatch / etc.] Date: [Date] | Review cycle: Quarterly


1. Monitoring Philosophy

Good monitoring answers three questions:

  1. Is the service healthy right now? (alerting)
  2. Was it healthy in the past, and is it trending worse? (dashboards + SLO tracking)
  3. Why did something fail? (logs + traces)

This guide defines the answers for [Service Name]. Every alert must be actionable — if an on-call engineer cannot take a specific action in response to the alert, the alert should not exist.

Key user journeys monitored:

  • Journey 1: [e.g. "User submits a payment — POST /charges, receives confirmation"]
  • Journey 2: [e.g. "User views transaction history — GET /transactions"]
  • Journey 3: [e.g. "Subscription renewal job runs — background worker processes billing events"]

2. The Four Golden Signals

Apply the four golden signals specifically to [Service Name]:

Latency

Latency measures how long requests take to complete. Track it separately for successful and failed requests — slow failures hide behind fast errors if you only measure aggregate latency.

MetricDescriptionSourceDimensions
[service].request.duration_msEnd-to-end request latencyApplication instrumentationendpoint, method, status_code
[service].db.query_duration_msDatabase query latencyORM / query instrumentationquery_name, table
[service].external.request_duration_msOutbound call latency to dependenciesHTTP client instrumentationtarget_service, endpoint
[service].queue.processing_duration_msTime to process one message (if applicable)Consumer instrumentationqueue_name, message_type

Latency SLO targets:

Endpoint / operationp50 targetp95 targetp99 target
GET /api/v1/[resource]< [50] ms< [200] ms< [500] ms
POST /api/v1/[resource]< [100] ms< [400] ms< [1000] ms
GET /health< [10] ms< [20] ms< [50] ms
[Background job name]< [5] sec< [15] sec< [60] sec

Traffic

Traffic measures demand on the system. Use it to detect unexpected spikes, traffic drops (which can indicate upstream failures), and to capacity-plan.

MetricDescriptionSource
[service].request.countRequests per secondApplication / load balancer
[service].request.count_by_endpointRPS broken down by endpointApplication
[service].queue.messages_consumed_per_secondConsumer throughputQueue consumer
[service].queue.depthMessages waiting in queueQueue metrics

Traffic baselines (update after observing production for 2+ weeks):

Time periodExpected RPSLow-traffic floorSpike ceiling
Peak (weekday business hours)[N] RPS[N × 0.5] RPS[N × 5] RPS
Off-peak (nights/weekends)[N × 0.2] RPS[N × 0.05] RPS[N] RPS

Errors

Errors measure the fraction of requests that fail. Distinguish between client errors (4xx — caller is doing something wrong) and server errors (5xx — the service is broken).

MetricDescriptionAlert on?
[service].request.error_rate5xx errors / total requestsYes — see alert rules
[service].request.client_error_rate4xx errors / total requestsThreshold alert — sudden spike may indicate API misuse
[service].dependency.error_rateErrors calling downstream dependenciesYes — upstream health signal
[service].queue.dlq_depthMessages in dead-letter queueYes — indicates processing failures

Saturation

Saturation measures how "full" the service is — how close to maximum capacity are the constrained resources.

ResourceMetricAlert thresholdSource
CPU[service].cpu.utilisation_pct>80% sustained 5 minContainer / VM metrics
Memory[service].memory.utilisation_pct>85% sustained 5 minContainer / VM metrics
DB connections[service].db.connection_pool.utilisation_pct>75%Application / DB metrics
Thread pool / goroutines[service].runtime.goroutine_count / thread_count>N (establish baseline)Runtime metrics
Disk (if applicable)[service].disk.utilisation_pct>75%Infrastructure
Queue depth (if applicable)[service].queue.depth>[backlog threshold]Queue metrics

3. Business Metrics

Beyond the golden signals, track metrics that measure whether the service is delivering business value. These matter for SLO reporting and product dashboards.

MetricDescriptionSourceAlert?
[service].[primary_action].success_rate[e.g. "Payment success rate"]ApplicationYes — if drops >5% vs 1h average
[service].[primary_action].count[e.g. "Payments processed per minute"]ApplicationYes — sudden drop (traffic anomaly)
[service].[resource].created_per_hour[e.g. "New accounts created"]Application / DBNo — informational
[service].cache.hit_rateFraction of requests served from cacheCache instrumentationYes — if drops below [60]%
[service].job.[name].success_rate[Background job success rate]Job frameworkYes — if drops below [99]%

4. Log Strategy

Structured Logging Schema

All logs must be structured JSON. Do not emit unstructured text logs in production. Every log line must include the mandatory fields.

Mandatory fields (every log line):

{
  "timestamp": "2024-01-15T10:23:45.123Z",
  "level": "info",
  "service": "[service-name]",
  "version": "[git-sha-short]",
  "trace_id": "[uuid-from-request-context]",
  "span_id": "[span-uuid]",
  "request_id": "[uuid-per-request]",
  "message": "[human readable description]"
}

Request log (emit for every HTTP request):

{
  "timestamp": "...",
  "level": "info",
  "service": "[service-name]",
  "event": "http_request",
  "method": "POST",
  "path": "/api/v1/[resource]",
  "status_code": 201,
  "duration_ms": 45,
  "user_id": "[uuid — DO NOT log PII directly]",
  "request_id": "[uuid]",
  "trace_id": "[uuid]"
}

Error log (emit for every error with context):

{
  "timestamp": "...",
  "level": "error",
  "service": "[service-name]",
  "event": "error",
  "error_code": "[application-error-code]",
  "error_message": "[description — no sensitive data]",
  "stack_trace": "[stack trace]",
  "request_id": "[uuid]",
  "trace_id": "[uuid]",
  "context": {
    "[key]": "[relevant context without PII]"
  }
}

Log Levels — When to Use Each

LevelUse whenExample
errorSomething failed that requires attention — this should page on-call eventuallyDatabase query failed, external API returned 5xx, required config missing
warnSomething unexpected happened but service is still functioningRetry succeeded after failure, cache miss on expected hit, rate limit approaching
infoSignificant business events and request lifecycleRequest received, payment processed, user authenticated, job started/completed
debugDetailed diagnostic information — off in production by defaultQuery parameters, intermediate computation results, cache key lookups

What NOT to Log

Never log:

  • Passwords, tokens, API keys, or secrets (even hashed)
  • Full credit card numbers or PAN data
  • Social security numbers or government IDs
  • Full names + dates of birth + contact info in the same log line (PII aggregation)
  • Request/response bodies in full (use field-level extraction instead)
  • Health check requests (too noisy — exclude GET /health from access logs)

5. Distributed Tracing Setup

Distributed tracing is mandatory for any service that calls other services. It enables root-cause analysis across service boundaries.

Instrumentation Checklist

[ ] Tracing library installed:
    - Go: go.opentelemetry.io/otel
    - Python: opentelemetry-sdk, opentelemetry-instrumentation
    - Node: @opentelemetry/sdk-node
    - Java: opentelemetry-java-instrumentation

[ ] Tracer initialized at service startup with service name and version

[ ] Trace context propagated via W3C Trace Context headers:
    traceparent: 00-[trace-id]-[span-id]-01
    tracestate: [optional vendor-specific]

[ ] Automatic instrumentation enabled for:
    [ ] Inbound HTTP/gRPC requests (creates root span)
    [ ] Outbound HTTP/gRPC calls (creates child spans)
    [ ] Database queries (creates child spans with sanitized query)
    [ ] Cache operations (Redis, Memcached)
    [ ] Message queue produce/consume

[ ] Custom spans added for:
    [ ] Key business operations ([e.g. payment processing, user lookup])
    [ ] Background jobs (each job execution = root span)
    [ ] Third-party API calls with custom attributes

[ ] Span attributes to capture on all spans:
    - user.id (if authenticated — no PII)
    - deployment.environment (production/staging)
    - service.version (git SHA)
    - [service-specific key attributes]

[ ] Trace exporter configured to: [Datadog / Jaeger / Tempo / OTLP endpoint]

[ ] Sampling rate configured:
    - Production: [1–10]% of requests (adjust based on volume and cost)
    - Always sample: errors, slow requests (>p99 threshold), and 100% of [critical endpoint]

Trace Instrumentation Examples

# Python — OpenTelemetry example
from opentelemetry import trace

tracer = trace.get_tracer("[service-name]")

def process_payment(payment_data):
    with tracer.start_as_current_span("process_payment") as span:
        span.set_attribute("payment.amount_cents", payment_data["amount"])
        span.set_attribute("payment.currency", payment_data["currency"])
        # Never: span.set_attribute("payment.card_number", ...)
        try:
            result = _do_process(payment_data)
            span.set_status(trace.StatusCode.OK)
            return result
        except PaymentError as e:
            span.set_status(trace.StatusCode.ERROR, str(e))
            span.record_exception(e)
            raise

6. Alert Rules Specification

Every alert must have: a name, a condition, a threshold, a severity, and a clear on-call action. Alerts without a clear action should not exist.

Alert Definitions

Alert nameConditionThresholdSeverityOn-call action
[Service]HighErrorRate5xx error rate, 5-min rolling window>1% for 2 consecutive windowsP1Check recent deploys; inspect error logs; see runbook [link]
[Service]CriticalErrorRate5xx error rate, 2-min rolling window>5%P1 — immediateSame as above — page immediately, do not wait
[Service]HighP99Latencyp99 latency on key endpoints>2× SLO target for 3 minP2Check DB latency, cache hit rate, and upstream dependencies
[Service]LatencySLOBreachp99 latency>SLO target for 5 consecutive minutesP1SLO burn — page on-call, escalate if not resolved in 20 min
[Service]HighCPUCPU utilisation>80% sustained for 5 minP2Check for traffic spike; scale up if needed; check for runaway processes
[Service]HighMemoryMemory utilisation>85% sustained for 5 minP2Check for memory leak (especially after deploys); restart pod if OOM imminent
[Service]DBConnectionPoolHighDB connection pool utilisation>75%P2Check for long-running queries; consider scaling service or increasing pool size
[Service]DLQDepthHighDead-letter queue depth>10 messagesP2Inspect DLQ messages for error pattern; fix bug and replay if safe
[Service]TrafficDropAnomalyRPS, compared to same hour yesterday>50% drop sustained 5 minP1Upstream may be down; check caller health; check load balancer
[Service]PrimaryActionSuccessRateDrop[Business metric success rate]<[95]% over 10 minP1[Service-specific action — e.g. "Check payment provider status"]
[Service]DownstreamDependencyErrorsError rate calling [dependency]>5% over 5 minP2Check [dependency] status page; enable fallback if available

Alert Configuration Examples

# Prometheus / Grafana alerting rules (adapt for your platform)
groups:
  - name: [service-name]-alerts
    rules:

      - alert: [Service]HighErrorRate
        expr: |
          (
            sum(rate([service]_http_requests_total{status=~"5.."}[5m]))
            /
            sum(rate([service]_http_requests_total[5m]))
          ) > 0.01
        for: 2m
        labels:
          severity: critical
          team: [team-name]
        annotations:
          summary: "High error rate on [Service Name]"
          description: "Error rate is {{ $value | humanizePercentage }} (threshold: 1%)"
          runbook_url: "[runbook link]"

      - alert: [Service]HighP99Latency
        expr: |
          histogram_quantile(0.99,
            sum(rate([service]_http_request_duration_seconds_bucket[5m])) by (le, endpoint)
          ) > [0.5]
        for: 3m
        labels:
          severity: warning
          team: [team-name]
        annotations:
          summary: "p99 latency elevated on [Service Name]"
          description: "p99 latency on {{ $labels.endpoint }} is {{ $value | humanizeDuration }}"
          runbook_url: "[runbook link]"
# Datadog monitor configuration (Python SDK or Terraform)
import datadog

datadog.initialize(api_key="[key]", app_key="[key]")

datadog.api.Monitor.create(
    type="metric alert",
    query=f"sum(last_5m):sum:{{service}}.http.errors{{service:[service-name]}} / sum:{{service}}.http.requests{{service:[service-name]}} > 0.01",
    name="[Service] High Error Rate",
    message="Error rate exceeded 1%. @pagerduty-[service-oncall]\n\nRunbook: [link]",
    tags=["service:[service-name]", "team:[team-name]"],
    options={
        "thresholds": {"critical": 0.01, "warning": 0.005},
        "notify_no_data": False,
        "evaluation_delay": 60,
    }
)

7. Dashboard Layout Specification

The primary service dashboard must answer "is the service healthy right now?" at a glance. Use this layout:

┌─────────────────────────────────────────────────────────────────────┐
│  [SERVICE NAME] — Service Health Dashboard           [Time range ▼] │
├───────────────┬───────────────┬───────────────┬─────────────────────┤
│  Error rate   │  p99 Latency  │  RPS (current)│  SLO budget remaining│
│  [BIG NUMBER] │  [BIG NUMBER] │  [BIG NUMBER] │  [BIG NUMBER / days] │
│  vs SLO: 0.1% │  vs SLO: 500ms│  vs avg: [N]  │  [Error budget gauge]│
├───────────────┴───────────────┴───────────────┴─────────────────────┤
│                   Error rate over time (24h)                        │
│  [Time series: 5xx rate line, SLO threshold line]                   │
├─────────────────────────────────┬───────────────────────────────────┤
│  Latency percentiles over time  │  Request throughput over time     │
│  [Lines: p50, p95, p99, p999]   │  [Bars: RPS by endpoint]          │
│  [SLO threshold horizontal line]│                                   │
├─────────────────────────────────┴───────────────────────────────────┤
│  Latency heatmap (all requests — shows distribution shape)          │
├─────────────────────────────────┬───────────────────────────────────┤
│  CPU utilisation over time      │  Memory utilisation over time     │
│  [All instances/pods — lines]   │  [All instances/pods — lines]     │
│  [Alert threshold: 80%]         │  [Alert threshold: 85%]           │
├─────────────────────────────────┴───────────────────────────────────┤
│  DB: connection pool utilisation│  DB: query latency (p99 per query)│
├─────────────────────────────────┴───────────────────────────────────┤
│  [Business metric 1 over time]  │  [Business metric 2 over time]    │
│  e.g. Payment success rate      │  e.g. Orders created/min          │
└─────────────────────────────────┴───────────────────────────────────┘

Second dashboard — Dependency Health:

┌─────────────────────────────────────────────────────────────────────┐
│  [SERVICE NAME] — Dependency Health                                 │
├─────────────────────────────────────────────────────────────────────┤
│  For each dependency: error rate | latency | current status         │
│  [Database]    [N]% errors | [N]ms p99 | ● Healthy / ⚠ Degraded    │
│  [Redis]       [N]% errors | [N]ms p99 | ● Healthy                 │
│  [External API][N]% errors | [N]ms p99 | ● Healthy                 │
├─────────────────────────────────────────────────────────────────────┤
│  Outbound call latency over time (one line per dependency)          │
├─────────────────────────────────────────────────────────────────────┤
│  Circuit breaker / fallback state (if implemented)                  │
└─────────────────────────────────────────────────────────────────────┘

8. Observability Debt Analysis

Honest assessment of what is missing today and what the priority to add it is:

GapImpactPriorityEffortOwnerTarget date
[e.g. No distributed tracing — can't see cross-service latency]High — blind to dependency issuesP1[2 days][Name][Date]
[e.g. No business metric alerts — only infra alerts]High — silent business failuresP1[1 day][Name][Date]
[e.g. Logs are unstructured text — not searchable]Medium — slow incident investigationP2[3 days][Name][Date]
[e.g. No dead-letter queue monitoring]Medium — failed messages go unnoticedP2[4 hours][Name][Date]
[e.g. Alert thresholds not calibrated to production baseline]Medium — alert fatigue or missed alertsP2[1 day][Name][Date]
[e.g. No latency heatmap — outliers invisible in averages]Low — harder to spot tail latency issuesP3[2 hours][Name][Date]

Total observability debt: [N] items | Estimated effort: [N days]


Quality Checks

  • Every alert has a named on-call action — no alert says "investigate" without specifying what to investigate first
  • Alert thresholds are calibrated against production baselines, not set to default values from a template
  • Structured logging is implemented — no unstructured text log lines in production
  • PII is explicitly excluded from logs — a named engineer has verified this
  • Distributed tracing is propagating trace IDs across all service boundaries (verify with a test request)
  • The primary dashboard answers "is the service healthy?" in under 10 seconds — no hunting for the right panel
  • Business metrics are tracked alongside infrastructure metrics — not just four golden signals
  • Observability debt items have owners and dates — not just "would be nice to have"

Anti-Patterns

  • Do not create alerts without a specific on-call action — an alert that just says "investigate" trains engineers to ignore it
  • Do not set alert thresholds from a template without calibrating against production baselines — uncalibrated thresholds cause either alert fatigue or missed incidents
  • Do not log PII, tokens, or secrets — a logging standard is incomplete without an explicit list of what must never be logged
  • Do not measure only the four golden signals without adding at least one business metric alert — infrastructure health can be green while the business-critical path is silently failing
  • Do not deploy distributed tracing without verifying that trace IDs propagate across all service boundaries — partial tracing is worse than no tracing because it produces misleading incomplete traces

Bundled with this artifact

1 file

Reference files that ship alongside this artifact. Agents pull these in only when the task needs them.

More on the bench

SKILL0

Google Cloud Waf Sustainability

Generates sustainability-focused guidance for Google Cloud workloads based on the design principles and recommendations in the Google Cloud Well-Architected Framework (WAF). Use this skill to evaluate a workload, identify environmental impact requirements, and provide actionable recommendations to build, deploy, and manage the workload sustainably in Google Cloud.

software-engineering+2
0
SKILL0

Google Cloud Waf Reliability

Generates reliability-focused guidance for Google Cloud workloads based on the design principles and recommendations in the Google Cloud Well-Architected Framework. Use this skill to evaluate a workload, identify reliability requirements, and provide actionable recommendations for build, deploy, and manage the workload reliably in Google Cloud.

software-engineering+2
0
SKILL0

Google Cloud Waf Performance Optimization

Generates performance-focused guidance for Google Cloud workloads based on the design principles and recommendations in the Performance Optimization pillar of the Google Cloud Well-Architected Framework (WAF). Use this skill to evaluate a workload, identify performance requirements, and provide actionable recommendations for resource allocation, modular design, and elasticity.

software-engineering+2
0