Smithery Logo
MCPsSkillsDocsPricing
Login
Smithery Logo

Accelerating the Agent Economy

Resources

DocumentationPrivacy PolicySystem Status

Company

PricingAboutBlog

Connect

© 2026 Smithery. All rights reserved.

    wshobson

    incident-runbook-templates

    wshobson/incident-runbook-templates
    DevOps
    28,185
    4 installs

    About

    SKILL.md

    Install

    Install via Skills CLI

    or add to your agent
    • Claude Code
      Claude Code
    • Codex
      Codex
    • OpenClaw
      OpenClaw
    • Cursor
      Cursor
    • Amp
      Amp
    • GitHub Copilot
      GitHub Copilot
    • Gemini CLI
      Gemini CLI
    • Kilo Code
      Kilo Code
    • Junie
      Junie
    • Replit
      Replit
    • Windsurf
      Windsurf
    • Cline
      Cline
    • Continue
      Continue
    • OpenCode
      OpenCode
    • OpenHands
      OpenHands
    • Roo Code
      Roo Code
    • Augment
      Augment
    • Goose
      Goose
    • Trae
      Trae
    • Zencoder
      Zencoder
    • Antigravity
      Antigravity
    ├─
    ├─
    └─

    About

    Create structured incident response runbooks with step-by-step procedures, escalation paths, and recovery actions...

    SKILL.md

    Incident Runbook Templates

    Production-ready templates for incident response runbooks covering detection, triage, mitigation, resolution, and communication.

    When to Use This Skill

    • Creating incident response procedures
    • Building service-specific runbooks
    • Establishing escalation paths
    • Documenting recovery procedures
    • Responding to active incidents
    • Onboarding on-call engineers

    Core Concepts

    1. Incident Severity Levels

    Severity Impact Response Time Example
    SEV1 Complete outage, data loss 15 min Production down
    SEV2 Major degradation 30 min Critical feature broken
    SEV3 Minor impact 2 hours Non-critical bug
    SEV4 Minimal impact Next business day Cosmetic issue

    2. Runbook Structure

    1. Overview & Impact
    2. Detection & Alerts
    3. Initial Triage
    4. Mitigation Steps
    5. Root Cause Investigation
    6. Resolution Procedures
    7. Verification & Rollback
    8. Communication Templates
    9. Escalation Matrix
    

    Runbook Templates

    Template 1: Service Outage Runbook

    # [Service Name] Outage Runbook
    
    ## Overview
    
    **Service**: Payment Processing Service
    **Owner**: Platform Team
    **Slack**: #payments-incidents
    **PagerDuty**: payments-oncall
    
    ## Impact Assessment
    
    - [ ] Which customers are affected?
    - [ ] What percentage of traffic is impacted?
    - [ ] Are there financial implications?
    - [ ] What's the blast radius?
    
    ## Detection
    
    ### Alerts
    
    - `payment_error_rate > 5%` (PagerDuty)
    - `payment_latency_p99 > 2s` (Slack)
    - `payment_success_rate < 95%` (PagerDuty)
    
    ### Dashboards
    
    - [Payment Service Dashboard](https://grafana/d/payments)
    - [Error Tracking](https://sentry.io/payments)
    - [Dependency Status](https://status.stripe.com)
    
    ## Initial Triage (First 5 Minutes)
    
    ### 1. Assess Scope
    
    ```bash
    # Check service health
    kubectl get pods -n payments -l app=payment-service
    
    # Check recent deployments
    kubectl rollout history deployment/payment-service -n payments
    
    # Check error rates
    curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))"
    ```
    

    2. Quick Health Checks

    • Can you reach the service? curl -I https://api.company.com/payments/health
    • Database connectivity? Check connection pool metrics
    • External dependencies? Check Stripe, bank API status
    • Recent changes? Check deploy history

    3. Initial Classification

    Symptom Likely Cause Go To Section
    All requests failing Service down Section 4.1
    High latency Database/dependency Section 4.2
    Partial failures Code bug Section 4.3
    Spike in errors Traffic surge Section 4.4

    Mitigation Procedures

    4.1 Service Completely Down

    # Step 1: Check pod status
    kubectl get pods -n payments
    
    # Step 2: If pods are crash-looping, check logs
    kubectl logs -n payments -l app=payment-service --tail=100
    
    # Step 3: Check recent deployments
    kubectl rollout history deployment/payment-service -n payments
    
    # Step 4: ROLLBACK if recent deploy is suspect
    kubectl rollout undo deployment/payment-service -n payments
    
    # Step 5: Scale up if resource constrained
    kubectl scale deployment/payment-service -n payments --replicas=10
    
    # Step 6: Verify recovery
    kubectl rollout status deployment/payment-service -n payments
    

    4.2 High Latency

    # Step 1: Check database connections
    kubectl exec -n payments deploy/payment-service -- \
      curl localhost:8080/metrics | grep db_pool
    
    # Step 2: Check slow queries (if DB issue)
    psql -h $DB_HOST -U $DB_USER -c "
      SELECT pid, now() - query_start AS duration, query
      FROM pg_stat_activity
      WHERE state = 'active' AND duration > interval '5 seconds'
      ORDER BY duration DESC;"
    
    # Step 3: Kill long-running queries if needed
    psql -h $DB_HOST -U $DB_USER -c "SELECT pg_terminate_backend(pid);"
    
    # Step 4: Check external dependency latency
    curl -w "@curl-format.txt" -o /dev/null -s https://api.stripe.com/v1/health
    
    # Step 5: Enable circuit breaker if dependency is slow
    kubectl set env deployment/payment-service \
      STRIPE_CIRCUIT_BREAKER_ENABLED=true -n payments
    

    4.3 Partial Failures (Specific Errors)

    # Step 1: Identify error pattern
    kubectl logs -n payments -l app=payment-service --tail=500 | \
      grep -i error | sort | uniq -c | sort -rn | head -20
    
    # Step 2: Check error tracking
    # Go to Sentry: https://sentry.io/payments
    
    # Step 3: If specific endpoint, enable feature flag to disable
    curl -X POST https://api.company.com/internal/feature-flags \
      -d '{"flag": "DISABLE_PROBLEMATIC_FEATURE", "enabled": true}'
    
    # Step 4: If data issue, check recent data changes
    psql -h $DB_HOST -c "
      SELECT * FROM audit_log
      WHERE table_name = 'payment_methods'
      AND created_at > now() - interval '1 hour';"
    

    4.4 Traffic Surge

    # Step 1: Check current request rate
    kubectl top pods -n payments
    
    # Step 2: Scale horizontally
    kubectl scale deployment/payment-service -n payments --replicas=20
    
    # Step 3: Enable rate limiting
    kubectl set env deployment/payment-service \
      RATE_LIMIT_ENABLED=true \
      RATE_LIMIT_RPS=1000 -n payments
    
    # Step 4: If attack, block suspicious IPs
    kubectl apply -f - <<EOF
    apiVersion: networking.k8s.io/v1
    kind: NetworkPolicy
    metadata:
      name: block-suspicious
      namespace: payments
    spec:
      podSelector:
        matchLabels:
          app: payment-service
      ingress:
      - from:
        - ipBlock:
            cidr: 0.0.0.0/0
            except:
            - 192.168.1.0/24  # Suspicious range
    EOF
    

    Verification Steps

    # Verify service is healthy
    curl -s https://api.company.com/payments/health | jq
    
    # Verify error rate is back to normal
    curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))" | jq '.data.result[0].value[1]'
    
    # Verify latency is acceptable
    curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.99,sum(rate(http_request_duration_seconds_bucket[5m]))by(le))" | jq
    
    # Smoke test critical flows
    ./scripts/smoke-test-payments.sh
    

    Rollback Procedures

    # Rollback Kubernetes deployment
    kubectl rollout undo deployment/payment-service -n payments
    
    # Rollback database migration (if applicable)
    ./scripts/db-rollback.sh $MIGRATION_VERSION
    
    # Rollback feature flag
    curl -X POST https://api.company.com/internal/feature-flags \
      -d '{"flag": "NEW_PAYMENT_FLOW", "enabled": false}'
    

    Escalation Matrix

    Condition Escalate To Contact
    > 15 min unresolved SEV1 Engineering Manager @manager (Slack)
    Data breach suspected Security Team #security-incidents
    Financial impact > $10k Finance + Legal @finance-oncall
    Customer communication needed Support Lead @support-lead

    Communication Templates

    Initial Notification (Internal)

    🚨 INCIDENT: Payment Service Degradation
    
    Severity: SEV2
    Status: Investigating
    Impact: ~20% of payment requests failing
    Start Time: [TIME]
    Incident Commander: [NAME]
    
    Current Actions:
    - Investigating root cause
    - Scaling up service
    - Monitoring dashboards
    
    Updates in #payments-incidents
    

    Status Update

    📊 UPDATE: Payment Service Incident
    
    Status: Mitigating
    Impact: Reduced to ~5% failure rate
    Duration: 25 minutes
    
    Actions Taken:
    - Rolled back deployment v2.3.4 → v2.3.3
    - Scaled service from 5 → 10 replicas
    
    Next Steps:
    - Continuing to monitor
    - Root cause analysis in progress
    
    ETA to Resolution: ~15 minutes
    

    Resolution Notification

    ✅ RESOLVED: Payment Service Incident
    
    Duration: 45 minutes
    Impact: ~5,000 affected transactions
    Root Cause: Memory leak in v2.3.4
    
    Resolution:
    - Rolled back to v2.3.3
    - Transactions auto-retried successfully
    
    Follow-up:
    - Postmortem scheduled for [DATE]
    - Bug fix in progress
    
    
    ### Template 2: Database Incident Runbook
    
    ```markdown
    # Database Incident Runbook
    
    ## Quick Reference
    | Issue | Command |
    |-------|---------|
    | Check connections | `SELECT count(*) FROM pg_stat_activity;` |
    | Kill query | `SELECT pg_terminate_backend(pid);` |
    | Check replication lag | `SELECT extract(epoch from (now() - pg_last_xact_replay_timestamp()));` |
    | Check locks | `SELECT * FROM pg_locks WHERE NOT granted;` |
    
    ## Connection Pool Exhaustion
    ```sql
    -- Check current connections
    SELECT datname, usename, state, count(*)
    FROM pg_stat_activity
    GROUP BY datname, usename, state
    ORDER BY count(*) DESC;
    
    -- Identify long-running connections
    SELECT pid, usename, datname, state, query_start, query
    FROM pg_stat_activity
    WHERE state != 'idle'
    ORDER BY query_start;
    
    -- Terminate idle connections
    SELECT pg_terminate_backend(pid)
    FROM pg_stat_activity
    WHERE state = 'idle'
    AND query_start < now() - interval '10 minutes';
    

    Replication Lag

    -- Check lag on replica
    SELECT
      CASE
        WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN 0
        ELSE extract(epoch from now() - pg_last_xact_replay_timestamp())
      END AS lag_seconds;
    
    -- If lag > 60s, consider:
    -- 1. Check network between primary/replica
    -- 2. Check replica disk I/O
    -- 3. Consider failover if unrecoverable
    

    Disk Space Critical

    # Check disk usage
    df -h /var/lib/postgresql/data
    
    # Find large tables
    psql -c "SELECT relname, pg_size_pretty(pg_total_relation_size(relid))
    FROM pg_catalog.pg_statio_user_tables
    ORDER BY pg_total_relation_size(relid) DESC
    LIMIT 10;"
    
    # VACUUM to reclaim space
    psql -c "VACUUM FULL large_table;"
    
    # If emergency, delete old data or expand disk
    
    
    ## Best Practices
    
    ### Do's
    - **Keep runbooks updated** - Review after every incident
    - **Test runbooks regularly** - Game days, chaos engineering
    - **Include rollback steps** - Always have an escape hatch
    - **Document assumptions** - What must be true for steps to work
    - **Link to dashboards** - Quick access during stress
    
    ### Don'ts
    - **Don't assume knowledge** - Write for 3 AM brain
    - **Don't skip verification** - Confirm each step worked
    - **Don't forget communication** - Keep stakeholders informed
    - **Don't work alone** - Escalate early
    - **Don't skip postmortems** - Learn from every incident
    
    ## Troubleshooting
    
    ### Runbook steps work in staging but fail during a real incident
    
    Steps often assume preconditions that are true in a healthy environment but not during an outage. For each command in your runbook, add a prerequisite check and a "what to do if this command fails" note:
    
    ```bash
    # Step: Check pod status
    kubectl get pods -n payments
    
    # Prerequisites: kubectl configured, kubeconfig points to correct cluster
    # If this fails: run `aws eks update-kubeconfig --name prod-cluster --region us-east-1`
    # Expected output: pods in Running state
    

    On-call engineer panics and skips steps out of order

    Add a numbered checklist at the top of the runbook that mirrors the section numbers, so responders can track progress under stress without reading the full document:

    ## Quick Checklist
    - [ ] 1. Declare incident severity and open war room
    - [ ] 2. Check service health (Section 4.1)
    - [ ] 3. Check recent deployments (Section 4.1)
    - [ ] 4. Roll back if deploy is suspect (Section 4.1)
    - [ ] 5. Post initial notification to #payments-incidents
    - [ ] 6. Escalate if > 15 min unresolved
    

    Runbook is outdated — commands reference old cluster names or endpoints

    Runbooks rot because they're updated manually. Include a "Last Verified" date and owner at the top, and add a CI check that validates all curl endpoints and kubectl context names are still valid:

    ## Runbook Metadata
    | Field | Value |
    |---|---|
    | Last verified | 2024-11-15 |
    | Owner | @platform-team |
    | Review cadence | After every SEV1/SEV2 |
    

    Stakeholder communication is delayed while engineers are heads-down

    Assign a dedicated incident communicator role (separate from the incident commander) whose only job is to post status updates. Add a standing agenda in the communication template:

    Update every 15 minutes (even if no new information):
    - Current status (Investigating / Mitigating / Monitoring)
    - Impact (what is broken, who is affected, % of traffic)
    - What we are doing right now
    - Next update in: 15 minutes
    

    Database runbook commands cause additional downtime when run incorrectly

    Add explicit warnings before destructive SQL commands and require a dry-run output check before executing:

    -- WARNING: This terminates active connections. Verify count first.
    -- DRY RUN (check count before terminating):
    SELECT count(*) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '10 minutes';
    
    -- EXECUTE only after verifying count is reasonable (< 50):
    SELECT pg_terminate_backend(pid) FROM pg_stat_activity
    WHERE state = 'idle' AND query_start < now() - interval '10 minutes';
    

    Related Skills

    • postmortem-writing - After resolving an incident, use postmortem templates to capture root cause and preventive actions
    • on-call-handoff-patterns - Structure shift handoffs so the incoming responder has full context on active incidents
    Recommended Servers
    Better Stack
    Better Stack
    Google Docs
    Google Docs
    Prisma
    Prisma
    Repository
    wshobson/agents
    Files