Smithery Logo
MCPsSkillsDocsPricing
Login
Smithery Logo

Give agents more agency

Resources

DocumentationPrivacy PolicySystem Status

Company

PricingAboutBlog

Connect

© 2026 Smithery. All rights reserved.

    sickn33

    incident-runbook-templates

    sickn33/incident-runbook-templates
    DevOps
    8,021
    1 installs

    About

    SKILL.md

    Install

    • Telegram
      Telegram
    • Slack
      Slack
    • Claude Code
      Claude Code
    • Codex
      Codex
    • OpenClaw
      OpenClaw
    • Cursor
      Cursor
    • Amp
      Amp
    • GitHub Copilot
      GitHub Copilot
    • Gemini CLI
      Gemini CLI
    • Kilo Code
      Kilo Code
    • Junie
      Junie
    • Replit
      Replit
    • Windsurf
      Windsurf
    • Cline
      Cline
    • Continue
      Continue
    • OpenCode
      OpenCode
    • OpenHands
      OpenHands
    • Roo Code
      Roo Code
    • Augment
      Augment
    • Goose
      Goose
    • Trae
      Trae
    • Zencoder
      Zencoder
    • Antigravity
      Antigravity
    • Download skill
    ├─
    ├─
    └─

    About

    Create structured incident response runbooks with step-by-step procedures, escalation paths, and recovery actions...

    SKILL.md

    Incident Runbook Templates

    Production-ready templates for incident response runbooks covering detection, triage, mitigation, resolution, and communication.

    Do not use this skill when

    • The task is unrelated to incident runbook templates
    • You need a different domain or tool outside this scope

    Instructions

    • Clarify goals, constraints, and required inputs.
    • Apply relevant best practices and validate outcomes.
    • Provide actionable steps and verification.
    • If detailed examples are required, open resources/implementation-playbook.md.

    Use this skill when

    • Creating incident response procedures
    • Building service-specific runbooks
    • Establishing escalation paths
    • Documenting recovery procedures
    • Responding to active incidents
    • Onboarding on-call engineers

    Core Concepts

    1. Incident Severity Levels

    Severity Impact Response Time Example
    SEV1 Complete outage, data loss 15 min Production down
    SEV2 Major degradation 30 min Critical feature broken
    SEV3 Minor impact 2 hours Non-critical bug
    SEV4 Minimal impact Next business day Cosmetic issue

    2. Runbook Structure

    1. Overview & Impact
    2. Detection & Alerts
    3. Initial Triage
    4. Mitigation Steps
    5. Root Cause Investigation
    6. Resolution Procedures
    7. Verification & Rollback
    8. Communication Templates
    9. Escalation Matrix
    

    Runbook Templates

    Template 1: Service Outage Runbook

    # [Service Name] Outage Runbook
    
    ## Overview
    **Service**: Payment Processing Service
    **Owner**: Platform Team
    **Slack**: #payments-incidents
    **PagerDuty**: payments-oncall
    
    ## Impact Assessment
    - [ ] Which customers are affected?
    - [ ] What percentage of traffic is impacted?
    - [ ] Are there financial implications?
    - [ ] What's the blast radius?
    
    ## Detection
    ### Alerts
    - `payment_error_rate > 5%` (PagerDuty)
    - `payment_latency_p99 > 2s` (Slack)
    - `payment_success_rate < 95%` (PagerDuty)
    
    ### Dashboards
    - [Payment Service Dashboard](https://grafana/d/payments)
    - [Error Tracking](https://sentry.io/payments)
    - [Dependency Status](https://status.stripe.com)
    
    ## Initial Triage (First 5 Minutes)
    
    ### 1. Assess Scope
    ```bash
    # Check service health
    kubectl get pods -n payments -l app=payment-service
    
    # Check recent deployments
    kubectl rollout history deployment/payment-service -n payments
    
    # Check error rates
    curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))"
    

    2. Quick Health Checks

    • Can you reach the service? curl -I https://api.company.com/payments/health
    • Database connectivity? Check connection pool metrics
    • External dependencies? Check Stripe, bank API status
    • Recent changes? Check deploy history

    3. Initial Classification

    Symptom Likely Cause Go To Section
    All requests failing Service down Section 4.1
    High latency Database/dependency Section 4.2
    Partial failures Code bug Section 4.3
    Spike in errors Traffic surge Section 4.4

    Mitigation Procedures

    4.1 Service Completely Down

    # Step 1: Check pod status
    kubectl get pods -n payments
    
    # Step 2: If pods are crash-looping, check logs
    kubectl logs -n payments -l app=payment-service --tail=100
    
    # Step 3: Check recent deployments
    kubectl rollout history deployment/payment-service -n payments
    
    # Step 4: ROLLBACK if recent deploy is suspect
    kubectl rollout undo deployment/payment-service -n payments
    
    # Step 5: Scale up if resource constrained
    kubectl scale deployment/payment-service -n payments --replicas=10
    
    # Step 6: Verify recovery
    kubectl rollout status deployment/payment-service -n payments
    

    4.2 High Latency

    # Step 1: Check database connections
    kubectl exec -n payments deploy/payment-service -- \
      curl localhost:8080/metrics | grep db_pool
    
    # Step 2: Check slow queries (if DB issue)
    psql -h $DB_HOST -U $DB_USER -c "
      SELECT pid, now() - query_start AS duration, query
      FROM pg_stat_activity
      WHERE state = 'active' AND duration > interval '5 seconds'
      ORDER BY duration DESC;"
    
    # Step 3: Kill long-running queries if needed
    psql -h $DB_HOST -U $DB_USER -c "SELECT pg_terminate_backend(pid);"
    
    # Step 4: Check external dependency latency
    curl -w "@curl-format.txt" -o /dev/null -s https://api.stripe.com/v1/health
    
    # Step 5: Enable circuit breaker if dependency is slow
    kubectl set env deployment/payment-service \
      STRIPE_CIRCUIT_BREAKER_ENABLED=true -n payments
    

    4.3 Partial Failures (Specific Errors)

    # Step 1: Identify error pattern
    kubectl logs -n payments -l app=payment-service --tail=500 | \
      grep -i error | sort | uniq -c | sort -rn | head -20
    
    # Step 2: Check error tracking
    # Go to Sentry: https://sentry.io/payments
    
    # Step 3: If specific endpoint, enable feature flag to disable
    curl -X POST https://api.company.com/internal/feature-flags \
      -d '{"flag": "DISABLE_PROBLEMATIC_FEATURE", "enabled": true}'
    
    # Step 4: If data issue, check recent data changes
    psql -h $DB_HOST -c "
      SELECT * FROM audit_log
      WHERE table_name = 'payment_methods'
      AND created_at > now() - interval '1 hour';"
    

    4.4 Traffic Surge

    # Step 1: Check current request rate
    kubectl top pods -n payments
    
    # Step 2: Scale horizontally
    kubectl scale deployment/payment-service -n payments --replicas=20
    
    # Step 3: Enable rate limiting
    kubectl set env deployment/payment-service \
      RATE_LIMIT_ENABLED=true \
      RATE_LIMIT_RPS=1000 -n payments
    
    # Step 4: If attack, block suspicious IPs
    kubectl apply -f - <<EOF
    apiVersion: networking.k8s.io/v1
    kind: NetworkPolicy
    metadata:
      name: block-suspicious
      namespace: payments
    spec:
      podSelector:
        matchLabels:
          app: payment-service
      ingress:
      - from:
        - ipBlock:
            cidr: 0.0.0.0/0
            except:
            - 192.168.1.0/24  # Suspicious range
    EOF
    

    Verification Steps

    # Verify service is healthy
    curl -s https://api.company.com/payments/health | jq
    
    # Verify error rate is back to normal
    curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))" | jq '.data.result[0].value[1]'
    
    # Verify latency is acceptable
    curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.99,sum(rate(http_request_duration_seconds_bucket[5m]))by(le))" | jq
    
    # Smoke test critical flows
    ./scripts/smoke-test-payments.sh
    

    Rollback Procedures

    # Rollback Kubernetes deployment
    kubectl rollout undo deployment/payment-service -n payments
    
    # Rollback database migration (if applicable)
    ./scripts/db-rollback.sh $MIGRATION_VERSION
    
    # Rollback feature flag
    curl -X POST https://api.company.com/internal/feature-flags \
      -d '{"flag": "NEW_PAYMENT_FLOW", "enabled": false}'
    

    Escalation Matrix

    Condition Escalate To Contact
    > 15 min unresolved SEV1 Engineering Manager @manager (Slack)
    Data breach suspected Security Team #security-incidents
    Financial impact > $10k Finance + Legal @finance-oncall
    Customer communication needed Support Lead @support-lead

    Communication Templates

    Initial Notification (Internal)

    🚨 INCIDENT: Payment Service Degradation
    
    Severity: SEV2
    Status: Investigating
    Impact: ~20% of payment requests failing
    Start Time: [TIME]
    Incident Commander: [NAME]
    
    Current Actions:
    - Investigating root cause
    - Scaling up service
    - Monitoring dashboards
    
    Updates in #payments-incidents
    

    Status Update

    📊 UPDATE: Payment Service Incident
    
    Status: Mitigating
    Impact: Reduced to ~5% failure rate
    Duration: 25 minutes
    
    Actions Taken:
    - Rolled back deployment v2.3.4 → v2.3.3
    - Scaled service from 5 → 10 replicas
    
    Next Steps:
    - Continuing to monitor
    - Root cause analysis in progress
    
    ETA to Resolution: ~15 minutes
    

    Resolution Notification

    ✅ RESOLVED: Payment Service Incident
    
    Duration: 45 minutes
    Impact: ~5,000 affected transactions
    Root Cause: Memory leak in v2.3.4
    
    Resolution:
    - Rolled back to v2.3.3
    - Transactions auto-retried successfully
    
    Follow-up:
    - Postmortem scheduled for [DATE]
    - Bug fix in progress
    
    
    ### Template 2: Database Incident Runbook
    
    ```markdown
    # Database Incident Runbook
    
    ## Quick Reference
    | Issue | Command |
    |-------|---------|
    | Check connections | `SELECT count(*) FROM pg_stat_activity;` |
    | Kill query | `SELECT pg_terminate_backend(pid);` |
    | Check replication lag | `SELECT extract(epoch from (now() - pg_last_xact_replay_timestamp()));` |
    | Check locks | `SELECT * FROM pg_locks WHERE NOT granted;` |
    
    ## Connection Pool Exhaustion
    ```sql
    -- Check current connections
    SELECT datname, usename, state, count(*)
    FROM pg_stat_activity
    GROUP BY datname, usename, state
    ORDER BY count(*) DESC;
    
    -- Identify long-running connections
    SELECT pid, usename, datname, state, query_start, query
    FROM pg_stat_activity
    WHERE state != 'idle'
    ORDER BY query_start;
    
    -- Terminate idle connections
    SELECT pg_terminate_backend(pid)
    FROM pg_stat_activity
    WHERE state = 'idle'
    AND query_start < now() - interval '10 minutes';
    

    Replication Lag

    -- Check lag on replica
    SELECT
      CASE
        WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN 0
        ELSE extract(epoch from now() - pg_last_xact_replay_timestamp())
      END AS lag_seconds;
    
    -- If lag > 60s, consider:
    -- 1. Check network between primary/replica
    -- 2. Check replica disk I/O
    -- 3. Consider failover if unrecoverable
    

    Disk Space Critical

    # Check disk usage
    df -h /var/lib/postgresql/data
    
    # Find large tables
    psql -c "SELECT relname, pg_size_pretty(pg_total_relation_size(relid))
    FROM pg_catalog.pg_statio_user_tables
    ORDER BY pg_total_relation_size(relid) DESC
    LIMIT 10;"
    
    # VACUUM to reclaim space
    psql -c "VACUUM FULL large_table;"
    
    # If emergency, delete old data or expand disk
    
    
    ## Best Practices
    
    ### Do's
    - **Keep runbooks updated** - Review after every incident
    - **Test runbooks regularly** - Game days, chaos engineering
    - **Include rollback steps** - Always have an escape hatch
    - **Document assumptions** - What must be true for steps to work
    - **Link to dashboards** - Quick access during stress
    
    ### Don'ts
    - **Don't assume knowledge** - Write for 3 AM brain
    - **Don't skip verification** - Confirm each step worked
    - **Don't forget communication** - Keep stakeholders informed
    - **Don't work alone** - Escalate early
    - **Don't skip postmortems** - Learn from every incident
    
    ## Resources
    
    - [Google SRE Book - Incident Management](https://sre.google/sre-book/managing-incidents/)
    - [PagerDuty Incident Response](https://response.pagerduty.com/)
    - [Atlassian Incident Management](https://www.atlassian.com/incident-management)
    
    ## Limitations
    - Use this skill only when the task clearly matches the scope described above.
    - Do not treat the output as a substitute for environment-specific validation, testing, or expert review.
    - Stop and ask for clarification if required inputs, permissions, safety boundaries, or success criteria are missing.
    
    Recommended Servers
    Airtable
    Airtable
    ScrapeGraph AI Integration Server
    ScrapeGraph AI Integration Server
    Repository
    sickn33/antigravity-awesome-skills
    Files