Smithery Logo
MCPsSkillsDocsPricing
Login
Smithery Logo

Accelerating the Agent Economy

Resources

DocumentationPrivacy PolicySystem Status

Company

PricingAboutBlog

Connect

© 2026 Smithery. All rights reserved.

    sickn33

    slo-implementation

    sickn33/slo-implementation
    DevOps
    8,021

    About

    SKILL.md

    Install

    Install via Skills CLI

    or add to your agent
    • Claude Code
      Claude Code
    • Codex
      Codex
    • OpenClaw
      OpenClaw
    • Cursor
      Cursor
    • Amp
      Amp
    • GitHub Copilot
      GitHub Copilot
    • Gemini CLI
      Gemini CLI
    • Kilo Code
      Kilo Code
    • Junie
      Junie
    • Replit
      Replit
    • Windsurf
      Windsurf
    • Cline
      Cline
    • Continue
      Continue
    • OpenCode
      OpenCode
    • OpenHands
      OpenHands
    • Roo Code
      Roo Code
    • Augment
      Augment
    • Goose
      Goose
    • Trae
      Trae
    • Zencoder
      Zencoder
    • Antigravity
      Antigravity
    ├─
    ├─
    └─

    About

    Define and implement Service Level Indicators (SLIs) and Service Level Objectives (SLOs) with error budgets and alerting...

    SKILL.md

    SLO Implementation

    Framework for defining and implementing Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets.

    Do not use this skill when

    • The task is unrelated to slo implementation
    • You need a different domain or tool outside this scope

    Instructions

    • Clarify goals, constraints, and required inputs.
    • Apply relevant best practices and validate outcomes.
    • Provide actionable steps and verification.
    • If detailed examples are required, open resources/implementation-playbook.md.

    Purpose

    Implement measurable reliability targets using SLIs, SLOs, and error budgets to balance reliability with innovation velocity.

    Use this skill when

    • Define service reliability targets
    • Measure user-perceived reliability
    • Implement error budgets
    • Create SLO-based alerts
    • Track reliability goals

    SLI/SLO/SLA Hierarchy

    SLA (Service Level Agreement)
      ↓ Contract with customers
    SLO (Service Level Objective)
      ↓ Internal reliability target
    SLI (Service Level Indicator)
      ↓ Actual measurement
    

    Defining SLIs

    Common SLI Types

    1. Availability SLI

    # Successful requests / Total requests
    sum(rate(http_requests_total{status!~"5.."}[28d]))
    /
    sum(rate(http_requests_total[28d]))
    

    2. Latency SLI

    # Requests below latency threshold / Total requests
    sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
    /
    sum(rate(http_request_duration_seconds_count[28d]))
    

    3. Durability SLI

    # Successful writes / Total writes
    sum(storage_writes_successful_total)
    /
    sum(storage_writes_total)
    

    Reference: See references/slo-definitions.md

    Setting SLO Targets

    Availability SLO Examples

    SLO % Downtime/Month Downtime/Year
    99% 7.2 hours 3.65 days
    99.9% 43.2 minutes 8.76 hours
    99.95% 21.6 minutes 4.38 hours
    99.99% 4.32 minutes 52.56 minutes

    Choose Appropriate SLOs

    Consider:

    • User expectations
    • Business requirements
    • Current performance
    • Cost of reliability
    • Competitor benchmarks

    Example SLOs:

    slos:
      - name: api_availability
        target: 99.9
        window: 28d
        sli: |
          sum(rate(http_requests_total{status!~"5.."}[28d]))
          /
          sum(rate(http_requests_total[28d]))
    
      - name: api_latency_p95
        target: 99
        window: 28d
        sli: |
          sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
          /
          sum(rate(http_request_duration_seconds_count[28d]))
    

    Error Budget Calculation

    Error Budget Formula

    Error Budget = 1 - SLO Target
    

    Example:

    • SLO: 99.9% availability
    • Error Budget: 0.1% = 43.2 minutes/month
    • Current Error: 0.05% = 21.6 minutes/month
    • Remaining Budget: 50%

    Error Budget Policy

    error_budget_policy:
      - remaining_budget: 100%
        action: Normal development velocity
      - remaining_budget: 50%
        action: Consider postponing risky changes
      - remaining_budget: 10%
        action: Freeze non-critical changes
      - remaining_budget: 0%
        action: Feature freeze, focus on reliability
    

    Reference: See references/error-budget.md

    SLO Implementation

    Prometheus Recording Rules

    # SLI Recording Rules
    groups:
      - name: sli_rules
        interval: 30s
        rules:
          # Availability SLI
          - record: sli:http_availability:ratio
            expr: |
              sum(rate(http_requests_total{status!~"5.."}[28d]))
              /
              sum(rate(http_requests_total[28d]))
    
          # Latency SLI (requests < 500ms)
          - record: sli:http_latency:ratio
            expr: |
              sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
              /
              sum(rate(http_request_duration_seconds_count[28d]))
    
      - name: slo_rules
        interval: 5m
        rules:
          # SLO compliance (1 = meeting SLO, 0 = violating)
          - record: slo:http_availability:compliance
            expr: sli:http_availability:ratio >= bool 0.999
    
          - record: slo:http_latency:compliance
            expr: sli:http_latency:ratio >= bool 0.99
    
          # Error budget remaining (percentage)
          - record: slo:http_availability:error_budget_remaining
            expr: |
              (sli:http_availability:ratio - 0.999) / (1 - 0.999) * 100
    
          # Error budget burn rate
          - record: slo:http_availability:burn_rate_5m
            expr: |
              (1 - (
                sum(rate(http_requests_total{status!~"5.."}[5m]))
                /
                sum(rate(http_requests_total[5m]))
              )) / (1 - 0.999)
    

    SLO Alerting Rules

    groups:
      - name: slo_alerts
        interval: 1m
        rules:
          # Fast burn: 14.4x rate, 1 hour window
          # Consumes 2% error budget in 1 hour
          - alert: SLOErrorBudgetBurnFast
            expr: |
              slo:http_availability:burn_rate_1h > 14.4
              and
              slo:http_availability:burn_rate_5m > 14.4
            for: 2m
            labels:
              severity: critical
            annotations:
              summary: "Fast error budget burn detected"
              description: "Error budget burning at {{ $value }}x rate"
    
          # Slow burn: 6x rate, 6 hour window
          # Consumes 5% error budget in 6 hours
          - alert: SLOErrorBudgetBurnSlow
            expr: |
              slo:http_availability:burn_rate_6h > 6
              and
              slo:http_availability:burn_rate_30m > 6
            for: 15m
            labels:
              severity: warning
            annotations:
              summary: "Slow error budget burn detected"
              description: "Error budget burning at {{ $value }}x rate"
    
          # Error budget exhausted
          - alert: SLOErrorBudgetExhausted
            expr: slo:http_availability:error_budget_remaining < 0
            for: 5m
            labels:
              severity: critical
            annotations:
              summary: "SLO error budget exhausted"
              description: "Error budget remaining: {{ $value }}%"
    

    SLO Dashboard

    Grafana Dashboard Structure:

    ┌────────────────────────────────────┐
    │ SLO Compliance (Current)           │
    │ ✓ 99.95% (Target: 99.9%)          │
    ├────────────────────────────────────┤
    │ Error Budget Remaining: 65%        │
    │ ████████░░ 65%                     │
    ├────────────────────────────────────┤
    │ SLI Trend (28 days)                │
    │ [Time series graph]                │
    ├────────────────────────────────────┤
    │ Burn Rate Analysis                 │
    │ [Burn rate by time window]         │
    └────────────────────────────────────┘
    

    Example Queries:

    # Current SLO compliance
    sli:http_availability:ratio * 100
    
    # Error budget remaining
    slo:http_availability:error_budget_remaining
    
    # Days until error budget exhausted (at current burn rate)
    (slo:http_availability:error_budget_remaining / 100)
    *
    28
    /
    (1 - sli:http_availability:ratio) * (1 - 0.999)
    

    Multi-Window Burn Rate Alerts

    # Combination of short and long windows reduces false positives
    rules:
      - alert: SLOBurnRateHigh
        expr: |
          (
            slo:http_availability:burn_rate_1h > 14.4
            and
            slo:http_availability:burn_rate_5m > 14.4
          )
          or
          (
            slo:http_availability:burn_rate_6h > 6
            and
            slo:http_availability:burn_rate_30m > 6
          )
        labels:
          severity: critical
    

    SLO Review Process

    Weekly Review

    • Current SLO compliance
    • Error budget status
    • Trend analysis
    • Incident impact

    Monthly Review

    • SLO achievement
    • Error budget usage
    • Incident postmortems
    • SLO adjustments

    Quarterly Review

    • SLO relevance
    • Target adjustments
    • Process improvements
    • Tooling enhancements

    Best Practices

    1. Start with user-facing services
    2. Use multiple SLIs (availability, latency, etc.)
    3. Set achievable SLOs (don't aim for 100%)
    4. Implement multi-window alerts to reduce noise
    5. Track error budget consistently
    6. Review SLOs regularly
    7. Document SLO decisions
    8. Align with business goals
    9. Automate SLO reporting
    10. Use SLOs for prioritization

    Reference Files

    • assets/slo-template.md - SLO definition template
    • references/slo-definitions.md - SLO definition patterns
    • references/error-budget.md - Error budget calculations

    Related Skills

    • prometheus-configuration - For metric collection
    • grafana-dashboards - For SLO visualization

    Limitations

    • Use this skill only when the task clearly matches the scope described above.
    • Do not treat the output as a substitute for environment-specific validation, testing, or expert review.
    • Stop and ask for clarification if required inputs, permissions, safety boundaries, or success criteria are missing.
    Recommended Servers
    Vercel Grep
    Vercel Grep
    Better Stack
    Better Stack
    vastlint - IAB XML VAST validator and linter
    vastlint - IAB XML VAST validator and linter
    Repository
    sickn33/antigravity-awesome-skills
    Files