Smithery Logo
MCPsSkillsDocsPricing
Login
Smithery Logo

Accelerating the Agent Economy

Resources

DocumentationPrivacy PolicySystem Status

Company

PricingAboutBlog

Connect

© 2026 Smithery. All rights reserved.

    yzavyas

    operations

    yzavyas/operations
    DevOps
    3
    1 installs

    About

    SKILL.md

    Install

    Install via Skills CLI

    or add to your agent
    • Claude Code
      Claude Code
    • Codex
      Codex
    • OpenClaw
      OpenClaw
    • Cursor
      Cursor
    • Amp
      Amp
    • GitHub Copilot
      GitHub Copilot
    • Gemini CLI
      Gemini CLI
    • Kilo Code
      Kilo Code
    • Junie
      Junie
    • Replit
      Replit
    • Windsurf
      Windsurf
    • Cline
      Cline
    • Continue
      Continue
    • OpenCode
      OpenCode
    • OpenHands
      OpenHands
    • Roo Code
      Roo Code
    • Augment
      Augment
    • Goose
      Goose
    • Trae
      Trae
    • Zencoder
      Zencoder
    • Antigravity
      Antigravity
    ├─
    ├─
    └─

    About

    This skill should be used when the user asks to "review for production", "check production readiness", "evaluate resilience", "assess observability", "review ops", "run chaos experiments", or...

    SKILL.md

    Operations

    Production readiness evaluation focused on resilience, observability, and incident response.

    Resilience

    Failure Modes

    • What can fail? List all external dependencies
    • Blast radius: If X fails, what else breaks?
    • Graceful degradation: Partial failure ≠ total failure?

    Patterns

    Pattern Purpose Check
    Timeouts Prevent hung connections Every external call has one?
    Circuit Breaker Stop cascading failures On critical paths?
    Bulkhead Isolate failures Separate thread pools?
    Retry Handle transient failures With backoff? Bounded?

    Observability

    The RED Method

    Metric What Why
    Rate Requests per second Traffic understanding
    Errors Failed requests Problem detection
    Duration Latency distribution Performance tracking

    Logging

    • Structured (JSON, not free text)
    • Correlation IDs across services
    • Appropriate levels (not everything is ERROR)
    • PII redaction

    Tracing

    • Distributed tracing enabled?
    • Spans for all external calls?
    • Context propagation working?

    Capacity

    • Scaling: Horizontal preferred, auto-scaling configured?
    • Limits: Memory, CPU, connections all bounded?
    • Backpressure: What happens at 2x load? 10x?
    • Rate Limiting: Per-tenant/client quotas?

    Security Posture

    • Secrets: In vault, not env vars or code
    • Network: Least privilege, mTLS where possible
    • Dependencies: Vulnerability scanning in CI
    • Access: Audit logging for sensitive operations

    Incident Readiness

    • Runbooks: Documented recovery procedures
    • On-call: Rotation defined, escalation clear
    • Rollback: One-click, tested regularly
    • Communication: Status page, stakeholder notification

    Checklist

    □ All external calls have timeouts
    □ Circuit breakers on critical paths
    □ Structured logging with correlation IDs
    □ RED metrics exposed
    □ Alerts are actionable (not noisy)
    □ Auto-scaling configured with limits
    □ Graceful shutdown implemented
    □ Health checks (liveness + readiness)
    □ Secrets in vault
    □ Runbook exists
    □ Rollback tested
    

    Guild Members for Operations

    Primary: Taleb (resilience), Erlang (capacity), Vector (security) Secondary: Lamport (distributed failure), Ixian (metrics/validation)

    Additional Resources

    • references/chaos-patterns.md — Chaos engineering patterns and failure injection
    Recommended Servers
    Cosmetic Regulatory Intelligence
    Cosmetic Regulatory Intelligence
    EduBase
    EduBase
    GitHub
    GitHub
    Repository
    yzavyas/claude-1337
    Files