Smithery Logo
MCPsSkillsDocsPricing
Login
NewFlame, an assistant that learns and improves. Available onTelegramSlack
    ancoleman

    planning-disaster-recovery

    ancoleman/planning-disaster-recovery
    Planning
    154

    About

    SKILL.md

    Install

    • Telegram
      Telegram
    • Slack
      Slack
    • Claude Code
      Claude Code
    • Codex
      Codex
    • OpenClaw
      OpenClaw
    • Cursor
      Cursor
    • Amp
      Amp
    • GitHub Copilot
      GitHub Copilot
    • Gemini CLI
      Gemini CLI
    • Kilo Code
      Kilo Code
    • Junie
      Junie
    • Replit
      Replit
    • Windsurf
      Windsurf
    • Cline
      Cline
    • Continue
      Continue
    • OpenCode
      OpenCode
    • OpenHands
      OpenHands
    • Roo Code
      Roo Code
    • Augment
      Augment
    • Goose
      Goose
    • Trae
      Trae
    • Zencoder
      Zencoder
    • Antigravity
      Antigravity
    • Download skill
    ├─
    ├─
    └─
    Smithery Logo

    Give agents more agency

    Resources

    DocumentationPrivacy PolicySystem Status

    Company

    PricingAboutBlog

    Connect

    © 2026 Smithery. All rights reserved.

    About

    Design and implement disaster recovery strategies with RTO/RPO planning, database backups, Kubernetes DR, cross-region replication, and chaos engineering testing...

    SKILL.md

    Disaster Recovery

    Purpose

    Provide comprehensive guidance for designing disaster recovery (DR) strategies, implementing backup systems, and validating recovery procedures across databases, Kubernetes clusters, and cloud infrastructure. Enable teams to define RTO/RPO objectives, select appropriate backup tools, configure automated failover, and test DR capabilities through chaos engineering.

    When to Use This Skill

    Invoke this skill when:

    • Defining recovery time objectives (RTO) and recovery point objectives (RPO)
    • Implementing database backups with point-in-time recovery (PITR)
    • Setting up Kubernetes cluster backup and restore workflows
    • Configuring cross-region replication for high availability
    • Testing disaster recovery procedures through chaos experiments
    • Meeting compliance requirements (GDPR, SOC 2, HIPAA)
    • Automating backup monitoring and alerting
    • Designing multi-cloud disaster recovery architectures

    Core Concepts

    RTO and RPO Fundamentals

    Recovery Time Objective (RTO): Maximum acceptable downtime after a disaster before business impact becomes unacceptable.

    Recovery Point Objective (RPO): Maximum acceptable data loss measured in time. Defines how far back in time recovery must reach.

    Criticality Tiers:

    • Tier 0 (Mission-Critical): RTO < 1 hour, RPO < 5 minutes
    • Tier 1 (Production): RTO 1-4 hours, RPO 15-60 minutes
    • Tier 2 (Important): RTO 4-24 hours, RPO 1-6 hours
    • Tier 3 (Standard): RTO > 24 hours, RPO > 6 hours

    3-2-1 Backup Rule

    Maintain 3 copies of data on 2 different media types with 1 copy offsite.

    Example implementation:

    • Primary: Production database
    • Secondary: Local backup storage
    • Tertiary: Cloud backup (S3/GCS/Azure)

    Backup Types

    Full Backup: Complete copy of all data. Slowest to create, fastest to restore.

    Incremental Backup: Only changes since last backup. Fastest to create, requires full + all incrementals to restore.

    Differential Backup: Changes since last full backup. Balance between storage and restore speed.

    Continuous Backup: Real-time or near-real-time backup via WAL/binlog archiving. Lowest RPO.

    Quick Decision Framework

    Step 1: Map RTO/RPO to Strategy

    RTO < 1 hour, RPO < 5 min
    → Active-Active replication, continuous archiving, automated failover
    → Tools: Aurora Global DB, GCS Multi-Region, pgBackRest PITR
    → Cost: Highest
    
    RTO 1-4 hours, RPO 15-60 min
    → Warm standby, incremental backups, automated failover
    → Tools: pgBackRest, WAL-G, RDS Multi-AZ
    → Cost: High
    
    RTO 4-24 hours, RPO 1-6 hours
    → Daily full + incremental, cross-region backup
    → Tools: pgBackRest, Velero, Restic
    → Cost: Medium
    
    RTO > 24 hours, RPO > 6 hours
    → Weekly full + daily incremental, single region
    → Tools: pg_dump, mysqldump, S3 versioning
    → Cost: Low
    

    Step 2: Select Backup Tools by Use Case

    Use Case Primary Tool Alternative Key Feature
    PostgreSQL production pgBackRest WAL-G PITR, compression, multi-repo
    MySQL production Percona XtraBackup WAL-G Hot backups, incremental
    MongoDB Atlas Backup mongodump Continuous backup, PITR
    Kubernetes cluster Velero ArgoCD + Git PV snapshots, scheduling
    File/object backup Restic Duplicity Encryption, deduplication
    Cross-region replication Aurora Global DB RDS Read Replica Active-Active capable

    Database Backup Patterns

    PostgreSQL with pgBackRest

    Use Case: Production PostgreSQL with < 5 minute RPO

    Quick Start: See examples/postgresql/pgbackrest-config/

    Configure continuous WAL archiving with full/differential/incremental backups to S3/GCS/Azure. Schedule weekly full, daily differential backups. Enable PITR with pgbackrest --stanza=main --delta restore.

    Detailed Guide: references/database-backups.md#postgresql

    MySQL with Percona XtraBackup

    Use Case: MySQL production requiring hot backups

    Quick Start: See examples/mysql/xtrabackup/

    Perform full (xtrabackup --backup --parallel=4) and incremental backups with binary log archiving for PITR. Restore requires decompress, prepare, apply incrementals, and copy-back steps.

    Detailed Guide: references/database-backups.md#mysql

    MongoDB Backup

    Quick Start: Use mongodump --gzip --numParallelCollections=4 for logical backups or MongoDB Atlas for continuous backup with PITR.

    Detailed Guide: references/database-backups.md#mongodb

    Kubernetes Disaster Recovery

    Velero for Cluster Backups

    Quick Start: velero install --provider aws --bucket my-backups

    Configure scheduled backups (daily full, hourly production namespace) with PV snapshots. Restore with velero restore create --from-backup <name>. Support selective restore (namespace mappings, storage class remapping).

    Examples: examples/kubernetes/velero/ Detailed Guide: references/kubernetes-dr.md

    etcd Backup

    Quick Start: ETCDCTL_API=3 etcdctl snapshot save /backups/etcd/snapshot.db

    Create periodic etcd snapshots for control plane recovery. Restore requires cluster recreation with snapshot data.

    Examples: examples/kubernetes/etcd/

    Cloud-Specific DR Patterns

    AWS

    Key Services:

    • RDS: Automated backups (30-day retention), PITR, Multi-AZ
    • Aurora Global DB: Cross-region active-passive with automatic failover
    • S3 CRR: Cross-region replication with 15-min SLA (Replication Time Control)

    Examples: examples/cloud/aws/ Detailed Guide: references/cloud-dr-patterns.md#aws

    GCP

    Key Services:

    • Cloud SQL: PITR with 7-day transaction logs, 30-day retention
    • GCS Multi-Regional: Automatic replication across 100+ mile separation
    • Regional HA: Synchronous replication within region

    Detailed Guide: references/cloud-dr-patterns.md#gcp

    Azure

    Key Services:

    • Azure Backup: VM backups with flexible retention (daily/weekly/monthly/yearly)
    • Azure Site Recovery: Cross-region VM replication with 4-hour app-consistent snapshots
    • Geo-Redundant Storage: Automatic replication to secondary region

    Detailed Guide: references/cloud-dr-patterns.md#azure

    Cross-Region Replication Patterns

    Pattern RTO RPO Cost Use Case
    Active-Active < 1 min < 1 min High Both regions serve traffic
    Active-Passive 15-60 min 5-15 min Medium Standby for failover
    Pilot Light 10-30 min 5-15 min Low Minimal secondary infra
    Warm Standby 5-15 min 5-15 min Med-High Scaled-down secondary

    Implementation Examples:

    • PostgreSQL streaming replication (Active-Passive)
    • Aurora Global Database (Active-Active)
    • ASG scale-up automation (Pilot Light)

    Detailed Guide: references/cross-region-replication.md

    Testing Disaster Recovery

    Chaos Engineering

    Purpose: Validate DR procedures through controlled failure injection.

    Test Scenarios:

    • Database failover (stop primary, measure promotion time)
    • Region failure (block network, trigger DNS failover)
    • Kubernetes recovery (delete namespace, restore from Velero)

    Tools: Chaos Mesh, Gremlin, Litmus, Toxiproxy

    Examples: examples/chaos/db-failover-test.sh, examples/chaos/region-failure-test.sh Detailed Guide: references/chaos-engineering.md

    Automated DR Drills

    Run Monthly Tests:

    ./scripts/dr-drill.sh --environment staging --test-type full
    ./scripts/test-restore.sh --backup latest --target staging-db
    

    Compliance and Retention

    Regulation Retention Requirements
    GDPR 1-7 years EU data residency, right to erasure
    SOC 2 1 year+ Secure deletion, access controls
    HIPAA 6 years Encryption, PHI protection
    PCI DSS 3mo-1yr Secure deletion, quarterly reviews

    Implement with S3/GCS lifecycle policies: 30d→Standard-IA, 90d→Glacier, 365d→Deep Archive

    Immutable backups: Use S3 Object Lock or Azure Immutable Blob Storage for ransomware protection.

    Detailed Guide: references/compliance-retention.md

    Monitoring and Alerting

    Key Metrics: Backup success rate, duration, time since last backup, RPO breach, storage utilization

    Prometheus Alerts: VeleroBackupFailed, VeleroBackupTooOld, BackupSizeTrend

    Validation Scripts:

    ./scripts/validate-backup.sh --backup latest --verify-integrity
    ./scripts/check-retention.sh --report-violations
    ./scripts/generate-dr-report.sh --format pdf
    

    Automation and Runbooks

    Automate Backup Schedules: Cron for pgBackRest (weekly full, daily differential), Velero schedules (K8s)

    DR Runbook Steps: Detect failure → Verify secondary → Promote → Update DNS → Notify → Document

    Detailed Guide: references/runbook-automation.md

    Integration with Other Skills

    Related Skills

    Prerequisites:

    • infrastructure-as-code: Provision backup infrastructure, DR regions
    • kubernetes-operations: K8s cluster setup for Velero
    • secret-management: Backup encryption keys, credentials

    Parallel Skills:

    • databases-postgresql: PostgreSQL configuration and operations
    • databases-mysql: MySQL configuration and operations
    • observability: Backup monitoring, alerting
    • security-hardening: Secure backup storage, access control

    Consumer Skills:

    • incident-management: Invoke DR procedures during incidents
    • compliance-frameworks: Meet regulatory requirements

    Skill Chaining Example

    infrastructure-as-code → secret-management → disaster-recovery → observability
           ↓                        ↓                   ↓                ↓
      Create S3 buckets      Store encryption     Configure backups   Monitor jobs
      Provision databases    keys in Vault        Set up replication  Alert failures
      Setup VPCs             Manage credentials   Test DR drills      Track metrics
    

    Best Practices

    Do

    ✓ Test restores regularly (monthly for critical systems) ✓ Automate backup monitoring and alerting ✓ Encrypt backups at rest and in transit ✓ Implement 3-2-1 backup rule ✓ Define and measure RTO/RPO ✓ Run chaos experiments to validate DR ✓ Document recovery procedures ✓ Store backups in different regions ✓ Use immutable backups for ransomware protection ✓ Automate DR testing in CI/CD

    Don't

    ✗ Assume backups work without testing ✗ Store all backups in single region ✗ Skip retention policy definition ✗ Forget to encrypt sensitive data ✗ Rely solely on cloud provider backups ✗ Ignore backup monitoring ✗ Perform backups only from primary database under high load ✗ Store encryption keys with backups

    Reference Documentation

    • RTO/RPO Planning: references/rto-rpo-planning.md
    • Database Backups: references/database-backups.md
    • Kubernetes DR: references/kubernetes-dr.md
    • Cloud DR Patterns: references/cloud-dr-patterns.md
    • Cross-Region Replication: references/cross-region-replication.md
    • Chaos Engineering: references/chaos-engineering.md
    • Compliance Requirements: references/compliance-retention.md
    • Runbook Automation: references/runbook-automation.md

    Examples

    • Runbooks: examples/runbooks/database-failover.md, examples/runbooks/region-failover.md
    • PostgreSQL: examples/postgresql/pgbackrest-config/, examples/postgresql/walg-config/
    • MySQL: examples/mysql/xtrabackup/, examples/mysql/walg/
    • Kubernetes: examples/kubernetes/velero/, examples/kubernetes/etcd/
    • Cloud: examples/cloud/aws/, examples/cloud/gcp/, examples/cloud/azure/
    • Chaos: examples/chaos/db-failover-test.sh, examples/chaos/region-failure-test.sh

    Scripts

    • scripts/validate-backup.sh: Verify backup integrity
    • scripts/test-restore.sh: Automated restore testing
    • scripts/dr-drill.sh: Run full DR drill
    • scripts/check-retention.sh: Verify retention policies
    • scripts/generate-dr-report.sh: Compliance reporting
    Recommended Servers
    GroundRoute: Web Search for AI Agents across 6 Engines ($10 free credit)
    GroundRoute: Web Search for AI Agents across 6 Engines ($10 free credit)
    ThinAir Data
    ThinAir Data
    PlanetScale
    PlanetScale
    Repository
    ancoleman/ai-design-components
    Files