Operations
Production readiness evaluation focused on resilience, observability, and incident response.
Resilience
Failure Modes
- What can fail? List all external dependencies
- Blast radius: If X fails, what else breaks?
- Graceful degradation: Partial failure ≠ total failure?
Patterns
| Pattern |
Purpose |
Check |
| Timeouts |
Prevent hung connections |
Every external call has one? |
| Circuit Breaker |
Stop cascading failures |
On critical paths? |
| Bulkhead |
Isolate failures |
Separate thread pools? |
| Retry |
Handle transient failures |
With backoff? Bounded? |
Observability
The RED Method
| Metric |
What |
Why |
| Rate |
Requests per second |
Traffic understanding |
| Errors |
Failed requests |
Problem detection |
| Duration |
Latency distribution |
Performance tracking |
Logging
- Structured (JSON, not free text)
- Correlation IDs across services
- Appropriate levels (not everything is ERROR)
- PII redaction
Tracing
- Distributed tracing enabled?
- Spans for all external calls?
- Context propagation working?
Capacity
- Scaling: Horizontal preferred, auto-scaling configured?
- Limits: Memory, CPU, connections all bounded?
- Backpressure: What happens at 2x load? 10x?
- Rate Limiting: Per-tenant/client quotas?
Security Posture
- Secrets: In vault, not env vars or code
- Network: Least privilege, mTLS where possible
- Dependencies: Vulnerability scanning in CI
- Access: Audit logging for sensitive operations
Incident Readiness
- Runbooks: Documented recovery procedures
- On-call: Rotation defined, escalation clear
- Rollback: One-click, tested regularly
- Communication: Status page, stakeholder notification
Checklist
□ All external calls have timeouts
□ Circuit breakers on critical paths
□ Structured logging with correlation IDs
□ RED metrics exposed
□ Alerts are actionable (not noisy)
□ Auto-scaling configured with limits
□ Graceful shutdown implemented
□ Health checks (liveness + readiness)
□ Secrets in vault
□ Runbook exists
□ Rollback tested
Guild Members for Operations
Primary: Taleb (resilience), Erlang (capacity), Vector (security)
Secondary: Lamport (distributed failure), Ixian (metrics/validation)
Additional Resources
references/chaos-patterns.md — Chaos engineering patterns and failure injection