Structured logging, metrics, distributed tracing, and alerting strategies
Comprehensive frameworks for implementing observability including structured logging, metrics, distributed tracing, and alerting.
┌─────────────────┬─────────────────┬─────────────────┐
│ LOGS │ METRICS │ TRACES │
├─────────────────┼─────────────────┼─────────────────┤
│ What happened │ How is system │ How do requests │
│ at specific │ performing │ flow through │
│ point in time │ over time │ services │
└─────────────────┴─────────────────┴─────────────────┘
| Level | Use Case |
|---|---|
| ERROR | Unhandled exceptions, failed operations |
| WARN | Deprecated API, retry attempts |
| INFO | Business events, successful operations |
| DEBUG | Development troubleshooting |
// Good: Structured with context
logger.info('User action completed', {
action: 'purchase',
userId: user.id,
orderId: order.id,
duration_ms: 150
});
// Bad: String interpolation
logger.info(`User ${user.id} completed purchase`);
See
templates/structured-logging.tsfor Winston setup and request middleware
Essential metrics for any service:
// HTTP request latency
buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5]
// Database query latency
buckets: [0.001, 0.01, 0.05, 0.1, 0.5, 1]
See
templates/prometheus-metrics.tsfor full metrics configuration
Auto-instrument common libraries:
tracer.startActiveSpan('processOrder', async (span) => {
span.setAttribute('order.id', orderId);
// ... work
span.end();
});
See
templates/opentelemetry-tracing.tsfor full setup
| Level | Response Time | Examples |
|---|---|---|
| Critical (P1) | < 15 min | Service down, data loss |
| High (P2) | < 1 hour | Major feature broken |
| Medium (P3) | < 4 hours | Increased error rate |
| Low (P4) | Next day | Warnings |
| Alert | Condition | Severity |
|---|---|---|
| ServiceDown | up == 0 for 1m |
Critical |
| HighErrorRate | 5xx > 5% for 5m | Critical |
| HighLatency | p95 > 2s for 5m | High |
| LowCacheHitRate | < 70% for 10m | Medium |
See
templates/alerting-rules.ymlfor Prometheus alerting rules
| Probe | Purpose | Endpoint |
|---|---|---|
| Liveness | Is app running? | /health |
| Readiness | Ready for traffic? | /ready |
| Startup | Finished starting? | /startup |
{
"status": "healthy|degraded|unhealthy",
"checks": {
"database": { "status": "pass", "latency_ms": 5 },
"redis": { "status": "pass", "latency_ms": 2 }
},
"version": "1.0.0",
"uptime": 3600
}
See
templates/health-checks.tsfor implementation
Use Opus 4.5 extended thinking for:
| Template | Purpose |
|---|---|
structured-logging.ts |
Winston logger with request middleware |
prometheus-metrics.ts |
HTTP, DB, cache metrics with middleware |
opentelemetry-tracing.ts |
Distributed tracing setup |
alerting-rules.yml |
Prometheus alerting rules |
health-checks.ts |
Liveness, readiness, startup probes |