Master monitoring and observability for distributed systems
Metrics - Numerical measurements over time
Logs - Timestamped event records
{"level":"error","msg":"connection failed","user_id":123}2025-01-15 ERROR: Connection timeoutTraces - Request flow through distributed systems
Latency - How long requests take
Traffic - How many requests (RPS, QPS)
Errors - Rate of failed requests
Saturation - How "full" your service is (CPU, memory, disk, network)
# Prometheus - Query metrics
curl 'http://localhost:9090/api/v1/query?query=up'
# Check alerting rules
promtool check rules alert-rules.yml
# Grafana - Create API key
curl -X POST http://admin:admin@localhost:3000/api/auth/keys \
-H "Content-Type: application/json" \
-d '{"name":"deploy-key","role":"Admin"}'
# Elasticsearch - Check cluster health
curl -X GET "localhost:9200/_cluster/health?pretty"
# Jaeger - Query traces
curl "http://localhost:16686/api/traces?service=frontend&limit=10"
📚 Full Examples: See REFERENCE.md for complete code samples, detailed configurations, and production-ready implementations.
Implementation Guide
See REFERENCE.md for complete implementation.
See REFERENCE.md for complete implementation.
Go Example:
See REFERENCE.md for complete implementation.
Python Example:
See REFERENCE.md for complete implementation.
See REFERENCE.md for complete implementation.
See REFERENCE.md for complete implementation.
Good - Structured JSON:
See REFERENCE.md for complete implementation.
Bad - Unstructured:
[ERROR] 2025-01-15 10:30:45 - User 12345 got error: Database connection failed (timeout 5s) from db-primary.internal, retried 3 times
See REFERENCE.md for complete implementation.
See REFERENCE.md for complete implementation.
See REFERENCE.md for complete implementation.
See REFERENCE.md for complete implementation.
See REFERENCE.md for complete implementation.
Go Example:
See REFERENCE.md for complete implementation.
Python Example:
See REFERENCE.md for complete implementation.
See REFERENCE.md for complete implementation.
See REFERENCE.md for complete implementation.
See REFERENCE.md for complete implementation.
See REFERENCE.md for complete implementation.
Best Practices:
SLI = Good Events / Total Events
Availability SLI = Successful Requests / Total Requests
Latency SLI = Requests < 100ms / Total Requests
Throughput SLI = Requests Processed / Expected Requests
See REFERENCE.md for complete implementation.
See REFERENCE.md for complete implementation.
Error Budget Policy:
See REFERENCE.md for complete implementation.
See REFERENCE.md for complete implementation.
bash kubectl get pods -n production kubectl logs -n production -l app=api-service --tail=100
2. **Check dependencies**
- Database: http://grafana/d/database
- Cache: http://grafana/d/redis
- External APIs: http://grafana/d/external
3. **Check recent changes**
```bash
git log --since="1 hour ago" --pretty=format:"%h %an %s"
*See [REFERENCE.md](./REFERENCE.md#example-25) for complete implementation.*
### 8. Cost Optimization
#### Cardinality Management
**High cardinality problem**:
*See [REFERENCE.md](./REFERENCE.md#example-26) for complete implementation.*
**Cardinality analysis**:
```promql
# Find metrics with highest cardinality
topk(10, count by (__name__)({__name__=~".+"}))
# Count unique label combinations
count({__name__="http_requests_total"})
See REFERENCE.md for complete implementation.
See REFERENCE.md for complete implementation.
See REFERENCE.md for complete implementation.
// TODO: Add advanced example for monitoring-observability
// This example shows production-ready patterns
// TODO: Add integration example showing how monitoring-observability
// works with other systems and services
See examples/monitoring-observability/ for complete working examples.
This skill integrates with:
Problem: Not testing edge cases and error conditions leads to production bugs
Solution: Implement comprehensive test coverage including:
Prevention: Enforce minimum code coverage (80%+) in CI/CD pipeline
Problem: Hardcoding values makes applications inflexible and environment-dependent
Solution: Use environment variables and configuration management:
Prevention: Use tools like dotenv, config validators, and secret scanners
Problem: Security vulnerabilities from not following established security patterns
Solution: Follow security guidelines:
Prevention: Use security linters, SAST tools, and regular dependency updates
Best Practices: