monitoring-observability

williamzujkowski/monitoring-observability

DevOps

1 installs

About

SKILL.md

monitoring-observability

williamzujkowski/monitoring-observability

DevOps

1 installs

About

Master monitoring and observability for distributed systems

SKILL.md

Monitoring & Observability

Level 1: Quick Reference

Three Pillars of Observability

Metrics - Numerical measurements over time

Counter (only increases): request_total, errors_total
Gauge (can go up/down): cpu_usage, memory_bytes
Histogram (distribution): request_duration_seconds
Summary (quantiles): response_time_summary

Logs - Timestamped event records

Structured (JSON): {"level":"error","msg":"connection failed","user_id":123}
Unstructured (text): 2025-01-15 ERROR: Connection timeout
Log levels: DEBUG, INFO, WARN, ERROR, FATAL

Traces - Request flow through distributed systems

Span: Single operation (HTTP request, DB query)
Trace: Collection of spans showing full request path
Context propagation: Trace ID passed between services

Golden Signals (Google SRE)

Latency    - How long requests take
Traffic    - How many requests (RPS, QPS)
Errors     - Rate of failed requests
Saturation - How "full" your service is (CPU, memory, disk, network)

Essential Checklist

SLIs defined: Key user-facing metrics (availability, latency)
SLOs set: Service Level Objectives (99.9% availability)
Error budgets: 0.1% downtime = 43 minutes/month
Alerting configured: On-call rotation, escalation policies
Dashboards created: Service overview, system health
Log aggregation: Centralized logging with retention policies
Distributed tracing: Request path visualization
Runbooks written: Step-by-step incident response guides

Quick Commands

# Prometheus - Query metrics
curl 'http://localhost:9090/api/v1/query?query=up'

# Check alerting rules
promtool check rules alert-rules.yml

# Grafana - Create API key
curl -X POST http://admin:admin@localhost:3000/api/auth/keys \
  -H "Content-Type: application/json" \
  -d '{"name":"deploy-key","role":"Admin"}'

# Elasticsearch - Check cluster health
curl -X GET "localhost:9200/_cluster/health?pretty"

# Jaeger - Query traces
curl "http://localhost:16686/api/traces?service=frontend&limit=10"

Level 2:

📚 Full Examples: See REFERENCE.md for complete code samples, detailed configurations, and production-ready implementations.

Implementation Guide

1. Metrics with Prometheus

Architecture Overview

See REFERENCE.md for complete implementation.

Prometheus Configuration

See REFERENCE.md for complete implementation.

Instrumenting Applications

Go Example:

See REFERENCE.md for complete implementation.

Python Example:

See REFERENCE.md for complete implementation.

PromQL Query Examples

See REFERENCE.md for complete implementation.

Recording Rules

See REFERENCE.md for complete implementation.

2. Logging with ELK/Loki

Structured Logging Best Practices

Good - Structured JSON:

See REFERENCE.md for complete implementation.

Bad - Unstructured:

[ERROR] 2025-01-15 10:30:45 - User 12345 got error: Database connection failed (timeout 5s) from db-primary.internal, retried 3 times

Log Levels Strategy

See REFERENCE.md for complete implementation.

Loki Configuration (Lightweight Alternative to ELK)

See REFERENCE.md for complete implementation.

Promtail (Log Shipper for Loki)

See REFERENCE.md for complete implementation.

LogQL Query Examples

See REFERENCE.md for complete implementation.

3. Distributed Tracing with OpenTelemetry

OpenTelemetry Architecture

See REFERENCE.md for complete implementation.

Instrumenting with OpenTelemetry

Go Example:

See REFERENCE.md for complete implementation.

Python Example:

See REFERENCE.md for complete implementation.

4. Grafana Dashboards

Dashboard JSON Structure

See REFERENCE.md for complete implementation.

Template Variables

See REFERENCE.md for complete implementation.

5. Alerting Strategies

Alert Rules

See REFERENCE.md for complete implementation.

Alertmanager Configuration

See REFERENCE.md for complete implementation.

Alert Fatigue Prevention

Best Practices:

Actionable alerts only: Every alert should require human action
Meaningful thresholds: Based on actual user impact, not arbitrary numbers
Proper severity levels: Critical = wake someone up, Warning = investigate during business hours
Group related alerts: Don't send 100 alerts for same issue
Runbooks required: Every alert must link to troubleshooting steps
Review regularly: Delete alerts that never fire or always ignored

6. SLIs, SLOs, and Error Budgets

Service Level Indicators (SLIs)

SLI = Good Events / Total Events

Availability SLI = Successful Requests / Total Requests
Latency SLI = Requests < 100ms / Total Requests
Throughput SLI = Requests Processed / Expected Requests

Service Level Objectives (SLOs)

See REFERENCE.md for complete implementation.

Error Budget Calculation

See REFERENCE.md for complete implementation.

Error Budget Policy:

See REFERENCE.md for complete implementation.

7. Incident Response

Runbook Template

See REFERENCE.md for complete implementation.

bash kubectl get pods -n production kubectl logs -n production -l app=api-service --tail=100


2. **Check dependencies**
- Database: http://grafana/d/database
- Cache: http://grafana/d/redis
- External APIs: http://grafana/d/external

3. **Check recent changes**

```bash
git log --since="1 hour ago" --pretty=format:"%h %an %s"


*See [REFERENCE.md](./REFERENCE.md#example-25) for complete implementation.*



### 8. Cost Optimization

#### Cardinality Management

**High cardinality problem**:


*See [REFERENCE.md](./REFERENCE.md#example-26) for complete implementation.*



**Cardinality analysis**:

```promql
# Find metrics with highest cardinality
topk(10, count by (__name__)({__name__=~".+"}))

# Count unique label combinations
count({__name__="http_requests_total"})

Retention Policies

See REFERENCE.md for complete implementation.

Sampling Strategies

See REFERENCE.md for complete implementation.

Examples

Basic Usage

See REFERENCE.md for complete implementation.

Advanced Usage

// TODO: Add advanced example for monitoring-observability
// This example shows production-ready patterns

Integration Example

// TODO: Add integration example showing how monitoring-observability
// works with other systems and services

See examples/monitoring-observability/ for complete working examples.

Integration Points

This skill integrates with:

Upstream Dependencies

Tools: Common development tools and frameworks
Prerequisites: Basic understanding of general concepts

Downstream Consumers

Applications: Production systems requiring monitoring-observability functionality
CI/CD Pipelines: Automated testing and deployment workflows
Monitoring Systems: Observability and logging platforms

Related Skills

See other skills in this category

Common Integration Patterns

Development Workflow: How this skill fits into daily development
Production Deployment: Integration with production systems
Monitoring & Alerting: Observability integration points

Common Pitfalls

Pitfall 1: Insufficient Testing

Problem: Not testing edge cases and error conditions leads to production bugs

Solution: Implement comprehensive test coverage including:

Happy path scenarios
Error handling and edge cases
Integration points with external systems

Prevention: Enforce minimum code coverage (80%+) in CI/CD pipeline

Pitfall 2: Hardcoded Configuration

Problem: Hardcoding values makes applications inflexible and environment-dependent

Solution: Use environment variables and configuration management:

Separate config from code
Use environment-specific configuration files
Never commit secrets to version control

Prevention: Use tools like dotenv, config validators, and secret scanners

Pitfall 3: Ignoring Security Best Practices

Problem: Security vulnerabilities from not following established security patterns

Solution: Follow security guidelines:

Input validation and sanitization
Proper authentication and authorization
Encrypted data transmission (TLS/SSL)
Regular security audits and updates

Prevention: Use security linters, SAST tools, and regular dependency updates

Best Practices:

Follow established patterns and conventions for monitoring-observability
Keep dependencies up to date and scan for vulnerabilities
Write comprehensive documentation and inline comments
Use linting and formatting tools consistently
Implement proper error handling and logging
Regular code reviews and pair programming
Monitor production metrics and set up alerts

Level 3: Deep Dive Resources

Official Documentation

Books

"Site Reliability Engineering" - Google SRE team
"The Site Reliability Workbook" - Practical SRE examples
"Distributed Tracing in Practice" - Austin Parker et al.
"Observability Engineering" - Charity Majors, Liz Fong-Jones

Advanced Topics

Multi-cluster monitoring with Thanos
Long-term metrics storage
Custom Prometheus exporters
Advanced PromQL and LogQL
Continuous profiling with Pyroscope
Real User Monitoring (RUM)
Synthetic monitoring
AIOps and anomaly detection

Community

About

SKILL.md

About

Master monitoring and observability for distributed systems

SKILL.md

Monitoring & Observability

Level 1: Quick Reference

Three Pillars of Observability

Metrics - Numerical measurements over time

Counter (only increases): request_total, errors_total
Gauge (can go up/down): cpu_usage, memory_bytes
Histogram (distribution): request_duration_seconds
Summary (quantiles): response_time_summary

Logs - Timestamped event records

Structured (JSON): {"level":"error","msg":"connection failed","user_id":123}
Unstructured (text): 2025-01-15 ERROR: Connection timeout
Log levels: DEBUG, INFO, WARN, ERROR, FATAL

Traces - Request flow through distributed systems

Span: Single operation (HTTP request, DB query)
Trace: Collection of spans showing full request path
Context propagation: Trace ID passed between services

Golden Signals (Google SRE)

Latency    - How long requests take
Traffic    - How many requests (RPS, QPS)
Errors     - Rate of failed requests
Saturation - How "full" your service is (CPU, memory, disk, network)

Essential Checklist

SLIs defined: Key user-facing metrics (availability, latency)
SLOs set: Service Level Objectives (99.9% availability)
Error budgets: 0.1% downtime = 43 minutes/month
Alerting configured: On-call rotation, escalation policies
Dashboards created: Service overview, system health
Log aggregation: Centralized logging with retention policies
Distributed tracing: Request path visualization
Runbooks written: Step-by-step incident response guides

Quick Commands

# Prometheus - Query metrics
curl 'http://localhost:9090/api/v1/query?query=up'

# Check alerting rules
promtool check rules alert-rules.yml

# Grafana - Create API key
curl -X POST http://admin:admin@localhost:3000/api/auth/keys \
  -H "Content-Type: application/json" \
  -d '{"name":"deploy-key","role":"Admin"}'

# Elasticsearch - Check cluster health
curl -X GET "localhost:9200/_cluster/health?pretty"

# Jaeger - Query traces
curl "http://localhost:16686/api/traces?service=frontend&limit=10"

Level 2:

📚 Full Examples: See REFERENCE.md for complete code samples, detailed configurations, and production-ready implementations.

Implementation Guide

1. Metrics with Prometheus

Architecture Overview

See REFERENCE.md for complete implementation.

Prometheus Configuration

See REFERENCE.md for complete implementation.

Instrumenting Applications

Go Example:

See REFERENCE.md for complete implementation.

Python Example:

See REFERENCE.md for complete implementation.

PromQL Query Examples

See REFERENCE.md for complete implementation.

Recording Rules

See REFERENCE.md for complete implementation.

2. Logging with ELK/Loki

Structured Logging Best Practices

Good - Structured JSON:

See REFERENCE.md for complete implementation.

Bad - Unstructured:

[ERROR] 2025-01-15 10:30:45 - User 12345 got error: Database connection failed (timeout 5s) from db-primary.internal, retried 3 times

Log Levels Strategy

See REFERENCE.md for complete implementation.

Loki Configuration (Lightweight Alternative to ELK)

See REFERENCE.md for complete implementation.

Promtail (Log Shipper for Loki)

See REFERENCE.md for complete implementation.

LogQL Query Examples

See REFERENCE.md for complete implementation.

3. Distributed Tracing with OpenTelemetry

OpenTelemetry Architecture

See REFERENCE.md for complete implementation.

Instrumenting with OpenTelemetry

Go Example:

See REFERENCE.md for complete implementation.

Python Example:

See REFERENCE.md for complete implementation.

4. Grafana Dashboards

Dashboard JSON Structure

See REFERENCE.md for complete implementation.

Template Variables

See REFERENCE.md for complete implementation.

5. Alerting Strategies

Alert Rules

See REFERENCE.md for complete implementation.

Alertmanager Configuration

See REFERENCE.md for complete implementation.

Alert Fatigue Prevention

Best Practices:

Actionable alerts only: Every alert should require human action
Meaningful thresholds: Based on actual user impact, not arbitrary numbers
Proper severity levels: Critical = wake someone up, Warning = investigate during business hours
Group related alerts: Don't send 100 alerts for same issue
Runbooks required: Every alert must link to troubleshooting steps
Review regularly: Delete alerts that never fire or always ignored

6. SLIs, SLOs, and Error Budgets

Service Level Indicators (SLIs)

SLI = Good Events / Total Events

Availability SLI = Successful Requests / Total Requests
Latency SLI = Requests < 100ms / Total Requests
Throughput SLI = Requests Processed / Expected Requests

Service Level Objectives (SLOs)

See REFERENCE.md for complete implementation.

Error Budget Calculation

See REFERENCE.md for complete implementation.

Error Budget Policy:

See REFERENCE.md for complete implementation.

7. Incident Response

Runbook Template

See REFERENCE.md for complete implementation.

bash kubectl get pods -n production kubectl logs -n production -l app=api-service --tail=100


2. **Check dependencies**
- Database: http://grafana/d/database
- Cache: http://grafana/d/redis
- External APIs: http://grafana/d/external

3. **Check recent changes**

```bash
git log --since="1 hour ago" --pretty=format:"%h %an %s"


*See [REFERENCE.md](./REFERENCE.md#example-25) for complete implementation.*



### 8. Cost Optimization

#### Cardinality Management

**High cardinality problem**:


*See [REFERENCE.md](./REFERENCE.md#example-26) for complete implementation.*



**Cardinality analysis**:

```promql
# Find metrics with highest cardinality
topk(10, count by (__name__)({__name__=~".+"}))

# Count unique label combinations
count({__name__="http_requests_total"})

Retention Policies

See REFERENCE.md for complete implementation.

Sampling Strategies

See REFERENCE.md for complete implementation.

Examples

Basic Usage

See REFERENCE.md for complete implementation.

Advanced Usage

// TODO: Add advanced example for monitoring-observability
// This example shows production-ready patterns

Integration Example

// TODO: Add integration example showing how monitoring-observability
// works with other systems and services

See examples/monitoring-observability/ for complete working examples.

Integration Points

This skill integrates with:

Upstream Dependencies

Tools: Common development tools and frameworks
Prerequisites: Basic understanding of general concepts

Downstream Consumers

Applications: Production systems requiring monitoring-observability functionality
CI/CD Pipelines: Automated testing and deployment workflows
Monitoring Systems: Observability and logging platforms

Related Skills

See other skills in this category

Common Integration Patterns

Development Workflow: How this skill fits into daily development
Production Deployment: Integration with production systems
Monitoring & Alerting: Observability integration points

Common Pitfalls

Pitfall 1: Insufficient Testing

Problem: Not testing edge cases and error conditions leads to production bugs

Solution: Implement comprehensive test coverage including:

Happy path scenarios
Error handling and edge cases
Integration points with external systems

Prevention: Enforce minimum code coverage (80%+) in CI/CD pipeline

Pitfall 2: Hardcoded Configuration

Problem: Hardcoding values makes applications inflexible and environment-dependent

Solution: Use environment variables and configuration management:

Separate config from code
Use environment-specific configuration files
Never commit secrets to version control

Prevention: Use tools like dotenv, config validators, and secret scanners

Pitfall 3: Ignoring Security Best Practices

Problem: Security vulnerabilities from not following established security patterns

Solution: Follow security guidelines:

Input validation and sanitization
Proper authentication and authorization
Encrypted data transmission (TLS/SSL)
Regular security audits and updates

Prevention: Use security linters, SAST tools, and regular dependency updates

Best Practices:

Follow established patterns and conventions for monitoring-observability
Keep dependencies up to date and scan for vulnerabilities
Write comprehensive documentation and inline comments
Use linting and formatting tools consistently
Implement proper error handling and logging
Regular code reviews and pair programming
Monitor production metrics and set up alerts

Level 3: Deep Dive Resources

Official Documentation

Books

"Site Reliability Engineering" - Google SRE team
"The Site Reliability Workbook" - Practical SRE examples
"Distributed Tracing in Practice" - Austin Parker et al.
"Observability Engineering" - Charity Majors, Liz Fong-Jones

Advanced Topics

Multi-cluster monitoring with Thanos
Long-term metrics storage
Custom Prometheus exporters
Advanced PromQL and LogQL
Continuous profiling with Pyroscope
Real User Monitoring (RUM)
Synthetic monitoring
AIOps and anomaly detection