Smithery Logo
MCPsSkillsDocsPricing
Login
NewFlame, an assistant that learns and improves. Available onTelegramSlack
    sickn33

    service-mesh-observability

    sickn33/service-mesh-observability
    DevOps
    8,021

    About

    SKILL.md

    Install

    • Telegram
      Telegram
    • Slack
      Slack
    • Claude Code
      Claude Code
    • Codex
      Codex
    • OpenClaw
      OpenClaw
    • Cursor
      Cursor
    • Amp
      Amp
    • GitHub Copilot
      GitHub Copilot
    • Gemini CLI
      Gemini CLI
    • Kilo Code
      Kilo Code
    • Junie
      Junie
    • Replit
      Replit
    • Windsurf
      Windsurf
    • Cline
      Cline
    • Continue
      Continue
    • OpenCode
      OpenCode
    • OpenHands
      OpenHands
    • Roo Code
      Roo Code
    • Augment
      Augment
    • Goose
      Goose
    • Trae
      Trae
    • Zencoder
      Zencoder
    • Antigravity
      Antigravity
    • Download skill
    ├─
    ├─
    └─
    Smithery Logo

    Give agents more agency

    Resources

    DocumentationPrivacy PolicySystem Status

    Company

    PricingAboutBlog

    Connect

    © 2026 Smithery. All rights reserved.

    About

    Implement comprehensive observability for service meshes including distributed tracing, metrics, and visualization...

    SKILL.md

    Service Mesh Observability

    Complete guide to observability patterns for Istio, Linkerd, and service mesh deployments.

    Do not use this skill when

    • The task is unrelated to service mesh observability
    • You need a different domain or tool outside this scope

    Instructions

    • Clarify goals, constraints, and required inputs.
    • Apply relevant best practices and validate outcomes.
    • Provide actionable steps and verification.
    • If detailed examples are required, open resources/implementation-playbook.md.

    Use this skill when

    • Setting up distributed tracing across services
    • Implementing service mesh metrics and dashboards
    • Debugging latency and error issues
    • Defining SLOs for service communication
    • Visualizing service dependencies
    • Troubleshooting mesh connectivity

    Core Concepts

    1. Three Pillars of Observability

    ┌─────────────────────────────────────────────────────┐
    │                  Observability                       │
    ├─────────────────┬─────────────────┬─────────────────┤
    │     Metrics     │     Traces      │      Logs       │
    │                 │                 │                 │
    │ • Request rate  │ • Span context  │ • Access logs   │
    │ • Error rate    │ • Latency       │ • Error details │
    │ • Latency P50   │ • Dependencies  │ • Debug info    │
    │ • Saturation    │ • Bottlenecks   │ • Audit trail   │
    └─────────────────┴─────────────────┴─────────────────┘
    

    2. Golden Signals for Mesh

    Signal Description Alert Threshold
    Latency Request duration P50, P99 P99 > 500ms
    Traffic Requests per second Anomaly detection
    Errors 5xx error rate > 1%
    Saturation Resource utilization > 80%

    Templates

    Template 1: Istio with Prometheus & Grafana

    # Install Prometheus
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: prometheus
      namespace: istio-system
    data:
      prometheus.yml: |
        global:
          scrape_interval: 15s
        scrape_configs:
          - job_name: 'istio-mesh'
            kubernetes_sd_configs:
              - role: endpoints
                namespaces:
                  names:
                    - istio-system
            relabel_configs:
              - source_labels: [__meta_kubernetes_service_name]
                action: keep
                regex: istio-telemetry
    ---
    # ServiceMonitor for Prometheus Operator
    apiVersion: monitoring.coreos.com/v1
    kind: ServiceMonitor
    metadata:
      name: istio-mesh
      namespace: istio-system
    spec:
      selector:
        matchLabels:
          app: istiod
      endpoints:
        - port: http-monitoring
          interval: 15s
    

    Template 2: Key Istio Metrics Queries

    # Request rate by service
    sum(rate(istio_requests_total{reporter="destination"}[5m])) by (destination_service_name)
    
    # Error rate (5xx)
    sum(rate(istio_requests_total{reporter="destination", response_code=~"5.."}[5m]))
      / sum(rate(istio_requests_total{reporter="destination"}[5m])) * 100
    
    # P99 latency
    histogram_quantile(0.99,
      sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m]))
      by (le, destination_service_name))
    
    # TCP connections
    sum(istio_tcp_connections_opened_total{reporter="destination"}) by (destination_service_name)
    
    # Request size
    histogram_quantile(0.99,
      sum(rate(istio_request_bytes_bucket{reporter="destination"}[5m]))
      by (le, destination_service_name))
    

    Template 3: Jaeger Distributed Tracing

    # Jaeger installation for Istio
    apiVersion: install.istio.io/v1alpha1
    kind: IstioOperator
    spec:
      meshConfig:
        enableTracing: true
        defaultConfig:
          tracing:
            sampling: 100.0  # 100% in dev, lower in prod
            zipkin:
              address: jaeger-collector.istio-system:9411
    ---
    # Jaeger deployment
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: jaeger
      namespace: istio-system
    spec:
      selector:
        matchLabels:
          app: jaeger
      template:
        metadata:
          labels:
            app: jaeger
        spec:
          containers:
            - name: jaeger
              image: jaegertracing/all-in-one:1.50
              ports:
                - containerPort: 5775   # UDP
                - containerPort: 6831   # Thrift
                - containerPort: 6832   # Thrift
                - containerPort: 5778   # Config
                - containerPort: 16686  # UI
                - containerPort: 14268  # HTTP
                - containerPort: 14250  # gRPC
                - containerPort: 9411   # Zipkin
              env:
                - name: COLLECTOR_ZIPKIN_HOST_PORT
                  value: ":9411"
    

    Template 4: Linkerd Viz Dashboard

    # Install Linkerd viz extension
    linkerd viz install | kubectl apply -f -
    
    # Access dashboard
    linkerd viz dashboard
    
    # CLI commands for observability
    # Top requests
    linkerd viz top deploy/my-app
    
    # Per-route metrics
    linkerd viz routes deploy/my-app --to deploy/backend
    
    # Live traffic inspection
    linkerd viz tap deploy/my-app --to deploy/backend
    
    # Service edges (dependencies)
    linkerd viz edges deployment -n my-namespace
    

    Template 5: Grafana Dashboard JSON

    {
      "dashboard": {
        "title": "Service Mesh Overview",
        "panels": [
          {
            "title": "Request Rate",
            "type": "graph",
            "targets": [
              {
                "expr": "sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) by (destination_service_name)",
                "legendFormat": "{{destination_service_name}}"
              }
            ]
          },
          {
            "title": "Error Rate",
            "type": "gauge",
            "targets": [
              {
                "expr": "sum(rate(istio_requests_total{response_code=~\"5..\"}[5m])) / sum(rate(istio_requests_total[5m])) * 100"
              }
            ],
            "fieldConfig": {
              "defaults": {
                "thresholds": {
                  "steps": [
                    {"value": 0, "color": "green"},
                    {"value": 1, "color": "yellow"},
                    {"value": 5, "color": "red"}
                  ]
                }
              }
            }
          },
          {
            "title": "P99 Latency",
            "type": "graph",
            "targets": [
              {
                "expr": "histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket{reporter=\"destination\"}[5m])) by (le, destination_service_name))",
                "legendFormat": "{{destination_service_name}}"
              }
            ]
          },
          {
            "title": "Service Topology",
            "type": "nodeGraph",
            "targets": [
              {
                "expr": "sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) by (source_workload, destination_service_name)"
              }
            ]
          }
        ]
      }
    }
    

    Template 6: Kiali Service Mesh Visualization

    # Kiali installation
    apiVersion: kiali.io/v1alpha1
    kind: Kiali
    metadata:
      name: kiali
      namespace: istio-system
    spec:
      auth:
        strategy: anonymous  # or openid, token
      deployment:
        accessible_namespaces:
          - "**"
      external_services:
        prometheus:
          url: http://prometheus.istio-system:9090
        tracing:
          url: http://jaeger-query.istio-system:16686
        grafana:
          url: http://grafana.istio-system:3000
    

    Template 7: OpenTelemetry Integration

    # OpenTelemetry Collector for mesh
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: otel-collector-config
    data:
      config.yaml: |
        receivers:
          otlp:
            protocols:
              grpc:
                endpoint: 0.0.0.0:4317
              http:
                endpoint: 0.0.0.0:4318
          zipkin:
            endpoint: 0.0.0.0:9411
    
        processors:
          batch:
            timeout: 10s
    
        exporters:
          jaeger:
            endpoint: jaeger-collector:14250
            tls:
              insecure: true
          prometheus:
            endpoint: 0.0.0.0:8889
    
        service:
          pipelines:
            traces:
              receivers: [otlp, zipkin]
              processors: [batch]
              exporters: [jaeger]
            metrics:
              receivers: [otlp]
              processors: [batch]
              exporters: [prometheus]
    ---
    # Istio Telemetry v2 with OTel
    apiVersion: telemetry.istio.io/v1alpha1
    kind: Telemetry
    metadata:
      name: mesh-default
      namespace: istio-system
    spec:
      tracing:
        - providers:
            - name: otel
          randomSamplingPercentage: 10
    

    Alerting Rules

    apiVersion: monitoring.coreos.com/v1
    kind: PrometheusRule
    metadata:
      name: mesh-alerts
      namespace: istio-system
    spec:
      groups:
        - name: mesh.rules
          rules:
            - alert: HighErrorRate
              expr: |
                sum(rate(istio_requests_total{response_code=~"5.."}[5m])) by (destination_service_name)
                / sum(rate(istio_requests_total[5m])) by (destination_service_name) > 0.05
              for: 5m
              labels:
                severity: critical
              annotations:
                summary: "High error rate for {{ $labels.destination_service_name }}"
    
            - alert: HighLatency
              expr: |
                histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket[5m]))
                by (le, destination_service_name)) > 1000
              for: 5m
              labels:
                severity: warning
              annotations:
                summary: "High P99 latency for {{ $labels.destination_service_name }}"
    
            - alert: MeshCertExpiring
              expr: |
                (certmanager_certificate_expiration_timestamp_seconds - time()) / 86400 < 7
              labels:
                severity: warning
              annotations:
                summary: "Mesh certificate expiring in less than 7 days"
    

    Best Practices

    Do's

    • Sample appropriately - 100% in dev, 1-10% in prod
    • Use trace context - Propagate headers consistently
    • Set up alerts - For golden signals
    • Correlate metrics/traces - Use exemplars
    • Retain strategically - Hot/cold storage tiers

    Don'ts

    • Don't over-sample - Storage costs add up
    • Don't ignore cardinality - Limit label values
    • Don't skip dashboards - Visualize dependencies
    • Don't forget costs - Monitor observability costs

    Resources

    • Istio Observability
    • Linkerd Observability
    • OpenTelemetry
    • Kiali

    Limitations

    • Use this skill only when the task clearly matches the scope described above.
    • Do not treat the output as a substitute for environment-specific validation, testing, or expert review.
    • Stop and ask for clarification if required inputs, permissions, safety boundaries, or success criteria are missing.
    Recommended Servers
    Thoughtbox
    Thoughtbox
    EasyWeek
    EasyWeek
    Browserbase
    Browserbase
    Repository
    sickn33/antigravity-awesome-skills
    Files