Root Cause Analysis with Kopai
Guide for debugging production issues using telemetry data (traces, logs, metrics) via Kopai CLI.
Prerequisites
Ensure access to Kopai app backend.
Make sure the services are set up to send their OpenTelemetry data to Kopai.
See otel-instrumentation skill for setup.
RCA Workflow Summary
- Find error traces
- Get full trace context
- Correlate logs with trace
- Check related metrics
- Identify root cause
Rules
1. Workflow (CRITICAL)
workflow-find-errors - Find Error Traces
workflow-get-context - Get Full Trace Context
workflow-correlate-logs - Correlate Logs with Trace
workflow-check-metrics - Check Related Metrics
2. Patterns (HIGH)
pattern-http-errors - HTTP Error Debugging
pattern-slow-requests - Slow Request Analysis
pattern-distributed - Distributed Failure Tracing
pattern-log-driven - Log-Driven Investigation
Read rules/<rule-name>.md for details.
Tips
- Always use
--json for programmatic analysis
- Pipe to
jq for filtering/aggregation
- Start with errors, then trace backwards
- Check span Duration to find bottlenecks
- Correlate TraceId across traces, logs, metrics
References