log-analysis

incidentfox/log-analysis

Data & Analytics

136

1 installs

About

SKILL.md

log-analysis

incidentfox/log-analysis

Data & Analytics

136

1 installs

About

Partition-first log analysis methodology. Use for log searches, error analysis, pattern finding across Datadog, CloudWatch, or Kubernetes logs.

SKILL.md

Log Analysis Methodology

Core Philosophy: Partition-First

NEVER start by reading raw log samples.

Logs can be overwhelming. The partition-first approach prevents:

Missing the forest for the trees
Wasting time on irrelevant data
Overwhelming context with noise

The 4-Step Process

Step 1: Get Statistics

Before ANY log search, understand the landscape:

CloudWatch Insights:

# How many errors?
filter @message like /ERROR/
| stats count(*) as total

# Error rate over time
filter @message like /ERROR/
| stats count(*) by bin(5m)

# What types of errors?
filter @message like /ERROR/
| parse @message /(?<error_type>[\w.]+Exception)/
| stats count(*) by error_type
| sort count desc

Datadog:

# Error distribution by service
service:* status:error | stats count by service

# Error types
service:myapp status:error | stats count by @error.kind

Questions to answer:

What's the total error volume?
Is it increasing, stable, or decreasing?
What are the unique error types?
Which services/hosts are affected?

Step 2: Identify Patterns

Look for correlations:

Temporal patterns:

Did errors start at a specific time?
Is there periodicity (every hour, every day)?
Correlation with deployments or traffic spikes?

Service patterns:

Is one service the source?
Is the error propagating across services?

Error patterns:

What's the most frequent error?
Are errors clustered or distributed?

Step 3: Sample Strategically

Only NOW read actual log samples:

Sample from anomalies:

Get logs from the peak error time
Get logs from normal time for comparison

Sample by error type:

Get examples of each distinct error type
Limit to 5-10 per type

Sample around events:

Logs before/after a deployment
Logs around a specific incident timestamp

Step 4: Correlate with Events

Connect logs to system changes:

# Use git_log to find recent deployments
git_log --since="2 hours ago"

# Use get_deployment_history for K8s
get_deployment_history deployment=api-server

# Compare log patterns before/after changes

Platform-Specific Tips

CloudWatch Insights

Best practices:

# Always include time filter
filter @timestamp > ago(1h)

# Use parse for structured extraction
parse @message /status=(?<status>\d+)/

# Aggregate before displaying
stats count(*) by status | sort count desc | limit 10

Common queries:

# Latency distribution
filter @type = "REPORT"
| stats avg(@duration) as avg,
        pct(@duration, 95) as p95,
        pct(@duration, 99) as p99

# Error messages with context
filter @message like /ERROR/
| fields @timestamp, @message
| sort @timestamp desc
| limit 20

Datadog Logs

Query syntax:

# Filter by service and status
service:api-gateway status:error

# Field queries
@http.status_code:>=500

# Wildcard
@error.message:*timeout*

# Time comparison
service:api (now-1h TO now) vs (now-25h TO now-24h)

Kubernetes Logs

Use get_pod_logs wisely:

Always specify tail_lines (default: 100)
Filter to specific containers in multi-container pods
Use get_pod_events first for crashes/restarts

Anti-Patterns to Avoid

Dumping all logs - Never request unbounded log queries
Starting with samples - Always get statistics first
Ignoring time windows - Narrow to incident window
Missing correlation - Always connect to deployments/changes
Single-service focus - Check upstream/downstream services

Investigation Template

## Log Analysis Report

### Statistics
- Time window: [start] to [end]
- Total log volume: X events
- Error count: Y events (Z%)
- Error rate trend: [increasing/stable/decreasing]

### Top Error Types
1. [ErrorType1]: N occurrences - [description]
2. [ErrorType2]: M occurrences - [description]

### Temporal Pattern
- Errors started at: [timestamp]
- Correlation: [deployment X / traffic spike / external event]

### Sample Errors
[Quote 2-3 representative error messages]

### Root Cause Hypothesis
[Based on patterns, what's the likely cause?]

About

SKILL.md

About

Partition-first log analysis methodology. Use for log searches, error analysis, pattern finding across Datadog, CloudWatch, or Kubernetes logs.

SKILL.md

Log Analysis Methodology

Core Philosophy: Partition-First

NEVER start by reading raw log samples.

Logs can be overwhelming. The partition-first approach prevents:

Missing the forest for the trees
Wasting time on irrelevant data
Overwhelming context with noise

The 4-Step Process

Step 1: Get Statistics

Before ANY log search, understand the landscape:

CloudWatch Insights:

# How many errors?
filter @message like /ERROR/
| stats count(*) as total

# Error rate over time
filter @message like /ERROR/
| stats count(*) by bin(5m)

# What types of errors?
filter @message like /ERROR/
| parse @message /(?<error_type>[\w.]+Exception)/
| stats count(*) by error_type
| sort count desc

Datadog:

# Error distribution by service
service:* status:error | stats count by service

# Error types
service:myapp status:error | stats count by @error.kind

Questions to answer:

What's the total error volume?
Is it increasing, stable, or decreasing?
What are the unique error types?
Which services/hosts are affected?

Step 2: Identify Patterns

Look for correlations:

Temporal patterns:

Did errors start at a specific time?
Is there periodicity (every hour, every day)?
Correlation with deployments or traffic spikes?

Service patterns:

Is one service the source?
Is the error propagating across services?

Error patterns:

What's the most frequent error?
Are errors clustered or distributed?

Step 3: Sample Strategically

Only NOW read actual log samples:

Sample from anomalies:

Get logs from the peak error time
Get logs from normal time for comparison

Sample by error type:

Get examples of each distinct error type
Limit to 5-10 per type

Sample around events:

Logs before/after a deployment
Logs around a specific incident timestamp

Step 4: Correlate with Events

Connect logs to system changes:

# Use git_log to find recent deployments
git_log --since="2 hours ago"

# Use get_deployment_history for K8s
get_deployment_history deployment=api-server

# Compare log patterns before/after changes

Platform-Specific Tips

CloudWatch Insights

Best practices:

# Always include time filter
filter @timestamp > ago(1h)

# Use parse for structured extraction
parse @message /status=(?<status>\d+)/

# Aggregate before displaying
stats count(*) by status | sort count desc | limit 10

Common queries:

# Latency distribution
filter @type = "REPORT"
| stats avg(@duration) as avg,
        pct(@duration, 95) as p95,
        pct(@duration, 99) as p99

# Error messages with context
filter @message like /ERROR/
| fields @timestamp, @message
| sort @timestamp desc
| limit 20

Datadog Logs

Query syntax:

# Filter by service and status
service:api-gateway status:error

# Field queries
@http.status_code:>=500

# Wildcard
@error.message:*timeout*

# Time comparison
service:api (now-1h TO now) vs (now-25h TO now-24h)

Kubernetes Logs

Use get_pod_logs wisely:

Always specify tail_lines (default: 100)
Filter to specific containers in multi-container pods
Use get_pod_events first for crashes/restarts

Anti-Patterns to Avoid

Dumping all logs - Never request unbounded log queries
Starting with samples - Always get statistics first
Ignoring time windows - Narrow to incident window
Missing correlation - Always connect to deployments/changes
Single-service focus - Check upstream/downstream services

Investigation Template

## Log Analysis Report

### Statistics
- Time window: [start] to [end]
- Total log volume: X events
- Error count: Y events (Z%)
- Error rate trend: [increasing/stable/decreasing]

### Top Error Types
1. [ErrorType1]: N occurrences - [description]
2. [ErrorType2]: M occurrences - [description]

### Temporal Pattern
- Errors started at: [timestamp]
- Correlation: [deployment X / traffic spike / external event]

### Sample Errors
[Quote 2-3 representative error messages]

### Root Cause Hypothesis
[Based on patterns, what's the likely cause?]