when-profiling-performance-use-performance-profiler

DNYoussef/when-profiling-performance-use-performance-profiler

Data & Analytics

About

SKILL.md

when-profiling-performance-use-performance-profiler

DNYoussef/when-profiling-performance-use-performance-profiler

Data & Analytics

About

Comprehensive performance profiling, bottleneck detection, and optimization system

SKILL.md

Performance Profiler Skill

Overview

When profiling performance, use performance-profiler to measure, analyze, and optimize application performance across CPU, memory, I/O, and network dimensions.

MECE Breakdown

Mutually Exclusive Components:

Baseline Phase: Establish current performance metrics
Detection Phase: Identify bottlenecks and hot paths
Analysis Phase: Root cause analysis and impact assessment
Optimization Phase: Generate and prioritize recommendations
Implementation Phase: Apply optimizations with agent assistance
Validation Phase: Benchmark improvements and verify gains

Collectively Exhaustive Coverage:

CPU Profiling: Function execution time, hot paths, call graphs
Memory Profiling: Heap usage, allocations, leaks, garbage collection
I/O Profiling: File system, database, network latency
Network Profiling: Request timing, bandwidth, connection pooling
Concurrency: Thread utilization, lock contention, async operations
Algorithm Analysis: Time complexity, space complexity
Cache Analysis: Hit rates, cache misses, invalidation patterns
Database: Query performance, N+1 problems, index usage

Features

Core Capabilities:

Multi-dimensional performance profiling (CPU, memory, I/O, network)
Automated bottleneck detection with prioritization
Real-time profiling and historical analysis
Flame graph generation for visual analysis
Memory leak detection and heap snapshots
Database query optimization
Algorithmic complexity analysis
A/B comparison of before/after optimizations
Production-safe profiling with minimal overhead
Integration with APM tools (New Relic, DataDog, etc.)

Profiling Modes:

Quick Scan: 30-second lightweight profiling
Standard: 5-minute comprehensive analysis
Deep: 30-minute detailed investigation
Continuous: Long-running production monitoring
Stress Test: Load-based profiling under high traffic

Usage

Slash Command:

/profile [path] [--mode quick|standard|deep] [--target cpu|memory|io|network|all]

Subagent Invocation:

Task("Performance Profiler", "Profile ./app with deep CPU and memory analysis", "performance-analyzer")

MCP Tool:

mcp__performance-profiler__analyze({
  project_path: "./app",
  profiling_mode: "standard",
  targets: ["cpu", "memory", "io"],
  generate_optimizations: true
})

Architecture

Phase 1: Baseline Measurement

Establish current performance metrics
Define performance budgets
Set up monitoring infrastructure
Capture baseline snapshots

Phase 2: Bottleneck Detection

CPU profiling (sampling or instrumentation)
Memory profiling (heap analysis)
I/O profiling (syscall tracing)
Network profiling (packet analysis)
Database profiling (query logs)

Phase 3: Root Cause Analysis

Correlate metrics across dimensions
Identify causal relationships
Calculate performance impact
Prioritize issues by severity

Phase 4: Optimization Generation

Algorithmic improvements
Caching strategies
Parallelization opportunities
Database query optimization
Memory optimization
Network optimization

Phase 5: Implementation

Generate optimized code with coder agent
Apply database optimizations
Configure caching layers
Implement parallelization

Phase 6: Validation

Run benchmark suite
Compare before/after metrics
Verify no regressions
Generate performance report

Output Formats

Performance Report:

{
  "project": "my-app",
  "profiling_mode": "standard",
  "duration_seconds": 300,
  "baseline": {
    "requests_per_second": 1247,
    "avg_response_time_ms": 123,
    "p95_response_time_ms": 456,
    "p99_response_time_ms": 789,
    "cpu_usage_percent": 67,
    "memory_usage_mb": 512,
    "error_rate_percent": 0.1
  },
  "bottlenecks": [
    {
      "type": "cpu",
      "severity": "high",
      "function": "processData",
      "time_percent": 34.5,
      "calls": 123456,
      "avg_time_ms": 2.3,
      "recommendation": "Optimize algorithm complexity from O(n²) to O(n log n)"
    }
  ],
  "optimizations": [...],
  "estimated_improvement": {
    "throughput_increase": "3.2x",
    "latency_reduction": "68%",
    "memory_reduction": "45%"
  }
}

Flame Graph:

Interactive SVG flame graph showing call stack with time proportions

Heap Snapshot:

Memory allocation breakdown with retention paths

Optimization Report:

Prioritized list of actionable improvements with code examples

Examples

Example 1: Quick CPU Profiling

/profile ./my-app --mode quick --target cpu

Example 2: Deep Memory Analysis

/profile ./my-app --mode deep --target memory --detect-leaks

Example 3: Full Stack Optimization

/profile ./my-app --mode standard --target all --optimize --benchmark

Example 4: Database Query Optimization

/profile ./my-app --mode standard --target io --database --explain-queries

Integration with Claude-Flow

Coordination Pattern:

// Step 1: Initialize profiling swarm
mcp__claude-flow__swarm_init({ topology: "star", maxAgents: 5 })

// Step 2: Spawn specialized agents
[Parallel Execution]:
  Task("CPU Profiler", "Profile CPU usage and identify hot paths in ./app", "performance-analyzer")
  Task("Memory Profiler", "Analyze heap usage and detect memory leaks", "performance-analyzer")
  Task("I/O Profiler", "Profile file system and database operations", "performance-analyzer")
  Task("Network Profiler", "Analyze network requests and identify slow endpoints", "performance-analyzer")
  Task("Optimizer", "Generate optimization recommendations based on profiling data", "optimizer")

// Step 3: Implementation agent applies optimizations
[Sequential Execution]:
  Task("Coder", "Implement recommended optimizations from profiling analysis", "coder")
  Task("Benchmarker", "Run benchmark suite and validate improvements", "performance-benchmarker")

Configuration

Default Settings:

{
  "profiling": {
    "sampling_rate_hz": 99,
    "stack_depth": 128,
    "include_native_code": false,
    "track_allocations": true
  },
  "thresholds": {
    "cpu_hot_path_percent": 10,
    "memory_leak_growth_mb": 10,
    "slow_query_ms": 100,
    "slow_request_ms": 1000
  },
  "optimization": {
    "auto_apply": false,
    "require_approval": true,
    "run_tests_before": true,
    "run_benchmarks_after": true
  },
  "output": {
    "flame_graph": true,
    "heap_snapshot": true,
    "call_tree": true,
    "recommendations": true
  }
}

Profiling Techniques

CPU Profiling:

Sampling: Periodic stack sampling (low overhead)
Instrumentation: Function entry/exit hooks (accurate but higher overhead)
Tracing: Event-based profiling

Memory Profiling:

Heap Snapshots: Point-in-time memory state
Allocation Tracking: Record all allocations
Leak Detection: Compare snapshots over time
GC Analysis: Garbage collection patterns

I/O Profiling:

Syscall Tracing: Track system calls (strace, dtrace)
File System: Monitor read/write operations
Database: Query logging and EXPLAIN ANALYZE
Network: Packet capture and request timing

Concurrency Profiling:

Thread Analysis: CPU utilization per thread
Lock Contention: Identify blocking operations
Async Operations: Promise/callback timing

Performance Optimization Strategies

Algorithmic:

Reduce time complexity (O(n²) → O(n log n))
Use appropriate data structures
Eliminate unnecessary work
Memoization and dynamic programming

Caching:

In-memory caching (Redis, Memcached)
CDN for static assets
HTTP caching headers
Query result caching

Parallelization:

Multi-threading
Worker pools
Async I/O
Batching operations

Database:

Add missing indexes
Optimize queries
Reduce N+1 queries
Connection pooling
Read replicas

Memory:

Object pooling
Reduce allocations
Stream processing
Compression

Network:

Connection keep-alive
HTTP/2 or HTTP/3
Compression
Request batching
Rate limiting

Performance Budgets

Frontend:

Time to First Byte (TTFB): < 200ms
First Contentful Paint (FCP): < 1.8s
Largest Contentful Paint (LCP): < 2.5s
Time to Interactive (TTI): < 3.8s
Total Blocking Time (TBT): < 200ms
Cumulative Layout Shift (CLS): < 0.1

Backend:

API Response Time (p50): < 100ms
API Response Time (p95): < 500ms
API Response Time (p99): < 1000ms
Throughput: > 1000 req/s
Error Rate: < 0.1%
CPU Usage: < 70%
Memory Usage: < 80%

Database:

Query Time (p50): < 10ms
Query Time (p95): < 50ms
Query Time (p99): < 100ms
Connection Pool Utilization: < 80%

Best Practices

Profile production workloads when possible
Use production-like data volumes
Profile under realistic load
Measure multiple times for consistency
Focus on p95/p99, not just averages
Optimize bottlenecks in order of impact
Always benchmark before and after
Monitor for regressions in CI/CD
Set up continuous profiling
Track performance over time

Troubleshooting

Issue: High CPU usage but no obvious hot path

Solution: Check for excessive small function calls, increase sampling rate, or use instrumentation

Issue: Memory grows continuously

Solution: Run heap snapshot comparison to identify leak sources

Issue: Slow database queries

Solution: Use EXPLAIN ANALYZE, check for missing indexes, analyze query plans

Issue: High latency but low CPU

Solution: Profile I/O operations, check for blocking synchronous calls

About

SKILL.md

About

Comprehensive performance profiling, bottleneck detection, and optimization system

SKILL.md

Performance Profiler Skill

Overview

When profiling performance, use performance-profiler to measure, analyze, and optimize application performance across CPU, memory, I/O, and network dimensions.

MECE Breakdown

Mutually Exclusive Components:

Baseline Phase: Establish current performance metrics
Detection Phase: Identify bottlenecks and hot paths
Analysis Phase: Root cause analysis and impact assessment
Optimization Phase: Generate and prioritize recommendations
Implementation Phase: Apply optimizations with agent assistance
Validation Phase: Benchmark improvements and verify gains

Collectively Exhaustive Coverage:

CPU Profiling: Function execution time, hot paths, call graphs
Memory Profiling: Heap usage, allocations, leaks, garbage collection
I/O Profiling: File system, database, network latency
Network Profiling: Request timing, bandwidth, connection pooling
Concurrency: Thread utilization, lock contention, async operations
Algorithm Analysis: Time complexity, space complexity
Cache Analysis: Hit rates, cache misses, invalidation patterns
Database: Query performance, N+1 problems, index usage

Features

Core Capabilities:

Multi-dimensional performance profiling (CPU, memory, I/O, network)
Automated bottleneck detection with prioritization
Real-time profiling and historical analysis
Flame graph generation for visual analysis
Memory leak detection and heap snapshots
Database query optimization
Algorithmic complexity analysis
A/B comparison of before/after optimizations
Production-safe profiling with minimal overhead
Integration with APM tools (New Relic, DataDog, etc.)

Profiling Modes:

Quick Scan: 30-second lightweight profiling
Standard: 5-minute comprehensive analysis
Deep: 30-minute detailed investigation
Continuous: Long-running production monitoring
Stress Test: Load-based profiling under high traffic

Usage

Slash Command:

/profile [path] [--mode quick|standard|deep] [--target cpu|memory|io|network|all]

Subagent Invocation:

Task("Performance Profiler", "Profile ./app with deep CPU and memory analysis", "performance-analyzer")

MCP Tool:

mcp__performance-profiler__analyze({
  project_path: "./app",
  profiling_mode: "standard",
  targets: ["cpu", "memory", "io"],
  generate_optimizations: true
})

Architecture

Phase 1: Baseline Measurement

Establish current performance metrics
Define performance budgets
Set up monitoring infrastructure
Capture baseline snapshots

Phase 2: Bottleneck Detection

CPU profiling (sampling or instrumentation)
Memory profiling (heap analysis)
I/O profiling (syscall tracing)
Network profiling (packet analysis)
Database profiling (query logs)

Phase 3: Root Cause Analysis

Correlate metrics across dimensions
Identify causal relationships
Calculate performance impact
Prioritize issues by severity

Phase 4: Optimization Generation

Algorithmic improvements
Caching strategies
Parallelization opportunities
Database query optimization
Memory optimization
Network optimization

Phase 5: Implementation

Generate optimized code with coder agent
Apply database optimizations
Configure caching layers
Implement parallelization

Phase 6: Validation

Run benchmark suite
Compare before/after metrics
Verify no regressions
Generate performance report

Output Formats

Performance Report:

{
  "project": "my-app",
  "profiling_mode": "standard",
  "duration_seconds": 300,
  "baseline": {
    "requests_per_second": 1247,
    "avg_response_time_ms": 123,
    "p95_response_time_ms": 456,
    "p99_response_time_ms": 789,
    "cpu_usage_percent": 67,
    "memory_usage_mb": 512,
    "error_rate_percent": 0.1
  },
  "bottlenecks": [
    {
      "type": "cpu",
      "severity": "high",
      "function": "processData",
      "time_percent": 34.5,
      "calls": 123456,
      "avg_time_ms": 2.3,
      "recommendation": "Optimize algorithm complexity from O(n²) to O(n log n)"
    }
  ],
  "optimizations": [...],
  "estimated_improvement": {
    "throughput_increase": "3.2x",
    "latency_reduction": "68%",
    "memory_reduction": "45%"
  }
}

Flame Graph:

Interactive SVG flame graph showing call stack with time proportions

Heap Snapshot:

Memory allocation breakdown with retention paths

Optimization Report:

Prioritized list of actionable improvements with code examples

Examples

Example 1: Quick CPU Profiling

/profile ./my-app --mode quick --target cpu

Example 2: Deep Memory Analysis

/profile ./my-app --mode deep --target memory --detect-leaks

Example 3: Full Stack Optimization

/profile ./my-app --mode standard --target all --optimize --benchmark

Example 4: Database Query Optimization

/profile ./my-app --mode standard --target io --database --explain-queries

Integration with Claude-Flow

Coordination Pattern:

// Step 1: Initialize profiling swarm
mcp__claude-flow__swarm_init({ topology: "star", maxAgents: 5 })

// Step 2: Spawn specialized agents
[Parallel Execution]:
  Task("CPU Profiler", "Profile CPU usage and identify hot paths in ./app", "performance-analyzer")
  Task("Memory Profiler", "Analyze heap usage and detect memory leaks", "performance-analyzer")
  Task("I/O Profiler", "Profile file system and database operations", "performance-analyzer")
  Task("Network Profiler", "Analyze network requests and identify slow endpoints", "performance-analyzer")
  Task("Optimizer", "Generate optimization recommendations based on profiling data", "optimizer")

// Step 3: Implementation agent applies optimizations
[Sequential Execution]:
  Task("Coder", "Implement recommended optimizations from profiling analysis", "coder")
  Task("Benchmarker", "Run benchmark suite and validate improvements", "performance-benchmarker")

Configuration

Default Settings:

{
  "profiling": {
    "sampling_rate_hz": 99,
    "stack_depth": 128,
    "include_native_code": false,
    "track_allocations": true
  },
  "thresholds": {
    "cpu_hot_path_percent": 10,
    "memory_leak_growth_mb": 10,
    "slow_query_ms": 100,
    "slow_request_ms": 1000
  },
  "optimization": {
    "auto_apply": false,
    "require_approval": true,
    "run_tests_before": true,
    "run_benchmarks_after": true
  },
  "output": {
    "flame_graph": true,
    "heap_snapshot": true,
    "call_tree": true,
    "recommendations": true
  }
}

Profiling Techniques

CPU Profiling:

Sampling: Periodic stack sampling (low overhead)
Instrumentation: Function entry/exit hooks (accurate but higher overhead)
Tracing: Event-based profiling

Memory Profiling:

Heap Snapshots: Point-in-time memory state
Allocation Tracking: Record all allocations
Leak Detection: Compare snapshots over time
GC Analysis: Garbage collection patterns

I/O Profiling:

Syscall Tracing: Track system calls (strace, dtrace)
File System: Monitor read/write operations
Database: Query logging and EXPLAIN ANALYZE
Network: Packet capture and request timing

Concurrency Profiling:

Thread Analysis: CPU utilization per thread
Lock Contention: Identify blocking operations
Async Operations: Promise/callback timing

Performance Optimization Strategies

Algorithmic:

Reduce time complexity (O(n²) → O(n log n))
Use appropriate data structures
Eliminate unnecessary work
Memoization and dynamic programming

Caching:

In-memory caching (Redis, Memcached)
CDN for static assets
HTTP caching headers
Query result caching

Parallelization:

Multi-threading
Worker pools
Async I/O
Batching operations

Database:

Add missing indexes
Optimize queries
Reduce N+1 queries
Connection pooling
Read replicas

Memory:

Object pooling
Reduce allocations
Stream processing
Compression

Network:

Connection keep-alive
HTTP/2 or HTTP/3
Compression
Request batching
Rate limiting

Performance Budgets

Frontend:

Time to First Byte (TTFB): < 200ms
First Contentful Paint (FCP): < 1.8s
Largest Contentful Paint (LCP): < 2.5s
Time to Interactive (TTI): < 3.8s
Total Blocking Time (TBT): < 200ms
Cumulative Layout Shift (CLS): < 0.1

Backend:

API Response Time (p50): < 100ms
API Response Time (p95): < 500ms
API Response Time (p99): < 1000ms
Throughput: > 1000 req/s
Error Rate: < 0.1%
CPU Usage: < 70%
Memory Usage: < 80%

Database:

Query Time (p50): < 10ms
Query Time (p95): < 50ms
Query Time (p99): < 100ms
Connection Pool Utilization: < 80%

Best Practices

Profile production workloads when possible
Use production-like data volumes
Profile under realistic load
Measure multiple times for consistency
Focus on p95/p99, not just averages
Optimize bottlenecks in order of impact
Always benchmark before and after
Monitor for regressions in CI/CD
Set up continuous profiling
Track performance over time

Troubleshooting

Issue: High CPU usage but no obvious hot path

Solution: Check for excessive small function calls, increase sampling rate, or use instrumentation

Issue: Memory grows continuously

Solution: Run heap snapshot comparison to identify leak sources

Issue: Slow database queries

Solution: Use EXPLAIN ANALYZE, check for missing indexes, analyze query plans

Issue: High latency but low CPU

Solution: Profile I/O operations, check for blocking synchronous calls

when-profiling-performance-use-performance-profiler

About

SKILL.md

when-profiling-performance-use-performance-profiler

About

SKILL.md

Performance Profiler Skill

Overview

MECE Breakdown

Mutually Exclusive Components:

Collectively Exhaustive Coverage:

Features

Core Capabilities:

Profiling Modes:

Usage

Slash Command:

Subagent Invocation:

MCP Tool:

Architecture

Phase 1: Baseline Measurement

Phase 2: Bottleneck Detection

Phase 3: Root Cause Analysis

Phase 4: Optimization Generation

Phase 5: Implementation

Phase 6: Validation

Output Formats

Performance Report:

Flame Graph:

Heap Snapshot:

Optimization Report:

Examples

Example 1: Quick CPU Profiling

Example 2: Deep Memory Analysis

Example 3: Full Stack Optimization

Example 4: Database Query Optimization

Integration with Claude-Flow

Coordination Pattern:

Configuration

Default Settings:

Profiling Techniques

CPU Profiling:

Memory Profiling:

I/O Profiling:

Concurrency Profiling:

Performance Optimization Strategies

Algorithmic:

Caching:

Parallelization:

Database:

Memory:

Network:

Performance Budgets

Frontend:

Backend:

Database:

Best Practices

Troubleshooting

Issue: High CPU usage but no obvious hot path

Issue: Memory grows continuously

Issue: Slow database queries

Issue: High latency but low CPU

See Also

About

SKILL.md

About

SKILL.md

Performance Profiler Skill

Overview

MECE Breakdown

Mutually Exclusive Components:

Collectively Exhaustive Coverage:

Features

Core Capabilities:

Profiling Modes:

Usage

Slash Command:

Subagent Invocation:

MCP Tool:

Architecture

Phase 1: Baseline Measurement