Smithery Logo
MCPsSkillsDocsPricing
Login
Smithery Logo

Give agents more agency

Resources

DocumentationPrivacy PolicySystem Status

Company

PricingAboutBlog

Connect

© 2026 Smithery. All rights reserved.

    vasilyu1983

    ai-llm-inference

    vasilyu1983/ai-llm-inference
    Productivity
    29

    About

    SKILL.md

    Install

    • Telegram
      Telegram
    • Slack
      Slack
    • Claude Code
      Claude Code
    • Codex
      Codex
    • OpenClaw
      OpenClaw
    • Cursor
      Cursor
    • Amp
      Amp
    • GitHub Copilot
      GitHub Copilot
    • Gemini CLI
      Gemini CLI
    • Kilo Code
      Kilo Code
    • Junie
      Junie
    • Replit
      Replit
    • Windsurf
      Windsurf
    • Cline
      Cline
    • Continue
      Continue
    • OpenCode
      OpenCode
    • OpenHands
      OpenHands
    • Roo Code
      Roo Code
    • Augment
      Augment
    • Goose
      Goose
    • Trae
      Trae
    • Zencoder
      Zencoder
    • Antigravity
      Antigravity
    • Download skill
    ├─
    ├─
    └─

    About

    Operational patterns for LLM inference: latency budgeting, tail-latency control, caching, batching/scheduling, quantization/compression, parallelism, and reliable serving at scale...

    SKILL.md

    LLMOps - Inference & Optimization - Production Skill Hub

    Modern Best Practices (January 2026):

    • Treat inference as a systems problem: SLOs, tail latency, retries, overload, and cache strategy.
    • Use continuous batching / smart scheduling when serving many concurrent requests (Orca scheduling: https://www.usenix.org/conference/osdi22/presentation/yu).
    • Use KV-cache aware serving (PagedAttention/vLLM: https://arxiv.org/abs/2309.06180) and efficient attention kernels (FlashAttention: https://arxiv.org/abs/2205.14135).
    • Use speculative decoding when latency is critical and draft-model quality is acceptable (speculative decoding: https://arxiv.org/abs/2302.01318).
    • Quantize only with measured quality impact and rollback plan (quantization must be validated on your eval set).

    This skill provides production-ready operational patterns for optimizing LLM inference performance, cost, and reliability. It centralizes decision rules, optimization strategies, configuration templates, and operational checklists for inference workloads.

    No theory. No narrative. Only what Codex can execute.


    When to Use This Skill

    Codex should activate this skill whenever the user asks for:

    • Optimizing LLM inference latency or throughput
    • Choosing quantization strategies (FP8/FP4/INT8/INT4)
    • Configuring vLLM, TensorRT-LLM, or DeepSpeed inference
    • Scaling LLM inference across GPUs (tensor/pipeline parallelism)
    • Building high-throughput LLM APIs
    • Improving context window performance (KV cache optimization)
    • Using speculative decoding for faster generation
    • Reducing cost per token
    • Profiling and benchmarking inference workloads
    • Planning infrastructure capacity
    • CPU/edge deployment patterns
    • High availability and resilience patterns

    Scope Boundaries (Use These Skills for Depth)

    • Prompting, tuning, datasets -> ai-llm
    • RAG pipeline construction -> ai-rag
    • Deployment, APIs, monitoring -> ai-mlops
    • Safety, governance -> ai-mlops
    • Performance monitoring -> qa-observability
    • Infrastructure operations -> ops-devops-platform

    Quick Reference

    Task Tool/Framework Command/Pattern When to Use
    Latency budget SLO + load model TTFT/ITL + P95/P99 under load Any production endpoint
    Tail-latency control Scheduling + timeouts Admission control + queue caps + backpressure Prevent p99 explosions
    Throughput Batching + KV-cache aware serving Continuous batching + KV paging High concurrency serving
    Cost control Model tiering + caching Cache (prefix/response) + quotas Reduce spend and overload risk
    Long context Prefill optimization Chunked prefill + prompt compression Long inputs and RAG-heavy apps
    Parallelism TP/PP/DP Choose by model size and interconnect Models that do not fit one device
    Reliability Resilience patterns Timeouts + circuit breakers + idempotency Avoid cascading failures

    Decision Tree: Inference Optimization Strategy

    Need to optimize LLM inference: [Optimization Path]
        │
        ├─ High throughput (>10k tok/s) OR P99 variance > 3x P50?
        │   └─ YES -> Disaggregated inference (prefill/decode separation)
        │            See references/disaggregated-inference.md
        │
        ├─ Primary constraint: Throughput?
        │   ├─ Many concurrent users? -> batching + KV-cache aware serving + admission control
        │   ├─ Chat/agents with KV reuse? -> SGLang (RadixAttention)
        │   └─ Mostly batch/offline? -> batch inference jobs + large batches + spot capacity
        │
        ├─ Primary constraint: Cost?
        │   ├─ Can accept lower quality tier? -> model tiering (small/medium/large router)
        │   └─ Must keep quality? -> caching + prompt/context reduction before quantization
        │
        ├─ Primary constraint: Latency?
        │   ├─ Draft model acceptable? -> speculative decoding
        │   └─ Long context? -> prefill optimizations + FlashAttention-3 + context budgets
        │
        ├─ Large model (>70B)?
        │   ├─ Multiple GPUs? -> Tensor parallelism (NVLink required)
        │   └─ Deep model? -> Pipeline parallelism (minimize bubbles)
        │
        ├─ Hardware selection?
        │   ├─ Memory-bound? -> more HBM, higher bandwidth
        │   ├─ Latency-bound? -> faster clocks + kernel support
        │   └─ Multi-node? -> prioritize interconnect (NVLink/RDMA) and topology
        │
        │   Notes: treat GPU/SKU advice as time-sensitive; verify with vendor docs and your own benchmarks.
        │   See references/gpu-optimization-checklists.md and references/infrastructure-tuning.md
        │
        └─ Edge deployment?
            └─ CPU + quantization -> llama.cpp/GGUF for constrained resources
    

    Intake Checklist (REQUIRED)

    Before recommending changes, collect (or infer) these inputs:

    • Model + variant (size, context length, precision/quantization, tokenizer)
    • Traffic shape (prompt/output length distributions, concurrency, QPS, streaming vs non-streaming)
    • SLOs and budgets (TTFT/ITL/total latency targets, error budget, cost per request)
    • Serving stack (engine/version, batching/scheduling settings, caching, parallelism, autoscaling)
    • Hardware and topology (GPU type/count, VRAM, NVLink/RDMA, CPU/RAM, storage, cluster/runtime)
    • Constraints (quality floor, safety requirements, rollout/rollback constraints)

    Core Concepts & Practices

    Core Concepts (Vendor-Agnostic)

    • Latency components: queueing + prefill + decode; optimize the largest contributor first.
    • Tail latency: p99 is dominated by queuing and long prompts; fix with admission control and context budgets.
    • Retries: retries can multiply load; bound retries and use hedged requests only with strict budgets.
    • Caching: prefix caching helps repeated system/tool scaffolds; response caching helps repeated questions (requires invalidation).
    • Security & privacy: prompts/outputs can contain sensitive data; scrub logs, enforce auth/tenancy, and rate-limit abuse (OWASP LLM Top 10: https://owasp.org/www-project-top-10-for-large-language-model-applications/).

    Implementation Practices (Tooling Examples)

    • Measure under load: benchmark TTFT/ITL and p95/p99 with realistic concurrency and prompt lengths.
    • Separate environments: dev/stage/prod model configs; promote only after passing the inference review checklist.
    • Export telemetry: request-level tokens, TTFT/ITL, queue depth, GPU memory headroom, and error classes (OpenTelemetry GenAI semantic conventions: https://opentelemetry.io/docs/specs/semconv/gen-ai/).

    Do / Avoid

    Do

    • Do enforce max_input_tokens and max_output_tokens at the API boundary.
    • Do cap concurrency and queue depth; return overload errors quickly.
    • Do validate quality after any quantization or kernel change.

    Avoid

    • Avoid unbounded retries (amplifies outages).
    • Avoid unbounded context windows (OOM + latency spikes).
    • Avoid benchmarking on single requests; always test with realistic concurrency.

    Accuracy Protocol (REQUIRED)

    • Treat performance ratios (for example, "2x faster") as hypotheses unless a source is cited and the workload is comparable.
    • Do not recommend hardware/SKU changes without stating assumptions (model size, context length, concurrency, interconnect).
    • Prefer a measured baseline + checklist-driven rollout over "best practice" claims.

    Resources (Detailed Operational Guides)

    For comprehensive guides on specific topics, see:

    Infrastructure & Serving

    • Disaggregated Inference - Prefill/decode separation (2025+ standard)
    • Infrastructure Tuning - OS, container, Kubernetes optimization for GPU workloads
    • Serving Architectures - Production serving stack patterns (vLLM, SGLang, TensorRT-LLM, NVIDIA Dynamo)
    • Resilience & HA Patterns - Multi-region, failover, traffic management

    Performance Optimization

    • Quantization Patterns - FP8/FP4/INT8/INT4 decision trees (FP8 first, INT8 not on Blackwell)
    • KV Cache Optimization - PagedAttention, FlashAttention-3, FlashInfer, RadixAttention
    • Parallelism Patterns - Tensor/pipeline/expert parallelism strategies
    • Optimization Strategies - Throughput, cost, memory optimization
    • Batching & Scheduling - Continuous batching and throughput patterns

    Deployment & Operations

    • Edge & CPU Optimization - llama.cpp, GGUF, mobile/browser deployment
    • GPU Optimization Checklists - Hardware-specific tuning
    • Speculative Decoding Guide - Advanced generation acceleration
    • Profiling & Capacity Planning - Benchmarking, SLOs, replica sizing

    Cost & Routing

    • Cost Optimization Patterns - Token budgets, caching economics, model tiering, cost-per-outcome tracking
    • Multi-Model Routing - Router architectures, quality-cost tradeoffs, cascading strategies, A/B routing
    • Streaming Patterns - SSE/WebSocket serving, token-by-token delivery, backpressure, client integration

    Templates

    Inference Configs

    Production-ready configuration templates for leading inference engines:

    • vLLM Configuration - Continuous batching, PagedAttention setup
    • TensorRT-LLM Configuration - NVIDIA kernel optimizations
    • DeepSpeed Inference - PyTorch-friendly inference

    Quantization & Compression

    Model compression templates for reducing memory and cost:

    • GPTQ Quantization - GPU post-training quantization
    • AWQ Quantization - Activation-aware weight quantization
    • GGUF Format - CPU/edge optimized formats

    Serving Pipelines

    High-throughput serving architectures:

    • LLM API Server - FastAPI + vLLM production setup
    • High-Throughput Setup - Multi-replica scaling patterns

    Caching & Batching

    Performance optimization templates:

    • Prefix Caching - KV cache reuse strategies
    • Batching Configuration - Continuous batching tuning

    Benchmarking

    Performance measurement and validation:

    • Latency & Throughput Testing - Load testing framework

    Checklists

    • Inference Performance Review Checklist - Baseline, bottlenecks, rollout readiness

    Navigation

    Resources

    • references/disaggregated-inference.md
    • references/serving-architectures.md
    • references/profiling-and-capacity-planning.md
    • references/gpu-optimization-checklists.md
    • references/speculative-decoding-guide.md
    • references/resilience-ha-patterns.md
    • references/optimization-strategies.md
    • references/kv-cache-optimization.md
    • references/batching-and-scheduling.md
    • references/quantization-patterns.md
    • references/parallelism-patterns.md
    • references/edge-cpu-optimization.md
    • references/infrastructure-tuning.md
    • references/cost-optimization-patterns.md
    • references/multi-model-routing.md
    • references/streaming-patterns.md

    Templates

    • assets/serving/template-llm-api.md
    • assets/serving/template-high-throughput-setup.md
    • assets/inference/template-vllm-config.md
    • assets/inference/template-tensorrtllm-config.md
    • assets/inference/template-deepspeed-inference.md
    • assets/quantization/template-awq.md
    • assets/quantization/template-gptq.md
    • assets/quantization/template-gguf.md
    • assets/batching/template-batching-config.md
    • assets/caching/template-prefix-caching.md
    • assets/benchmarking/template-latency-throughput-test.md
    • assets/checklists/inference-review-checklist.md

    Data

    • data/sources.json - Curated external references

    Trend Awareness Protocol

    IMPORTANT: When users ask recommendation questions about LLM inference, you MUST use WebSearch to check current trends before answering.

    Trigger Conditions

    • "What's the best inference engine for [use case]?"
    • "What should I use for [serving/quantization/batching]?"
    • "What's the latest in LLM inference optimization?"
    • "Current best practices for [vLLM/TensorRT/quantization]?"
    • "Is [inference tool] still relevant in 2026?"
    • "[vLLM] vs [TensorRT-LLM] vs [SGLang]?"
    • "Best quantization method for [model size]?"
    • "What GPU should I use for inference?"

    Required Searches

    1. Search: "LLM inference optimization best practices 2026"
    2. Search: "[vLLM/TensorRT-LLM/SGLang] comparison 2026"
    3. Search: "LLM quantization trends January 2026"
    4. Search: "LLM serving new releases 2026"

    What to Report

    After searching, provide:

    • Current landscape: What serving engines are popular NOW (not 6 months ago)
    • Emerging trends: New inference optimizations gaining traction
    • Deprecated/declining: Techniques or tools losing relevance
    • Recommendation: Based on fresh data, not just static knowledge

    Example Topics (verify with fresh search)

    • Inference engines (vLLM 0.7+, TensorRT-LLM, SGLang, llama.cpp)
    • Quantization methods (FP8, AWQ, GPTQ, GGUF, bitsandbytes)
    • Attention kernels (FlashAttention-3, FlashInfer, xFormers)
    • Speculative decoding advances
    • KV cache optimization techniques
    • New GPU architectures (H200, Blackwell) and their optimizations

    Related Skills

    This skill focuses on inference-time performance. For related workflows:

    • See "Scope Boundaries" above.

    External Resources

    See data/sources.json for:

    • Serving frameworks (vLLM, TensorRT-LLM, DeepSpeed-MII)
    • Quantization libraries (GPTQ, AWQ, bitsandbytes, LLM Compressor)
    • FlashAttention, FlashInfer, xFormers
    • GPU hardware guides and optimization docs
    • Benchmarking frameworks and tools

    Use this skill whenever the user needs LLM inference performance, cost reduction, or serving architecture guidance.

    Fact-Checking

    • Use web search/web fetch to verify current external facts, versions, pricing, deadlines, regulations, or platform behavior before final answers.
    • Prefer primary sources; report source links and dates for volatile information.
    • If web access is unavailable, state the limitation and mark guidance as unverified.
    Recommended Servers
    Local Model Suitability MCP
    Local Model Suitability MCP
    Parallel Tasks
    Parallel Tasks
    Parallel Web Search
    Parallel Web Search
    Repository
    vasilyu1983/ai-agents-public
    Files