Smithery Logo
MCPsSkillsDocsPricing
Login
Smithery Logo

Accelerating the Agent Economy

Resources

DocumentationPrivacy PolicySystem Status

Company

PricingAboutBlog

Connect

© 2026 Smithery. All rights reserved.

    neversight

    prompt-caching

    neversight/prompt-caching
    Productivity
    2

    About

    SKILL.md

    Install

    Install via Skills CLI

    or add to your agent
    • Claude Code
      Claude Code
    • Codex
      Codex
    • OpenClaw
      OpenClaw
    • Cursor
      Cursor
    • Amp
      Amp
    • GitHub Copilot
      GitHub Copilot
    • Gemini CLI
      Gemini CLI
    • Kilo Code
      Kilo Code
    • Junie
      Junie
    • Replit
      Replit
    • Windsurf
      Windsurf
    • Cline
      Cline
    • Continue
      Continue
    • OpenCode
      OpenCode
    • OpenHands
      OpenHands
    • Roo Code
      Roo Code
    • Augment
      Augment
    • Goose
      Goose
    • Trae
      Trae
    • Zencoder
      Zencoder
    • Antigravity
      Antigravity
    ├─
    ├─
    └─

    About

    Prompt caching for Claude API to reduce latency by up to 85% and costs by up to 90%. Activate for cache_control, ephemeral caching, cache breakpoints, and performance optimization.

    SKILL.md

    Prompt Caching Skill

    Leverage Anthropic's prompt caching to dramatically reduce latency and costs for repeated prompts.

    When to Use This Skill

    • RAG systems with large static documents
    • Multi-turn conversations with long instructions
    • Code analysis with large codebase context
    • Batch processing with shared prefixes
    • Document analysis and summarization

    Core Concepts

    Cache Control Placement

    import anthropic
    
    client = anthropic.Anthropic()
    
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        system=[
            {
                "type": "text",
                "text": "You are a helpful assistant with access to a large knowledge base...",
                "cache_control": {"type": "ephemeral"}  # Cache this content
            }
        ],
        messages=[{"role": "user", "content": "What is...?"}]
    )
    

    Cache Hierarchy

    Cache breakpoints are checked in this order:

    1. Tools - Tool definitions cached first
    2. System - System prompts cached second
    3. Messages - Conversation history cached last

    TTL Options

    TTL Write Cost Read Cost Use Case
    5 minutes (default) 1.25x base 0.1x base Interactive sessions
    1 hour 2.0x base 0.1x base Batch processing, stable docs

    Cache Requirements

    • Minimum tokens: 1024-4096 (varies by model)
    • Maximum breakpoints: 4 per request
    • Supported models: Claude Opus 4.5, Sonnet 4.5, Haiku 4.5

    Implementation Patterns

    Pattern 1: Single Breakpoint (Recommended)

    # Best for: Document analysis, Q&A with static context
    system = [
        {
            "type": "text",
            "text": large_document_content,
            "cache_control": {"type": "ephemeral"}  # Single breakpoint at end
        }
    ]
    

    Pattern 2: Multi-Turn Conversation

    # Cache grows with conversation
    messages = [
        {"role": "user", "content": "First question"},
        {"role": "assistant", "content": "First answer"},
        {
            "role": "user",
            "content": "Follow-up question",
            "cache_control": {"type": "ephemeral"}  # Cache entire conversation
        }
    ]
    

    Pattern 3: RAG with Multiple Breakpoints

    system = [
        {
            "type": "text",
            "text": "Tool definitions and instructions",
            "cache_control": {"type": "ephemeral"}  # Breakpoint 1: Tools
        },
        {
            "type": "text",
            "text": retrieved_documents,
            "cache_control": {"type": "ephemeral"}  # Breakpoint 2: Documents
        }
    ]
    

    Pattern 4: Batch Processing with 1-Hour TTL

    # Warm the cache before batch
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=100,
        system=[{
            "type": "text",
            "text": shared_context,
            "cache_control": {"type": "ephemeral", "ttl": "1h"}
        }],
        messages=[{"role": "user", "content": "Initialize cache"}]
    )
    
    # Now run batch - all requests hit the cache
    for item in batch_items:
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=1024,
            system=[{
                "type": "text",
                "text": shared_context,
                "cache_control": {"type": "ephemeral", "ttl": "1h"}
            }],
            messages=[{"role": "user", "content": item}]
        )
    

    Performance Monitoring

    Check Cache Usage

    response = client.messages.create(...)
    
    # Monitor these fields
    cache_write = response.usage.cache_creation_input_tokens  # New cache written
    cache_read = response.usage.cache_read_input_tokens       # Cache hit!
    uncached = response.usage.input_tokens                    # After breakpoint
    
    print(f"Cache hit rate: {cache_read / (cache_read + cache_write + uncached) * 100:.1f}%")
    

    Cost Calculation

    def calculate_cost(usage, model="claude-sonnet-4-20250514"):
        # Example rates (check current pricing)
        base_input_rate = 0.003  # per 1K tokens
    
        write_cost = (usage.cache_creation_input_tokens / 1000) * base_input_rate * 1.25
        read_cost = (usage.cache_read_input_tokens / 1000) * base_input_rate * 0.1
        uncached_cost = (usage.input_tokens / 1000) * base_input_rate
    
        return write_cost + read_cost + uncached_cost
    

    Cache Invalidation

    Changes that invalidate cache:

    Change Impact
    Tool definitions Entire cache invalidated
    System prompt System + messages invalidated
    Any content before breakpoint That breakpoint + later invalidated

    Best Practices

    DO:

    • Place breakpoint at END of static content
    • Keep tools/instructions stable across requests
    • Use 1-hour TTL for batch processing
    • Monitor cache_read_input_tokens for savings

    DON'T:

    • Place breakpoint in middle of dynamic content
    • Change tool definitions frequently
    • Expect cache to work with <1024 tokens
    • Ignore the 20-block lookback limit

    Integration with Extended Thinking

    # Cache + Extended Thinking
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=16000,
        thinking={"type": "enabled", "budget_tokens": 10000},
        system=[{
            "type": "text",
            "text": large_context,
            "cache_control": {"type": "ephemeral"}
        }],
        messages=[{"role": "user", "content": "Analyze this..."}]
    )
    

    See Also

    • [[llm-integration]] - Claude API basics
    • [[extended-thinking]] - Deep reasoning
    • [[batch-processing]] - Bulk processing
    Recommended Servers
    Browser tool
    Browser tool
    Context7
    Context7
    Exa Search
    Exa Search
    Repository
    neversight/skills_feed