prompt-caching

neversight/prompt-caching

Productivity

About

SKILL.md

prompt-caching

neversight/prompt-caching

Productivity

About

Prompt caching for Claude API to reduce latency by up to 85% and costs by up to 90%. Activate for cache_control, ephemeral caching, cache breakpoints, and performance optimization.

SKILL.md

Prompt Caching Skill

Leverage Anthropic's prompt caching to dramatically reduce latency and costs for repeated prompts.

When to Use This Skill

RAG systems with large static documents
Multi-turn conversations with long instructions
Code analysis with large codebase context
Batch processing with shared prefixes
Document analysis and summarization

Core Concepts

Cache Control Placement

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a helpful assistant with access to a large knowledge base...",
            "cache_control": {"type": "ephemeral"}  # Cache this content
        }
    ],
    messages=[{"role": "user", "content": "What is...?"}]
)

Cache Hierarchy

Cache breakpoints are checked in this order:

Tools - Tool definitions cached first
System - System prompts cached second
Messages - Conversation history cached last

TTL Options

TTL	Write Cost	Read Cost	Use Case
5 minutes (default)	1.25x base	0.1x base	Interactive sessions
1 hour	2.0x base	0.1x base	Batch processing, stable docs

Cache Requirements

Minimum tokens: 1024-4096 (varies by model)
Maximum breakpoints: 4 per request
Supported models: Claude Opus 4.5, Sonnet 4.5, Haiku 4.5

Implementation Patterns

Pattern 1: Single Breakpoint (Recommended)

# Best for: Document analysis, Q&A with static context
system = [
    {
        "type": "text",
        "text": large_document_content,
        "cache_control": {"type": "ephemeral"}  # Single breakpoint at end
    }
]

Pattern 2: Multi-Turn Conversation

# Cache grows with conversation
messages = [
    {"role": "user", "content": "First question"},
    {"role": "assistant", "content": "First answer"},
    {
        "role": "user",
        "content": "Follow-up question",
        "cache_control": {"type": "ephemeral"}  # Cache entire conversation
    }
]

Pattern 3: RAG with Multiple Breakpoints

system = [
    {
        "type": "text",
        "text": "Tool definitions and instructions",
        "cache_control": {"type": "ephemeral"}  # Breakpoint 1: Tools
    },
    {
        "type": "text",
        "text": retrieved_documents,
        "cache_control": {"type": "ephemeral"}  # Breakpoint 2: Documents
    }
]

Pattern 4: Batch Processing with 1-Hour TTL

# Warm the cache before batch
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=100,
    system=[{
        "type": "text",
        "text": shared_context,
        "cache_control": {"type": "ephemeral", "ttl": "1h"}
    }],
    messages=[{"role": "user", "content": "Initialize cache"}]
)

# Now run batch - all requests hit the cache
for item in batch_items:
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        system=[{
            "type": "text",
            "text": shared_context,
            "cache_control": {"type": "ephemeral", "ttl": "1h"}
        }],
        messages=[{"role": "user", "content": item}]
    )

Performance Monitoring

Check Cache Usage

response = client.messages.create(...)

# Monitor these fields
cache_write = response.usage.cache_creation_input_tokens  # New cache written
cache_read = response.usage.cache_read_input_tokens       # Cache hit!
uncached = response.usage.input_tokens                    # After breakpoint

print(f"Cache hit rate: {cache_read / (cache_read + cache_write + uncached) * 100:.1f}%")

Cost Calculation

def calculate_cost(usage, model="claude-sonnet-4-20250514"):
    # Example rates (check current pricing)
    base_input_rate = 0.003  # per 1K tokens

    write_cost = (usage.cache_creation_input_tokens / 1000) * base_input_rate * 1.25
    read_cost = (usage.cache_read_input_tokens / 1000) * base_input_rate * 0.1
    uncached_cost = (usage.input_tokens / 1000) * base_input_rate

    return write_cost + read_cost + uncached_cost

Cache Invalidation

Changes that invalidate cache:

Change	Impact
Tool definitions	Entire cache invalidated
System prompt	System + messages invalidated
Any content before breakpoint	That breakpoint + later invalidated

Best Practices

DO:

Place breakpoint at END of static content
Keep tools/instructions stable across requests
Use 1-hour TTL for batch processing
Monitor cache_read_input_tokens for savings

DON'T:

Place breakpoint in middle of dynamic content
Change tool definitions frequently
Expect cache to work with <1024 tokens
Ignore the 20-block lookback limit

Integration with Extended Thinking

# Cache + Extended Thinking
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 10000},
    system=[{
        "type": "text",
        "text": large_context,
        "cache_control": {"type": "ephemeral"}
    }],
    messages=[{"role": "user", "content": "Analyze this..."}]
)

About

SKILL.md

About

Prompt caching for Claude API to reduce latency by up to 85% and costs by up to 90%. Activate for cache_control, ephemeral caching, cache breakpoints, and performance optimization.

SKILL.md

Prompt Caching Skill

Leverage Anthropic's prompt caching to dramatically reduce latency and costs for repeated prompts.

When to Use This Skill

RAG systems with large static documents
Multi-turn conversations with long instructions
Code analysis with large codebase context
Batch processing with shared prefixes
Document analysis and summarization

Core Concepts

Cache Control Placement

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a helpful assistant with access to a large knowledge base...",
            "cache_control": {"type": "ephemeral"}  # Cache this content
        }
    ],
    messages=[{"role": "user", "content": "What is...?"}]
)

Cache Hierarchy

Cache breakpoints are checked in this order:

Tools - Tool definitions cached first
System - System prompts cached second
Messages - Conversation history cached last

TTL Options

TTL	Write Cost	Read Cost	Use Case
5 minutes (default)	1.25x base	0.1x base	Interactive sessions
1 hour	2.0x base	0.1x base	Batch processing, stable docs

Cache Requirements

Minimum tokens: 1024-4096 (varies by model)
Maximum breakpoints: 4 per request
Supported models: Claude Opus 4.5, Sonnet 4.5, Haiku 4.5

Implementation Patterns

Pattern 1: Single Breakpoint (Recommended)

# Best for: Document analysis, Q&A with static context
system = [
    {
        "type": "text",
        "text": large_document_content,
        "cache_control": {"type": "ephemeral"}  # Single breakpoint at end
    }
]

Pattern 2: Multi-Turn Conversation

# Cache grows with conversation
messages = [
    {"role": "user", "content": "First question"},
    {"role": "assistant", "content": "First answer"},
    {
        "role": "user",
        "content": "Follow-up question",
        "cache_control": {"type": "ephemeral"}  # Cache entire conversation
    }
]

Pattern 3: RAG with Multiple Breakpoints

system = [
    {
        "type": "text",
        "text": "Tool definitions and instructions",
        "cache_control": {"type": "ephemeral"}  # Breakpoint 1: Tools
    },
    {
        "type": "text",
        "text": retrieved_documents,
        "cache_control": {"type": "ephemeral"}  # Breakpoint 2: Documents
    }
]

Pattern 4: Batch Processing with 1-Hour TTL

# Warm the cache before batch
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=100,
    system=[{
        "type": "text",
        "text": shared_context,
        "cache_control": {"type": "ephemeral", "ttl": "1h"}
    }],
    messages=[{"role": "user", "content": "Initialize cache"}]
)

# Now run batch - all requests hit the cache
for item in batch_items:
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        system=[{
            "type": "text",
            "text": shared_context,
            "cache_control": {"type": "ephemeral", "ttl": "1h"}
        }],
        messages=[{"role": "user", "content": item}]
    )

Performance Monitoring

Check Cache Usage

response = client.messages.create(...)

# Monitor these fields
cache_write = response.usage.cache_creation_input_tokens  # New cache written
cache_read = response.usage.cache_read_input_tokens       # Cache hit!
uncached = response.usage.input_tokens                    # After breakpoint

print(f"Cache hit rate: {cache_read / (cache_read + cache_write + uncached) * 100:.1f}%")

Cost Calculation

def calculate_cost(usage, model="claude-sonnet-4-20250514"):
    # Example rates (check current pricing)
    base_input_rate = 0.003  # per 1K tokens

    write_cost = (usage.cache_creation_input_tokens / 1000) * base_input_rate * 1.25
    read_cost = (usage.cache_read_input_tokens / 1000) * base_input_rate * 0.1
    uncached_cost = (usage.input_tokens / 1000) * base_input_rate

    return write_cost + read_cost + uncached_cost

Cache Invalidation

Changes that invalidate cache:

Change	Impact
Tool definitions	Entire cache invalidated
System prompt	System + messages invalidated
Any content before breakpoint	That breakpoint + later invalidated

Best Practices

DO:

Place breakpoint at END of static content
Keep tools/instructions stable across requests
Use 1-hour TTL for batch processing
Monitor cache_read_input_tokens for savings

DON'T:

Place breakpoint in middle of dynamic content
Change tool definitions frequently
Expect cache to work with <1024 tokens
Ignore the 20-block lookback limit

Integration with Extended Thinking

# Cache + Extended Thinking
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 10000},
    system=[{
        "type": "text",
        "text": large_context,
        "cache_control": {"type": "ephemeral"}
    }],
    messages=[{"role": "user", "content": "Analyze this..."}]
)

prompt-caching

About

SKILL.md

prompt-caching

About

SKILL.md

Prompt Caching Skill

When to Use This Skill

Core Concepts

Cache Control Placement

Cache Hierarchy

TTL Options

Cache Requirements

Implementation Patterns

Pattern 1: Single Breakpoint (Recommended)

Pattern 2: Multi-Turn Conversation

Pattern 3: RAG with Multiple Breakpoints

Pattern 4: Batch Processing with 1-Hour TTL

Performance Monitoring

Check Cache Usage

Cost Calculation

Cache Invalidation

Best Practices

DO:

DON'T:

Integration with Extended Thinking

See Also

About

SKILL.md

About

SKILL.md

Prompt Caching Skill

When to Use This Skill

Core Concepts

Cache Control Placement

Cache Hierarchy

TTL Options

Cache Requirements

Implementation Patterns

Pattern 1: Single Breakpoint (Recommended)

Pattern 2: Multi-Turn Conversation

Pattern 3: RAG with Multiple Breakpoints

Pattern 4: Batch Processing with 1-Hour TTL

Performance Monitoring

Check Cache Usage

Cost Calculation

Cache Invalidation

Best Practices

DO:

DON'T:

Integration with Extended Thinking

See Also