Smithery Logo
MCPsSkillsDocsPricing
Login
Smithery Logo

Accelerating the Agent Economy

Resources

DocumentationPrivacy PolicySystem Status

Company

PricingAboutBlog

Connect

© 2026 Smithery. All rights reserved.

    ancoleman

    embedding-optimization

    ancoleman/embedding-optimization
    AI & ML
    154
    2 installs

    About

    SKILL.md

    Install

    Install via Skills CLI

    or add to your agent
    • Claude Code
      Claude Code
    • Codex
      Codex
    • OpenClaw
      OpenClaw
    • Cursor
      Cursor
    • Amp
      Amp
    • GitHub Copilot
      GitHub Copilot
    • Gemini CLI
      Gemini CLI
    • Kilo Code
      Kilo Code
    • Junie
      Junie
    • Replit
      Replit
    • Windsurf
      Windsurf
    • Cline
      Cline
    • Continue
      Continue
    • OpenCode
      OpenCode
    • OpenHands
      OpenHands
    • Roo Code
      Roo Code
    • Augment
      Augment
    • Goose
      Goose
    • Trae
      Trae
    • Zencoder
      Zencoder
    • Antigravity
      Antigravity
    ├─
    ├─
    └─

    About

    Optimizing vector embeddings for RAG systems through model selection, chunking strategies, caching, and performance tuning...

    SKILL.md

    Embedding Optimization

    Optimize embedding generation for cost, performance, and quality in RAG and semantic search systems.

    When to Use This Skill

    Trigger this skill when:

    • Building RAG (Retrieval Augmented Generation) systems
    • Implementing semantic search or similarity detection
    • Optimizing embedding API costs (reducing by 70-90%)
    • Improving document retrieval quality through better chunking
    • Processing large document corpora (thousands to millions of documents)
    • Selecting between API-based vs. local embedding models

    Model Selection Framework

    Choose the optimal embedding model based on requirements:

    Quick Recommendations:

    • Startup/MVP: all-MiniLM-L6-v2 (local, 384 dims, zero API costs)
    • Production: text-embedding-3-small (API, 1,536 dims, balanced quality/cost)
    • High Quality: text-embedding-3-large (API, 3,072 dims, premium)
    • Multilingual: multilingual-e5-base (local, 768 dims) or Cohere embed-multilingual-v3.0

    For detailed decision frameworks including cost comparisons, quality benchmarks, and data privacy considerations, see references/model-selection-guide.md.

    Model Comparison Summary:

    Model Type Dimensions Cost per 1M tokens Best For
    all-MiniLM-L6-v2 Local 384 $0 (compute only) High volume, tight budgets
    BGE-base-en-v1.5 Local 768 $0 (compute only) Quality + cost balance
    text-embedding-3-small API 1,536 $0.02 General purpose production
    text-embedding-3-large API 3,072 $0.13 Premium quality requirements
    embed-multilingual-v3.0 API 1,024 $0.10 100+ language support

    Chunking Strategies

    Select chunking strategy based on content type and use case:

    Content Type → Strategy Mapping:

    • Documentation: Recursive (heading-aware), 800 chars, 100 overlap
    • Code: Recursive (function-level), 1,000 chars, 100 overlap
    • Q&A/FAQ: Fixed-size, 500 chars, 50 overlap (precise retrieval)
    • Legal/Technical: Semantic (large), 1,500 chars, 200 overlap (context preservation)
    • Blog Posts: Semantic (paragraph), 1,000 chars, 100 overlap
    • Academic Papers: Recursive (section-aware), 1,200 chars, 150 overlap

    For detailed chunking patterns, decision trees, and implementation guidance, see references/chunking-strategies.md.

    Quick Start with CLI:

    python scripts/chunk_document.py \
      --input document.txt \
      --content-type markdown \
      --chunk-size 800 \
      --overlap 100 \
      --output chunks.jsonl
    

    Caching Implementation

    Achieve 80-90% cost reduction through content-addressable caching.

    Caching Architecture by Query Volume:

    • <10K queries/month: In-memory cache (Python lru_cache)
    • 10K-100K queries/month: Redis (fast, TTL-based expiration)
    • 100K-1M queries/month: Redis (hot) + PostgreSQL (warm)
    • >1M queries/month: Multi-tier (Redis + PostgreSQL + S3)

    Production Caching with Redis:

    # Embed documents with caching enabled
    python scripts/cached_embedder.py \
      --model text-embedding-3-small \
      --input documents.jsonl \
      --output embeddings.npy \
      --cache-backend redis \
      --cache-ttl 2592000  # 30 days
    

    Caching ROI Example:

    • 50,000 document chunks
    • 20% duplicate content
    • Without caching: $0.50 API cost
    • With caching (60% hit rate): $0.20 API cost
    • Savings: 60% ($0.30)

    Dimensionality Trade-offs

    Balance storage, search speed, and quality:

    Dimensions Storage (1M vectors) Search Speed (p95) Quality Use Case
    384 1.5 GB 10ms Good Large-scale search
    768 3 GB 15ms High General purpose RAG
    1,536 6 GB 25ms Very High High-quality retrieval
    3,072 12 GB 40ms Highest Premium applications

    Key Insight: For most RAG applications, 768 dimensions (BGE-base-en-v1.5 local or equivalent) provides the best quality/cost/speed balance.

    Batch Processing Optimization

    Maximize throughput for large-scale ingestion:

    OpenAI API:

    • Batch up to 2,048 inputs per request
    • Implement rate limiting (tier-dependent: 500-5,000 RPM)
    • Use parallel requests with backoff on rate limits

    Local Models (sentence-transformers):

    • GPU acceleration (CUDA, MPS for Apple Silicon)
    • Batch size tuning (32-128 based on GPU memory)
    • Multi-GPU support for maximum throughput

    Expected Throughput:

    • OpenAI API: 1,000-5,000 texts/minute (rate limit dependent)
    • Local GPU (RTX 3090): 5,000-10,000 texts/minute
    • Local CPU: 100-500 texts/minute

    Performance Monitoring

    Track key metrics for optimization:

    Critical Metrics:

    • Latency: Embedding generation time (p50, p95, p99)
    • Throughput: Embeddings per second/minute
    • Cost: API usage tracking (USD per 1K/1M tokens)
    • Cache Efficiency: Hit rate percentage

    For detailed monitoring setup, metric collection patterns, and dashboarding, see references/performance-monitoring.md.

    Monitor with Wrapper:

    from scripts.performance_monitor import MonitoredEmbedder
    
    monitored = MonitoredEmbedder(
        embedder=your_embedder,
        cost_per_1k_tokens=0.00002  # OpenAI pricing
    )
    
    embeddings = monitored.embed_batch(texts)
    metrics = monitored.get_metrics()
    print(f"Cache hit rate: {metrics['cache_hit_rate_pct']}%")
    print(f"Total cost: ${metrics['total_cost_usd']}")
    

    Working Examples

    See examples/ directory for complete implementations:

    Python Examples:

    • examples/openai_cached.py - OpenAI embeddings with Redis caching
    • examples/local_embedder.py - sentence-transformers local embedding
    • examples/smart_chunker.py - Content-aware recursive chunking
    • examples/performance_monitor.py - Pipeline performance tracking
    • examples/batch_processor.py - Large-scale document processing

    All examples include:

    • Complete, runnable code
    • Dependency installation instructions
    • Error handling and retry logic
    • Configuration options

    Integration Points

    Upstream (This skill provides to):

    • Vector Databases: Embeddings flow to Pinecone, Weaviate, Qdrant, pgvector
    • RAG Systems: Optimized embeddings for retrieval pipelines
    • Semantic Search: Query and document embeddings for similarity search

    Downstream (This skill uses from):

    • Document Processing: Chunk documents before embedding
    • Data Ingestion: Process documents from various sources

    Related Skills:

    • For RAG architecture, see building-ai-chat skill
    • For vector database operations, see databases-vector skill
    • For data ingestion pipelines, see ingesting-data skill

    Common Patterns

    Pattern 1: RAG Pipeline

    Document → Chunk → Embed → Store (vector DB) → Retrieve
    

    Pattern 2: Semantic Search

    Query → Embed → Search (vector DB) → Rank → Display
    

    Pattern 3: Multi-Stage Retrieval (Cost Optimization)

    Query → Cheap Embedding (384d) → Initial Search →
    Expensive Embedding (1,536d) → Rerank Top-K → Return
    

    Cost Savings: 70% reduction vs. single-stage with expensive embeddings

    Quick Reference Checklist

    Model Selection:

    • Identified data privacy requirements (local vs. API)
    • Calculated expected query volume
    • Determined quality requirements (good/high/highest)
    • Checked multilingual support needs

    Chunking:

    • Analyzed content type (code, docs, legal, etc.)
    • Selected appropriate chunk size (500-1,500 chars)
    • Set overlap to prevent context loss (50-200 chars)
    • Validated chunks preserve semantic boundaries

    Caching:

    • Implemented content-addressable hashing
    • Selected cache backend (Redis, PostgreSQL)
    • Set TTL based on content volatility
    • Monitoring cache hit rate (target: >60%)

    Performance:

    • Tracking latency (embedding generation time)
    • Measuring throughput (embeddings/sec)
    • Monitoring costs (USD spent on API calls)
    • Optimizing batch sizes for maximum efficiency
    Recommended Servers
    Cloudflare AI Search
    Cloudflare AI Search
    InfraNodus Knowledge Graphs & Text Analysis
    InfraNodus Knowledge Graphs & Text Analysis
    Gemini
    Gemini
    Repository
    ancoleman/ai-design-components
    Files