Optimizing vector embeddings for RAG systems through model selection, chunking strategies, caching, and performance tuning...
Optimize embedding generation for cost, performance, and quality in RAG and semantic search systems.
Trigger this skill when:
Choose the optimal embedding model based on requirements:
Quick Recommendations:
all-MiniLM-L6-v2 (local, 384 dims, zero API costs)text-embedding-3-small (API, 1,536 dims, balanced quality/cost)text-embedding-3-large (API, 3,072 dims, premium)multilingual-e5-base (local, 768 dims) or Cohere embed-multilingual-v3.0For detailed decision frameworks including cost comparisons, quality benchmarks, and data privacy considerations, see references/model-selection-guide.md.
Model Comparison Summary:
| Model | Type | Dimensions | Cost per 1M tokens | Best For |
|---|---|---|---|---|
| all-MiniLM-L6-v2 | Local | 384 | $0 (compute only) | High volume, tight budgets |
| BGE-base-en-v1.5 | Local | 768 | $0 (compute only) | Quality + cost balance |
| text-embedding-3-small | API | 1,536 | $0.02 | General purpose production |
| text-embedding-3-large | API | 3,072 | $0.13 | Premium quality requirements |
| embed-multilingual-v3.0 | API | 1,024 | $0.10 | 100+ language support |
Select chunking strategy based on content type and use case:
Content Type → Strategy Mapping:
For detailed chunking patterns, decision trees, and implementation guidance, see references/chunking-strategies.md.
Quick Start with CLI:
python scripts/chunk_document.py \
--input document.txt \
--content-type markdown \
--chunk-size 800 \
--overlap 100 \
--output chunks.jsonl
Achieve 80-90% cost reduction through content-addressable caching.
Caching Architecture by Query Volume:
lru_cache)Production Caching with Redis:
# Embed documents with caching enabled
python scripts/cached_embedder.py \
--model text-embedding-3-small \
--input documents.jsonl \
--output embeddings.npy \
--cache-backend redis \
--cache-ttl 2592000 # 30 days
Caching ROI Example:
Balance storage, search speed, and quality:
| Dimensions | Storage (1M vectors) | Search Speed (p95) | Quality | Use Case |
|---|---|---|---|---|
| 384 | 1.5 GB | 10ms | Good | Large-scale search |
| 768 | 3 GB | 15ms | High | General purpose RAG |
| 1,536 | 6 GB | 25ms | Very High | High-quality retrieval |
| 3,072 | 12 GB | 40ms | Highest | Premium applications |
Key Insight: For most RAG applications, 768 dimensions (BGE-base-en-v1.5 local or equivalent) provides the best quality/cost/speed balance.
Maximize throughput for large-scale ingestion:
OpenAI API:
Local Models (sentence-transformers):
Expected Throughput:
Track key metrics for optimization:
Critical Metrics:
For detailed monitoring setup, metric collection patterns, and dashboarding, see references/performance-monitoring.md.
Monitor with Wrapper:
from scripts.performance_monitor import MonitoredEmbedder
monitored = MonitoredEmbedder(
embedder=your_embedder,
cost_per_1k_tokens=0.00002 # OpenAI pricing
)
embeddings = monitored.embed_batch(texts)
metrics = monitored.get_metrics()
print(f"Cache hit rate: {metrics['cache_hit_rate_pct']}%")
print(f"Total cost: ${metrics['total_cost_usd']}")
See examples/ directory for complete implementations:
Python Examples:
examples/openai_cached.py - OpenAI embeddings with Redis cachingexamples/local_embedder.py - sentence-transformers local embeddingexamples/smart_chunker.py - Content-aware recursive chunkingexamples/performance_monitor.py - Pipeline performance trackingexamples/batch_processor.py - Large-scale document processingAll examples include:
Upstream (This skill provides to):
Downstream (This skill uses from):
Related Skills:
building-ai-chat skilldatabases-vector skillingesting-data skillPattern 1: RAG Pipeline
Document → Chunk → Embed → Store (vector DB) → Retrieve
Pattern 2: Semantic Search
Query → Embed → Search (vector DB) → Rank → Display
Pattern 3: Multi-Stage Retrieval (Cost Optimization)
Query → Cheap Embedding (384d) → Initial Search →
Expensive Embedding (1,536d) → Rerank Top-K → Return
Cost Savings: 70% reduction vs. single-stage with expensive embeddings
Model Selection:
Chunking:
Caching:
Performance: