Smithery Logo
MCPsSkillsDocsPricing
Login
Smithery Logo

Accelerating the Agent Economy

Resources

DocumentationPrivacy PolicySystem Status

Company

PricingAboutBlog

Connect

© 2026 Smithery. All rights reserved.

    ancoleman

    using-vector-databases

    ancoleman/using-vector-databases
    AI & ML
    154
    1 installs

    About

    SKILL.md

    Install

    Install via Skills CLI

    or add to your agent
    • Claude Code
      Claude Code
    • Codex
      Codex
    • OpenClaw
      OpenClaw
    • Cursor
      Cursor
    • Amp
      Amp
    • GitHub Copilot
      GitHub Copilot
    • Gemini CLI
      Gemini CLI
    • Kilo Code
      Kilo Code
    • Junie
      Junie
    • Replit
      Replit
    • Windsurf
      Windsurf
    • Cline
      Cline
    • Continue
      Continue
    • OpenCode
      OpenCode
    • OpenHands
      OpenHands
    • Roo Code
      Roo Code
    • Augment
      Augment
    • Goose
      Goose
    • Trae
      Trae
    • Zencoder
      Zencoder
    • Antigravity
      Antigravity
    ├─
    ├─
    └─

    About

    Vector database implementation for AI/ML applications, semantic search, and RAG systems. Use when building chatbots, search engines, recommendation systems, or similarity-based retrieval.

    SKILL.md

    Vector Databases for AI Applications

    When to Use This Skill

    Use this skill when implementing:

    • RAG (Retrieval-Augmented Generation) systems for AI chatbots
    • Semantic search capabilities (meaning-based, not just keyword)
    • Recommendation systems based on similarity
    • Multi-modal AI (unified search across text, images, audio)
    • Document similarity and deduplication
    • Question answering over private knowledge bases

    Quick Decision Framework

    1. Vector Database Selection

    START: Choosing a Vector Database
    
    EXISTING INFRASTRUCTURE?
    ├─ Using PostgreSQL already?
    │  └─ pgvector (<10M vectors, tight budget)
    │      See: references/pgvector.md
    │
    └─ No existing vector database?
       │
       ├─ OPERATIONAL PREFERENCE?
       │  │
       │  ├─ Zero-ops managed only
       │  │  └─ Pinecone (fully managed, excellent DX)
       │  │      See: references/pinecone.md
       │  │
       │  └─ Flexible (self-hosted or managed)
       │     │
       │     ├─ SCALE: <100M vectors + complex filtering ⭐
       │     │  └─ Qdrant (RECOMMENDED)
       │     │      • Best metadata filtering
       │     │      • Built-in hybrid search (BM25 + Vector)
       │     │      • Self-host: Docker/K8s
       │     │      • Managed: Qdrant Cloud
       │     │      See: references/qdrant.md
       │     │
       │     ├─ SCALE: >100M vectors + GPU acceleration
       │     │  └─ Milvus / Zilliz Cloud
       │     │      See: references/milvus.md
       │     │
       │     ├─ Embedded / No server
       │     │  └─ LanceDB (serverless, edge deployment)
       │     │
       │     └─ Local prototyping
       │        └─ Chroma (simple API, in-memory)
    

    2. Embedding Model Selection

    REQUIREMENTS?
    
    ├─ Best quality (cost no object)
    │  └─ Voyage AI voyage-3 (1024d)
    │      • 9.74% better than OpenAI on MTEB
    │      • ~$0.12/1M tokens
    │      See: references/embedding-strategies.md
    │
    ├─ Enterprise reliability
    │  └─ OpenAI text-embedding-3-large (3072d)
    │      • Industry standard
    │      • ~$0.13/1M tokens
    │      • Maturity shortening: reduce to 256/512/1024d
    │
    ├─ Cost-optimized
    │  └─ OpenAI text-embedding-3-small (1536d)
    │      • ~$0.02/1M tokens (6x cheaper)
    │      • 90-95% of large model performance
    │
    ├─ Multilingual (100+ languages)
    │  └─ Cohere embed-v3 (1024d)
    │      • ~$0.10/1M tokens
    │
    └─ Self-hosted / Privacy-critical
       ├─ English: nomic-embed-text-v1.5 (768d, Apache 2.0)
       ├─ Multilingual: BAAI/bge-m3 (1024d, MIT)
       └─ Long docs: jina-embeddings-v2 (768d, 8K context)
    

    Core Concepts

    Document Chunking Strategy

    Recommended defaults for most RAG systems:

    • Chunk size: 512 tokens (not characters)
    • Overlap: 50 tokens (10% overlap)

    Why these numbers?

    • 512 tokens balances context vs. precision
      • Too small (128-256): Fragments concepts, loses context
      • Too large (1024-2048): Dilutes relevance, wastes LLM tokens
    • 50 token overlap ensures sentences aren't split mid-context

    See references/chunking-patterns.md for advanced strategies by content type.

    Hybrid Search (Vector + Keyword)

    Hybrid Search = Vector Similarity + BM25 Keyword Matching

    User Query: "OAuth refresh token implementation"
               │
        ┌──────┴──────┐
        │             │
    Vector Search   Keyword Search
    (Semantic)      (BM25)
        │             │
    Top 20 docs   Top 20 docs
        │             │
        └──────┬──────┘
               │
       Reciprocal Rank Fusion
       (Merge + Re-rank)
               │
        Final Top 5 Results
    

    Why hybrid matters:

    • Vector captures semantic meaning ("OAuth refresh" ≈ "token renewal")
    • Keyword ensures exact matches ("refresh_token" literal)
    • Combined provides best retrieval quality

    See references/hybrid-search.md for implementation details.

    Getting Started

    Python + Qdrant Example

    from qdrant_client import QdrantClient
    from qdrant_client.models import Distance, VectorParams, PointStruct
    
    # 1. Initialize client
    client = QdrantClient("localhost", port=6333)
    
    # 2. Create collection
    client.create_collection(
        collection_name="documents",
        vectors_config=VectorParams(size=1024, distance=Distance.COSINE)
    )
    
    # 3. Insert documents with embeddings
    points = [
        PointStruct(
            id=idx,
            vector=embedding,  # From OpenAI/Voyage/etc
            payload={
                "text": chunk_text,
                "source": "docs/api.md",
                "section": "Authentication"
            }
        )
        for idx, (embedding, chunk_text) in enumerate(chunks)
    ]
    client.upsert(collection_name="documents", points=points)
    
    # 4. Search with metadata filtering
    results = client.search(
        collection_name="documents",
        query_vector=query_embedding,
        limit=5,
        query_filter={
            "must": [
                {"key": "section", "match": {"value": "Authentication"}}
            ]
        }
    )
    

    For complete examples, see examples/qdrant-python/.

    TypeScript + Qdrant Example

    import { QdrantClient } from '@qdrant/js-client-rest';
    
    const client = new QdrantClient({ url: 'http://localhost:6333' });
    
    // Create collection
    await client.createCollection('documents', {
      vectors: { size: 1024, distance: 'Cosine' }
    });
    
    // Insert documents
    await client.upsert('documents', {
      points: chunks.map((chunk, idx) => ({
        id: idx,
        vector: chunk.embedding,
        payload: {
          text: chunk.text,
          source: chunk.source
        }
      }))
    });
    
    // Search
    const results = await client.search('documents', {
      vector: queryEmbedding,
      limit: 5,
      filter: {
        must: [
          { key: 'source', match: { value: 'docs/api.md' } }
        ]
      }
    });
    

    For complete examples, see examples/typescript-rag/.

    RAG Pipeline Architecture

    Complete Pipeline Components

    1. INGESTION
       ├─ Document Loading (PDF, web, code, Office)
       ├─ Text Extraction & Cleaning
       ├─ Chunking (semantic, recursive, code-aware)
       └─ Embedding Generation (batch, rate-limited)
    
    2. INDEXING
       ├─ Vector Store Insertion (batch upsert)
       ├─ Index Configuration (HNSW, distance metric)
       └─ Keyword Index (BM25 for hybrid search)
    
    3. RETRIEVAL (Query Time)
       ├─ Query Processing (expansion, embedding)
       ├─ Hybrid Search (vector + keyword)
       ├─ Filtering & Post-Processing (metadata, MMR)
       └─ Re-Ranking (cross-encoder, LLM-based)
    
    4. GENERATION
       ├─ Context Construction (format chunks, citations)
       ├─ Prompt Engineering (system + context + query)
       ├─ LLM Inference (streaming, temperature tuning)
       └─ Response Post-Processing (citations, validation)
    
    5. EVALUATION (Production Critical)
       ├─ Retrieval Metrics (precision, recall, relevancy)
       ├─ Generation Metrics (faithfulness, correctness)
       └─ System Metrics (latency, cost, satisfaction)
    

    Essential Metadata for Production RAG

    Critical for filtering and relevance:

    metadata = {
        # SOURCE TRACKING
        "source": "docs/api-reference.md",
        "source_type": "documentation",  # code, docs, logs, chat
        "last_updated": "2025-12-01T12:00:00Z",
    
        # HIERARCHICAL CONTEXT
        "section": "Authentication",
        "subsection": "OAuth 2.1",
        "heading_hierarchy": ["API Reference", "Authentication", "OAuth 2.1"],
    
        # CONTENT CLASSIFICATION
        "content_type": "code_example",  # prose, code, table, list
        "programming_language": "python",
    
        # FILTERING DIMENSIONS
        "product_version": "v2.0",
        "audience": "enterprise",  # free, pro, enterprise
    
        # RETRIEVAL HINTS
        "chunk_index": 3,
        "total_chunks": 12,
        "has_code": True
    }
    

    Why metadata matters:

    • Enables filtering BEFORE vector search (reduces search space)
    • Improves relevance through targeted retrieval
    • Supports multi-tenant systems (filter by user/org)
    • Enables versioned documentation (filter by product version)

    Evaluation with RAGAS

    Use scripts/evaluate_rag.py for automated evaluation:

    from ragas import evaluate
    from ragas.metrics import (
        faithfulness,       # Answer grounded in context
        answer_relevancy,   # Answer addresses query
        context_recall,     # Retrieved docs cover ground truth
        context_precision   # Retrieved docs are relevant
    )
    
    # Test dataset
    test_data = {
        "question": ["How do I refresh OAuth tokens?"],
        "answer": ["Use /token with refresh_token grant..."],
        "contexts": [["OAuth refresh documentation..."]],
        "ground_truth": ["POST to /token with grant_type=refresh_token"]
    }
    
    # Evaluate
    results = evaluate(test_data, metrics=[
        faithfulness,
        answer_relevancy,
        context_recall,
        context_precision
    ])
    
    # Production targets:
    # faithfulness: >0.90 (minimal hallucination)
    # answer_relevancy: >0.85 (addresses user query)
    # context_recall: >0.80 (sufficient context retrieved)
    # context_precision: >0.75 (minimal noise)
    

    Performance Optimization

    Embedding Generation

    • Batch processing: 100-500 chunks per batch
    • Caching: Cache embeddings by content hash
    • Rate limiting: Respect API provider limits (exponential backoff)

    Vector Search

    • Index type: HNSW (Hierarchical Navigable Small World) for most cases
    • Distance metric: Cosine for normalized embeddings
    • Pre-filtering: Apply metadata filters before vector search
    • Result diversity: Use MMR (Maximal Marginal Relevance) to reduce redundancy

    Cost Optimization

    • Embedding model: Consider text-embedding-3-small for budget constraints
    • Dimension reduction: Use maturity shortening (3072d → 1024d)
    • Caching: Implement semantic caching for repeated queries
    • Batch operations: Group insertions/updates for efficiency

    Common Workflows

    1. Building a RAG Chatbot

    • Vector database: Qdrant (self-hosted or cloud)
    • Embeddings: OpenAI text-embedding-3-large
    • Chunking: 512 tokens, 50 overlap, semantic splitter
    • Search: Hybrid (vector + BM25)
    • Integration: Frontend with ai-chat skill

    See examples/qdrant-python/ for complete implementation.

    2. Semantic Search Engine

    • Vector database: Qdrant or Pinecone
    • Embeddings: Voyage AI voyage-3 (best quality)
    • Chunking: Content-type specific (see chunking-patterns.md)
    • Search: Hybrid with re-ranking
    • Filtering: Pre-filter by metadata (date, category, etc.)

    3. Code Search

    • Vector database: Qdrant
    • Embeddings: OpenAI text-embedding-3-large
    • Chunking: AST-based (function/class boundaries)
    • Metadata: Language, file path, imports
    • Search: Hybrid with language filtering

    See examples/qdrant-python/ for code-specific implementation.

    Integration with Other Skills

    Frontend Skills

    • ai-chat: Vector DB powers RAG pipeline behind chat interface
    • search-filter: Replace keyword search with semantic search
    • data-viz: Visualize embedding spaces, similarity scores

    Backend Skills

    • databases-relational: Hybrid approach using pgvector extension
    • api-patterns: Expose semantic search via REST/GraphQL
    • observability: Monitor embedding quality and retrieval metrics

    Multi-Language Support

    Python (Primary)

    • Client: qdrant-client
    • Framework: LangChain, LlamaIndex
    • See: examples/qdrant-python/

    Rust

    • Client: qdrant-client (1,549 code snippets in Context7)
    • Framework: Raw Rust for performance-critical systems
    • See: examples/rust-axum-vector/

    TypeScript

    • Client: @qdrant/js-client-rest
    • Framework: LangChain.js, integration with Next.js
    • See: examples/typescript-rag/

    Go

    • Client: qdrant-go
    • Use case: High-performance microservices

    Troubleshooting

    Poor Retrieval Quality

    1. Check chunking strategy (too large/small?)
    2. Verify metadata filtering (too restrictive?)
    3. Try hybrid search instead of vector-only
    4. Implement re-ranking stage
    5. Evaluate with RAGAS metrics

    Slow Performance

    1. Use HNSW index (not Flat)
    2. Pre-filter with metadata before vector search
    3. Reduce vector dimensions (maturity shortening)
    4. Batch operations (insertions, searches)
    5. Consider GPU acceleration (Milvus)

    High Costs

    1. Switch to text-embedding-3-small
    2. Implement semantic caching
    3. Reduce chunk overlap
    4. Use self-hosted embeddings (nomic, bge-m3)
    5. Batch embedding generation

    Qdrant Context7 Documentation

    Primary resource: /llmstxt/qdrant_tech_llms-full_txt

    • Trust score: High
    • Code snippets: 10,154
    • Quality score: 83.1

    Access via Context7:

    resolve-library-id({ libraryName: "Qdrant" })
    get-library-docs({
      context7CompatibleLibraryID: "/llmstxt/qdrant_tech_llms-full_txt",
      topic: "hybrid search collections python",
      mode: "code"
    })
    

    Additional Resources

    Reference Documentation

    • references/qdrant.md - Comprehensive Qdrant guide
    • references/pgvector.md - PostgreSQL pgvector extension
    • references/milvus.md - Milvus/Zilliz for billion-scale
    • references/embedding-strategies.md - Embedding model comparison
    • references/chunking-patterns.md - Advanced chunking techniques

    Code Examples

    • examples/qdrant-python/ - FastAPI + Qdrant RAG pipeline
    • examples/pgvector-prisma/ - PostgreSQL + Prisma integration
    • examples/typescript-rag/ - TypeScript RAG with Hono

    Automation Scripts

    • scripts/generate_embeddings.py - Batch embedding generation
    • scripts/benchmark_similarity.py - Performance benchmarking
    • scripts/evaluate_rag.py - RAGAS-based evaluation

    Next Steps:

    1. Choose vector database based on scale and infrastructure
    2. Select embedding model based on quality vs. cost trade-off
    3. Implement chunking strategy for the content type
    4. Set up hybrid search for production quality
    5. Evaluate with RAGAS metrics
    6. Optimize for performance and cost
    Repository
    ancoleman/ai-design-components
    Files