Smithery Logo
MCPsSkillsDocsPricing
Login
Smithery Logo

Accelerating the Agent Economy

Resources

DocumentationPrivacy PolicySystem Status

Company

PricingAboutBlog

Connect

© 2026 Smithery. All rights reserved.

    wshobson

    rag-implementation

    wshobson/rag-implementation
    AI & ML
    28,185
    5 installs

    About

    SKILL.md

    Install

    Install via Skills CLI

    or add to your agent
    • Claude Code
      Claude Code
    • Codex
      Codex
    • OpenClaw
      OpenClaw
    • Cursor
      Cursor
    • Amp
      Amp
    • GitHub Copilot
      GitHub Copilot
    • Gemini CLI
      Gemini CLI
    • Kilo Code
      Kilo Code
    • Junie
      Junie
    • Replit
      Replit
    • Windsurf
      Windsurf
    • Cline
      Cline
    • Continue
      Continue
    • OpenCode
      OpenCode
    • OpenHands
      OpenHands
    • Roo Code
      Roo Code
    • Augment
      Augment
    • Goose
      Goose
    • Trae
      Trae
    • Zencoder
      Zencoder
    • Antigravity
      Antigravity
    ├─
    ├─
    └─

    About

    Build Retrieval-Augmented Generation (RAG) systems for LLM applications with vector databases and semantic search...

    SKILL.md

    RAG Implementation

    Master Retrieval-Augmented Generation (RAG) to build LLM applications that provide accurate, grounded responses using external knowledge sources.

    When to Use This Skill

    • Building Q&A systems over proprietary documents
    • Creating chatbots with current, factual information
    • Implementing semantic search with natural language queries
    • Reducing hallucinations with grounded responses
    • Enabling LLMs to access domain-specific knowledge
    • Building documentation assistants
    • Creating research tools with source citation

    Core Components

    1. Vector Databases

    Purpose: Store and retrieve document embeddings efficiently

    Options:

    • Pinecone: Managed, scalable, serverless
    • Weaviate: Open-source, hybrid search, GraphQL
    • Milvus: High performance, on-premise
    • Chroma: Lightweight, easy to use, local development
    • Qdrant: Fast, filtered search, Rust-based
    • pgvector: PostgreSQL extension, SQL integration

    2. Embeddings

    Purpose: Convert text to numerical vectors for similarity search

    Models (2026):

    Model Dimensions Best For
    voyage-3-large 1024 Claude apps (Anthropic recommended)
    voyage-code-3 1024 Code search
    text-embedding-3-large 3072 OpenAI apps, high accuracy
    text-embedding-3-small 1536 OpenAI apps, cost-effective
    bge-large-en-v1.5 1024 Open source, local deployment
    multilingual-e5-large 1024 Multi-language support

    3. Retrieval Strategies

    Approaches:

    • Dense Retrieval: Semantic similarity via embeddings
    • Sparse Retrieval: Keyword matching (BM25, TF-IDF)
    • Hybrid Search: Combine dense + sparse with weighted fusion
    • Multi-Query: Generate multiple query variations
    • HyDE: Generate hypothetical documents for better retrieval

    4. Reranking

    Purpose: Improve retrieval quality by reordering results

    Methods:

    • Cross-Encoders: BERT-based reranking (ms-marco-MiniLM)
    • Cohere Rerank: API-based reranking
    • Maximal Marginal Relevance (MMR): Diversity + relevance
    • LLM-based: Use LLM to score relevance

    Quick Start with LangGraph

    from langgraph.graph import StateGraph, START, END
    from langchain_anthropic import ChatAnthropic
    from langchain_voyageai import VoyageAIEmbeddings
    from langchain_pinecone import PineconeVectorStore
    from langchain_core.documents import Document
    from langchain_core.prompts import ChatPromptTemplate
    from langchain_text_splitters import RecursiveCharacterTextSplitter
    from typing import TypedDict, Annotated
    
    class RAGState(TypedDict):
        question: str
        context: list[Document]
        answer: str
    
    # Initialize components
    llm = ChatAnthropic(model="claude-sonnet-4-6")
    embeddings = VoyageAIEmbeddings(model="voyage-3-large")
    vectorstore = PineconeVectorStore(index_name="docs", embedding=embeddings)
    retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
    
    # RAG prompt
    rag_prompt = ChatPromptTemplate.from_template(
        """Answer based on the context below. If you cannot answer, say so.
    
        Context:
        {context}
    
        Question: {question}
    
        Answer:"""
    )
    
    async def retrieve(state: RAGState) -> RAGState:
        """Retrieve relevant documents."""
        docs = await retriever.ainvoke(state["question"])
        return {"context": docs}
    
    async def generate(state: RAGState) -> RAGState:
        """Generate answer from context."""
        context_text = "\n\n".join(doc.page_content for doc in state["context"])
        messages = rag_prompt.format_messages(
            context=context_text,
            question=state["question"]
        )
        response = await llm.ainvoke(messages)
        return {"answer": response.content}
    
    # Build RAG graph
    builder = StateGraph(RAGState)
    builder.add_node("retrieve", retrieve)
    builder.add_node("generate", generate)
    builder.add_edge(START, "retrieve")
    builder.add_edge("retrieve", "generate")
    builder.add_edge("generate", END)
    
    rag_chain = builder.compile()
    
    # Use
    result = await rag_chain.ainvoke({"question": "What are the main features?"})
    print(result["answer"])
    

    Advanced RAG Patterns

    Pattern 1: Hybrid Search with RRF

    from langchain_community.retrievers import BM25Retriever
    from langchain.retrievers import EnsembleRetriever
    
    # Sparse retriever (BM25 for keyword matching)
    bm25_retriever = BM25Retriever.from_documents(documents)
    bm25_retriever.k = 10
    
    # Dense retriever (embeddings for semantic search)
    dense_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})
    
    # Combine with Reciprocal Rank Fusion weights
    ensemble_retriever = EnsembleRetriever(
        retrievers=[bm25_retriever, dense_retriever],
        weights=[0.3, 0.7]  # 30% keyword, 70% semantic
    )
    

    Pattern 2: Multi-Query Retrieval

    from langchain.retrievers.multi_query import MultiQueryRetriever
    
    # Generate multiple query perspectives for better recall
    multi_query_retriever = MultiQueryRetriever.from_llm(
        retriever=vectorstore.as_retriever(search_kwargs={"k": 5}),
        llm=llm
    )
    
    # Single query → multiple variations → combined results
    results = await multi_query_retriever.ainvoke("What is the main topic?")
    

    Pattern 3: Contextual Compression

    from langchain.retrievers import ContextualCompressionRetriever
    from langchain.retrievers.document_compressors import LLMChainExtractor
    
    # Compressor extracts only relevant portions
    compressor = LLMChainExtractor.from_llm(llm)
    
    compression_retriever = ContextualCompressionRetriever(
        base_compressor=compressor,
        base_retriever=vectorstore.as_retriever(search_kwargs={"k": 10})
    )
    
    # Returns only relevant parts of documents
    compressed_docs = await compression_retriever.ainvoke("specific query")
    

    Pattern 4: Parent Document Retriever

    from langchain.retrievers import ParentDocumentRetriever
    from langchain.storage import InMemoryStore
    from langchain_text_splitters import RecursiveCharacterTextSplitter
    
    # Small chunks for precise retrieval, large chunks for context
    child_splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=50)
    parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)
    
    # Store for parent documents
    docstore = InMemoryStore()
    
    parent_retriever = ParentDocumentRetriever(
        vectorstore=vectorstore,
        docstore=docstore,
        child_splitter=child_splitter,
        parent_splitter=parent_splitter
    )
    
    # Add documents (splits children, stores parents)
    await parent_retriever.aadd_documents(documents)
    
    # Retrieval returns parent documents with full context
    results = await parent_retriever.ainvoke("query")
    

    Pattern 5: HyDE (Hypothetical Document Embeddings)

    from langchain_core.prompts import ChatPromptTemplate
    
    class HyDEState(TypedDict):
        question: str
        hypothetical_doc: str
        context: list[Document]
        answer: str
    
    hyde_prompt = ChatPromptTemplate.from_template(
        """Write a detailed passage that would answer this question:
    
        Question: {question}
    
        Passage:"""
    )
    
    async def generate_hypothetical(state: HyDEState) -> HyDEState:
        """Generate hypothetical document for better retrieval."""
        messages = hyde_prompt.format_messages(question=state["question"])
        response = await llm.ainvoke(messages)
        return {"hypothetical_doc": response.content}
    
    async def retrieve_with_hyde(state: HyDEState) -> HyDEState:
        """Retrieve using hypothetical document."""
        # Use hypothetical doc for retrieval instead of original query
        docs = await retriever.ainvoke(state["hypothetical_doc"])
        return {"context": docs}
    
    # Build HyDE RAG graph
    builder = StateGraph(HyDEState)
    builder.add_node("hypothetical", generate_hypothetical)
    builder.add_node("retrieve", retrieve_with_hyde)
    builder.add_node("generate", generate)
    builder.add_edge(START, "hypothetical")
    builder.add_edge("hypothetical", "retrieve")
    builder.add_edge("retrieve", "generate")
    builder.add_edge("generate", END)
    
    hyde_rag = builder.compile()
    

    Document Chunking Strategies

    Recursive Character Text Splitter

    from langchain_text_splitters import RecursiveCharacterTextSplitter
    
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200,
        length_function=len,
        separators=["\n\n", "\n", ". ", " ", ""]  # Try in order
    )
    
    chunks = splitter.split_documents(documents)
    

    Token-Based Splitting

    from langchain_text_splitters import TokenTextSplitter
    
    splitter = TokenTextSplitter(
        chunk_size=512,
        chunk_overlap=50,
        encoding_name="cl100k_base"  # OpenAI tiktoken encoding
    )
    

    Semantic Chunking

    from langchain_experimental.text_splitter import SemanticChunker
    
    splitter = SemanticChunker(
        embeddings=embeddings,
        breakpoint_threshold_type="percentile",
        breakpoint_threshold_amount=95
    )
    

    Markdown Header Splitter

    from langchain_text_splitters import MarkdownHeaderTextSplitter
    
    headers_to_split_on = [
        ("#", "Header 1"),
        ("##", "Header 2"),
        ("###", "Header 3"),
    ]
    
    splitter = MarkdownHeaderTextSplitter(
        headers_to_split_on=headers_to_split_on,
        strip_headers=False
    )
    

    Vector Store Configurations

    Pinecone (Serverless)

    from pinecone import Pinecone, ServerlessSpec
    from langchain_pinecone import PineconeVectorStore
    
    # Initialize Pinecone client
    pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])
    
    # Create index if needed
    if "my-index" not in pc.list_indexes().names():
        pc.create_index(
            name="my-index",
            dimension=1024,  # voyage-3-large dimensions
            metric="cosine",
            spec=ServerlessSpec(cloud="aws", region="us-east-1")
        )
    
    # Create vector store
    index = pc.Index("my-index")
    vectorstore = PineconeVectorStore(index=index, embedding=embeddings)
    

    Weaviate

    import weaviate
    from langchain_weaviate import WeaviateVectorStore
    
    client = weaviate.connect_to_local()  # or connect_to_weaviate_cloud()
    
    vectorstore = WeaviateVectorStore(
        client=client,
        index_name="Documents",
        text_key="content",
        embedding=embeddings
    )
    

    Chroma (Local Development)

    from langchain_chroma import Chroma
    
    vectorstore = Chroma(
        collection_name="my_collection",
        embedding_function=embeddings,
        persist_directory="./chroma_db"
    )
    

    pgvector (PostgreSQL)

    from langchain_postgres.vectorstores import PGVector
    
    connection_string = "postgresql+psycopg://user:pass@localhost:5432/vectordb"
    
    vectorstore = PGVector(
        embeddings=embeddings,
        collection_name="documents",
        connection=connection_string,
    )
    

    Retrieval Optimization

    1. Metadata Filtering

    from langchain_core.documents import Document
    
    # Add metadata during indexing
    docs_with_metadata = []
    for doc in documents:
        doc.metadata.update({
            "source": doc.metadata.get("source", "unknown"),
            "category": determine_category(doc.page_content),
            "date": datetime.now().isoformat()
        })
        docs_with_metadata.append(doc)
    
    # Filter during retrieval
    results = await vectorstore.asimilarity_search(
        "query",
        filter={"category": "technical"},
        k=5
    )
    

    2. Maximal Marginal Relevance (MMR)

    # Balance relevance with diversity
    results = await vectorstore.amax_marginal_relevance_search(
        "query",
        k=5,
        fetch_k=20,  # Fetch 20, return top 5 diverse
        lambda_mult=0.5  # 0=max diversity, 1=max relevance
    )
    

    3. Reranking with Cross-Encoder

    from sentence_transformers import CrossEncoder
    
    reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
    
    async def retrieve_and_rerank(query: str, k: int = 5) -> list[Document]:
        # Get initial results
        candidates = await vectorstore.asimilarity_search(query, k=20)
    
        # Rerank
        pairs = [[query, doc.page_content] for doc in candidates]
        scores = reranker.predict(pairs)
    
        # Sort by score and take top k
        ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
        return [doc for doc, score in ranked[:k]]
    

    4. Cohere Rerank

    from langchain.retrievers import CohereRerank
    from langchain_cohere import CohereRerank
    
    reranker = CohereRerank(model="rerank-english-v3.0", top_n=5)
    
    # Wrap retriever with reranking
    reranked_retriever = ContextualCompressionRetriever(
        base_compressor=reranker,
        base_retriever=vectorstore.as_retriever(search_kwargs={"k": 20})
    )
    

    Prompt Engineering for RAG

    Contextual Prompt with Citations

    rag_prompt = ChatPromptTemplate.from_template(
        """Answer the question based on the context below. Include citations using [1], [2], etc.
    
        If you cannot answer based on the context, say "I don't have enough information."
    
        Context:
        {context}
    
        Question: {question}
    
        Instructions:
        1. Use only information from the context
        2. Cite sources with [1], [2] format
        3. If uncertain, express uncertainty
    
        Answer (with citations):"""
    )
    

    Structured Output for RAG

    from pydantic import BaseModel, Field
    
    class RAGResponse(BaseModel):
        answer: str = Field(description="The answer based on context")
        confidence: float = Field(description="Confidence score 0-1")
        sources: list[str] = Field(description="Source document IDs used")
        reasoning: str = Field(description="Brief reasoning for the answer")
    
    # Use with structured output
    structured_llm = llm.with_structured_output(RAGResponse)
    

    Evaluation Metrics

    from typing import TypedDict
    
    class RAGEvalMetrics(TypedDict):
        retrieval_precision: float  # Relevant docs / retrieved docs
        retrieval_recall: float     # Retrieved relevant / total relevant
        answer_relevance: float     # Answer addresses question
        faithfulness: float         # Answer grounded in context
        context_relevance: float    # Context relevant to question
    
    async def evaluate_rag_system(
        rag_chain,
        test_cases: list[dict]
    ) -> RAGEvalMetrics:
        """Evaluate RAG system on test cases."""
        metrics = {k: [] for k in RAGEvalMetrics.__annotations__}
    
        for test in test_cases:
            result = await rag_chain.ainvoke({"question": test["question"]})
    
            # Retrieval metrics
            retrieved_ids = {doc.metadata["id"] for doc in result["context"]}
            relevant_ids = set(test["relevant_doc_ids"])
    
            precision = len(retrieved_ids & relevant_ids) / len(retrieved_ids)
            recall = len(retrieved_ids & relevant_ids) / len(relevant_ids)
    
            metrics["retrieval_precision"].append(precision)
            metrics["retrieval_recall"].append(recall)
    
            # Use LLM-as-judge for quality metrics
            quality = await evaluate_answer_quality(
                question=test["question"],
                answer=result["answer"],
                context=result["context"],
                expected=test.get("expected_answer")
            )
            metrics["answer_relevance"].append(quality["relevance"])
            metrics["faithfulness"].append(quality["faithfulness"])
            metrics["context_relevance"].append(quality["context_relevance"])
    
        return {k: sum(v) / len(v) for k, v in metrics.items()}
    
    Recommended Servers
    Cloudflare AI Search
    Cloudflare AI Search
    Gemini
    Gemini
    InfraNodus Knowledge Graphs & Text Analysis
    InfraNodus Knowledge Graphs & Text Analysis
    Repository
    wshobson/agents
    Files