Production-ready patterns for building LLM applications. Covers RAG pipelines, agent architectures, prompt IDEs, and LLMOps monitoring...
Production-ready patterns for building LLM applications, inspired by Dify and industry best practices.
Use this skill when:
RAG (Retrieval-Augmented Generation) grounds LLM responses in your data.
graph LR
A[Ingest Documents] --> B[Retrieve Context]
B --> C[Generate Response]
A --> D[Chunking/Embedding]
B --> E[Vector Search]
C --> F[LLM + Context]
Strategies include Fixed-size, Semantic, Recursive, and Document-aware splitting. 👉 View Code Example: Chunking Strategies
Selecting the right Vector DB (Pinecone, Weaviate, Chroma, Pgvector) and Embedding Model. 👉 View Code Example: Vector DB & Embeddings
Prompting the LLM with retrieved context and handling citations. 👉 View Code Example: RAG Generation
The agent interleaves thought, action, and observation steps to solve reasoning tasks. 👉 View Code Example: ReAct Agent
Using structured tool definitions (JSON schema) natively supported by LLMs (OpenAI, Anthropic). 👉 View Code Example: Function Calling
Separating planning (high-level steps) from execution (doing the work) to handle complex, long-horizon tasks. 👉 View Code Example: Plan-and-Execute
Specialized agents (Researcher, Writer, Critic) working together with a coordinator. 👉 View Code Example: Multi-Agent Team
Managing dynamic prompts with validation and few-shot examples. 👉 View Code Example: Prompt Templates
Tracking prompt versions, running A/B tests, and recording outcomes. 👉 View Code Example: Prompt Registry
Sequencing multiple prompts where the output of one becomes the input of the next (e.g., Research -> Analyze -> Summarize). 👉 View Code Example: Prompt Chaining
Key metrics include Latency (p50/p99), Quality (satisfaction, hallucination), Cost, and Reliability. 👉 View Code Example: Metrics Dictionary
Structured logging of requests/responses and distributed tracing (OpenTelemetry) to visualize chains. 👉 View Code Example: Logging & Tracing
Systematically scoring responses for Relevance, Coherence, Groundedness, and Accuracy. 👉 View Code Example: Custom Evaluator
Semantic or exact caching (Redis) to reduce costs and latency for repeated queries. 👉 View Code Example: LLM Cache
Handling API limits and transient errors with exponential backoff strategies. 👉 View Code Example: Rate Limiter & Retry
Automatically switching to cheaper/faster or more capable models when the primary model fails. 👉 View Code Example: Model Fallback
| Pattern | Use When | Complexity | Cost |
|---|---|---|---|
| Simple RAG | FAQ, docs search | Low | Low |
| Hybrid RAG | Mixed queries | Medium | Medium |
| ReAct Agent | Multi-step tasks | Medium | Medium |
| Function Calling | Structured tools | Low | Low |
| Plan-Execute | Complex tasks | High | High |
| Multi-Agent | Research tasks | Very High | Very High |