NVIDIA NeMo framework for building and training conversational AI models...
Comprehensive assistance with NVIDIA NeMo development, the enterprise AI platform for building, customizing, and deploying generative AI agents at scale.
This skill should be triggered when:
Core NeMo Components:
NVIDIA Nemotron Models:
Use Cases:
Basic Embedding Generation
import requests
# NeMo Retriever Embedding NIM
url = "https://integrate.api.nvidia.com/v1/embeddings"
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {api_key}"
}
payload = {
"input": ["What is retrieval-augmented generation?"],
"model": "nvidia/nv-embedqa-e5-v5",
"input_type": "query"
}
response = requests.post(url, json=payload, headers=headers)
embeddings = response.json()["data"][0]["embedding"]
Reranking Results
# NeMo Retriever Reranking NIM
url = "https://integrate.api.nvidia.com/v1/ranking"
payload = {
"model": "nvidia/nv-rerankqa-mistral-4b-v3",
"query": {"text": "What is machine learning?"},
"passages": [
{"text": "Machine learning is a subset of AI..."},
{"text": "Python is a programming language..."},
{"text": "ML models learn patterns from data..."}
]
}
response = requests.post(url, json=payload, headers=headers)
ranked_results = response.json()["rankings"]
Submit Fine-Tuning Job
# Fine-tune with LoRA
payload = {
"name": "custom-model-lora",
"model": "meta/llama-3.1-8b-instruct",
"method": "lora",
"dataset": "s3://my-bucket/training-data.jsonl",
"hyperparameters": {
"learning_rate": 1e-4,
"batch_size": 8,
"epochs": 3,
"lora_rank": 8
}
}
response = requests.post(
"http://nemo-customizer:8000/v1/customization/jobs",
json=payload
)
job_id = response.json()["id"]
Check Job Status
# Monitor customization progress
status_response = requests.get(
f"http://nemo-customizer:8000/v1/customization/jobs/{job_id}"
)
print(f"Status: {status_response.json()['status']}")
print(f"Progress: {status_response.json()['progress']}%")
Initialize Guardrails
from nemoguardrails import RailsConfig, LLMRails
# Load configuration
config = RailsConfig.from_path("./config")
rails = LLMRails(config)
# Apply guardrails
response = rails.generate(
messages=[{"role": "user", "content": "Tell me about..."}]
)
YAML Configuration for Topic Control
# config.yml
models:
- type: main
engine: nvidia_ai_endpoints
model: meta/llama-3.1-70b-instruct
rails:
input:
flows:
- check jailbreak
- check topic relevance
output:
flows:
- check hallucination
- check safety
Custom Rail Definition
# Custom topic control
define user ask about competitors
"Tell me about competing products"
"What do you think of [competitor]"
define bot refuse competitors
"I can only discuss our own products and services."
define flow
user ask about competitors
bot refuse competitors
stop
Text Processing Pipeline
from nemo_curator import ScoreFilter, DedupFilter
from nemo_curator.datasets import DocumentDataset
# Load dataset
dataset = DocumentDataset.read_json("data.jsonl")
# Quality filtering
quality_filter = ScoreFilter(
score_field="quality_score",
score_threshold=0.7
)
dataset = quality_filter(dataset)
# Deduplication
dedup_filter = DedupFilter()
dataset = dedup_filter(dataset)
# Save processed data
dataset.to_json("processed_data.jsonl")
Synthetic Data Generation
from nemo_curator.synthetic import PromptTemplate, generate_data
# Define prompt template
template = PromptTemplate(
system="You are a helpful assistant.",
user_template="Generate a question about {topic}"
)
# Generate synthetic data
synthetic_data = generate_data(
template=template,
topics=["machine learning", "data science"],
model="nvidia/llama-3.1-nemotron-70b-instruct",
num_samples=100
)
Academic Benchmark Evaluation
from nemo_evaluator import Evaluator
evaluator = Evaluator()
# Run MMLU benchmark
results = evaluator.evaluate(
model="meta/llama-3.1-8b-instruct",
tasks=["mmlu"],
batch_size=8
)
print(f"MMLU Score: {results['mmlu']['acc']}")
RAG Pipeline Evaluation
# Evaluate RAG with custom metrics
rag_results = evaluator.evaluate_rag(
model="custom-rag-pipeline",
metrics=["faithfulness", "answer_relevance", "context_precision"],
dataset="custom_qa_dataset.jsonl"
)
Define Agent with Tools
# agent_config.yaml
agents:
- name: customer_support_agent
model: nvidia/llama-3.1-nemotron-70b-instruct
tools:
- web_search
- knowledge_base_query
- ticket_creation
max_iterations: 5
Tool Registration
from nemo_agent_toolkit import Agent, Tool
# Define custom tool
@Tool(
name="database_query",
description="Query customer database for information"
)
def query_database(customer_id: str) -> dict:
# Tool implementation
return {"name": "John Doe", "status": "Premium"}
# Create agent
agent = Agent.from_config("agent_config.yaml")
agent.register_tool(query_database)
# Run agent
response = agent.run("What is the status of customer ID 12345?")
Deploy Custom Model as NIM
# Pull NIM container
docker pull nvcr.io/nim/meta/llama-3.1-8b-instruct:latest
# Run NIM with custom LoRA
docker run -d \
--gpus all \
-p 8000:8000 \
-e NGC_API_KEY=$NGC_API_KEY \
-e PEFT_MODEL_PATH=/models/custom-lora \
-v ./models:/models \
nvcr.io/nim/meta/llama-3.1-8b-instruct:latest
Query NIM Endpoint
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-used"
)
response = client.chat.completions.create(
model="meta/llama-3.1-8b-instruct",
messages=[
{"role": "user", "content": "Explain quantum computing"}
],
temperature=0.7,
max_tokens=500
)
NeMo Suite Components:
NVIDIA Nemotron Models:
1. Document Extraction (NeMo Retriever)
2. Embedding (Nemotron Embedding Models)
3. Vector Storage (cuVS)
4. Reranking (Nemotron Reranking Models)
LoRA (Low-Rank Adaptation)
SFT (Supervised Fine-Tuning)
DPO (Direct Preference Optimization)
GRPO (Group Relative Policy Optimization)
Rail Types:
Orchestration:
NeMo Customizer API Documentation
NeMo Retriever Models & Pipeline
RAG Implementation & Best Practices
Comprehensive NeMo Ecosystem
Start Here:
retriever.md for RAG fundamentalsFirst Project Ideas:
Focus Areas:
rag.md for advanced retrieval patternsapi.md to fine-tune with LoRA/SFTother.mdCommon Workflows:
Enterprise Patterns:
Performance Optimization:
Scaling Strategies:
Quick Lookups:
api.mdrag.mdother.md (Nemotron section)other.md (Guardrails section)Deep Dives:
retriever.md + rag.mdapi.md + other.md (Customizer)other.md (Agent Toolkit)other.md (Curator)Data Curation (Curator)
→ Model Training/Fine-tuning (Customizer)
→ Evaluation (Evaluator)
→ Deployment (NIM)
→ Monitoring (Agent Toolkit)
→ Safety (Guardrails)
Documents
→ Extraction (NeMo Retriever)
→ Vector DB (cuVS)
→ Embedding (Nemotron RAG)
→ Reranking (Nemotron RAG)
→ LLM (Nemotron + NIM)
→ Guardrails (NeMo Guardrails)
→ Response
User Interactions
→ Data Collection
→ Curation (Curator)
→ Fine-tuning (Customizer)
→ Evaluation (Evaluator)
→ Deployment (NIM)
→ Loop Back
| Use Case | Recommended Model | Deployment |
|---|---|---|
| Edge AI, IoT | Nemotron Nano 8B | Single device |
| Chatbots, agents | Nemotron Super 70B | Single GPU |
| Enterprise RAG | Nemotron Ultra 405B | Data center |
| Document intelligence | Nemotron Nano VL | GPU workstation |
| Embedding | NV-Embed-v2 | NIM microservice |
| Reranking | NV-RerankQA | NIM microservice |
NeMo Retriever:
Nemotron Models:
To refresh this skill with updated documentation: