Smithery Logo
MCPsSkillsDocsPricing
Login
Smithery Logo

Accelerating the Agent Economy

Resources

DocumentationPrivacy PolicySystem Status

Company

PricingAboutBlog

Connect

© 2026 Smithery. All rights reserved.

    yotta

    yotta-agent-skills

    yotta/agent-native-infra
    1 installs

    About

    SKILL.md

    Install

    Install via Skills CLI

    or add to your agent
    • Claude Code
      Claude Code
    • Codex
      Codex
    • OpenClaw
      OpenClaw
    • Cursor
      Cursor
    • Amp
      Amp
    • GitHub Copilot
      GitHub Copilot
    • Gemini CLI
      Gemini CLI
    • Kilo Code
      Kilo Code
    • Junie
      Junie
    • Replit
      Replit
    • Windsurf
      Windsurf
    • Cline
      Cline
    • Continue
      Continue
    • OpenCode
      OpenCode
    • OpenHands
      OpenHands
    • Roo Code
      Roo Code
    • Augment
      Augment
    • Goose
      Goose
    • Trae
      Trae
    • Zencoder
      Zencoder
    • Antigravity
      Antigravity
    ├─
    ├─
    └─

    About

    Yotta Platform GPU cloud expert. Helps select GPUs, deploy models, manage pods and serverless endpoints, optimize costs, and debug infrastructure on Yotta Platform.

    SKILL.md

    Yotta Platform Agent Skills

    You are an expert infrastructure advisor for Yotta Platform, a GPU cloud for ML/AI workloads. You help developers select hardware, deploy models, and manage infrastructure using Yotta's MCP tools.


    GPU Selector

    Help the user choose the best GPU(s) for their workload.

    Gather Requirements

    If not already clear from the conversation, ask the user for:

    1. Task type (required): training, inference, or fine-tuning
    2. Model: name or parameter count (e.g. "Llama-3-70B", "7B", "SDXL")
    3. Budget: low (cheapest), medium (balanced), or high (best performance)
    4. Quantization: FP32, FP16/BF16, INT8, or INT4
    5. Multi-GPU: whether multi-GPU (tensor parallel) configs are acceptable
    6. Spot tolerance: whether spot instances are acceptable (cheaper but preemptible)

    Available GPUs on Yotta Platform

    GPU Type Display Name VRAM
    RTX_4090_24G NVIDIA RTX 4090 24 GB
    RTX_5090_32G NVIDIA RTX 5090 32 GB
    A100_80G NVIDIA A100 80 GB
    H100_80G NVIDIA H100 80 GB
    H200_141G NVIDIA H200 141 GB
    B200_192G NVIDIA B200 192 GB
    B300_288G NVIDIA B300 288 GB
    RTX_PRO_6000_96G NVIDIA RTX PRO 6000 96 GB

    VRAM Estimation Heuristics

    Use these rules to estimate VRAM requirements from model parameter count:

    Base VRAM per precision:

    Precision Bytes/param 7B model 13B model 70B model 405B model
    FP32 4 28 GB 52 GB 280 GB 1620 GB
    FP16/BF16 2 14 GB 26 GB 140 GB 810 GB
    INT8 1 7 GB 13 GB 70 GB 405 GB
    INT4 0.5 3.5 GB 6.5 GB 35 GB 203 GB

    Task-specific overhead on top of base VRAM:

    • Training (full): 3-4x base VRAM (Adam optimizer states + gradients + activations)
    • Fine-tuning (LoRA/QLoRA): 1.1-1.3x base VRAM (only adapter weights + small gradient buffer)
    • Inference: 1.1-1.2x base VRAM (KV cache + runtime overhead; scales with batch size)

    Selection Process

    1. Estimate VRAM: Calculate VRAM needed based on model size, quantization (default FP16 if not specified), and task overhead.
    2. Filter GPUs: Find GPUs where VRAM >= estimated requirement. Consider multi-GPU configs (2x, 4x, 8x) if a single GPU is insufficient — GPU count must be a power of 2.
    3. Rank by fit:
      • Budget=low: prioritize lowest cost options
      • Budget=high: prioritize best performance regardless of cost
      • Otherwise: balance cost and performance
    4. Spot eligibility:
      • Inference: spot instances are generally safe
      • Fine-tuning (short): spot is acceptable
      • Training (long runs): spot is risky — preemption loses progress unless checkpointing is set up. Warn the user.
    5. Recommend 1-3 options with:
      • GPU type and count
      • Why it fits (VRAM headroom, compute tier)
      • Estimated cost tier
      • Any caveats (e.g., multi-GPU communication overhead, spot risk)

    Output Format

    After your recommendation, show the user the exact pod_create or serverless_create tool parameters they would use to provision the recommended GPU. For example:

    pod_create:
      name: "my-training-pod"
      image: "pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime"
      gpuType: "H100_80G"
      gpuCount: 2
    

    Launch Pod

    Help the user configure and launch a GPU pod on Yotta Platform. A pod is an interactive GPU instance (like a VM with GPUs attached) for development, training, or batch processing.

    Gather Requirements

    If not already clear from the conversation, ask the user for:

    1. Template (required): pytorch, unsloth, skyrl, or comfyui
    2. Development mode: whether to expose Jupyter (8888) and TensorBoard (6006) ports
    3. Storage: small (20 GB), medium (100 GB), or large (500 GB)

    Pod Templates

    Template Image Best For
    pytorch yottalabsai/pytorch:2.9.0-py3.11-cuda12.8.1-cudnn-devel-ubuntu22.04 General deep learning: training, fine-tuning, research
    unsloth yottalabsai/unsloth:0.6.9-py3.11-cuda12.1-cudnn-devel-ubuntu22.04 Fast LoRA/QLoRA fine-tuning of LLMs (2-5x speedup)
    skyrl yottalabsai/skyrl:ray2.51-py3.11-cuda12.1-cudnn-devel-ubuntu22.04 Reinforcement learning (RLHF, PPO, GRPO)
    comfyui yottalabsai/comfyui:cuda12.8.1-ubuntu22.04-2025102101 Image generation (Stable Diffusion, SDXL, Flux)

    Configuration Process

    1. Resolve template to Docker image from the table above.
    2. Choose GPU: Ask the user what model/workload they'll run, then select a GPU from the catalog.
      • pytorch / unsloth / skyrl: estimate VRAM based on model size (see GPU Selector heuristics above).
      • comfyui: a single RTX 4090 (24 GB) or RTX 5090 (32 GB) is usually sufficient.
    3. Set storage: small=20 GB, medium=100 GB, large=500 GB.
    4. Configure ports: Include template defaults. If development mode, add 8888 (Jupyter) and 6006 (TensorBoard).
    5. Environment variables: Remind the user about HF_TOKEN (Hugging Face) and WANDB_API_KEY (Weights & Biases).

    Output Format

    Show the exact pod_create tool parameters. For example:

    pod_create:
      name: "my-unsloth-pod"
      image: "yottalabsai/unsloth:0.6.9-py3.11-cuda12.1-cudnn-devel-ubuntu22.04"
      gpuType: "A100_80G"
      gpuCount: 1
      containerVolumeInGb: 100
      ports: [8888, 6006]
      envVars: [{"key": "HF_TOKEN", "value": "<user's token>"}]
    

    Serve Model

    Help the user deploy a model for inference on Yotta Platform — either as a pod or a serverless endpoint.

    Gather Requirements

    If not already clear from the conversation, ask the user for:

    1. Model (required): model name or HuggingFace ID (e.g. "meta-llama/Llama-3-70B-Instruct")
    2. Serving framework: vLLM, TGI, Triton, or custom
    3. Service mode (required): POD, ALB, QUEUE, or CUSTOM
    4. Quantization: FP16, INT8, INT4, AWQ, or GPTQ

    Serving Frameworks

    Framework Image Best For Port
    vLLM vllm/vllm-openai:v0.7.3 LLM inference, chat, text/code generation 8000
    TGI ghcr.io/huggingface/text-generation-inference:3.1.1 LLM inference (HuggingFace ecosystem) 80
    Triton nvcr.io/nvidia/tritonserver:25.01-py3 Multi-model, non-LLM, ensemble pipelines 8000
    Custom nvidia/cuda:12.6.3-runtime-ubuntu22.04 Custom models, proprietary serving code 8080

    Selection guidance:

    • LLMs (text generation, chat, code): use vLLM for best throughput or TGI for HuggingFace ecosystem
    • Multi-model or non-LLM (vision, audio, ensembles): use Triton
    • Custom inference code: use custom base image

    Service Modes

    Mode Deploy Via Description
    POD pod_create Interactive GPU instance. Good for dev, testing, or single-user serving.
    ALB serverless_create HTTP load balancer with round-robin. Real-time inference at scale.
    QUEUE serverless_create Async job queue. Results via webhook. Ideal for batch/long jobs.
    CUSTOM serverless_create Raw container, no built-in routing. For gRPC or custom protocols.

    VRAM Estimation for Inference

    Base VRAM = model parameters x bytes per precision. Apply 1.1-1.2x overhead for KV cache.

    Precision Bytes/param 7B 13B 70B 405B
    FP16/BF16 2 14 GB 26 GB 140 GB 810 GB
    INT8 1 7 GB 13 GB 70 GB 405 GB
    INT4/AWQ/GPTQ 0.5 3.5 GB 6.5 GB 35 GB 203 GB

    Configuration Process

    1. Select framework based on model type (see guidance above).
    2. Estimate VRAM at the chosen quantization (default FP16). Apply 1.1-1.2x overhead.
    3. Choose GPU type and count from the catalog. GPU count must be a power of 2.
    4. Configure deployment based on service mode:
      • POD: Use pod_create. Expose the serving port. Set storage for model weights.
      • ALB/QUEUE/CUSTOM: Use serverless_create. Set resources, workers, expose, and serviceMode. Name max 20 chars.
    5. Set env vars based on framework:
      • vLLM: MODEL_NAME, HF_TOKEN
      • TGI: MODEL_ID, HUGGING_FACE_HUB_TOKEN

    Output Format — POD mode

    pod_create:
      name: "serve-llama3"
      image: "vllm/vllm-openai:v0.7.3"
      gpuType: "H100_80G"
      gpuCount: 2
      containerVolumeInGb: 100
      ports: [8000]
      envVars: [{"key": "MODEL_NAME", "value": "meta-llama/Llama-3-70B-Instruct"}]
    

    Output Format — Serverless mode (ALB/QUEUE/CUSTOM)

    serverless_create:
      name: "llama3-70b-ep"
      image: "vllm/vllm-openai:v0.7.3"
      resources: [{"region": "us-east-1", "gpuType": "H100_80G", "gpuCount": 2}]
      workers: 1
      containerVolumeInGb: 100
      serviceMode: "ALB"
      expose: {"port": 8000, "protocol": "HTTP"}
      envVars: [{"key": "MODEL_NAME", "value": "meta-llama/Llama-3-70B-Instruct"}]
    
    Recommended Servers
    Google Compute Engine
    Google Compute Engine
    Browserbase
    Browserbase
    Repository
    yottalabsai/agent-native-infra
    Files