Smithery Logo
MCPsSkillsDocsPricing
Login
Smithery Logo

Accelerating the Agent Economy

Resources

DocumentationPrivacy PolicySystem Status

Company

PricingAboutBlog

Connect

© 2026 Smithery. All rights reserved.

    davila7

    llama-cpp

    davila7/llama-cpp
    AI & ML
    19,892
    4 installs

    About

    SKILL.md

    Install

    Install via Skills CLI

    or add to your agent
    • Claude Code
      Claude Code
    • Codex
      Codex
    • OpenClaw
      OpenClaw
    • Cursor
      Cursor
    • Amp
      Amp
    • GitHub Copilot
      GitHub Copilot
    • Gemini CLI
      Gemini CLI
    • Kilo Code
      Kilo Code
    • Junie
      Junie
    • Replit
      Replit
    • Windsurf
      Windsurf
    • Cline
      Cline
    • Continue
      Continue
    • OpenCode
      OpenCode
    • OpenHands
      OpenHands
    • Roo Code
      Roo Code
    • Augment
      Augment
    • Goose
      Goose
    • Trae
      Trae
    • Zencoder
      Zencoder
    • Antigravity
      Antigravity
    ├─
    ├─
    └─

    About

    Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware. Use for edge deployment, M1/M2/M3 Macs, AMD/Intel GPUs, or when CUDA is unavailable. Supports GGUF quantization (1...

    SKILL.md

    llama.cpp

    Pure C/C++ LLM inference with minimal dependencies, optimized for CPUs and non-NVIDIA hardware.

    When to use llama.cpp

    Use llama.cpp when:

    • Running on CPU-only machines
    • Deploying on Apple Silicon (M1/M2/M3/M4)
    • Using AMD or Intel GPUs (no CUDA)
    • Edge deployment (Raspberry Pi, embedded systems)
    • Need simple deployment without Docker/Python

    Use TensorRT-LLM instead when:

    • Have NVIDIA GPUs (A100/H100)
    • Need maximum throughput (100K+ tok/s)
    • Running in datacenter with CUDA

    Use vLLM instead when:

    • Have NVIDIA GPUs
    • Need Python-first API
    • Want PagedAttention

    Quick start

    Installation

    # macOS/Linux
    brew install llama.cpp
    
    # Or build from source
    git clone https://github.com/ggerganov/llama.cpp
    cd llama.cpp
    make
    
    # With Metal (Apple Silicon)
    make LLAMA_METAL=1
    
    # With CUDA (NVIDIA)
    make LLAMA_CUDA=1
    
    # With ROCm (AMD)
    make LLAMA_HIP=1
    

    Download model

    # Download from HuggingFace (GGUF format)
    huggingface-cli download \
        TheBloke/Llama-2-7B-Chat-GGUF \
        llama-2-7b-chat.Q4_K_M.gguf \
        --local-dir models/
    
    # Or convert from HuggingFace
    python convert_hf_to_gguf.py models/llama-2-7b-chat/
    

    Run inference

    # Simple chat
    ./llama-cli \
        -m models/llama-2-7b-chat.Q4_K_M.gguf \
        -p "Explain quantum computing" \
        -n 256  # Max tokens
    
    # Interactive chat
    ./llama-cli \
        -m models/llama-2-7b-chat.Q4_K_M.gguf \
        --interactive
    

    Server mode

    # Start OpenAI-compatible server
    ./llama-server \
        -m models/llama-2-7b-chat.Q4_K_M.gguf \
        --host 0.0.0.0 \
        --port 8080 \
        -ngl 32  # Offload 32 layers to GPU
    
    # Client request
    curl http://localhost:8080/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "llama-2-7b-chat",
        "messages": [{"role": "user", "content": "Hello!"}],
        "temperature": 0.7,
        "max_tokens": 100
      }'
    

    Quantization formats

    GGUF format overview

    Format Bits Size (7B) Speed Quality Use Case
    Q4_K_M 4.5 4.1 GB Fast Good Recommended default
    Q4_K_S 4.3 3.9 GB Faster Lower Speed critical
    Q5_K_M 5.5 4.8 GB Medium Better Quality critical
    Q6_K 6.5 5.5 GB Slower Best Maximum quality
    Q8_0 8.0 7.0 GB Slow Excellent Minimal degradation
    Q2_K 2.5 2.7 GB Fastest Poor Testing only

    Choosing quantization

    # General use (balanced)
    Q4_K_M  # 4-bit, medium quality
    
    # Maximum speed (more degradation)
    Q2_K or Q3_K_M
    
    # Maximum quality (slower)
    Q6_K or Q8_0
    
    # Very large models (70B, 405B)
    Q3_K_M or Q4_K_S  # Lower bits to fit in memory
    

    Hardware acceleration

    Apple Silicon (Metal)

    # Build with Metal
    make LLAMA_METAL=1
    
    # Run with GPU acceleration (automatic)
    ./llama-cli -m model.gguf -ngl 999  # Offload all layers
    
    # Performance: M3 Max 40-60 tokens/sec (Llama 2-7B Q4_K_M)
    

    NVIDIA GPUs (CUDA)

    # Build with CUDA
    make LLAMA_CUDA=1
    
    # Offload layers to GPU
    ./llama-cli -m model.gguf -ngl 35  # Offload 35/40 layers
    
    # Hybrid CPU+GPU for large models
    ./llama-cli -m llama-70b.Q4_K_M.gguf -ngl 20  # GPU: 20 layers, CPU: rest
    

    AMD GPUs (ROCm)

    # Build with ROCm
    make LLAMA_HIP=1
    
    # Run with AMD GPU
    ./llama-cli -m model.gguf -ngl 999
    

    Common patterns

    Batch processing

    # Process multiple prompts from file
    cat prompts.txt | ./llama-cli \
        -m model.gguf \
        --batch-size 512 \
        -n 100
    

    Constrained generation

    # JSON output with grammar
    ./llama-cli \
        -m model.gguf \
        -p "Generate a person: " \
        --grammar-file grammars/json.gbnf
    
    # Outputs valid JSON only
    

    Context size

    # Increase context (default 512)
    ./llama-cli \
        -m model.gguf \
        -c 4096  # 4K context window
    
    # Very long context (if model supports)
    ./llama-cli -m model.gguf -c 32768  # 32K context
    

    Performance benchmarks

    CPU performance (Llama 2-7B Q4_K_M)

    CPU Threads Speed Cost
    Apple M3 Max 16 50 tok/s $0 (local)
    AMD Ryzen 9 7950X 32 35 tok/s $0.50/hour
    Intel i9-13900K 32 30 tok/s $0.40/hour
    AWS c7i.16xlarge 64 40 tok/s $2.88/hour

    GPU acceleration (Llama 2-7B Q4_K_M)

    GPU Speed vs CPU Cost
    NVIDIA RTX 4090 120 tok/s 3-4× $0 (local)
    NVIDIA A10 80 tok/s 2-3× $1.00/hour
    AMD MI250 70 tok/s 2× $2.00/hour
    Apple M3 Max (Metal) 50 tok/s ~Same $0 (local)

    Supported models

    LLaMA family:

    • Llama 2 (7B, 13B, 70B)
    • Llama 3 (8B, 70B, 405B)
    • Code Llama

    Mistral family:

    • Mistral 7B
    • Mixtral 8x7B, 8x22B

    Other:

    • Falcon, BLOOM, GPT-J
    • Phi-3, Gemma, Qwen
    • LLaVA (vision), Whisper (audio)

    Find models: https://huggingface.co/models?library=gguf

    References

    • Quantization Guide - GGUF formats, conversion, quality comparison
    • Server Deployment - API endpoints, Docker, monitoring
    • Optimization - Performance tuning, hybrid CPU+GPU

    Resources

    • GitHub: https://github.com/ggerganov/llama.cpp
    • Models: https://huggingface.co/models?library=gguf
    • Discord: https://discord.gg/llama-cpp
    Recommended Servers
    Maximum Sats
    Parallel Web Search
    Parallel Web Search
    ThinAir Data
    ThinAir Data
    Repository
    davila7/claude-code-templates
    Files