Smithery Logo
MCPsSkillsDocsPricing
Login
NewFlame, an assistant that learns and improves. Available onTelegramSlack
    davila7

    gguf-quantization

    davila7/gguf-quantization
    AI & ML
    19,892

    About

    SKILL.md

    Install

    • Telegram
      Telegram
    • Slack
      Slack
    • Claude Code
      Claude Code
    • Codex
      Codex
    • OpenClaw
      OpenClaw
    • Cursor
      Cursor
    • Amp
      Amp
    • GitHub Copilot
      GitHub Copilot
    • Gemini CLI
      Gemini CLI
    • Kilo Code
      Kilo Code
    • Junie
      Junie
    • Replit
      Replit
    • Windsurf
      Windsurf
    • Cline
      Cline
    • Continue
      Continue
    • OpenCode
      OpenCode
    • OpenHands
      OpenHands
    • Roo Code
      Roo Code
    • Augment
      Augment
    • Goose
      Goose
    • Trae
      Trae
    • Zencoder
      Zencoder
    • Antigravity
      Antigravity
    • Download skill
    ├─
    ├─
    └─
    Smithery Logo

    Give agents more agency

    Resources

    DocumentationPrivacy PolicySystem Status

    Company

    PricingAboutBlog

    Connect

    © 2026 Smithery. All rights reserved.

    About

    GGUF format and llama.cpp quantization for efficient CPU/GPU inference...

    SKILL.md

    GGUF - Quantization Format for llama.cpp

    The GGUF (GPT-Generated Unified Format) is the standard file format for llama.cpp, enabling efficient inference on CPUs, Apple Silicon, and GPUs with flexible quantization options.

    When to use GGUF

    Use GGUF when:

    • Deploying on consumer hardware (laptops, desktops)
    • Running on Apple Silicon (M1/M2/M3) with Metal acceleration
    • Need CPU inference without GPU requirements
    • Want flexible quantization (Q2_K to Q8_0)
    • Using local AI tools (LM Studio, Ollama, text-generation-webui)

    Key advantages:

    • Universal hardware: CPU, Apple Silicon, NVIDIA, AMD support
    • No Python runtime: Pure C/C++ inference
    • Flexible quantization: 2-8 bit with various methods (K-quants)
    • Ecosystem support: LM Studio, Ollama, koboldcpp, and more
    • imatrix: Importance matrix for better low-bit quality

    Use alternatives instead:

    • AWQ/GPTQ: Maximum accuracy with calibration on NVIDIA GPUs
    • HQQ: Fast calibration-free quantization for HuggingFace
    • bitsandbytes: Simple integration with transformers library
    • TensorRT-LLM: Production NVIDIA deployment with maximum speed

    Quick start

    Installation

    # Clone llama.cpp
    git clone https://github.com/ggml-org/llama.cpp
    cd llama.cpp
    
    # Build (CPU)
    make
    
    # Build with CUDA (NVIDIA)
    make GGML_CUDA=1
    
    # Build with Metal (Apple Silicon)
    make GGML_METAL=1
    
    # Install Python bindings (optional)
    pip install llama-cpp-python
    

    Convert model to GGUF

    # Install requirements
    pip install -r requirements.txt
    
    # Convert HuggingFace model to GGUF (FP16)
    python convert_hf_to_gguf.py ./path/to/model --outfile model-f16.gguf
    
    # Or specify output type
    python convert_hf_to_gguf.py ./path/to/model \
        --outfile model-f16.gguf \
        --outtype f16
    

    Quantize model

    # Basic quantization to Q4_K_M
    ./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M
    
    # Quantize with importance matrix (better quality)
    ./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix
    ./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M
    

    Run inference

    # CLI inference
    ./llama-cli -m model-q4_k_m.gguf -p "Hello, how are you?"
    
    # Interactive mode
    ./llama-cli -m model-q4_k_m.gguf --interactive
    
    # With GPU offload
    ./llama-cli -m model-q4_k_m.gguf -ngl 35 -p "Hello!"
    

    Quantization types

    K-quant methods (recommended)

    Type Bits Size (7B) Quality Use Case
    Q2_K 2.5 ~2.8 GB Low Extreme compression
    Q3_K_S 3.0 ~3.0 GB Low-Med Memory constrained
    Q3_K_M 3.3 ~3.3 GB Medium Balance
    Q4_K_S 4.0 ~3.8 GB Med-High Good balance
    Q4_K_M 4.5 ~4.1 GB High Recommended default
    Q5_K_S 5.0 ~4.6 GB High Quality focused
    Q5_K_M 5.5 ~4.8 GB Very High High quality
    Q6_K 6.0 ~5.5 GB Excellent Near-original
    Q8_0 8.0 ~7.2 GB Best Maximum quality

    Legacy methods

    Type Description
    Q4_0 4-bit, basic
    Q4_1 4-bit with delta
    Q5_0 5-bit, basic
    Q5_1 5-bit with delta

    Recommendation: Use K-quant methods (Q4_K_M, Q5_K_M) for best quality/size ratio.

    Conversion workflows

    Workflow 1: HuggingFace to GGUF

    # 1. Download model
    huggingface-cli download meta-llama/Llama-3.1-8B --local-dir ./llama-3.1-8b
    
    # 2. Convert to GGUF (FP16)
    python convert_hf_to_gguf.py ./llama-3.1-8b \
        --outfile llama-3.1-8b-f16.gguf \
        --outtype f16
    
    # 3. Quantize
    ./llama-quantize llama-3.1-8b-f16.gguf llama-3.1-8b-q4_k_m.gguf Q4_K_M
    
    # 4. Test
    ./llama-cli -m llama-3.1-8b-q4_k_m.gguf -p "Hello!" -n 50
    

    Workflow 2: With importance matrix (better quality)

    # 1. Convert to GGUF
    python convert_hf_to_gguf.py ./model --outfile model-f16.gguf
    
    # 2. Create calibration text (diverse samples)
    cat > calibration.txt << 'EOF'
    The quick brown fox jumps over the lazy dog.
    Machine learning is a subset of artificial intelligence.
    Python is a popular programming language.
    # Add more diverse text samples...
    EOF
    
    # 3. Generate importance matrix
    ./llama-imatrix -m model-f16.gguf \
        -f calibration.txt \
        --chunk 512 \
        -o model.imatrix \
        -ngl 35  # GPU layers if available
    
    # 4. Quantize with imatrix
    ./llama-quantize --imatrix model.imatrix \
        model-f16.gguf \
        model-q4_k_m.gguf \
        Q4_K_M
    

    Workflow 3: Multiple quantizations

    #!/bin/bash
    MODEL="llama-3.1-8b-f16.gguf"
    IMATRIX="llama-3.1-8b.imatrix"
    
    # Generate imatrix once
    ./llama-imatrix -m $MODEL -f wiki.txt -o $IMATRIX -ngl 35
    
    # Create multiple quantizations
    for QUANT in Q4_K_M Q5_K_M Q6_K Q8_0; do
        OUTPUT="llama-3.1-8b-${QUANT,,}.gguf"
        ./llama-quantize --imatrix $IMATRIX $MODEL $OUTPUT $QUANT
        echo "Created: $OUTPUT ($(du -h $OUTPUT | cut -f1))"
    done
    

    Python usage

    llama-cpp-python

    from llama_cpp import Llama
    
    # Load model
    llm = Llama(
        model_path="./model-q4_k_m.gguf",
        n_ctx=4096,          # Context window
        n_gpu_layers=35,     # GPU offload (0 for CPU only)
        n_threads=8          # CPU threads
    )
    
    # Generate
    output = llm(
        "What is machine learning?",
        max_tokens=256,
        temperature=0.7,
        stop=["</s>", "\n\n"]
    )
    print(output["choices"][0]["text"])
    

    Chat completion

    from llama_cpp import Llama
    
    llm = Llama(
        model_path="./model-q4_k_m.gguf",
        n_ctx=4096,
        n_gpu_layers=35,
        chat_format="llama-3"  # Or "chatml", "mistral", etc.
    )
    
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is Python?"}
    ]
    
    response = llm.create_chat_completion(
        messages=messages,
        max_tokens=256,
        temperature=0.7
    )
    print(response["choices"][0]["message"]["content"])
    

    Streaming

    from llama_cpp import Llama
    
    llm = Llama(model_path="./model-q4_k_m.gguf", n_gpu_layers=35)
    
    # Stream tokens
    for chunk in llm(
        "Explain quantum computing:",
        max_tokens=256,
        stream=True
    ):
        print(chunk["choices"][0]["text"], end="", flush=True)
    

    Server mode

    Start OpenAI-compatible server

    # Start server
    ./llama-server -m model-q4_k_m.gguf \
        --host 0.0.0.0 \
        --port 8080 \
        -ngl 35 \
        -c 4096
    
    # Or with Python bindings
    python -m llama_cpp.server \
        --model model-q4_k_m.gguf \
        --n_gpu_layers 35 \
        --host 0.0.0.0 \
        --port 8080
    

    Use with OpenAI client

    from openai import OpenAI
    
    client = OpenAI(
        base_url="http://localhost:8080/v1",
        api_key="not-needed"
    )
    
    response = client.chat.completions.create(
        model="local-model",
        messages=[{"role": "user", "content": "Hello!"}],
        max_tokens=256
    )
    print(response.choices[0].message.content)
    

    Hardware optimization

    Apple Silicon (Metal)

    # Build with Metal
    make clean && make GGML_METAL=1
    
    # Run with Metal acceleration
    ./llama-cli -m model.gguf -ngl 99 -p "Hello"
    
    # Python with Metal
    llm = Llama(
        model_path="model.gguf",
        n_gpu_layers=99,     # Offload all layers
        n_threads=1          # Metal handles parallelism
    )
    

    NVIDIA CUDA

    # Build with CUDA
    make clean && make GGML_CUDA=1
    
    # Run with CUDA
    ./llama-cli -m model.gguf -ngl 35 -p "Hello"
    
    # Specify GPU
    CUDA_VISIBLE_DEVICES=0 ./llama-cli -m model.gguf -ngl 35
    

    CPU optimization

    # Build with AVX2/AVX512
    make clean && make
    
    # Run with optimal threads
    ./llama-cli -m model.gguf -t 8 -p "Hello"
    
    # Python CPU config
    llm = Llama(
        model_path="model.gguf",
        n_gpu_layers=0,      # CPU only
        n_threads=8,         # Match physical cores
        n_batch=512          # Batch size for prompt processing
    )
    

    Integration with tools

    Ollama

    # Create Modelfile
    cat > Modelfile << 'EOF'
    FROM ./model-q4_k_m.gguf
    TEMPLATE """{{ .System }}
    {{ .Prompt }}"""
    PARAMETER temperature 0.7
    PARAMETER num_ctx 4096
    EOF
    
    # Create Ollama model
    ollama create mymodel -f Modelfile
    
    # Run
    ollama run mymodel "Hello!"
    

    LM Studio

    1. Place GGUF file in ~/.cache/lm-studio/models/
    2. Open LM Studio and select the model
    3. Configure context length and GPU offload
    4. Start inference

    text-generation-webui

    # Place in models folder
    cp model-q4_k_m.gguf text-generation-webui/models/
    
    # Start with llama.cpp loader
    python server.py --model model-q4_k_m.gguf --loader llama.cpp --n-gpu-layers 35
    

    Best practices

    1. Use K-quants: Q4_K_M offers best quality/size balance
    2. Use imatrix: Always use importance matrix for Q4 and below
    3. GPU offload: Offload as many layers as VRAM allows
    4. Context length: Start with 4096, increase if needed
    5. Thread count: Match physical CPU cores, not logical
    6. Batch size: Increase n_batch for faster prompt processing

    Common issues

    Model loads slowly:

    # Use mmap for faster loading
    ./llama-cli -m model.gguf --mmap
    

    Out of memory:

    # Reduce GPU layers
    ./llama-cli -m model.gguf -ngl 20  # Reduce from 35
    
    # Or use smaller quantization
    ./llama-quantize model-f16.gguf model-q3_k_m.gguf Q3_K_M
    

    Poor quality at low bits:

    # Always use imatrix for Q4 and below
    ./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix
    ./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M
    

    References

    • Advanced Usage - Batching, speculative decoding, custom builds
    • Troubleshooting - Common issues, debugging, benchmarks

    Resources

    • Repository: https://github.com/ggml-org/llama.cpp
    • Python Bindings: https://github.com/abetlen/llama-cpp-python
    • Pre-quantized Models: https://huggingface.co/TheBloke
    • GGUF Converter: https://huggingface.co/spaces/ggml-org/gguf-my-repo
    • License: MIT
    Recommended Servers
    Local Model Suitability MCP
    Local Model Suitability MCP
    Maximum Sats
    Maximum Sats
    fillin
    fillin
    Repository
    davila7/claude-code-templates
    Files