Smithery Logo
MCPsSkillsDocsPricing
Login
Smithery Logo

Accelerating the Agent Economy

Resources

DocumentationPrivacy PolicySystem Status

Company

PricingAboutBlog

Connect

© 2026 Smithery. All rights reserved.

    davila7

    gptq

    davila7/gptq
    AI & ML
    19,892
    1 installs

    About

    SKILL.md

    Install

    Install via Skills CLI

    or add to your agent
    • Claude Code
      Claude Code
    • Codex
      Codex
    • OpenClaw
      OpenClaw
    • Cursor
      Cursor
    • Amp
      Amp
    • GitHub Copilot
      GitHub Copilot
    • Gemini CLI
      Gemini CLI
    • Kilo Code
      Kilo Code
    • Junie
      Junie
    • Replit
      Replit
    • Windsurf
      Windsurf
    • Cline
      Cline
    • Continue
      Continue
    • OpenCode
      OpenCode
    • OpenHands
      OpenHands
    • Roo Code
      Roo Code
    • Augment
      Augment
    • Goose
      Goose
    • Trae
      Trae
    • Zencoder
      Zencoder
    • Antigravity
      Antigravity
    ├─
    ├─
    └─

    About

    Post-training 4-bit quantization for LLMs with minimal accuracy loss.

    SKILL.md

    GPTQ (Generative Pre-trained Transformer Quantization)

    Post-training quantization method that compresses LLMs to 4-bit with minimal accuracy loss using group-wise quantization.

    When to use GPTQ

    Use GPTQ when:

    • Need to fit large models (70B+) on limited GPU memory
    • Want 4× memory reduction with <2% accuracy loss
    • Deploying on consumer GPUs (RTX 4090, 3090)
    • Need faster inference (3-4× speedup vs FP16)

    Use AWQ instead when:

    • Need slightly better accuracy (<1% loss)
    • Have newer GPUs (Ampere, Ada)
    • Want Marlin kernel support (2× faster on some GPUs)

    Use bitsandbytes instead when:

    • Need simple integration with transformers
    • Want 8-bit quantization (less compression, better quality)
    • Don't need pre-quantized model files

    Quick start

    Installation

    # Install AutoGPTQ
    pip install auto-gptq
    
    # With Triton (Linux only, faster)
    pip install auto-gptq[triton]
    
    # With CUDA extensions (faster)
    pip install auto-gptq --no-build-isolation
    
    # Full installation
    pip install auto-gptq transformers accelerate
    

    Load pre-quantized model

    from transformers import AutoTokenizer
    from auto_gptq import AutoGPTQForCausalLM
    
    # Load quantized model from HuggingFace
    model_name = "TheBloke/Llama-2-7B-Chat-GPTQ"
    
    model = AutoGPTQForCausalLM.from_quantized(
        model_name,
        device="cuda:0",
        use_triton=False  # Set True on Linux for speed
    )
    
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    
    # Generate
    prompt = "Explain quantum computing"
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")
    outputs = model.generate(**inputs, max_new_tokens=200)
    print(tokenizer.decode(outputs[0]))
    

    Quantize your own model

    from transformers import AutoTokenizer
    from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
    from datasets import load_dataset
    
    # Load model
    model_name = "meta-llama/Llama-2-7b-chat-hf"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    
    # Quantization config
    quantize_config = BaseQuantizeConfig(
        bits=4,              # 4-bit quantization
        group_size=128,      # Group size (recommended: 128)
        desc_act=False,      # Activation order (False for CUDA kernel)
        damp_percent=0.01    # Dampening factor
    )
    
    # Load model for quantization
    model = AutoGPTQForCausalLM.from_pretrained(
        model_name,
        quantize_config=quantize_config
    )
    
    # Prepare calibration data
    dataset = load_dataset("c4", split="train", streaming=True)
    calibration_data = [
        tokenizer(example["text"])["input_ids"][:512]
        for example in dataset.take(128)
    ]
    
    # Quantize
    model.quantize(calibration_data)
    
    # Save quantized model
    model.save_quantized("llama-2-7b-gptq")
    tokenizer.save_pretrained("llama-2-7b-gptq")
    
    # Push to HuggingFace
    model.push_to_hub("username/llama-2-7b-gptq")
    

    Group-wise quantization

    How GPTQ works:

    1. Group weights: Divide each weight matrix into groups (typically 128 elements)
    2. Quantize per-group: Each group has its own scale/zero-point
    3. Minimize error: Uses Hessian information to minimize quantization error
    4. Result: 4-bit weights with near-FP16 accuracy

    Group size trade-off:

    Group Size Model Size Accuracy Speed Recommendation
    -1 (per-column) Smallest Best Slowest Research only
    32 Smaller Better Slower High accuracy needed
    128 Medium Good Fast Recommended default
    256 Larger Lower Faster Speed critical
    1024 Largest Lowest Fastest Not recommended

    Example:

    Weight matrix: [1024, 4096] = 4.2M elements
    
    Group size = 128:
    - Groups: 4.2M / 128 = 32,768 groups
    - Each group: own 4-bit scale + zero-point
    - Result: Better granularity → better accuracy
    

    Quantization configurations

    Standard 4-bit (recommended)

    from auto_gptq import BaseQuantizeConfig
    
    config = BaseQuantizeConfig(
        bits=4,              # 4-bit quantization
        group_size=128,      # Standard group size
        desc_act=False,      # Faster CUDA kernel
        damp_percent=0.01    # Dampening factor
    )
    

    Performance:

    • Memory: 4× reduction (70B model: 140GB → 35GB)
    • Accuracy: ~1.5% perplexity increase
    • Speed: 3-4× faster than FP16

    High accuracy (3-bit with larger groups)

    config = BaseQuantizeConfig(
        bits=3,              # 3-bit (more compression)
        group_size=128,      # Keep standard group size
        desc_act=True,       # Better accuracy (slower)
        damp_percent=0.01
    )
    

    Trade-off:

    • Memory: 5× reduction
    • Accuracy: ~3% perplexity increase
    • Speed: 5× faster (but less accurate)

    Maximum accuracy (4-bit with small groups)

    config = BaseQuantizeConfig(
        bits=4,
        group_size=32,       # Smaller groups (better accuracy)
        desc_act=True,       # Activation reordering
        damp_percent=0.005   # Lower dampening
    )
    

    Trade-off:

    • Memory: 3.5× reduction (slightly larger)
    • Accuracy: ~0.8% perplexity increase (best)
    • Speed: 2-3× faster (kernel overhead)

    Kernel backends

    ExLlamaV2 (default, fastest)

    model = AutoGPTQForCausalLM.from_quantized(
        model_name,
        device="cuda:0",
        use_exllama=True,      # Use ExLlamaV2
        exllama_config={"version": 2}
    )
    

    Performance: 1.5-2× faster than Triton

    Marlin (Ampere+ GPUs)

    # Quantize with Marlin format
    config = BaseQuantizeConfig(
        bits=4,
        group_size=128,
        desc_act=False  # Required for Marlin
    )
    
    model.quantize(calibration_data, use_marlin=True)
    
    # Load with Marlin
    model = AutoGPTQForCausalLM.from_quantized(
        model_name,
        device="cuda:0",
        use_marlin=True  # 2× faster on A100/H100
    )
    

    Requirements:

    • NVIDIA Ampere or newer (A100, H100, RTX 40xx)
    • Compute capability ≥ 8.0

    Triton (Linux only)

    model = AutoGPTQForCausalLM.from_quantized(
        model_name,
        device="cuda:0",
        use_triton=True  # Linux only
    )
    

    Performance: 1.2-1.5× faster than CUDA backend

    Integration with transformers

    Direct transformers usage

    from transformers import AutoModelForCausalLM, AutoTokenizer
    
    # Load quantized model (transformers auto-detects GPTQ)
    model = AutoModelForCausalLM.from_pretrained(
        "TheBloke/Llama-2-13B-Chat-GPTQ",
        device_map="auto",
        trust_remote_code=False
    )
    
    tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-13B-Chat-GPTQ")
    
    # Use like any transformers model
    inputs = tokenizer("Hello", return_tensors="pt").to("cuda")
    outputs = model.generate(**inputs, max_new_tokens=100)
    

    QLoRA fine-tuning (GPTQ + LoRA)

    from transformers import AutoModelForCausalLM
    from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model
    
    # Load GPTQ model
    model = AutoModelForCausalLM.from_pretrained(
        "TheBloke/Llama-2-7B-GPTQ",
        device_map="auto"
    )
    
    # Prepare for LoRA training
    model = prepare_model_for_kbit_training(model)
    
    # LoRA config
    lora_config = LoraConfig(
        r=16,
        lora_alpha=32,
        target_modules=["q_proj", "v_proj"],
        lora_dropout=0.05,
        bias="none",
        task_type="CAUSAL_LM"
    )
    
    # Add LoRA adapters
    model = get_peft_model(model, lora_config)
    
    # Fine-tune (memory efficient!)
    # 70B model trainable on single A100 80GB
    

    Performance benchmarks

    Memory reduction

    Model FP16 GPTQ 4-bit Reduction
    Llama 2-7B 14 GB 3.5 GB 4×
    Llama 2-13B 26 GB 6.5 GB 4×
    Llama 2-70B 140 GB 35 GB 4×
    Llama 3-405B 810 GB 203 GB 4×

    Enables:

    • 70B on single A100 80GB (vs 2× A100 needed for FP16)
    • 405B on 3× A100 80GB (vs 11× A100 needed for FP16)
    • 13B on RTX 4090 24GB (vs OOM with FP16)

    Inference speed (Llama 2-7B, A100)

    Precision Tokens/sec vs FP16
    FP16 25 tok/s 1×
    GPTQ 4-bit (CUDA) 85 tok/s 3.4×
    GPTQ 4-bit (ExLlama) 105 tok/s 4.2×
    GPTQ 4-bit (Marlin) 120 tok/s 4.8×

    Accuracy (perplexity on WikiText-2)

    Model FP16 GPTQ 4-bit (g=128) Degradation
    Llama 2-7B 5.47 5.55 +1.5%
    Llama 2-13B 4.88 4.95 +1.4%
    Llama 2-70B 3.32 3.38 +1.8%

    Excellent quality preservation - less than 2% degradation!

    Common patterns

    Multi-GPU deployment

    # Automatic device mapping
    model = AutoGPTQForCausalLM.from_quantized(
        "TheBloke/Llama-2-70B-GPTQ",
        device_map="auto",  # Automatically split across GPUs
        max_memory={0: "40GB", 1: "40GB"}  # Limit per GPU
    )
    
    # Manual device mapping
    device_map = {
        "model.embed_tokens": 0,
        "model.layers.0-39": 0,  # First 40 layers on GPU 0
        "model.layers.40-79": 1,  # Last 40 layers on GPU 1
        "model.norm": 1,
        "lm_head": 1
    }
    
    model = AutoGPTQForCausalLM.from_quantized(
        model_name,
        device_map=device_map
    )
    

    CPU offloading

    # Offload some layers to CPU (for very large models)
    model = AutoGPTQForCausalLM.from_quantized(
        "TheBloke/Llama-2-405B-GPTQ",
        device_map="auto",
        max_memory={
            0: "80GB",  # GPU 0
            1: "80GB",  # GPU 1
            2: "80GB",  # GPU 2
            "cpu": "200GB"  # Offload overflow to CPU
        }
    )
    

    Batch inference

    # Process multiple prompts efficiently
    prompts = [
        "Explain AI",
        "Explain ML",
        "Explain DL"
    ]
    
    inputs = tokenizer(prompts, return_tensors="pt", padding=True).to("cuda")
    
    outputs = model.generate(
        **inputs,
        max_new_tokens=100,
        pad_token_id=tokenizer.eos_token_id
    )
    
    for i, output in enumerate(outputs):
        print(f"Prompt {i}: {tokenizer.decode(output)}")
    

    Finding pre-quantized models

    TheBloke on HuggingFace:

    • https://huggingface.co/TheBloke
    • 1000+ models in GPTQ format
    • Multiple group sizes (32, 128)
    • Both CUDA and Marlin formats

    Search:

    # Find GPTQ models on HuggingFace
    https://huggingface.co/models?library=gptq
    

    Download:

    from auto_gptq import AutoGPTQForCausalLM
    
    # Automatically downloads from HuggingFace
    model = AutoGPTQForCausalLM.from_quantized(
        "TheBloke/Llama-2-70B-Chat-GPTQ",
        device="cuda:0"
    )
    

    Supported models

    • LLaMA family: Llama 2, Llama 3, Code Llama
    • Mistral: Mistral 7B, Mixtral 8x7B, 8x22B
    • Qwen: Qwen, Qwen2, QwQ
    • DeepSeek: V2, V3
    • Phi: Phi-2, Phi-3
    • Yi, Falcon, BLOOM, OPT
    • 100+ models on HuggingFace

    References

    • Calibration Guide - Dataset selection, quantization process, quality optimization
    • Integration Guide - Transformers, PEFT, vLLM, TensorRT-LLM
    • Troubleshooting - Common issues, performance optimization

    Resources

    • GitHub: https://github.com/AutoGPTQ/AutoGPTQ
    • Paper: GPTQ: Accurate Post-Training Quantization (arXiv:2210.17323)
    • Models: https://huggingface.co/models?library=gptq
    • Discord: https://discord.gg/autogptq
    Recommended Servers
    Parallel Web Search
    Parallel Web Search
    Repository
    davila7/claude-code-templates
    Files