Smithery Logo
MCPsSkillsDocsPricing
Login
Smithery Logo

Give agents more agency

Resources

DocumentationPrivacy PolicySystem Status

Company

PricingAboutBlog

Connect

© 2026 Smithery. All rights reserved.

    davila7

    awq-quantization

    davila7/awq-quantization
    AI & ML
    19,892
    4 installs

    About

    SKILL.md

    Install

    Install via Skills CLI

    or add to your agent
    • Claude Code
      Claude Code
    • Codex
      Codex
    • OpenClaw
      OpenClaw
    • Cursor
      Cursor
    • Amp
      Amp
    • GitHub Copilot
      GitHub Copilot
    • Gemini CLI
      Gemini CLI
    • Kilo Code
      Kilo Code
    • Junie
      Junie
    • Replit
      Replit
    • Windsurf
      Windsurf
    • Cline
      Cline
    • Continue
      Continue
    • OpenCode
      OpenCode
    • OpenHands
      OpenHands
    • Roo Code
      Roo Code
    • Augment
      Augment
    • Goose
      Goose
    • Trae
      Trae
    • Zencoder
      Zencoder
    • Antigravity
      Antigravity
    ├─
    ├─
    └─

    About

    Activation-aware weight quantization for 4-bit LLM compression with 3x speedup and minimal accuracy loss...

    SKILL.md

    AWQ (Activation-aware Weight Quantization)

    4-bit quantization that preserves salient weights based on activation patterns, achieving 3x speedup with minimal accuracy loss.

    When to use AWQ

    Use AWQ when:

    • Need 4-bit quantization with <5% accuracy loss
    • Deploying instruction-tuned or chat models (AWQ generalizes better)
    • Want ~2.5-3x inference speedup over FP16
    • Using vLLM for production serving
    • Have Ampere+ GPUs (A100, H100, RTX 40xx) for Marlin kernel support

    Use GPTQ instead when:

    • Need maximum ecosystem compatibility (more tools support GPTQ)
    • Working with ExLlamaV2 backend specifically
    • Have older GPUs without Marlin support

    Use bitsandbytes instead when:

    • Need zero calibration overhead (quantize on-the-fly)
    • Want to fine-tune with QLoRA
    • Prefer simpler integration

    Quick start

    Installation

    # Default (Triton kernels)
    pip install autoawq
    
    # With optimized CUDA kernels + Flash Attention
    pip install autoawq[kernels]
    
    # Intel CPU/XPU optimization
    pip install autoawq[cpu]
    

    Requirements: Python 3.8+, CUDA 11.8+, Compute Capability 7.5+

    Load pre-quantized model

    from awq import AutoAWQForCausalLM
    from transformers import AutoTokenizer
    
    model_name = "TheBloke/Mistral-7B-Instruct-v0.2-AWQ"
    
    model = AutoAWQForCausalLM.from_quantized(
        model_name,
        fuse_layers=True  # Enable fused attention for speed
    )
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    
    # Generate
    inputs = tokenizer("Explain quantum computing", return_tensors="pt").to("cuda")
    outputs = model.generate(**inputs, max_new_tokens=200)
    print(tokenizer.decode(outputs[0], skip_special_tokens=True))
    

    Quantize your own model

    from awq import AutoAWQForCausalLM
    from transformers import AutoTokenizer
    
    model_path = "mistralai/Mistral-7B-Instruct-v0.2"
    
    # Load model and tokenizer
    model = AutoAWQForCausalLM.from_pretrained(model_path)
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    
    # Quantization config
    quant_config = {
        "zero_point": True,      # Use zero-point quantization
        "q_group_size": 128,     # Group size (128 recommended)
        "w_bit": 4,              # 4-bit weights
        "version": "GEMM"        # GEMM for batch, GEMV for single-token
    }
    
    # Quantize (uses pileval dataset by default)
    model.quantize(tokenizer, quant_config=quant_config)
    
    # Save
    model.save_quantized("mistral-7b-awq")
    tokenizer.save_pretrained("mistral-7b-awq")
    

    Timing: ~10-15 min for 7B, ~1 hour for 70B models.

    AWQ vs GPTQ vs bitsandbytes

    Feature AWQ GPTQ bitsandbytes
    Speedup (4-bit) ~2.5-3x ~2x ~1.5x
    Accuracy loss <5% ~5-10% ~5-15%
    Calibration Minimal (128-1K tokens) More extensive None
    Overfitting risk Low Higher N/A
    Best for Production inference GPU inference Easy integration
    vLLM support Native Yes Limited

    Key insight: AWQ assumes not all weights are equally important. It protects ~1% of salient weights identified by activation patterns, reducing quantization error without mixed-precision overhead.

    Kernel backends

    GEMM (default, batch inference)

    quant_config = {
        "zero_point": True,
        "q_group_size": 128,
        "w_bit": 4,
        "version": "GEMM"  # Best for batch sizes > 1
    }
    

    GEMV (single-token generation)

    quant_config = {
        "version": "GEMV"  # 20% faster for batch_size=1
    }
    

    Limitation: Only batch size 1, not good for large context.

    Marlin (Ampere+ GPUs)

    from transformers import AwqConfig, AutoModelForCausalLM
    
    config = AwqConfig(
        bits=4,
        version="marlin"  # 2x faster on A100/H100
    )
    
    model = AutoModelForCausalLM.from_pretrained(
        "TheBloke/Mistral-7B-AWQ",
        quantization_config=config
    )
    

    Requirements: Compute Capability 8.0+ (A100, H100, RTX 40xx)

    ExLlamaV2 (AMD compatible)

    config = AwqConfig(
        bits=4,
        version="exllama"  # Faster prefill, AMD GPU support
    )
    

    HuggingFace Transformers integration

    Direct loading

    from transformers import AutoModelForCausalLM, AutoTokenizer
    
    model = AutoModelForCausalLM.from_pretrained(
        "TheBloke/zephyr-7B-alpha-AWQ",
        device_map="auto"
    )
    tokenizer = AutoTokenizer.from_pretrained("TheBloke/zephyr-7B-alpha-AWQ")
    

    Fused modules (recommended)

    from transformers import AwqConfig, AutoModelForCausalLM
    
    config = AwqConfig(
        bits=4,
        fuse_max_seq_len=512,  # Max sequence length for fusing
        do_fuse=True           # Enable fused attention/MLP
    )
    
    model = AutoModelForCausalLM.from_pretrained(
        "TheBloke/Mistral-7B-OpenOrca-AWQ",
        quantization_config=config
    )
    

    Note: Fused modules cannot combine with FlashAttention2.

    vLLM integration

    from vllm import LLM, SamplingParams
    
    # vLLM auto-detects AWQ models
    llm = LLM(
        model="TheBloke/Llama-2-7B-AWQ",
        quantization="awq",
        dtype="half"
    )
    
    sampling = SamplingParams(temperature=0.7, max_tokens=200)
    outputs = llm.generate(["Explain AI"], sampling)
    

    Performance benchmarks

    Memory reduction

    Model FP16 AWQ 4-bit Reduction
    Mistral 7B 14 GB 5.5 GB 2.5x
    Llama 2-13B 26 GB 10 GB 2.6x
    Llama 2-70B 140 GB 35 GB 4x

    Inference speed (RTX 4090)

    Model Prefill (tok/s) Decode (tok/s) Memory
    Mistral 7B GEMM 3,897 114 5.55 GB
    TinyLlama 1B GEMV 5,179 431 2.10 GB
    Llama 2-13B GEMM 2,279 74 10.28 GB

    Accuracy (perplexity)

    Model FP16 AWQ 4-bit Degradation
    Llama 3 8B 8.20 8.48 +3.4%
    Mistral 7B 5.25 5.42 +3.2%
    Qwen2 72B 4.85 4.95 +2.1%

    Custom calibration data

    # Use custom dataset for domain-specific models
    model.quantize(
        tokenizer,
        quant_config=quant_config,
        calib_data="wikitext",       # Or custom list of strings
        max_calib_samples=256,       # More samples = better accuracy
        max_calib_seq_len=512        # Sequence length
    )
    
    # Or provide your own samples
    calib_samples = [
        "Your domain-specific text here...",
        "More examples from your use case...",
    ]
    model.quantize(tokenizer, quant_config=quant_config, calib_data=calib_samples)
    

    Multi-GPU deployment

    model = AutoAWQForCausalLM.from_quantized(
        "TheBloke/Llama-2-70B-AWQ",
        device_map="auto",  # Auto-split across GPUs
        max_memory={0: "40GB", 1: "40GB"}
    )
    

    Supported models

    35+ architectures including:

    • Llama family: Llama 2/3, Code Llama, Mistral, Mixtral
    • Qwen: Qwen, Qwen2, Qwen2.5-VL
    • Others: Falcon, MPT, Phi, Yi, DeepSeek, Gemma
    • Multimodal: LLaVA, LLaVA-Next, Qwen2-VL

    Common issues

    CUDA OOM during quantization:

    # Reduce batch size
    model.quantize(tokenizer, quant_config=quant_config, max_calib_samples=64)
    

    Slow inference:

    # Enable fused layers
    model = AutoAWQForCausalLM.from_quantized(model_name, fuse_layers=True)
    

    AMD GPU support:

    # Use ExLlama backend
    config = AwqConfig(bits=4, version="exllama")
    

    Deprecation notice

    AutoAWQ is officially deprecated. For new projects, consider:

    • vLLM llm-compressor: https://github.com/vllm-project/llm-compressor
    • MLX-LM: For Mac devices with Apple Silicon

    Existing quantized models remain usable.

    References

    • Paper: AWQ: Activation-aware Weight Quantization (arXiv:2306.00978) - MLSys 2024 Best Paper
    • GitHub: https://github.com/casper-hansen/AutoAWQ
    • MIT Han Lab: https://github.com/mit-han-lab/llm-awq
    • Models: https://huggingface.co/models?library=awq
    Recommended Servers
    Parallel Web Search
    Parallel Web Search
    vastlint - IAB XML VAST validator and linter
    vastlint - IAB XML VAST validator and linter
    fillin
    fillin
    Repository
    davila7/claude-code-templates
    Files