Smithery Logo
MCPsSkillsDocsPricing
Login
Smithery Logo

Accelerating the Agent Economy

Resources

DocumentationPrivacy PolicySystem Status

Company

PricingAboutBlog

Connect

© 2026 Smithery. All rights reserved.

    davila7

    hqq-quantization

    davila7/hqq-quantization
    AI & ML
    19,892
    3 installs

    About

    SKILL.md

    Install

    Install via Skills CLI

    or add to your agent
    • Claude Code
      Claude Code
    • Codex
      Codex
    • OpenClaw
      OpenClaw
    • Cursor
      Cursor
    • Amp
      Amp
    • GitHub Copilot
      GitHub Copilot
    • Gemini CLI
      Gemini CLI
    • Kilo Code
      Kilo Code
    • Junie
      Junie
    • Replit
      Replit
    • Windsurf
      Windsurf
    • Cline
      Cline
    • Continue
      Continue
    • OpenCode
      OpenCode
    • OpenHands
      OpenHands
    • Roo Code
      Roo Code
    • Augment
      Augment
    • Goose
      Goose
    • Trae
      Trae
    • Zencoder
      Zencoder
    • Antigravity
      Antigravity
    ├─
    ├─
    └─

    About

    Half-Quadratic Quantization for LLMs without calibration data...

    SKILL.md

    HQQ - Half-Quadratic Quantization

    Fast, calibration-free weight quantization supporting 8/4/3/2/1-bit precision with multiple optimized backends.

    When to use HQQ

    Use HQQ when:

    • Quantizing models without calibration data (no dataset needed)
    • Need fast quantization (minutes vs hours for GPTQ/AWQ)
    • Deploying with vLLM or HuggingFace Transformers
    • Fine-tuning quantized models with LoRA/PEFT
    • Experimenting with extreme quantization (2-bit, 1-bit)

    Key advantages:

    • No calibration: Quantize any model instantly without sample data
    • Multiple backends: PyTorch, ATEN, TorchAO, Marlin, BitBlas for optimized inference
    • Flexible precision: 8/4/3/2/1-bit with configurable group sizes
    • Framework integration: Native HuggingFace and vLLM support
    • PEFT compatible: Fine-tune quantized models with LoRA

    Use alternatives instead:

    • AWQ: Need calibration-based accuracy, production serving
    • GPTQ: Maximum accuracy with calibration data available
    • bitsandbytes: Simple 8-bit/4-bit without custom backends
    • llama.cpp/GGUF: CPU inference, Apple Silicon deployment

    Quick start

    Installation

    pip install hqq
    
    # With specific backend
    pip install hqq[torch]      # PyTorch backend
    pip install hqq[torchao]    # TorchAO int4 backend
    pip install hqq[bitblas]    # BitBlas backend
    pip install hqq[marlin]     # Marlin backend
    

    Basic quantization

    from hqq.core.quantize import BaseQuantizeConfig, HQQLinear
    import torch.nn as nn
    
    # Configure quantization
    config = BaseQuantizeConfig(
        nbits=4,           # 4-bit quantization
        group_size=64,     # Group size for quantization
        axis=1             # Quantize along output dimension
    )
    
    # Quantize a linear layer
    linear = nn.Linear(4096, 4096)
    hqq_linear = HQQLinear(linear, config)
    
    # Use normally
    output = hqq_linear(input_tensor)
    

    Quantize full model with HuggingFace

    from transformers import AutoModelForCausalLM, HqqConfig
    
    # Configure HQQ
    quantization_config = HqqConfig(
        nbits=4,
        group_size=64,
        axis=1
    )
    
    # Load and quantize
    model = AutoModelForCausalLM.from_pretrained(
        "meta-llama/Llama-3.1-8B",
        quantization_config=quantization_config,
        device_map="auto"
    )
    
    # Model is quantized and ready to use
    

    Core concepts

    Quantization configuration

    HQQ uses BaseQuantizeConfig to define quantization parameters:

    from hqq.core.quantize import BaseQuantizeConfig
    
    # Standard 4-bit config
    config_4bit = BaseQuantizeConfig(
        nbits=4,           # Bits per weight (1-8)
        group_size=64,     # Weights per quantization group
        axis=1             # 0=input dim, 1=output dim
    )
    
    # Aggressive 2-bit config
    config_2bit = BaseQuantizeConfig(
        nbits=2,
        group_size=16,     # Smaller groups for low-bit
        axis=1
    )
    
    # Mixed precision per layer type
    layer_configs = {
        "self_attn.q_proj": BaseQuantizeConfig(nbits=4, group_size=64),
        "self_attn.k_proj": BaseQuantizeConfig(nbits=4, group_size=64),
        "self_attn.v_proj": BaseQuantizeConfig(nbits=4, group_size=64),
        "mlp.gate_proj": BaseQuantizeConfig(nbits=2, group_size=32),
        "mlp.up_proj": BaseQuantizeConfig(nbits=2, group_size=32),
        "mlp.down_proj": BaseQuantizeConfig(nbits=4, group_size=64),
    }
    

    HQQLinear layer

    The core quantized layer that replaces nn.Linear:

    from hqq.core.quantize import HQQLinear
    import torch
    
    # Create quantized layer
    linear = torch.nn.Linear(4096, 4096)
    hqq_layer = HQQLinear(linear, config)
    
    # Access quantized weights
    W_q = hqq_layer.W_q           # Quantized weights
    scale = hqq_layer.scale       # Scale factors
    zero = hqq_layer.zero         # Zero points
    
    # Dequantize for inspection
    W_dequant = hqq_layer.dequantize()
    

    Backends

    HQQ supports multiple inference backends for different hardware:

    from hqq.core.quantize import HQQLinear
    
    # Available backends
    backends = [
        "pytorch",          # Pure PyTorch (default)
        "pytorch_compile",  # torch.compile optimized
        "aten",            # Custom CUDA kernels
        "torchao_int4",    # TorchAO int4 matmul
        "gemlite",         # GemLite CUDA kernels
        "bitblas",         # BitBlas optimized
        "marlin",          # Marlin 4-bit kernels
    ]
    
    # Set backend globally
    HQQLinear.set_backend("torchao_int4")
    
    # Or per layer
    hqq_layer.set_backend("marlin")
    

    Backend selection guide:

    Backend Best For Requirements
    pytorch Compatibility Any GPU
    pytorch_compile Moderate speedup torch>=2.0
    aten Good balance CUDA GPU
    torchao_int4 4-bit inference torchao installed
    marlin Maximum 4-bit speed Ampere+ GPU
    bitblas Flexible bit-widths bitblas installed

    HuggingFace integration

    Load pre-quantized models

    from transformers import AutoModelForCausalLM, AutoTokenizer
    
    # Load HQQ-quantized model from Hub
    model = AutoModelForCausalLM.from_pretrained(
        "mobiuslabsgmbh/Llama-3.1-8B-HQQ-4bit",
        device_map="auto"
    )
    tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
    
    # Use normally
    inputs = tokenizer("Hello, world!", return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_new_tokens=50)
    

    Quantize and save

    from transformers import AutoModelForCausalLM, HqqConfig
    
    # Quantize
    config = HqqConfig(nbits=4, group_size=64)
    model = AutoModelForCausalLM.from_pretrained(
        "meta-llama/Llama-3.1-8B",
        quantization_config=config,
        device_map="auto"
    )
    
    # Save quantized model
    model.save_pretrained("./llama-8b-hqq-4bit")
    
    # Push to Hub
    model.push_to_hub("my-org/Llama-3.1-8B-HQQ-4bit")
    

    Mixed precision quantization

    from transformers import AutoModelForCausalLM, HqqConfig
    
    # Different precision per layer type
    config = HqqConfig(
        nbits=4,
        group_size=64,
        # Attention layers: higher precision
        # MLP layers: lower precision for memory savings
        dynamic_config={
            "attn": {"nbits": 4, "group_size": 64},
            "mlp": {"nbits": 2, "group_size": 32}
        }
    )
    

    vLLM integration

    Serve HQQ models with vLLM

    from vllm import LLM, SamplingParams
    
    # Load HQQ-quantized model
    llm = LLM(
        model="mobiuslabsgmbh/Llama-3.1-8B-HQQ-4bit",
        quantization="hqq",
        dtype="float16"
    )
    
    # Generate
    sampling_params = SamplingParams(temperature=0.7, max_tokens=100)
    outputs = llm.generate(["What is machine learning?"], sampling_params)
    

    vLLM with custom HQQ config

    from vllm import LLM
    
    llm = LLM(
        model="meta-llama/Llama-3.1-8B",
        quantization="hqq",
        quantization_config={
            "nbits": 4,
            "group_size": 64
        }
    )
    

    PEFT/LoRA fine-tuning

    Fine-tune quantized models

    from transformers import AutoModelForCausalLM, HqqConfig
    from peft import LoraConfig, get_peft_model
    
    # Load quantized model
    quant_config = HqqConfig(nbits=4, group_size=64)
    model = AutoModelForCausalLM.from_pretrained(
        "meta-llama/Llama-3.1-8B",
        quantization_config=quant_config,
        device_map="auto"
    )
    
    # Apply LoRA
    lora_config = LoraConfig(
        r=16,
        lora_alpha=32,
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
        lora_dropout=0.05,
        bias="none",
        task_type="CAUSAL_LM"
    )
    
    model = get_peft_model(model, lora_config)
    
    # Train normally with Trainer or custom loop
    

    QLoRA-style training

    from transformers import TrainingArguments, Trainer
    
    training_args = TrainingArguments(
        output_dir="./hqq-lora-output",
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        learning_rate=2e-4,
        num_train_epochs=3,
        fp16=True,
        logging_steps=10,
        save_strategy="epoch"
    )
    
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        data_collator=data_collator
    )
    
    trainer.train()
    

    Quantization workflows

    Workflow 1: Quick model compression

    from transformers import AutoModelForCausalLM, AutoTokenizer, HqqConfig
    
    # 1. Configure quantization
    config = HqqConfig(nbits=4, group_size=64)
    
    # 2. Load and quantize (no calibration needed!)
    model = AutoModelForCausalLM.from_pretrained(
        "meta-llama/Llama-3.1-8B",
        quantization_config=config,
        device_map="auto"
    )
    tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
    
    # 3. Verify quality
    prompt = "The capital of France is"
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_new_tokens=20)
    print(tokenizer.decode(outputs[0]))
    
    # 4. Save
    model.save_pretrained("./llama-8b-hqq")
    tokenizer.save_pretrained("./llama-8b-hqq")
    

    Workflow 2: Optimize for inference speed

    from hqq.core.quantize import HQQLinear
    from transformers import AutoModelForCausalLM, HqqConfig
    
    # 1. Quantize with optimal backend
    config = HqqConfig(nbits=4, group_size=64)
    model = AutoModelForCausalLM.from_pretrained(
        "meta-llama/Llama-3.1-8B",
        quantization_config=config,
        device_map="auto"
    )
    
    # 2. Set fast backend
    HQQLinear.set_backend("marlin")  # or "torchao_int4"
    
    # 3. Compile for additional speedup
    import torch
    model = torch.compile(model)
    
    # 4. Benchmark
    import time
    inputs = tokenizer("Hello", return_tensors="pt").to(model.device)
    start = time.time()
    for _ in range(10):
        model.generate(**inputs, max_new_tokens=100)
    print(f"Avg time: {(time.time() - start) / 10:.2f}s")
    

    Best practices

    1. Start with 4-bit: Best quality/size tradeoff for most models
    2. Use group_size=64: Good balance; smaller for extreme quantization
    3. Choose backend wisely: Marlin for 4-bit Ampere+, TorchAO for flexibility
    4. Verify quality: Always test generation quality after quantization
    5. Mixed precision: Keep attention at higher precision, compress MLP more
    6. PEFT training: Use LoRA r=16-32 for good fine-tuning results

    Common issues

    Out of memory during quantization:

    # Quantize layer-by-layer
    from hqq.models.hf.base import AutoHQQHFModel
    
    model = AutoHQQHFModel.from_pretrained(
        "meta-llama/Llama-3.1-8B",
        quantization_config=config,
        device_map="sequential"  # Load layers sequentially
    )
    

    Slow inference:

    # Switch to optimized backend
    from hqq.core.quantize import HQQLinear
    HQQLinear.set_backend("marlin")  # Requires Ampere+ GPU
    
    # Or compile
    model = torch.compile(model, mode="reduce-overhead")
    

    Poor quality at 2-bit:

    # Use smaller group size
    config = BaseQuantizeConfig(
        nbits=2,
        group_size=16,  # Smaller groups help at low bits
        axis=1
    )
    

    References

    • Advanced Usage - Custom backends, mixed precision, optimization
    • Troubleshooting - Common issues, debugging, benchmarks

    Resources

    • Repository: https://github.com/mobiusml/hqq
    • Paper: Half-Quadratic Quantization
    • HuggingFace Models: https://huggingface.co/mobiuslabsgmbh
    • Version: 0.2.0+
    • License: Apache 2.0
    Recommended Servers
    ThinAir Data
    ThinAir Data
    vastlint - IAB XML VAST validator and linter
    vastlint - IAB XML VAST validator and linter
    InstantDB
    InstantDB
    Repository
    davila7/claude-code-templates
    Files