Smithery Logo
MCPsSkillsDocsPricing
Login
Smithery Logo

Accelerating the Agent Economy

Resources

DocumentationPrivacy PolicySystem Status

Company

PricingAboutBlog

Connect

© 2026 Smithery. All rights reserved.

    itsmostafa

    qlora

    itsmostafa/qlora
    AI & ML
    11

    About

    SKILL.md

    Install

    Install via Skills CLI

    or add to your agent
    • Claude Code
      Claude Code
    • Codex
      Codex
    • OpenClaw
      OpenClaw
    • Cursor
      Cursor
    • Amp
      Amp
    • GitHub Copilot
      GitHub Copilot
    • Gemini CLI
      Gemini CLI
    • Kilo Code
      Kilo Code
    • Junie
      Junie
    • Replit
      Replit
    • Windsurf
      Windsurf
    • Cline
      Cline
    • Continue
      Continue
    • OpenCode
      OpenCode
    • OpenHands
      OpenHands
    • Roo Code
      Roo Code
    • Augment
      Augment
    • Goose
      Goose
    • Trae
      Trae
    • Zencoder
      Zencoder
    • Antigravity
      Antigravity
    ├─
    ├─
    └─

    About

    Memory-efficient fine-tuning with 4-bit quantization and LoRA adapters. Use when fine-tuning large models (7B+) on consumer GPUs, when VRAM is limited, or when standard LoRA still exceeds memory...

    SKILL.md

    QLoRA: Quantized Low-Rank Adaptation

    QLoRA enables fine-tuning of large language models on consumer GPUs by combining 4-bit quantization with LoRA adapters. A 65B model can be fine-tuned on a single 48GB GPU while matching 16-bit fine-tuning performance.

    Prerequisites: This skill assumes familiarity with LoRA. See the lora skill for LoRA fundamentals (LoraConfig, target_modules, training patterns).

    Table of Contents

    • Core Innovations
    • BitsAndBytesConfig Deep Dive
    • Memory Requirements
    • Complete Training Example
    • Inference and Merging
    • Troubleshooting
    • Best Practices
    • References

    Core Innovations

    QLoRA introduces three techniques that reduce memory usage without sacrificing performance:

    4-bit NormalFloat (NF4)

    NF4 is an information-theoretically optimal quantization data type for normally distributed weights. Neural network weights are typically normally distributed, making NF4 more efficient than standard 4-bit floats.

    Storage: 4-bit NF4 (quantized weights)
    Compute: 16-bit BF16 (dequantized for forward/backward pass)
    

    The key insight: weights are stored in 4-bit but dequantized to bf16 for computation. Only the frozen base model is quantized; LoRA adapters remain in full precision.

    NF4 vs FP4:

    Quantization Description Use Case
    nf4 Normalized Float 4-bit, optimal for normal distributions Default, recommended
    fp4 Standard 4-bit float Legacy, rarely needed

    Double Quantization

    Standard quantization requires storing scaling constants (typically fp32) for each quantization block. Double quantization quantizes these constants too:

    First quantization:  weights → 4-bit + fp32 scaling constants
    Double quantization: scaling constants → 8-bit + fp32 second-level constants
    

    This saves approximately 0.37 bits per parameter—significant for billion-parameter models:

    • 7B model: ~325 MB savings
    • 70B model: ~3.2 GB savings

    Paged Optimizers

    During training, gradient checkpointing can cause memory spikes when processing long sequences. Paged optimizers use NVIDIA unified memory to automatically transfer optimizer states between GPU and CPU:

    Normal training: OOM on memory spike
    Paged optimizers: GPU ↔ CPU transfer handles spikes gracefully
    

    This is handled automatically by bitsandbytes when using 4-bit training.

    BitsAndBytesConfig Deep Dive

    All Parameters Explained

    from transformers import BitsAndBytesConfig
    import torch
    
    bnb_config = BitsAndBytesConfig(
        # Core 4-bit settings
        load_in_4bit=True,              # Enable 4-bit quantization
        bnb_4bit_quant_type="nf4",      # "nf4" (recommended) or "fp4"
    
        # Double quantization
        bnb_4bit_use_double_quant=True, # Quantize the quantization constants
    
        # Compute precision
        bnb_4bit_compute_dtype=torch.bfloat16,  # Dequantize to this dtype for compute
    
        # Optional: specific storage type (usually auto-detected)
        bnb_4bit_quant_storage=torch.uint8,     # Storage dtype for quantized weights
    )
    

    Compute Dtype Selection

    Dtype Hardware Notes
    torch.bfloat16 Ampere+ (RTX 30xx, A100) Recommended, faster
    torch.float16 Older GPUs (V100, RTX 20xx) Use if bf16 not supported
    torch.float32 Any Slower, only for debugging

    Check bf16 support:

    import torch
    print(torch.cuda.is_bf16_supported())  # True on Ampere+
    

    Comparison: Quantization Options

    # Recommended: NF4 + double quant + bf16
    optimal_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_use_double_quant=True,
        bnb_4bit_compute_dtype=torch.bfloat16,
    )
    
    # Fallback when bf16 is unsupported
    fp16_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_use_double_quant=True,
        bnb_4bit_compute_dtype=torch.float16,  # use when bf16 is unsupported or slower
    )
    
    # 8-bit alternative (less compression, sometimes more stable)
    eight_bit_config = BitsAndBytesConfig(
        load_in_8bit=True,
    )
    

    Memory Requirements

    Model Size Full Fine-tuning LoRA (16-bit) QLoRA (4-bit)
    7B ~60 GB ~16 GB ~6 GB
    13B ~104 GB ~28 GB ~10 GB
    34B ~272 GB ~75 GB ~20 GB
    70B ~560 GB ~160 GB ~48 GB

    Notes:

    • QLoRA memory includes model + optimizer states + activations
    • Actual usage varies with batch size, sequence length, and gradient checkpointing
    • Add ~20% buffer for safe operation

    GPU Recommendations

    GPU VRAM Max Model Size (QLoRA)
    8 GB 7B (tight)
    16 GB 7-13B
    24 GB 13-34B
    48 GB 34-70B
    80 GB 70B+ comfortably

    Complete Training Example

    from transformers import (
        AutoModelForCausalLM,
        AutoTokenizer,
        BitsAndBytesConfig,
        TrainingArguments,
    )
    from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
    from trl import SFTTrainer, SFTConfig
    from datasets import load_dataset
    import torch
    
    # 1. Quantization config
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_use_double_quant=True,
        bnb_4bit_compute_dtype=torch.bfloat16,
    )
    
    # 2. Load quantized model
    model_name = "meta-llama/Llama-3.1-8B"
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=bnb_config,
        device_map="auto",
        dtype="auto",
        attn_implementation="flash_attention_2",  # Optional: faster attention
    )
    
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.padding_side = "right"
    
    # 3. Prepare for k-bit training (critical step!)
    model = prepare_model_for_kbit_training(model)
    
    # 4. LoRA config (see lora skill for parameter details)
    lora_config = LoraConfig(
        r=16,
        lora_alpha=32,
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
        lora_dropout=0.05,
        bias="none",
        task_type="CAUSAL_LM",
    )
    
    model = get_peft_model(model, lora_config)
    model.print_trainable_parameters()
    
    # 5. Dataset
    dataset = load_dataset("tatsu-lab/alpaca", split="train[:1000]")
    
    def format_example(example):
        if example["input"]:
            return {"text": f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}"}
        return {"text": f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"}
    
    dataset = dataset.map(format_example)
    
    # 6. Training
    sft_config = SFTConfig(
        output_dir="./qlora-output",
        max_length=512,
        dataset_text_field="text",
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        num_train_epochs=1,
        learning_rate=2e-4,
        bf16=True,
        logging_steps=10,
        save_steps=100,
        gradient_checkpointing=True,
        gradient_checkpointing_kwargs={"use_reentrant": False},
        optim="paged_adamw_8bit",  # Paged optimizer for memory efficiency
    )
    
    trainer = SFTTrainer(
        model=model,
        args=sft_config,
        train_dataset=dataset,
        processing_class=tokenizer,
    )
    
    trainer.train()
    
    # 7. Save adapter
    model.save_pretrained("./qlora-adapter")
    tokenizer.save_pretrained("./qlora-adapter")
    

    Inference and Merging

    Inference with Quantized Model

    from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
    from peft import PeftModel
    import torch
    
    model_name = "meta-llama/Llama-3.1-8B"
    
    # Load quantized base model
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
    )
    
    base_model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=bnb_config,
        device_map="auto",
        dtype="auto",
    )
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    
    # Load adapter
    model = PeftModel.from_pretrained(base_model, "./qlora-adapter")
    model.eval()
    
    # Generate
    inputs = tokenizer("### Instruction:\nExplain quantum computing.\n\n### Response:\n", return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_new_tokens=256)
    print(tokenizer.decode(outputs[0], skip_special_tokens=True))
    

    Merging to Full Precision

    To merge QLoRA adapters into a full-precision model (for deployment without bitsandbytes):

    from transformers import AutoModelForCausalLM
    from peft import PeftModel
    import torch
    
    # Load base model in full precision (on CPU to avoid OOM)
    base_model = AutoModelForCausalLM.from_pretrained(
        "meta-llama/Llama-3.1-8B",
        dtype=torch.bfloat16,
        device_map="cpu",
    )
    
    # Load adapter
    model = PeftModel.from_pretrained(base_model, "./qlora-adapter")
    
    # Merge and unload
    merged_model = model.merge_and_unload()
    
    # Save merged model
    merged_model.save_pretrained("./merged-model")
    

    Note: Merging requires enough RAM to hold the full-precision model. For 70B models, this means ~140GB RAM.

    Troubleshooting

    CUDA Version Issues

    # Check CUDA version
    nvcc --version
    python -c "import torch; print(torch.version.cuda)"
    
    # bitsandbytes requires CUDA 11.7+
    # If version mismatch, reinstall:
    pip uninstall bitsandbytes
    pip install bitsandbytes --upgrade
    

    "cannot find libcudart" or Missing Library Errors

    # Find CUDA installation
    find /usr -name "libcudart*" 2>/dev/null
    
    # Set environment variable
    export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
    
    # Or for conda:
    export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH
    

    Slow Training

    Common cause: compute dtype mismatch

    # Check if model is using expected dtype
    for name, param in model.named_parameters():
        if param.requires_grad:
            print(f"{name}: {param.dtype}")
            break  # All LoRA params should match
    
    # Ensure bf16 is used in training args if BitsAndBytesConfig uses bf16
    # Mismatch causes constant dtype conversions
    

    Out of Memory

    # 1. Enable gradient checkpointing
    model.gradient_checkpointing_enable()
    
    # 2. Reduce batch size, increase accumulation
    per_device_train_batch_size = 1
    gradient_accumulation_steps = 16
    
    # 3. Use paged optimizer
    optim = "paged_adamw_8bit"
    
    # 4. Reduce sequence length
    max_length = 256
    
    # 5. Target fewer modules
    target_modules = ["q_proj", "v_proj"]  # Minimal set
    

    Model Loads But Training Fails

    # Ensure prepare_model_for_kbit_training is called
    from peft import prepare_model_for_kbit_training
    model = prepare_model_for_kbit_training(model)  # Don't skip this!
    
    # Enable input gradients if needed
    model.enable_input_require_grads()
    

    Best Practices

    1. Always use prepare_model_for_kbit_training: This enables gradient computation through the frozen quantized layers

    2. Match compute dtype with training precision: If bnb_4bit_compute_dtype=torch.bfloat16, use bf16=True in training args

    3. Use paged optimizers for large models: optim="paged_adamw_8bit" or "paged_adamw_32bit" handles memory spikes

    4. Start with NF4 + double quantization: This is the recommended default; only change if debugging

    5. Gradient checkpointing is essential: Always enable for QLoRA training to fit larger batch sizes

    6. Test inference before long training runs: Load the model and generate a few tokens to catch configuration issues early

    7. Monitor GPU memory: Use nvidia-smi or torch.cuda.memory_summary() to track actual usage

    8. Consider 8-bit for unstable training: If 4-bit training shows instability, try load_in_8bit=True as a middle ground

    9. Use current TRL config fields: Put max_length and dataset_text_field in SFTConfig, and pass the tokenizer or processor as processing_class.

    References

    • Transformers bitsandbytes quantization
    • TRL SFTTrainer
    • PEFT LoRA developer guide
    Recommended Servers
    Memory Tool
    Memory Tool
    Parallel Web Search
    Parallel Web Search
    Maximum Sats
    Repository
    itsmostafa/llm-engineering-skills
    Files