Smithery Logo
MCPsSkillsDocsPricing
Login
Smithery Logo

Accelerating the Agent Economy

Resources

DocumentationPrivacy PolicySystem Status

Company

PricingAboutBlog

Connect

© 2026 Smithery. All rights reserved.

    davila7

    peft-fine-tuning

    davila7/peft-fine-tuning
    AI & ML
    19,892
    2 installs

    About

    SKILL.md

    Install

    Install via Skills CLI

    or add to your agent
    • Claude Code
      Claude Code
    • Codex
      Codex
    • OpenClaw
      OpenClaw
    • Cursor
      Cursor
    • Amp
      Amp
    • GitHub Copilot
      GitHub Copilot
    • Gemini CLI
      Gemini CLI
    • Kilo Code
      Kilo Code
    • Junie
      Junie
    • Replit
      Replit
    • Windsurf
      Windsurf
    • Cline
      Cline
    • Continue
      Continue
    • OpenCode
      OpenCode
    • OpenHands
      OpenHands
    • Roo Code
      Roo Code
    • Augment
      Augment
    • Goose
      Goose
    • Trae
      Trae
    • Zencoder
      Zencoder
    • Antigravity
      Antigravity
    ├─
    ├─
    └─

    About

    Parameter-efficient fine-tuning for LLMs using LoRA, QLoRA, and 25+ methods...

    SKILL.md

    PEFT (Parameter-Efficient Fine-Tuning)

    Fine-tune LLMs by training <1% of parameters using LoRA, QLoRA, and 25+ adapter methods.

    When to use PEFT

    Use PEFT/LoRA when:

    • Fine-tuning 7B-70B models on consumer GPUs (RTX 4090, A100)
    • Need to train <1% parameters (6MB adapters vs 14GB full model)
    • Want fast iteration with multiple task-specific adapters
    • Deploying multiple fine-tuned variants from one base model

    Use QLoRA (PEFT + quantization) when:

    • Fine-tuning 70B models on single 24GB GPU
    • Memory is the primary constraint
    • Can accept ~5% quality trade-off vs full fine-tuning

    Use full fine-tuning instead when:

    • Training small models (<1B parameters)
    • Need maximum quality and have compute budget
    • Significant domain shift requires updating all weights

    Quick start

    Installation

    # Basic installation
    pip install peft
    
    # With quantization support (recommended)
    pip install peft bitsandbytes
    
    # Full stack
    pip install peft transformers accelerate bitsandbytes datasets
    

    LoRA fine-tuning (standard)

    from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
    from peft import get_peft_model, LoraConfig, TaskType
    from datasets import load_dataset
    
    # Load base model
    model_name = "meta-llama/Llama-3.1-8B"
    model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.pad_token = tokenizer.eos_token
    
    # LoRA configuration
    lora_config = LoraConfig(
        task_type=TaskType.CAUSAL_LM,
        r=16,                          # Rank (8-64, higher = more capacity)
        lora_alpha=32,                 # Scaling factor (typically 2*r)
        lora_dropout=0.05,             # Dropout for regularization
        target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],  # Attention layers
        bias="none"                    # Don't train biases
    )
    
    # Apply LoRA
    model = get_peft_model(model, lora_config)
    model.print_trainable_parameters()
    # Output: trainable params: 13,631,488 || all params: 8,043,307,008 || trainable%: 0.17%
    
    # Prepare dataset
    dataset = load_dataset("databricks/databricks-dolly-15k", split="train")
    
    def tokenize(example):
        text = f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['response']}"
        return tokenizer(text, truncation=True, max_length=512, padding="max_length")
    
    tokenized = dataset.map(tokenize, remove_columns=dataset.column_names)
    
    # Training
    training_args = TrainingArguments(
        output_dir="./lora-llama",
        num_train_epochs=3,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=10,
        save_strategy="epoch"
    )
    
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized,
        data_collator=lambda data: {"input_ids": torch.stack([f["input_ids"] for f in data]),
                                     "attention_mask": torch.stack([f["attention_mask"] for f in data]),
                                     "labels": torch.stack([f["input_ids"] for f in data])}
    )
    
    trainer.train()
    
    # Save adapter only (6MB vs 16GB)
    model.save_pretrained("./lora-llama-adapter")
    

    QLoRA fine-tuning (memory-efficient)

    from transformers import AutoModelForCausalLM, BitsAndBytesConfig
    from peft import get_peft_model, LoraConfig, prepare_model_for_kbit_training
    
    # 4-bit quantization config
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",           # NormalFloat4 (best for LLMs)
        bnb_4bit_compute_dtype="bfloat16",   # Compute in bf16
        bnb_4bit_use_double_quant=True       # Nested quantization
    )
    
    # Load quantized model
    model = AutoModelForCausalLM.from_pretrained(
        "meta-llama/Llama-3.1-70B",
        quantization_config=bnb_config,
        device_map="auto"
    )
    
    # Prepare for training (enables gradient checkpointing)
    model = prepare_model_for_kbit_training(model)
    
    # LoRA config for QLoRA
    lora_config = LoraConfig(
        r=64,                              # Higher rank for 70B
        lora_alpha=128,
        lora_dropout=0.1,
        target_modules=["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
        bias="none",
        task_type="CAUSAL_LM"
    )
    
    model = get_peft_model(model, lora_config)
    # 70B model now fits on single 24GB GPU!
    

    LoRA parameter selection

    Rank (r) - capacity vs efficiency

    Rank Trainable Params Memory Quality Use Case
    4 ~3M Minimal Lower Simple tasks, prototyping
    8 ~7M Low Good Recommended starting point
    16 ~14M Medium Better General fine-tuning
    32 ~27M Higher High Complex tasks
    64 ~54M High Highest Domain adaptation, 70B models

    Alpha (lora_alpha) - scaling factor

    # Rule of thumb: alpha = 2 * rank
    LoraConfig(r=16, lora_alpha=32)  # Standard
    LoraConfig(r=16, lora_alpha=16)  # Conservative (lower learning rate effect)
    LoraConfig(r=16, lora_alpha=64)  # Aggressive (higher learning rate effect)
    

    Target modules by architecture

    # Llama / Mistral / Qwen
    target_modules = ["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
    
    # GPT-2 / GPT-Neo
    target_modules = ["c_attn", "c_proj", "c_fc"]
    
    # Falcon
    target_modules = ["query_key_value", "dense", "dense_h_to_4h", "dense_4h_to_h"]
    
    # BLOOM
    target_modules = ["query_key_value", "dense", "dense_h_to_4h", "dense_4h_to_h"]
    
    # Auto-detect all linear layers
    target_modules = "all-linear"  # PEFT 0.6.0+
    

    Loading and merging adapters

    Load trained adapter

    from peft import PeftModel, AutoPeftModelForCausalLM
    from transformers import AutoModelForCausalLM
    
    # Option 1: Load with PeftModel
    base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
    model = PeftModel.from_pretrained(base_model, "./lora-llama-adapter")
    
    # Option 2: Load directly (recommended)
    model = AutoPeftModelForCausalLM.from_pretrained(
        "./lora-llama-adapter",
        device_map="auto"
    )
    

    Merge adapter into base model

    # Merge for deployment (no adapter overhead)
    merged_model = model.merge_and_unload()
    
    # Save merged model
    merged_model.save_pretrained("./llama-merged")
    tokenizer.save_pretrained("./llama-merged")
    
    # Push to Hub
    merged_model.push_to_hub("username/llama-finetuned")
    

    Multi-adapter serving

    from peft import PeftModel
    
    # Load base with first adapter
    model = AutoPeftModelForCausalLM.from_pretrained("./adapter-task1")
    
    # Load additional adapters
    model.load_adapter("./adapter-task2", adapter_name="task2")
    model.load_adapter("./adapter-task3", adapter_name="task3")
    
    # Switch between adapters at runtime
    model.set_adapter("task1")  # Use task1 adapter
    output1 = model.generate(**inputs)
    
    model.set_adapter("task2")  # Switch to task2
    output2 = model.generate(**inputs)
    
    # Disable adapters (use base model)
    with model.disable_adapter():
        base_output = model.generate(**inputs)
    

    PEFT methods comparison

    Method Trainable % Memory Speed Best For
    LoRA 0.1-1% Low Fast General fine-tuning
    QLoRA 0.1-1% Very Low Medium Memory-constrained
    AdaLoRA 0.1-1% Low Medium Automatic rank selection
    IA3 0.01% Minimal Fastest Few-shot adaptation
    Prefix Tuning 0.1% Low Medium Generation control
    Prompt Tuning 0.001% Minimal Fast Simple task adaptation
    P-Tuning v2 0.1% Low Medium NLU tasks

    IA3 (minimal parameters)

    from peft import IA3Config
    
    ia3_config = IA3Config(
        target_modules=["q_proj", "v_proj", "k_proj", "down_proj"],
        feedforward_modules=["down_proj"]
    )
    model = get_peft_model(model, ia3_config)
    # Trains only 0.01% of parameters!
    

    Prefix Tuning

    from peft import PrefixTuningConfig
    
    prefix_config = PrefixTuningConfig(
        task_type="CAUSAL_LM",
        num_virtual_tokens=20,      # Prepended tokens
        prefix_projection=True       # Use MLP projection
    )
    model = get_peft_model(model, prefix_config)
    

    Integration patterns

    With TRL (SFTTrainer)

    from trl import SFTTrainer, SFTConfig
    from peft import LoraConfig
    
    lora_config = LoraConfig(r=16, lora_alpha=32, target_modules="all-linear")
    
    trainer = SFTTrainer(
        model=model,
        args=SFTConfig(output_dir="./output", max_seq_length=512),
        train_dataset=dataset,
        peft_config=lora_config,  # Pass LoRA config directly
    )
    trainer.train()
    

    With Axolotl (YAML config)

    # axolotl config.yaml
    adapter: lora
    lora_r: 16
    lora_alpha: 32
    lora_dropout: 0.05
    lora_target_modules:
      - q_proj
      - v_proj
      - k_proj
      - o_proj
    lora_target_linear: true  # Target all linear layers
    

    With vLLM (inference)

    from vllm import LLM
    from vllm.lora.request import LoRARequest
    
    # Load base model with LoRA support
    llm = LLM(model="meta-llama/Llama-3.1-8B", enable_lora=True)
    
    # Serve with adapter
    outputs = llm.generate(
        prompts,
        lora_request=LoRARequest("adapter1", 1, "./lora-adapter")
    )
    

    Performance benchmarks

    Memory usage (Llama 3.1 8B)

    Method GPU Memory Trainable Params
    Full fine-tuning 60+ GB 8B (100%)
    LoRA r=16 18 GB 14M (0.17%)
    QLoRA r=16 6 GB 14M (0.17%)
    IA3 16 GB 800K (0.01%)

    Training speed (A100 80GB)

    Method Tokens/sec vs Full FT
    Full FT 2,500 1x
    LoRA 3,200 1.3x
    QLoRA 2,100 0.84x

    Quality (MMLU benchmark)

    Model Full FT LoRA QLoRA
    Llama 2-7B 45.3 44.8 44.1
    Llama 2-13B 54.8 54.2 53.5

    Common issues

    CUDA OOM during training

    # Solution 1: Enable gradient checkpointing
    model.gradient_checkpointing_enable()
    
    # Solution 2: Reduce batch size + increase accumulation
    TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=16
    )
    
    # Solution 3: Use QLoRA
    from transformers import BitsAndBytesConfig
    bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4")
    

    Adapter not applying

    # Verify adapter is active
    print(model.active_adapters)  # Should show adapter name
    
    # Check trainable parameters
    model.print_trainable_parameters()
    
    # Ensure model in training mode
    model.train()
    

    Quality degradation

    # Increase rank
    LoraConfig(r=32, lora_alpha=64)
    
    # Target more modules
    target_modules = "all-linear"
    
    # Use more training data and epochs
    TrainingArguments(num_train_epochs=5)
    
    # Lower learning rate
    TrainingArguments(learning_rate=1e-4)
    

    Best practices

    1. Start with r=8-16, increase if quality insufficient
    2. Use alpha = 2 * rank as starting point
    3. Target attention + MLP layers for best quality/efficiency
    4. Enable gradient checkpointing for memory savings
    5. Save adapters frequently (small files, easy rollback)
    6. Evaluate on held-out data before merging
    7. Use QLoRA for 70B+ models on consumer hardware

    References

    • Advanced Usage - DoRA, LoftQ, rank stabilization, custom modules
    • Troubleshooting - Common errors, debugging, optimization

    Resources

    • GitHub: https://github.com/huggingface/peft
    • Docs: https://huggingface.co/docs/peft
    • LoRA Paper: arXiv:2106.09685
    • QLoRA Paper: arXiv:2305.14314
    • Models: https://huggingface.co/models?library=peft
    Recommended Servers
    Parallel Web Search
    Parallel Web Search
    Vercel Grep
    Vercel Grep
    vastlint - IAB XML VAST validator and linter
    vastlint - IAB XML VAST validator and linter
    Repository
    davila7/claude-code-templates
    Files