Smithery Logo
MCPsSkillsDocsPricing
Login
Smithery Logo

Accelerating the Agent Economy

Resources

DocumentationPrivacy PolicySystem Status

Company

PricingAboutBlog

Connect

© 2026 Smithery. All rights reserved.

    vuralserhat86

    huggingface-transformers

    vuralserhat86/huggingface-transformers
    AI & ML
    28

    About

    SKILL.md

    Install

    Install via Skills CLI

    or add to your agent
    • Claude Code
      Claude Code
    • Codex
      Codex
    • OpenClaw
      OpenClaw
    • Cursor
      Cursor
    • Amp
      Amp
    • GitHub Copilot
      GitHub Copilot
    • Gemini CLI
      Gemini CLI
    • Kilo Code
      Kilo Code
    • Junie
      Junie
    • Replit
      Replit
    • Windsurf
      Windsurf
    • Cline
      Cline
    • Continue
      Continue
    • OpenCode
      OpenCode
    • OpenHands
      OpenHands
    • Roo Code
      Roo Code
    • Augment
      Augment
    • Goose
      Goose
    • Trae
      Trae
    • Zencoder
      Zencoder
    • Antigravity
      Antigravity
    ├─
    ├─
    └─

    About

    Hugging Face Transformers best practices including model loading, tokenization, fine-tuning workflows, and inference optimization...

    SKILL.md

    Hugging Face Transformers Best Practices

    Comprehensive guide to using the Hugging Face Transformers library including model loading, tokenization, fine-tuning workflows, pipeline usage, custom datasets, and deployment optimization.


    Quick Reference

    When to use this skill:

    • Loading and using pre-trained transformers (BERT, GPT, T5, LLaMA, etc.)
    • Fine-tuning models on custom data
    • Implementing NLP tasks (classification, QA, generation, etc.)
    • Optimizing inference (quantization, ONNX, etc.)
    • Debugging tokenization issues
    • Using Hugging Face pipelines
    • Deploying transformers to production

    Models covered:

    • Encoders: BERT, RoBERTa, DeBERTa, ALBERT
    • Decoders: GPT-2, GPT-Neo, LLaMA, Mistral
    • Encoder-Decoders: T5, BART, Flan-T5
    • Vision: ViT, CLIP, Stable Diffusion

    Part 1: Model Loading Patterns

    Pattern 1: Basic Model Loading

    from transformers import AutoModel, AutoTokenizer
    
    # Load model and tokenizer
    model_name = "bert-base-uncased"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModel.from_pretrained(model_name)
    
    # For specific tasks
    from transformers import AutoModelForSequenceClassification
    model = AutoModelForSequenceClassification.from_pretrained(
        model_name,
        num_labels=3  # For 3-class classification
    )
    

    Pattern 2: Loading with Specific Configuration

    from transformers import AutoConfig, AutoModel
    
    # Modify configuration
    config = AutoConfig.from_pretrained("bert-base-uncased")
    config.hidden_dropout_prob = 0.2  # Custom dropout
    config.attention_probs_dropout_prob = 0.2
    
    # Load model with custom config
    model = AutoModel.from_pretrained("bert-base-uncased", config=config)
    
    # Or create model from scratch with config
    model = AutoModel.from_config(config)
    

    Pattern 3: Loading Quantized Models (Memory Efficient)

    from transformers import AutoModel, BitsAndBytesConfig
    import torch
    
    # 8-bit quantization (50% memory reduction)
    quantization_config = BitsAndBytesConfig(load_in_8bit=True)
    
    model = AutoModel.from_pretrained(
        "meta-llama/Llama-2-7b-hf",
        quantization_config=quantization_config,
        device_map="auto"  # Automatic device placement
    )
    
    # 4-bit quantization (75% memory reduction)
    quantization_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.float16,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_use_double_quant=True
    )
    
    model = AutoModel.from_pretrained(
        "meta-llama/Llama-2-13b-hf",
        quantization_config=quantization_config,
        device_map="auto"
    )
    

    Pattern 4: Loading from Local Path

    # Save model locally
    model.save_pretrained("./my-model")
    tokenizer.save_pretrained("./my-model")
    
    # Load from local path
    model = AutoModel.from_pretrained("./my-model")
    tokenizer = AutoTokenizer.from_pretrained("./my-model")
    

    Part 2: Tokenization Best Practices

    Critical Tokenization Patterns

    from transformers import AutoTokenizer
    
    tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
    
    # ✅ CORRECT: All required arguments
    tokens = tokenizer(
        text,
        padding=True,  # Pad to longest in batch
        truncation=True,  # Truncate to max_length
        max_length=512,  # Maximum sequence length
        return_tensors="pt"  # Return PyTorch tensors
    )
    
    # Access components
    input_ids = tokens['input_ids']  # Token IDs
    attention_mask = tokens['attention_mask']  # Padding mask
    token_type_ids = tokens.get('token_type_ids')  # Segment IDs (BERT)
    
    # ❌ WRONG: Missing critical arguments
    tokens = tokenizer(text)  # No padding, truncation, or tensor format!
    

    Batch Tokenization

    # Tokenize multiple texts efficiently
    texts = ["First text", "Second text", "Third text"]
    
    tokens = tokenizer(
        texts,
        padding=True,  # Pad all to longest in batch
        truncation=True,
        max_length=128,
        return_tensors="pt"
    )
    
    # Result shape: [batch_size, max_length]
    print(tokens['input_ids'].shape)  # torch.Size([3, max_len_in_batch])
    

    Special Token Handling

    # Add special tokens
    tokenizer.add_special_tokens({
        'additional_special_tokens': ['[CUSTOM]', '[MARKER]']
    })
    
    # Resize model embeddings to match
    model.resize_token_embeddings(len(tokenizer))
    
    # Encode with special tokens preserved
    text = "Hello [CUSTOM] world"
    tokens = tokenizer(text, add_special_tokens=True)
    
    # Decode
    decoded = tokenizer.decode(tokens['input_ids'][0], skip_special_tokens=False)
    

    Tokenization for Different Tasks

    # Text classification (single sequence)
    tokens = tokenizer(
        "This movie was great!",
        padding="max_length",
        truncation=True,
        max_length=128,
        return_tensors="pt"
    )
    
    # Question answering (pair of sequences)
    question = "What is the capital of France?"
    context = "France is a country in Europe. Paris is its capital."
    
    tokens = tokenizer(
        question,
        context,
        padding="max_length",
        truncation="only_second",  # Only truncate context
        max_length=384,
        return_tensors="pt"
    )
    
    # Text generation (decoder-only models)
    prompt = "Once upon a time"
    tokens = tokenizer(prompt, return_tensors="pt")
    # No padding needed for generation input
    

    Part 3: Fine-Tuning Workflows

    Pattern 1: Simple Fine-Tuning with Trainer

    from transformers import (
        AutoModelForSequenceClassification,
        AutoTokenizer,
        Trainer,
        TrainingArguments
    )
    from datasets import load_dataset
    
    # 1. Load dataset
    dataset = load_dataset("glue", "mrpc")
    
    # 2. Load model
    model = AutoModelForSequenceClassification.from_pretrained(
        "bert-base-uncased",
        num_labels=2
    )
    tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
    
    # 3. Tokenize dataset
    def tokenize_function(examples):
        return tokenizer(
            examples["sentence1"],
            examples["sentence2"],
            padding="max_length",
            truncation=True,
            max_length=128
        )
    
    tokenized_datasets = dataset.map(tokenize_function, batched=True)
    
    # 4. Define training arguments
    training_args = TrainingArguments(
        output_dir="./results",
        evaluation_strategy="epoch",
        learning_rate=2e-5,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        num_train_epochs=3,
        weight_decay=0.01,
        logging_dir="./logs",
        logging_steps=100,
        save_strategy="epoch",
        load_best_model_at_end=True,
        metric_for_best_model="accuracy",
    )
    
    # 5. Define metrics
    from datasets import load_metric
    import numpy as np
    
    metric = load_metric("accuracy")
    
    def compute_metrics(eval_pred):
        logits, labels = eval_pred
        predictions = np.argmax(logits, axis=-1)
        return metric.compute(predictions=predictions, references=labels)
    
    # 6. Create Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_datasets["train"],
        eval_dataset=tokenized_datasets["validation"],
        compute_metrics=compute_metrics,
    )
    
    # 7. Train
    trainer.train()
    
    # 8. Save
    trainer.save_model("./fine-tuned-model")
    

    Pattern 2: LoRA Fine-Tuning (Parameter-Efficient)

    from transformers import AutoModelForCausalLM, AutoTokenizer
    from peft import LoraConfig, get_peft_model, TaskType
    
    # Load base model
    model = AutoModelForCausalLM.from_pretrained(
        "meta-llama/Llama-2-7b-hf",
        load_in_8bit=True,  # 8-bit for memory efficiency
        device_map="auto"
    )
    
    # Configure LoRA
    lora_config = LoraConfig(
        task_type=TaskType.CAUSAL_LM,
        r=8,  # LoRA rank
        lora_alpha=32,  # LoRA alpha
        lora_dropout=0.1,
        target_modules=["q_proj", "v_proj"],  # Which layers to adapt
    )
    
    # Apply LoRA
    model = get_peft_model(model, lora_config)
    
    # Check trainable parameters
    model.print_trainable_parameters()
    # Output: trainable params: 4.2M || all params: 6.7B || trainable%: 0.062%
    
    # Train with Trainer (same as before)
    # Only LoRA parameters are updated!
    

    Pattern 3: Custom Training Loop

    import torch
    from torch.utils.data import DataLoader
    from transformers import AdamW, get_scheduler
    
    # Prepare dataloaders
    train_dataloader = DataLoader(tokenized_datasets["train"], batch_size=16, shuffle=True)
    eval_dataloader = DataLoader(tokenized_datasets["validation"], batch_size=16)
    
    # Optimizer
    optimizer = AdamW(model.parameters(), lr=2e-5)
    
    # Learning rate scheduler
    num_epochs = 3
    num_training_steps = num_epochs * len(train_dataloader)
    lr_scheduler = get_scheduler(
        "linear",
        optimizer=optimizer,
        num_warmup_steps=500,
        num_training_steps=num_training_steps
    )
    
    # Training loop
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)
    
    for epoch in range(num_epochs):
        model.train()
        for batch in train_dataloader:
            batch = {k: v.to(device) for k, v in batch.items()}
    
            outputs = model(**batch)
            loss = outputs.loss
            loss.backward()
    
            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()
    
        # Evaluation
        model.eval()
        for batch in eval_dataloader:
            batch = {k: v.to(device) for k, v in batch.items()}
            with torch.no_grad():
                outputs = model(**batch)
            # Compute metrics
    

    Part 4: Pipeline Usage (High-Level API)

    Text Classification Pipeline

    from transformers import pipeline
    
    # Load pipeline
    classifier = pipeline(
        "text-classification",
        model="distilbert-base-uncased-finetuned-sst-2-english"
    )
    
    # Single prediction
    result = classifier("I love this product!")
    # [{'label': 'POSITIVE', 'score': 0.9998}]
    
    # Batch prediction
    results = classifier([
        "Great service!",
        "Terrible experience",
        "Average quality"
    ])
    

    Question Answering Pipeline

    qa_pipeline = pipeline("question-answering", model="distilbert-base-uncased-distilled-squad")
    
    result = qa_pipeline(
        question="What is the capital of France?",
        context="France is a country in Europe. Its capital is Paris, a beautiful city."
    )
    # {'score': 0.98, 'start': 49, 'end': 54, 'answer': 'Paris'}
    

    Text Generation Pipeline

    generator = pipeline("text-generation", model="gpt2")
    
    outputs = generator(
        "Once upon a time",
        max_length=50,
        num_return_sequences=3,
        temperature=0.7,
        top_k=50,
        top_p=0.95,
        do_sample=True
    )
    
    for output in outputs:
        print(output['generated_text'])
    

    Zero-Shot Classification Pipeline

    classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
    
    result = classifier(
        "This is a course about Python programming.",
        candidate_labels=["education", "technology", "business", "sports"]
    )
    # {'sequence': '...', 'labels': ['education', 'technology', ...], 'scores': [0.85, 0.12, ...]}
    

    Part 5: Inference Optimization

    Optimization 1: Batch Processing

    # ❌ SLOW: Process one at a time
    for text in texts:
        output = model(**tokenizer(text, return_tensors="pt"))
    
    # ✅ FAST: Process in batches
    batch_size = 32
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        inputs = tokenizer(batch, padding=True, truncation=True, return_tensors="pt")
        outputs = model(**inputs)
    

    Optimization 2: Mixed Precision (AMP)

    from torch.cuda.amp import autocast, GradScaler
    
    scaler = GradScaler()
    
    for batch in dataloader:
        optimizer.zero_grad()
    
        # Forward pass in mixed precision
        with autocast():
            outputs = model(**batch)
            loss = outputs.loss
    
        # Backward pass with scaled gradients
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()
    

    Optimization 3: ONNX Export

    from transformers import AutoModelForSequenceClassification, AutoTokenizer
    from optimum.onnxruntime import ORTModelForSequenceClassification
    
    # Export to ONNX
    model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
    model.save_pretrained("./onnx-model", export=True)
    
    # Load ONNX model (faster inference)
    ort_model = ORTModelForSequenceClassification.from_pretrained("./onnx-model")
    
    # Inference (2-3x faster)
    tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
    inputs = tokenizer("Hello world", return_tensors="pt")
    outputs = ort_model(**inputs)
    

    Optimization 4: Dynamic Quantization

    import torch
    
    # Quantize model to int8
    quantized_model = torch.quantization.quantize_dynamic(
        model,
        {torch.nn.Linear},  # Quantize Linear layers
        dtype=torch.qint8
    )
    
    # 4x smaller model, 2-3x faster inference on CPU
    

    Part 6: Common Issues & Solutions

    Issue 1: CUDA Out of Memory

    Problem: RuntimeError: CUDA out of memory

    Solutions:

    # Solution 1: Reduce batch size
    training_args = TrainingArguments(
        per_device_train_batch_size=8,  # Was 32
        gradient_accumulation_steps=4,  # Effective batch = 8*4 = 32
    )
    
    # Solution 2: Use gradient checkpointing
    model.gradient_checkpointing_enable()
    
    # Solution 3: Use 8-bit model
    from transformers import BitsAndBytesConfig
    quantization_config = BitsAndBytesConfig(load_in_8bit=True)
    model = AutoModel.from_pretrained("model-name", quantization_config=quantization_config)
    
    # Solution 4: Clear cache
    import torch
    torch.cuda.empty_cache()
    

    Issue 2: Slow Tokenization

    Problem: Tokenization is bottleneck

    Solutions:

    # Solution 1: Use fast tokenizers
    tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=True)
    
    # Solution 2: Tokenize dataset once, cache it
    tokenized_dataset = dataset.map(
        tokenize_function,
        batched=True,
        num_proc=4,  # Parallel processing
        remove_columns=dataset.column_names,
        load_from_cache_file=True  # Cache results
    )
    
    # Solution 3: Use larger batches for tokenization
    tokenizer(
        texts,
        padding=True,
        truncation=True,
        max_length=512,
        return_tensors="pt",
        batched=True,  # Process multiple texts at once
        batch_size=1000
    )
    

    Issue 3: Inconsistent Results

    Problem: Model outputs different results for same input

    Solution:

    # Set seeds for reproducibility
    import random
    import numpy as np
    import torch
    
    def set_seed(seed=42):
        random.seed(seed)
        np.random.seed(seed)
        torch.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False
    
    set_seed(42)
    
    # Disable dropout during inference
    model.eval()
    
    # Use deterministic generation
    outputs = model.generate(
        inputs,
        do_sample=False,  # Greedy decoding
        # OR
        do_sample=True,
        temperature=1.0,
        top_k=50,
        seed=42  # For sampling
    )
    

    Issue 4: Attention Mask Errors

    Problem: IndexError: index out of range in self

    Solution:

    # ✅ ALWAYS provide attention mask
    tokens = tokenizer(
        text,
        padding=True,
        truncation=True,
        return_tensors="pt",
        return_attention_mask=True  # Explicit (usually default)
    )
    
    # Use it in model forward
    outputs = model(
        input_ids=tokens['input_ids'],
        attention_mask=tokens['attention_mask']  # Don't forget this!
    )
    
    # For custom padding
    attention_mask = (input_ids != tokenizer.pad_token_id).long()
    

    Part 7: Model-Specific Patterns

    GPT Models (Decoder-Only)

    from transformers import GPT2LMHeadModel, GPT2Tokenizer
    
    model = GPT2LMHeadModel.from_pretrained("gpt2")
    tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
    
    # Set pad token (GPT doesn't have one by default)
    tokenizer.pad_token = tokenizer.eos_token
    
    # Generation
    input_text = "The future of AI is"
    inputs = tokenizer(input_text, return_tensors="pt")
    
    outputs = model.generate(
        **inputs,
        max_new_tokens=50,
        num_beams=5,  # Beam search
        early_stopping=True,
        no_repeat_ngram_size=2,  # Prevent repetition
        temperature=0.8,
        top_p=0.9
    )
    
    print(tokenizer.decode(outputs[0], skip_special_tokens=True))
    

    T5 Models (Encoder-Decoder)

    from transformers import T5ForConditionalGeneration, T5Tokenizer
    
    model = T5ForConditionalGeneration.from_pretrained("t5-small")
    tokenizer = T5Tokenizer.from_pretrained("t5-small")
    
    # T5 expects task prefix
    input_text = "translate English to German: How are you?"
    inputs = tokenizer(input_text, return_tensors="pt")
    
    outputs = model.generate(
        **inputs,
        max_length=50
    )
    
    print(tokenizer.decode(outputs[0], skip_special_tokens=True))
    # "Wie geht es dir?"
    

    BERT Models (Encoder-Only)

    from transformers import BertForMaskedLM, BertTokenizer
    
    model = BertForMaskedLM.from_pretrained("bert-base-uncased")
    tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
    
    # Masked language modeling
    text = "Paris is the [MASK] of France."
    inputs = tokenizer(text, return_tensors="pt")
    
    # Get predictions for [MASK]
    outputs = model(**inputs)
    mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
    mask_token_logits = outputs.logits[0, mask_token_index, :]
    
    # Top 5 predictions
    top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()
    for token in top_5_tokens:
        print(tokenizer.decode([token]))
    # capital, city, center, heart, ...
    

    Part 8: Production Deployment

    FastAPI Serving Pattern

    from fastapi import FastAPI
    from transformers import pipeline
    from pydantic import BaseModel
    import uvicorn
    
    app = FastAPI()
    
    # Load model once at startup
    classifier = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english")
    
    class TextInput(BaseModel):
        text: str
    
    @app.post("/classify")
    async def classify_text(input: TextInput):
        result = classifier(input.text)[0]
        return {
            "label": result['label'],
            "confidence": result['score']
        }
    
    if __name__ == "__main__":
        uvicorn.run(app, host="0.0.0.0", port=8000)
    

    Batch Inference Optimization

    import asyncio
    from typing import List
    
    class BatchPredictor:
        def __init__(self, model, tokenizer, max_batch_size=32):
            self.model = model
            self.tokenizer = tokenizer
            self.max_batch_size = max_batch_size
            self.queue = []
            self.lock = asyncio.Lock()
    
        async def predict(self, text: str):
            async with self.lock:
                future = asyncio.Future()
                self.queue.append((text, future))
    
                if len(self.queue) >= self.max_batch_size:
                    await self._process_batch()
    
            return await future
    
        async def _process_batch(self):
            if not self.queue:
                return
    
            texts, futures = zip(*self.queue)
            self.queue = []
    
            # Process batch
            inputs = self.tokenizer(list(texts), padding=True, truncation=True, return_tensors="pt")
            outputs = self.model(**inputs)
            results = outputs.logits.argmax(dim=-1).tolist()
    
            # Return results
            for future, result in zip(futures, results):
                future.set_result(result)
    

    Quick Decision Trees

    "Which model should I use?"

    *Hugging Face Transformers v1.1 - Enhanced*
    
    ## 🔄 Workflow
    
    > **Kaynak:** [Hugging Face Course](https://huggingface.co/course) & [Production Guide](https://huggingface.co/docs/transformers/performance)
    
    ### Aşama 1: Model Selection
    - [ ] **Task**: Görevine en uygun mimariyi seç (Encoder: classification, Decoder: generation).
    - [ ] **License**: Modelin ticari kullanım izni (Apache 2.0 vs Llama Community) var mı?
    - [ ] **Size**: Parametre sayısı vs performans dengesini kur (7B genellikle yeterli).
    
    ### Aşama 2: Optimization pipeline
    - [ ] **Quantization**: Inference için 4-bit / 8-bit quantization (BitsAndBytes) kullan.
    - [ ] **Batching**: Tek tek değil, batch halinde process et (GPU verimi).
    - [ ] **Format**: Production için ONNX veya TensorRT formatına çevir.
    
    ### Aşama 3: Deployment
    - [ ] **Cache**: Model ağırlıklarını ve tokenizer'ı docker image içine bake etme, volume kullan.
    - [ ] **Token Limits**: Context window sınırını aşan inputlar için strateji belirle (chunking).
    
    ### Kontrol Noktaları
    | Aşama | Doğrulama |
    |-------|-----------|
    | 1 | Model GPU hafızasına sığıyor mu (OOM hatası)? |
    | 2 | Inference süresi (Latency) hedefin altında mı? |
    | 3 | Tokenizer ile Model uyumlu mu (aynı vocab)? |
      Classification → BERT, RoBERTa, DeBERTa
      Generation → GPT-2, GPT-Neo, LLaMA
      Translation/Summarization → T5, BART, mT5
      Question Answering → BERT, DeBERTa, RoBERTa
    
    Performance vs Speed?
      Best performance → Large models (355M+ params)
      Balanced → Base models (110M params)
      Fast inference → Distilled models (66M params)
    

    "How should I fine-tune?"

    Have full dataset control?
      YES → Full fine-tuning or LoRA
      NO → Few-shot prompting
    
    Dataset size?
      Large (>10K examples) → Full fine-tuning
      Medium (1K-10K) → LoRA or full fine-tuning
      Small (<1K) → LoRA or prompt engineering
    
    Compute available?
      Limited → LoRA (4-bit quantized)
      Moderate → LoRA (8-bit)
      High → Full fine-tuning
    

    Resources

    • Hugging Face Docs: https://huggingface.co/docs/transformers/
    • Model Hub: https://huggingface.co/models
    • PEFT (LoRA): https://huggingface.co/docs/peft/
    • Optimum: https://huggingface.co/docs/optimum/
    • Datasets: https://huggingface.co/docs/datasets/

    Skill version: 1.0.0 Last updated: 2025-10-25 Maintained by: Applied Artificial Intelligence

    Recommended Servers
    Gemini
    Gemini
    LILT
    LILT
    Hugging Face
    Hugging Face
    Repository
    vuralserhat86/antigravity-agentic-skills
    Files