Smithery Logo
MCPsSkillsDocsPricing
Login
Smithery Logo

Accelerating the Agent Economy

Resources

DocumentationPrivacy PolicySystem Status

Company

PricingAboutBlog

Connect

© 2026 Smithery. All rights reserved.

    davila7

    blip-2-vision-language

    davila7/blip-2-vision-language
    AI & ML
    19,892
    1 installs

    About

    SKILL.md

    Install

    Install via Skills CLI

    or add to your agent
    • Claude Code
      Claude Code
    • Codex
      Codex
    • OpenClaw
      OpenClaw
    • Cursor
      Cursor
    • Amp
      Amp
    • GitHub Copilot
      GitHub Copilot
    • Gemini CLI
      Gemini CLI
    • Kilo Code
      Kilo Code
    • Junie
      Junie
    • Replit
      Replit
    • Windsurf
      Windsurf
    • Cline
      Cline
    • Continue
      Continue
    • OpenCode
      OpenCode
    • OpenHands
      OpenHands
    • Roo Code
      Roo Code
    • Augment
      Augment
    • Goose
      Goose
    • Trae
      Trae
    • Zencoder
      Zencoder
    • Antigravity
      Antigravity
    ├─
    ├─
    └─

    About

    Vision-language pre-training framework bridging frozen image encoders and LLMs...

    SKILL.md

    BLIP-2: Vision-Language Pre-training

    Comprehensive guide to using Salesforce's BLIP-2 for vision-language tasks with frozen image encoders and large language models.

    When to use BLIP-2

    Use BLIP-2 when:

    • Need high-quality image captioning with natural descriptions
    • Building visual question answering (VQA) systems
    • Require zero-shot image-text understanding without task-specific training
    • Want to leverage LLM reasoning for visual tasks
    • Building multimodal conversational AI
    • Need image-text retrieval or matching

    Key features:

    • Q-Former architecture: Lightweight query transformer bridges vision and language
    • Frozen backbone efficiency: No need to fine-tune large vision/language models
    • Multiple LLM backends: OPT (2.7B, 6.7B) and FlanT5 (XL, XXL)
    • Zero-shot capabilities: Strong performance without task-specific training
    • Efficient training: Only trains Q-Former (~188M parameters)
    • State-of-the-art results: Beats larger models on VQA benchmarks

    Use alternatives instead:

    • LLaVA: For instruction-following multimodal chat
    • InstructBLIP: For improved instruction-following (BLIP-2 successor)
    • GPT-4V/Claude 3: For production multimodal chat (proprietary)
    • CLIP: For simple image-text similarity without generation
    • Flamingo: For few-shot visual learning

    Quick start

    Installation

    # HuggingFace Transformers (recommended)
    pip install transformers accelerate torch Pillow
    
    # Or LAVIS library (Salesforce official)
    pip install salesforce-lavis
    

    Basic image captioning

    import torch
    from PIL import Image
    from transformers import Blip2Processor, Blip2ForConditionalGeneration
    
    # Load model and processor
    processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
    model = Blip2ForConditionalGeneration.from_pretrained(
        "Salesforce/blip2-opt-2.7b",
        torch_dtype=torch.float16,
        device_map="auto"
    )
    
    # Load image
    image = Image.open("photo.jpg").convert("RGB")
    
    # Generate caption
    inputs = processor(images=image, return_tensors="pt").to("cuda", torch.float16)
    generated_ids = model.generate(**inputs, max_new_tokens=50)
    caption = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
    print(caption)
    

    Visual question answering

    # Ask a question about the image
    question = "What color is the car in this image?"
    
    inputs = processor(images=image, text=question, return_tensors="pt").to("cuda", torch.float16)
    generated_ids = model.generate(**inputs, max_new_tokens=50)
    answer = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
    print(answer)
    

    Using LAVIS library

    import torch
    from lavis.models import load_model_and_preprocess
    from PIL import Image
    
    # Load model
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model, vis_processors, txt_processors = load_model_and_preprocess(
        name="blip2_opt",
        model_type="pretrain_opt2.7b",
        is_eval=True,
        device=device
    )
    
    # Process image
    image = Image.open("photo.jpg").convert("RGB")
    image = vis_processors["eval"](image).unsqueeze(0).to(device)
    
    # Caption
    caption = model.generate({"image": image})
    print(caption)
    
    # VQA
    question = txt_processors["eval"]("What is in this image?")
    answer = model.generate({"image": image, "prompt": question})
    print(answer)
    

    Core concepts

    Architecture overview

    BLIP-2 Architecture:
    ┌─────────────────────────────────────────────────────────────┐
    │                        Q-Former                              │
    │  ┌─────────────────────────────────────────────────────┐    │
    │  │     Learned Queries (32 queries × 768 dim)          │    │
    │  └────────────────────────┬────────────────────────────┘    │
    │                           │                                  │
    │  ┌────────────────────────▼────────────────────────────┐    │
    │  │    Cross-Attention with Image Features               │    │
    │  └────────────────────────┬────────────────────────────┘    │
    │                           │                                  │
    │  ┌────────────────────────▼────────────────────────────┐    │
    │  │    Self-Attention Layers (Transformer)               │    │
    │  └────────────────────────┬────────────────────────────┘    │
    └───────────────────────────┼─────────────────────────────────┘
                                │
    ┌───────────────────────────▼─────────────────────────────────┐
    │  Frozen Vision Encoder    │      Frozen LLM                  │
    │  (ViT-G/14 from EVA-CLIP) │      (OPT or FlanT5)            │
    └─────────────────────────────────────────────────────────────┘
    

    Model variants

    Model LLM Backend Size Use Case
    blip2-opt-2.7b OPT-2.7B ~4GB General captioning, VQA
    blip2-opt-6.7b OPT-6.7B ~8GB Better reasoning
    blip2-flan-t5-xl FlanT5-XL ~5GB Instruction following
    blip2-flan-t5-xxl FlanT5-XXL ~13GB Best quality

    Q-Former components

    Component Description Parameters
    Learned queries Fixed set of learnable embeddings 32 × 768
    Image transformer Cross-attention to vision features ~108M
    Text transformer Self-attention for text ~108M
    Linear projection Maps to LLM dimension Varies

    Advanced usage

    Batch processing

    from PIL import Image
    import torch
    
    # Load multiple images
    images = [Image.open(f"image_{i}.jpg").convert("RGB") for i in range(4)]
    questions = [
        "What is shown in this image?",
        "Describe the scene.",
        "What colors are prominent?",
        "Is there a person in this image?"
    ]
    
    # Process batch
    inputs = processor(
        images=images,
        text=questions,
        return_tensors="pt",
        padding=True
    ).to("cuda", torch.float16)
    
    # Generate
    generated_ids = model.generate(**inputs, max_new_tokens=50)
    answers = processor.batch_decode(generated_ids, skip_special_tokens=True)
    
    for q, a in zip(questions, answers):
        print(f"Q: {q}\nA: {a}\n")
    

    Controlling generation

    # Control generation parameters
    generated_ids = model.generate(
        **inputs,
        max_new_tokens=100,
        min_length=20,
        num_beams=5,              # Beam search
        no_repeat_ngram_size=2,   # Avoid repetition
        top_p=0.9,                # Nucleus sampling
        temperature=0.7,          # Creativity
        do_sample=True,           # Enable sampling
    )
    
    # For deterministic output
    generated_ids = model.generate(
        **inputs,
        max_new_tokens=50,
        num_beams=5,
        do_sample=False,
    )
    

    Memory optimization

    # 8-bit quantization
    from transformers import BitsAndBytesConfig
    
    quantization_config = BitsAndBytesConfig(load_in_8bit=True)
    
    model = Blip2ForConditionalGeneration.from_pretrained(
        "Salesforce/blip2-opt-6.7b",
        quantization_config=quantization_config,
        device_map="auto"
    )
    
    # 4-bit quantization (more aggressive)
    quantization_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.float16
    )
    
    model = Blip2ForConditionalGeneration.from_pretrained(
        "Salesforce/blip2-flan-t5-xxl",
        quantization_config=quantization_config,
        device_map="auto"
    )
    

    Image-text matching

    # Using LAVIS for ITM (Image-Text Matching)
    from lavis.models import load_model_and_preprocess
    
    model, vis_processors, txt_processors = load_model_and_preprocess(
        name="blip2_image_text_matching",
        model_type="pretrain",
        is_eval=True,
        device=device
    )
    
    image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
    text = txt_processors["eval"]("a dog sitting on grass")
    
    # Get matching score
    itm_output = model({"image": image, "text_input": text}, match_head="itm")
    itm_scores = torch.nn.functional.softmax(itm_output, dim=1)
    print(f"Match probability: {itm_scores[:, 1].item():.3f}")
    

    Feature extraction

    # Extract image features with Q-Former
    from lavis.models import load_model_and_preprocess
    
    model, vis_processors, _ = load_model_and_preprocess(
        name="blip2_feature_extractor",
        model_type="pretrain",
        is_eval=True,
        device=device
    )
    
    image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
    
    # Get features
    features = model.extract_features({"image": image}, mode="image")
    image_embeds = features.image_embeds  # Shape: [1, 32, 768]
    image_features = features.image_embeds_proj  # Projected for matching
    

    Common workflows

    Workflow 1: Image captioning pipeline

    import torch
    from PIL import Image
    from transformers import Blip2Processor, Blip2ForConditionalGeneration
    from pathlib import Path
    
    class ImageCaptioner:
        def __init__(self, model_name="Salesforce/blip2-opt-2.7b"):
            self.processor = Blip2Processor.from_pretrained(model_name)
            self.model = Blip2ForConditionalGeneration.from_pretrained(
                model_name,
                torch_dtype=torch.float16,
                device_map="auto"
            )
    
        def caption(self, image_path: str, prompt: str = None) -> str:
            image = Image.open(image_path).convert("RGB")
    
            if prompt:
                inputs = self.processor(images=image, text=prompt, return_tensors="pt")
            else:
                inputs = self.processor(images=image, return_tensors="pt")
    
            inputs = inputs.to("cuda", torch.float16)
    
            generated_ids = self.model.generate(
                **inputs,
                max_new_tokens=50,
                num_beams=5
            )
    
            return self.processor.decode(generated_ids[0], skip_special_tokens=True)
    
        def caption_batch(self, image_paths: list, prompt: str = None) -> list:
            images = [Image.open(p).convert("RGB") for p in image_paths]
    
            if prompt:
                inputs = self.processor(
                    images=images,
                    text=[prompt] * len(images),
                    return_tensors="pt",
                    padding=True
                )
            else:
                inputs = self.processor(images=images, return_tensors="pt", padding=True)
    
            inputs = inputs.to("cuda", torch.float16)
    
            generated_ids = self.model.generate(**inputs, max_new_tokens=50)
            return self.processor.batch_decode(generated_ids, skip_special_tokens=True)
    
    # Usage
    captioner = ImageCaptioner()
    
    # Single image
    caption = captioner.caption("photo.jpg")
    print(f"Caption: {caption}")
    
    # With prompt for style
    caption = captioner.caption("photo.jpg", "a detailed description of")
    print(f"Detailed: {caption}")
    
    # Batch processing
    captions = captioner.caption_batch(["img1.jpg", "img2.jpg", "img3.jpg"])
    for i, cap in enumerate(captions):
        print(f"Image {i+1}: {cap}")
    

    Workflow 2: Visual Q&A system

    class VisualQA:
        def __init__(self, model_name="Salesforce/blip2-flan-t5-xl"):
            self.processor = Blip2Processor.from_pretrained(model_name)
            self.model = Blip2ForConditionalGeneration.from_pretrained(
                model_name,
                torch_dtype=torch.float16,
                device_map="auto"
            )
            self.current_image = None
            self.current_inputs = None
    
        def set_image(self, image_path: str):
            """Load image for multiple questions."""
            self.current_image = Image.open(image_path).convert("RGB")
    
        def ask(self, question: str) -> str:
            """Ask a question about the current image."""
            if self.current_image is None:
                raise ValueError("No image set. Call set_image() first.")
    
            # Format question for FlanT5
            prompt = f"Question: {question} Answer:"
    
            inputs = self.processor(
                images=self.current_image,
                text=prompt,
                return_tensors="pt"
            ).to("cuda", torch.float16)
    
            generated_ids = self.model.generate(
                **inputs,
                max_new_tokens=50,
                num_beams=5
            )
    
            return self.processor.decode(generated_ids[0], skip_special_tokens=True)
    
        def ask_multiple(self, questions: list) -> dict:
            """Ask multiple questions about current image."""
            return {q: self.ask(q) for q in questions}
    
    # Usage
    vqa = VisualQA()
    vqa.set_image("scene.jpg")
    
    # Ask questions
    print(vqa.ask("What objects are in this image?"))
    print(vqa.ask("What is the weather like?"))
    print(vqa.ask("How many people are there?"))
    
    # Batch questions
    results = vqa.ask_multiple([
        "What is the main subject?",
        "What colors are dominant?",
        "Is this indoors or outdoors?"
    ])
    

    Workflow 3: Image search/retrieval

    import torch
    import numpy as np
    from PIL import Image
    from lavis.models import load_model_and_preprocess
    
    class ImageSearchEngine:
        def __init__(self):
            self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
            self.model, self.vis_processors, self.txt_processors = load_model_and_preprocess(
                name="blip2_feature_extractor",
                model_type="pretrain",
                is_eval=True,
                device=self.device
            )
            self.image_features = []
            self.image_paths = []
    
        def index_images(self, image_paths: list):
            """Build index from images."""
            self.image_paths = image_paths
    
            for path in image_paths:
                image = Image.open(path).convert("RGB")
                image = self.vis_processors["eval"](image).unsqueeze(0).to(self.device)
    
                with torch.no_grad():
                    features = self.model.extract_features({"image": image}, mode="image")
                    # Use projected features for matching
                    self.image_features.append(
                        features.image_embeds_proj.mean(dim=1).cpu().numpy()
                    )
    
            self.image_features = np.vstack(self.image_features)
    
        def search(self, query: str, top_k: int = 5) -> list:
            """Search images by text query."""
            # Get text features
            text = self.txt_processors["eval"](query)
            text_input = {"text_input": [text]}
    
            with torch.no_grad():
                text_features = self.model.extract_features(text_input, mode="text")
                text_embeds = text_features.text_embeds_proj[:, 0].cpu().numpy()
    
            # Compute similarities
            similarities = np.dot(self.image_features, text_embeds.T).squeeze()
            top_indices = np.argsort(similarities)[::-1][:top_k]
    
            return [(self.image_paths[i], similarities[i]) for i in top_indices]
    
    # Usage
    engine = ImageSearchEngine()
    engine.index_images(["img1.jpg", "img2.jpg", "img3.jpg", ...])
    
    # Search
    results = engine.search("a sunset over the ocean", top_k=5)
    for path, score in results:
        print(f"{path}: {score:.3f}")
    

    Output format

    Generation output

    # Direct generation returns token IDs
    generated_ids = model.generate(**inputs, max_new_tokens=50)
    # Shape: [batch_size, sequence_length]
    
    # Decode to text
    text = processor.batch_decode(generated_ids, skip_special_tokens=True)
    # Returns: list of strings
    

    Feature extraction output

    # Q-Former outputs
    features = model.extract_features({"image": image}, mode="image")
    
    features.image_embeds          # [B, 32, 768] - Q-Former outputs
    features.image_embeds_proj     # [B, 32, 256] - Projected for matching
    features.text_embeds          # [B, seq_len, 768] - Text features
    features.text_embeds_proj     # [B, 256] - Projected text (CLS)
    

    Performance optimization

    GPU memory requirements

    Model FP16 VRAM INT8 VRAM INT4 VRAM
    blip2-opt-2.7b ~8GB ~5GB ~3GB
    blip2-opt-6.7b ~16GB ~9GB ~5GB
    blip2-flan-t5-xl ~10GB ~6GB ~4GB
    blip2-flan-t5-xxl ~26GB ~14GB ~8GB

    Speed optimization

    # Use Flash Attention if available
    model = Blip2ForConditionalGeneration.from_pretrained(
        "Salesforce/blip2-opt-2.7b",
        torch_dtype=torch.float16,
        attn_implementation="flash_attention_2",  # Requires flash-attn
        device_map="auto"
    )
    
    # Compile model (PyTorch 2.0+)
    model = torch.compile(model)
    
    # Use smaller images (if quality allows)
    processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
    # Default is 224x224, which is optimal
    

    Common issues

    Issue Solution
    CUDA OOM Use INT8/INT4 quantization, smaller model
    Slow generation Use greedy decoding, reduce max_new_tokens
    Poor captions Try FlanT5 variant, use prompts
    Hallucinations Lower temperature, use beam search
    Wrong answers Rephrase question, provide context

    References

    • Advanced Usage - Fine-tuning, integration, deployment
    • Troubleshooting - Common issues and solutions

    Resources

    • Paper: https://arxiv.org/abs/2301.12597
    • GitHub (LAVIS): https://github.com/salesforce/LAVIS
    • HuggingFace: https://huggingface.co/Salesforce/blip2-opt-2.7b
    • Demo: https://huggingface.co/spaces/Salesforce/BLIP2
    • InstructBLIP: https://arxiv.org/abs/2305.06500 (successor)
    Recommended Servers
    LILT
    LILT
    Gemini
    Gemini
    Nanobanana
    Nanobanana
    Repository
    davila7/claude-code-templates
    Files