Smithery Logo
MCPsSkillsDocsPricing
Login
Smithery Logo

Accelerating the Agent Economy

Resources

DocumentationPrivacy PolicySystem Status

Company

PricingAboutBlog

Connect

© 2026 Smithery. All rights reserved.

    einverne

    gemini-vision

    einverne/gemini-vision
    AI & ML
    117

    About

    SKILL.md

    Install

    Install via Skills CLI

    or add to your agent
    • Claude Code
      Claude Code
    • Codex
      Codex
    • OpenClaw
      OpenClaw
    • Cursor
      Cursor
    • Amp
      Amp
    • GitHub Copilot
      GitHub Copilot
    • Gemini CLI
      Gemini CLI
    • Kilo Code
      Kilo Code
    • Junie
      Junie
    • Replit
      Replit
    • Windsurf
      Windsurf
    • Cline
      Cline
    • Continue
      Continue
    • OpenCode
      OpenCode
    • OpenHands
      OpenHands
    • Roo Code
      Roo Code
    • Augment
      Augment
    • Goose
      Goose
    • Trae
      Trae
    • Zencoder
      Zencoder
    • Antigravity
      Antigravity
    ├─
    ├─
    └─

    About

    Guide for implementing Google Gemini API image understanding - analyze images with captioning, classification, visual QA, object detection, segmentation, and multi-image comparison...

    SKILL.md

    Gemini Vision API Skill

    This skill enables Claude to use Google's Gemini API for advanced image understanding tasks including captioning, classification, visual question answering, object detection, segmentation, and multi-image analysis.

    Quick Start

    Prerequisites

    1. Get API Key: Obtain from Google AI Studio
    2. Install SDK: pip install google-genai (Python 3.9+)

    API Key Configuration

    The skill checks for GEMINI_API_KEY in this order:

    1. Process environment variable (recommended)

      export GEMINI_API_KEY="your-api-key"
      
    2. Skill directory: .claude/skills/gemini-vision/.env

      GEMINI_API_KEY=your-api-key
      
    3. Project directory: .env or .gemini_api_key in project root

    Security: Never commit API keys to version control. Add .env to .gitignore.

    Core Capabilities

    Image Analysis

    • Captioning: Generate descriptive text for images
    • Classification: Categorize and identify image content
    • Visual QA: Answer questions about image content
    • Multi-image: Compare and analyze up to 3,600 images

    Advanced Features (Model-Specific)

    • Object Detection: Identify and locate objects with bounding boxes (Gemini 2.0+)
    • Segmentation: Create pixel-level masks for objects (Gemini 2.5+)
    • Document Understanding: Process PDFs with vision (up to 1,000 pages)

    Supported Formats

    • Images: PNG, JPEG, WEBP, HEIC, HEIF
    • Documents: PDF (up to 1,000 pages)
    • Size Limits:
      • Inline: 20MB max total request size
      • File API: For larger files
      • Max images: 3,600 per request

    Available Models

    • gemini-2.5-pro: Most capable, segmentation + detection
    • gemini-2.5-flash: Fast, efficient, segmentation + detection
    • gemini-2.5-flash-lite: Lightweight, segmentation + detection
    • gemini-2.0-flash: Object detection support
    • gemini-1.5-pro/flash: Previous generation

    Usage Examples

    Basic Image Analysis

    # Analyze a local image
    python scripts/analyze-image.py path/to/image.jpg "What's in this image?"
    
    # Analyze from URL
    python scripts/analyze-image.py https://example.com/image.jpg "Describe this"
    
    # Specify model
    python scripts/analyze-image.py image.jpg "Caption this" --model gemini-2.5-pro
    

    Object Detection (2.0+)

    python scripts/analyze-image.py image.jpg "Detect all objects" --model gemini-2.0-flash
    

    Multi-Image Comparison

    python scripts/analyze-image.py img1.jpg img2.jpg "What's different between these?"
    

    File Upload (for large files or reuse)

    # Upload file
    python scripts/upload-file.py path/to/large-image.jpg
    
    # Use uploaded file
    python scripts/analyze-image.py file://file-id "Caption this"
    

    File Management

    # List uploaded files
    python scripts/manage-files.py list
    
    # Get file info
    python scripts/manage-files.py get file-id
    
    # Delete file
    python scripts/manage-files.py delete file-id
    

    Token Costs

    Images consume tokens based on size:

    • Small (≤384px both dimensions): 258 tokens
    • Large: Tiled into 768×768 chunks, 258 tokens each

    Token Formula:

    crop_unit = floor(min(width, height) / 1.5)
    tiles = (width / crop_unit) × (height / crop_unit)
    total_tokens = tiles × 258
    

    Example: 960×540 image = 6 tiles = 1,548 tokens

    Rate Limits

    Limits vary by tier (Free, Tier 1, 2, 3):

    • Measured in RPM (requests/min), TPM (tokens/min), RPD (requests/day)
    • Applied per project, not per API key
    • RPD resets at midnight Pacific

    Best Practices

    Image Quality

    • Use clear, non-blurry images
    • Verify correct image rotation
    • Consider token costs when sizing

    Prompting

    • Be specific in instructions
    • Place text after image for single-image prompts
    • Use few-shot examples for better accuracy
    • Specify output format (JSON, markdown, etc.)

    File Management

    • Use File API for files >20MB
    • Use File API for repeated usage (saves tokens)
    • Files auto-delete after 48 hours
    • Clean up manually when done

    Security

    • Never expose API keys in code
    • Use environment variables
    • Add API key restrictions in Google Cloud Console
    • Monitor usage regularly
    • Rotate keys periodically

    Error Handling

    Common errors:

    • 401: Invalid API key
    • 429: Rate limit exceeded
    • 400: Invalid request (check file size, format)
    • 403: Permission denied (check API key restrictions)

    Additional Resources

    See the references/ directory for:

    • api-reference.md: Detailed API methods and endpoints
    • examples.md: Comprehensive code examples
    • best-practices.md: Advanced tips and optimization strategies

    Implementation Guide

    When implementing Gemini vision features:

    1. Check API key availability using the 3-step lookup
    2. Choose appropriate model based on requirements:
      • Need segmentation? Use 2.5+ models
      • Need detection? Use 2.0+ models
      • Need speed? Use Flash variants
      • Need quality? Use Pro variants
    3. Validate inputs:
      • Check file format (PNG, JPEG, WEBP, HEIC, HEIF, PDF)
      • Verify file size (<20MB for inline, >20MB use File API)
      • Count images (max 3,600)
    4. Handle responses appropriately:
      • Parse structured output if requested
      • Extract bounding boxes for object detection
      • Process segmentation masks if applicable
    5. Manage files efficiently:
      • Upload large files via File API
      • Reuse uploaded files when possible
      • Clean up after use

    Scripts Overview

    All scripts support the 3-step API key lookup:

    • analyze-image.py: Main script for image analysis, supports inline and File API
    • upload-file.py: Upload files to Gemini File API
    • manage-files.py: List, get metadata, and delete uploaded files

    Run any script with --help for detailed usage instructions.


    Official Documentation: https://ai.google.dev/gemini-api/docs/image-understanding

    Recommended Servers
    Gemini
    Gemini
    Nanobanana
    Nanobanana
    Repository
    einverne/dotfiles
    Files