Smithery Logo
MCPsSkillsDocsPricing
Login
Smithery Logo

Accelerating the Agent Economy

Resources

DocumentationPrivacy PolicySystem Status

Company

PricingAboutBlog

Connect

© 2026 Smithery. All rights reserved.

    ma1orek

    agentic-vision

    ma1orek/agentic-vision
    Design
    1

    About

    SKILL.md

    Install

    Install via Skills CLI

    or add to your agent
    • Claude Code
      Claude Code
    • Codex
      Codex
    • OpenClaw
      OpenClaw
    • Cursor
      Cursor
    • Amp
      Amp
    • GitHub Copilot
      GitHub Copilot
    • Gemini CLI
      Gemini CLI
    • Kilo Code
      Kilo Code
    • Junie
      Junie
    • Replit
      Replit
    • Windsurf
      Windsurf
    • Cline
      Cline
    • Continue
      Continue
    • OpenCode
      OpenCode
    • OpenHands
      OpenHands
    • Roo Code
      Roo Code
    • Augment
      Augment
    • Goose
      Goose
    • Trae
      Trae
    • Zencoder
      Zencoder
    • Antigravity
      Antigravity
    ├─
    ├─
    └─

    About

    Gemini 3 Flash Agentic Vision - The Sandwich Architecture for pixel-perfect UI generation. Phase 1: SURVEYOR measures layout BEFORE generation (grids, spacing, colors). Phase 2: QA TESTER verifies...

    SKILL.md

    Agentic Vision - The Sandwich Architecture

    Version: 1.0.0 Last Updated: 2026-01-30


    What is Agentic Vision?

    Agentic Vision in Gemini 3 Flash converts image understanding from a static act into an agentic process. It combines visual reasoning with Code Execution.

    Think → Act → Observe loop:
    1. THINK: Analyze image, formulate plan
    2. ACT: Generate and execute Python code (crop, measure, annotate)
    3. OBSERVE: Process results, refine understanding
    

    Key capability: Instead of "guessing" padding is p-4, it MEASURES and returns 24px.


    The Sandwich Architecture

                      REPLAY "SANDWICH" ARCHITECTURE
    ┌───────────────────────────────────────────────────────────────────┐
    │                                                                   │
    │  ┌──────────┐                                                     │
    │  │  Video   │──────────────────────────────┐                      │
    │  │  Input   │                              │                      │
    │  └────┬─────┘                              │                      │
    │       │                                    ▼                      │
    │       │                       ┌─────────────────────────┐         │
    │       │                       │  PHASE 1: THE SURVEYOR  │         │
    │       │                       │ (Agentic Vision Flash)  │         │
    │       │                       ├─────────────────────────┤         │
    │       │                       │ 1. Measure Grids (px)   │         │
    │       │                       │ 2. Extract Colors (hex) │         │
    │       │                       │ 3. Map Layout (JSON)    │ ◄─── KEY
    │       │                       └────────────┬────────────┘         │
    │       │                                    │                      │
    │       ▼                                    ▼                      │
    │  ┌──────────────┐             ┌─────────────────────────┐         │
    │  │ Gemini 3 Pro │◄────────────│  Architecture Specs     │         │
    │  │ (Code Gen)   │             │   (Hard Data JSON)      │         │
    │  └──────┬───────┘             └─────────────────────────┘         │
    │         │                                                         │
    │         ▼                                                         │
    │  ┌──────────────┐    ┌──────────────────────────────────┐         │
    │  │ Render View  │───▶│      PHASE 2: THE QA TESTER      │         │
    │  └──────────────┘    │     (Agentic Vision Flash)       │         │
    │                      ├──────────────────────────────────┤         │
    │                      │ 1. Compare Original vs Render    │         │
    │                      │ 2. "Spot the difference" (SSIM)  │         │
    │                      │ 3. Auto-fix suggestions          │         │
    │                      └─────────────────┬────────────────┘         │
    │                                        │                          │
    │                                        ▼                          │
    │                              ┌──────────────────┐                 │
    │                              │ FINAL PIXEL-PERFECT │              │
    │                              │      COMPONENT      │              │
    │                              └──────────────────┘                 │
    │                                                                   │
    └───────────────────────────────────────────────────────────────────┘
    

    Phase 1: THE SURVEYOR

    Measures layout BEFORE code generation.

    API Endpoint

    POST /api/survey/measure
    {
      imageBase64: string,      // Base64 encoded frame
      mimeType?: string,        // default: 'image/png'
      useParallel?: boolean,    // default: true (faster)
      includePromptFormat?: boolean  // Include formatted prompt for generator
    }
    

    Response

    {
      success: true,
      measurements: {
        imageDimensions: { width: 1920, height: 1080 },
        grid: { columns: 12, gap: "24px" },
        spacing: {
          sidebarWidth: "256px",
          navHeight: "64px",
          cardPadding: "24px",
          sectionGap: "48px",
          containerPadding: "32px"
        },
        colors: {
          background: "#0f172a",
          surface: "#1e293b",
          primary: "#6366f1",
          text: "#ffffff",
          textMuted: "#94a3b8",
          border: "#334155"
        },
        typography: {
          h1: "48px",
          h2: "32px",
          body: "16px",
          small: "14px"
        },
        components: [
          { type: "sidebar", bbox: {...}, confidence: 0.95 }
        ],
        confidence: 0.91
      },
      promptFormat: "... formatted for code generator ..."
    }
    

    Code Usage

    import { runParallelSurveyor, formatSurveyorDataForPrompt } from '@/lib/agentic-vision';
    
    // 1. Run Surveyor on video frame
    const { measurements } = await runParallelSurveyor(frameBase64, 'image/png');
    
    // 2. Inject into code generator prompt
    const prompt = `
    ${SYSTEM_PROMPT}
    
    ${formatSurveyorDataForPrompt(measurements)}
    
    Generate code based on the video above.
    `;
    
    // 3. Generator uses EXACT values: p-[24px] not p-4
    

    Phase 2: THE QA TESTER

    Verifies generated UI AFTER render.

    API Endpoint

    POST /api/verify/diff
    {
      originalImageBase64: string,    // Original frame from video
      generatedImageBase64: string,   // Screenshot of generated code
      mimeType?: string,              // default: 'image/png'
      quickCheck?: boolean,           // Only SSIM, skip full analysis
      includeReport?: boolean         // Include formatted text report
    }
    

    Response

    {
      success: true,
      verification: {
        ssimScore: 0.94,
        overallAccuracy: "94%",
        verdict: "needs_fixes",  // "pass" | "needs_fixes" | "major_issues"
        issues: [
          {
            type: "spacing",
            severity: "medium",
            location: "card padding",
            description: "Card padding is 16px, should be 24px",
            expected: "24px",
            actual: "16px"
          }
        ],
        autoFixSuggestions: [
          {
            selector: ".card",
            property: "padding",
            suggestedValue: "24px",
            confidence: 0.85
          }
        ]
      },
      report: "✅ QA VERIFICATION REPORT..."
    }
    

    Verdict Rules

    Verdict Condition
    pass SSIM >= 0.95 AND no high severity issues
    needs_fixes SSIM >= 0.85 AND <= 3 high severity issues
    major_issues SSIM < 0.85 OR > 3 high severity issues

    Enabling Code Execution

    Agentic Vision requires codeExecution tool in Gemini API:

    import { GoogleGenAI } from '@google/genai';
    
    const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
    
    const response = await ai.models.generateContent({
      model: 'gemini-3-flash',
      contents: [
        { text: prompt },
        { inlineData: { data: imageBase64, mimeType: 'image/png' } }
      ],
      config: {
        tools: [{ codeExecution: {} }]  // <-- CRITICAL
      }
    });
    
    // Response contains:
    // - executableCode: { code: "Python code..." }
    // - codeExecutionResult: { outcome: "OUTCOME_OK", output: "JSON result" }
    

    Available Python Libraries in Sandbox

    # Data Science
    import numpy as np
    import pandas as pd
    from scipy import ndimage
    from sklearn import preprocessing
    
    # Image Processing
    from PIL import Image
    from skimage import filters, measure, transform
    from skimage.metrics import structural_similarity as ssim
    
    # Visualization
    import matplotlib.pyplot as plt
    import seaborn as sns
    
    # Utilities
    import io
    import json
    

    Technical Considerations

    1. Coordinate Normalization

    Gemini may rescale images internally. Always request BOTH:

    • Normalized coordinates (0.0-1.0)
    • Image dimensions for backend rescaling
    def normalize_bbox(x, y, w, h, img_width, img_height):
        return {
            "x": x / img_width,
            "y": y / img_height,
            "width": w / img_width,
            "height": h / img_height
        }
    

    2. Parallel Execution for Speed

    Run color sampling and spacing measurement in parallel:

    const [colors, spacing] = await Promise.all([
      surveyColors(frame),      // Fast
      surveySpacing(frame)      // Heavier CV
    ]);
    // Time reduced by ~50%
    

    3. SSIM with scikit-image

    Use industry-standard SSIM calculation:

    from skimage.metrics import structural_similarity as ssim
    
    score, diff_image = ssim(img1, img2, full=True)
    # score: 0.0 (different) to 1.0 (identical)
    # diff_image: per-pixel difference map
    

    Integration with Replay Pipeline

    Before (Without Surveyor)

    Video → Gemini Pro "guesses" → p-4 or p-6? → 3-5 iterations
    

    After (With Sandwich Architecture)

    Video → Surveyor MEASURES → padding: 24px → Generator EXECUTES → 1-2 iterations
    

    Result: First generation is 80% better!


    File Structure

    lib/agentic-vision/
    ├── index.ts          # Main exports
    ├── types.ts          # TypeScript interfaces
    ├── prompts.ts        # Surveyor & QA prompts
    ├── surveyor.ts       # Phase 1 implementation
    └── qa-tester.ts      # Phase 2 implementation
    
    app/api/
    ├── survey/measure/route.ts    # Surveyor endpoint
    └── verify/diff/route.ts       # QA Tester endpoint
    

    Quick Start

    // Full pipeline with Agentic Vision
    
    // 1. PHASE 1: Measure before generation
    const surveyResult = await fetch('/api/survey/measure', {
      method: 'POST',
      body: JSON.stringify({ 
        imageBase64: videoFrame,
        includePromptFormat: true 
      })
    });
    const { measurements, promptFormat } = await surveyResult.json();
    
    // 2. Generate code with HARD DATA
    const codeResult = await generateWithConstraints(video, promptFormat);
    
    // 3. Render and screenshot
    const screenshot = await renderAndCapture(codeResult.code);
    
    // 4. PHASE 2: Verify
    const qaResult = await fetch('/api/verify/diff', {
      method: 'POST',
      body: JSON.stringify({
        originalImageBase64: videoFrame,
        generatedImageBase64: screenshot
      })
    });
    const { verification } = await qaResult.json();
    
    // 5. Check result
    if (verification.verdict === 'pass') {
      console.log('✅ Pixel-perfect!');
    } else {
      console.log('⚠️ Apply fixes:', verification.autoFixSuggestions);
    }
    

    References

    • Google Blog: Agentic Vision in Gemini 3 Flash
    • Gemini API Code Execution Docs
    • scikit-image SSIM
    Recommended Servers
    Nanobanana
    Nanobanana
    Browser tool
    Browser tool
    Exa Search
    Exa Search
    Repository
    ma1orek/replay
    Files