Smithery Logo
MCPsSkillsDocsPricing
Login
Smithery Logo

Accelerating the Agent Economy

Resources

DocumentationPrivacy PolicySystem Status

Company

PricingAboutBlog

Connect

© 2026 Smithery. All rights reserved.

    patchy631

    hugging-face-datasets

    patchy631/hugging-face-datasets
    AI & ML
    28,065
    2 installs

    About

    SKILL.md

    Install

    Install via Skills CLI

    or add to your agent
    • Claude Code
      Claude Code
    • Codex
      Codex
    • OpenClaw
      OpenClaw
    • Cursor
      Cursor
    • Amp
      Amp
    • GitHub Copilot
      GitHub Copilot
    • Gemini CLI
      Gemini CLI
    • Kilo Code
      Kilo Code
    • Junie
      Junie
    • Replit
      Replit
    • Windsurf
      Windsurf
    • Cline
      Cline
    • Continue
      Continue
    • OpenCode
      OpenCode
    • OpenHands
      OpenHands
    • Roo Code
      Roo Code
    • Augment
      Augment
    • Goose
      Goose
    • Trae
      Trae
    • Zencoder
      Zencoder
    • Antigravity
      Antigravity
    ├─
    ├─
    └─

    About

    Create and manage datasets on Hugging Face Hub. Supports initializing repos, defining configs/system prompts, streaming row updates, and SQL-based dataset querying/transformation...

    SKILL.md

    Overview

    This skill provides tools to manage datasets on the Hugging Face Hub with a focus on creation, configuration, content management, and SQL-based data manipulation. It is designed to complement the existing Hugging Face MCP server by providing dataset editing and querying capabilities.

    Integration with HF MCP Server

    • Use HF MCP Server for: Dataset discovery, search, and metadata retrieval
    • Use This Skill for: Dataset creation, content editing, SQL queries, data transformation, and structured data formatting

    Version

    2.1.0

    Dependencies

    • huggingface_hub
    • duckdb (for SQL queries)
    • datasets (for pushing query results to Hub)
    • json (built-in)
    • time (built-in)

    Core Capabilities

    1. Dataset Lifecycle Management

    • Initialize: Create new dataset repositories with proper structure
    • Configure: Store detailed configuration including system prompts and metadata
    • Stream Updates: Add rows efficiently without downloading entire datasets

    2. SQL-Based Dataset Querying (NEW)

    Query any Hugging Face dataset using DuckDB SQL via scripts/sql_manager.py:

    • Direct Queries: Run SQL on datasets using the hf:// protocol
    • Schema Discovery: Describe dataset structure and column types
    • Data Sampling: Get random samples for exploration
    • Aggregations: Count, histogram, unique values analysis
    • Transformations: Filter, join, reshape data with SQL
    • Export & Push: Save results locally or push to new Hub repos

    3. Multi-Format Dataset Support

    Supports diverse dataset types through template system:

    • Chat/Conversational: Chat templating, multi-turn dialogues, tool usage examples
    • Text Classification: Sentiment analysis, intent detection, topic classification
    • Question-Answering: Reading comprehension, factual QA, knowledge bases
    • Text Completion: Language modeling, code completion, creative writing
    • Tabular Data: Structured data for regression/classification tasks
    • Custom Formats: Flexible schema definition for specialized needs

    4. Quality Assurance Features

    • JSON Validation: Ensures data integrity during uploads
    • Batch Processing: Efficient handling of large datasets
    • Error Recovery: Graceful handling of upload failures and conflicts

    Usage Instructions

    The skill includes two Python scripts:

    • scripts/dataset_manager.py - Dataset creation and management
    • scripts/sql_manager.py - SQL-based dataset querying and transformation

    Prerequisites

    • huggingface_hub library: uv add huggingface_hub
    • duckdb library (for SQL): uv add duckdb
    • datasets library (for pushing): uv add datasets
    • HF_TOKEN environment variable must be set with a Write-access token
    • Activate virtual environment: source .venv/bin/activate

    SQL Dataset Querying (sql_manager.py)

    Query, transform, and push Hugging Face datasets using DuckDB SQL. The hf:// protocol provides direct access to any public dataset (or private with token).

    Quick Start

    # Query a dataset
    python scripts/sql_manager.py query \
      --dataset "cais/mmlu" \
      --sql "SELECT * FROM data WHERE subject='nutrition' LIMIT 10"
    
    # Get dataset schema
    python scripts/sql_manager.py describe --dataset "cais/mmlu"
    
    # Sample random rows
    python scripts/sql_manager.py sample --dataset "cais/mmlu" --n 5
    
    # Count rows with filter
    python scripts/sql_manager.py count --dataset "cais/mmlu" --where "subject='nutrition'"
    

    SQL Query Syntax

    Use data as the table name in your SQL - it gets replaced with the actual hf:// path:

    -- Basic select
    SELECT * FROM data LIMIT 10
    
    -- Filtering
    SELECT * FROM data WHERE subject='nutrition'
    
    -- Aggregations
    SELECT subject, COUNT(*) as cnt FROM data GROUP BY subject ORDER BY cnt DESC
    
    -- Column selection and transformation
    SELECT question, choices[answer] AS correct_answer FROM data
    
    -- Regex matching
    SELECT * FROM data WHERE regexp_matches(question, 'nutrition|diet')
    
    -- String functions
    SELECT regexp_replace(question, '\n', '') AS cleaned FROM data
    

    Common Operations

    1. Explore Dataset Structure

    # Get schema
    python scripts/sql_manager.py describe --dataset "cais/mmlu"
    
    # Get unique values in column
    python scripts/sql_manager.py unique --dataset "cais/mmlu" --column "subject"
    
    # Get value distribution
    python scripts/sql_manager.py histogram --dataset "cais/mmlu" --column "subject" --bins 20
    

    2. Filter and Transform

    # Complex filtering with SQL
    python scripts/sql_manager.py query \
      --dataset "cais/mmlu" \
      --sql "SELECT subject, COUNT(*) as cnt FROM data GROUP BY subject HAVING cnt > 100"
    
    # Using transform command
    python scripts/sql_manager.py transform \
      --dataset "cais/mmlu" \
      --select "subject, COUNT(*) as cnt" \
      --group-by "subject" \
      --order-by "cnt DESC" \
      --limit 10
    

    3. Create Subsets and Push to Hub

    # Query and push to new dataset
    python scripts/sql_manager.py query \
      --dataset "cais/mmlu" \
      --sql "SELECT * FROM data WHERE subject='nutrition'" \
      --push-to "username/mmlu-nutrition-subset" \
      --private
    
    # Transform and push
    python scripts/sql_manager.py transform \
      --dataset "ibm/duorc" \
      --config "ParaphraseRC" \
      --select "question, answers" \
      --where "LENGTH(question) > 50" \
      --push-to "username/duorc-long-questions"
    

    4. Export to Local Files

    # Export to Parquet
    python scripts/sql_manager.py export \
      --dataset "cais/mmlu" \
      --sql "SELECT * FROM data WHERE subject='nutrition'" \
      --output "nutrition.parquet" \
      --format parquet
    
    # Export to JSONL
    python scripts/sql_manager.py export \
      --dataset "cais/mmlu" \
      --sql "SELECT * FROM data LIMIT 100" \
      --output "sample.jsonl" \
      --format jsonl
    

    5. Working with Dataset Configs/Splits

    # Specify config (subset)
    python scripts/sql_manager.py query \
      --dataset "ibm/duorc" \
      --config "ParaphraseRC" \
      --sql "SELECT * FROM data LIMIT 5"
    
    # Specify split
    python scripts/sql_manager.py query \
      --dataset "cais/mmlu" \
      --split "test" \
      --sql "SELECT COUNT(*) FROM data"
    
    # Query all splits
    python scripts/sql_manager.py query \
      --dataset "cais/mmlu" \
      --split "*" \
      --sql "SELECT * FROM data LIMIT 10"
    

    6. Raw SQL with Full Paths

    For complex queries or joining datasets:

    python scripts/sql_manager.py raw --sql "
      SELECT a.*, b.* 
      FROM 'hf://datasets/dataset1@~parquet/default/train/*.parquet' a
      JOIN 'hf://datasets/dataset2@~parquet/default/train/*.parquet' b
      ON a.id = b.id
      LIMIT 100
    "
    

    Python API Usage

    from sql_manager import HFDatasetSQL
    
    sql = HFDatasetSQL()
    
    # Query
    results = sql.query("cais/mmlu", "SELECT * FROM data WHERE subject='nutrition' LIMIT 10")
    
    # Get schema
    schema = sql.describe("cais/mmlu")
    
    # Sample
    samples = sql.sample("cais/mmlu", n=5, seed=42)
    
    # Count
    count = sql.count("cais/mmlu", where="subject='nutrition'")
    
    # Histogram
    dist = sql.histogram("cais/mmlu", "subject")
    
    # Filter and transform
    results = sql.filter_and_transform(
        "cais/mmlu",
        select="subject, COUNT(*) as cnt",
        group_by="subject",
        order_by="cnt DESC",
        limit=10
    )
    
    # Push to Hub
    url = sql.push_to_hub(
        "cais/mmlu",
        "username/nutrition-subset",
        sql="SELECT * FROM data WHERE subject='nutrition'",
        private=True
    )
    
    # Export locally
    sql.export_to_parquet("cais/mmlu", "output.parquet", sql="SELECT * FROM data LIMIT 100")
    
    sql.close()
    

    HF Path Format

    DuckDB uses the hf:// protocol to access datasets:

    hf://datasets/{dataset_id}@{revision}/{config}/{split}/*.parquet
    

    Examples:

    • hf://datasets/cais/mmlu@~parquet/default/train/*.parquet
    • hf://datasets/ibm/duorc@~parquet/ParaphraseRC/test/*.parquet

    The @~parquet revision provides auto-converted Parquet files for any dataset format.

    Useful DuckDB SQL Functions

    -- String functions
    LENGTH(column)                    -- String length
    regexp_replace(col, '\n', '')     -- Regex replace
    regexp_matches(col, 'pattern')    -- Regex match
    LOWER(col), UPPER(col)           -- Case conversion
    
    -- Array functions  
    choices[0]                        -- Array indexing (0-based)
    array_length(choices)             -- Array length
    unnest(choices)                   -- Expand array to rows
    
    -- Aggregations
    COUNT(*), SUM(col), AVG(col)
    GROUP BY col HAVING condition
    
    -- Sampling
    USING SAMPLE 10                   -- Random sample
    USING SAMPLE 10 (RESERVOIR, 42)   -- Reproducible sample
    
    -- Window functions
    ROW_NUMBER() OVER (PARTITION BY col ORDER BY col2)
    

    Dataset Creation (dataset_manager.py)

    Recommended Workflow

    1. Discovery (Use HF MCP Server):

    # Use HF MCP tools to find existing datasets
    search_datasets("conversational AI training")
    get_dataset_details("username/dataset-name")
    

    2. Creation (Use This Skill):

    # Initialize new dataset
    python scripts/dataset_manager.py init --repo_id "your-username/dataset-name" [--private]
    
    # Configure with detailed system prompt
    python scripts/dataset_manager.py config --repo_id "your-username/dataset-name" --system_prompt "$(cat system_prompt.txt)"
    

    3. Content Management (Use This Skill):

    # Quick setup with any template
    python scripts/dataset_manager.py quick_setup \
      --repo_id "your-username/dataset-name" \
      --template classification
    
    # Add data with template validation
    python scripts/dataset_manager.py add_rows \
      --repo_id "your-username/dataset-name" \
      --template qa \
      --rows_json "$(cat your_qa_data.json)"
    

    Template-Based Data Structures

    1. Chat Template (--template chat)

    {
      "messages": [
        {"role": "user", "content": "Natural user request"},
        {"role": "assistant", "content": "Response with tool usage"},
        {"role": "tool", "content": "Tool response", "tool_call_id": "call_123"}
      ],
      "scenario": "Description of use case",
      "complexity": "simple|intermediate|advanced"
    }
    

    2. Classification Template (--template classification)

    {
      "text": "Input text to be classified",
      "label": "classification_label",
      "confidence": 0.95,
      "metadata": {"domain": "technology", "language": "en"}
    }
    

    3. QA Template (--template qa)

    {
      "question": "What is the question being asked?",
      "answer": "The complete answer",
      "context": "Additional context if needed",
      "answer_type": "factual|explanatory|opinion",
      "difficulty": "easy|medium|hard"
    }
    

    4. Completion Template (--template completion)

    {
      "prompt": "The beginning text or context",
      "completion": "The expected continuation",
      "domain": "code|creative|technical|conversational",
      "style": "description of writing style"
    }
    

    5. Tabular Template (--template tabular)

    {
      "columns": [
        {"name": "feature1", "type": "numeric", "description": "First feature"},
        {"name": "target", "type": "categorical", "description": "Target variable"}
      ],
      "data": [
        {"feature1": 123, "target": "class_a"},
        {"feature1": 456, "target": "class_b"}
      ]
    }
    

    Advanced System Prompt Template

    For high-quality training data generation:

    You are an AI assistant expert at using MCP tools effectively.
    
    ## MCP SERVER DEFINITIONS
    [Define available servers and tools]
    
    ## TRAINING EXAMPLE STRUCTURE
    [Specify exact JSON schema for chat templating]
    
    ## QUALITY GUIDELINES
    [Detail requirements for realistic scenarios, progressive complexity, proper tool usage]
    
    ## EXAMPLE CATEGORIES
    [List development workflows, debugging scenarios, data management tasks]
    

    Example Categories & Templates

    The skill includes diverse training examples beyond just MCP usage:

    Available Example Sets:

    • training_examples.json - MCP tool usage examples (debugging, project setup, database analysis)
    • diverse_training_examples.json - Broader scenarios including:
      • Educational Chat - Explaining programming concepts, tutorials
      • Git Workflows - Feature branches, version control guidance
      • Code Analysis - Performance optimization, architecture review
      • Content Generation - Professional writing, creative brainstorming
      • Codebase Navigation - Legacy code exploration, systematic analysis
      • Conversational Support - Problem-solving, technical discussions

    Using Different Example Sets:

    # Add MCP-focused examples
    python scripts/dataset_manager.py add_rows --repo_id "your-username/dataset-name" \
      --rows_json "$(cat examples/training_examples.json)"
    
    # Add diverse conversational examples
    python scripts/dataset_manager.py add_rows --repo_id "your-username/dataset-name" \
      --rows_json "$(cat examples/diverse_training_examples.json)"
    
    # Mix both for comprehensive training data
    python scripts/dataset_manager.py add_rows --repo_id "your-username/dataset-name" \
      --rows_json "$(jq -s '.[0] + .[1]' examples/training_examples.json examples/diverse_training_examples.json)"
    

    Commands Reference

    List Available Templates:

    python scripts/dataset_manager.py list_templates
    

    Quick Setup (Recommended):

    python scripts/dataset_manager.py quick_setup --repo_id "your-username/dataset-name" --template classification
    

    Manual Setup:

    # Initialize repository
    python scripts/dataset_manager.py init --repo_id "your-username/dataset-name" [--private]
    
    # Configure with system prompt
    python scripts/dataset_manager.py config --repo_id "your-username/dataset-name" --system_prompt "Your prompt here"
    
    # Add data with validation
    python scripts/dataset_manager.py add_rows \
      --repo_id "your-username/dataset-name" \
      --template qa \
      --rows_json '[{"question": "What is AI?", "answer": "Artificial Intelligence..."}]'
    

    View Dataset Statistics:

    python scripts/dataset_manager.py stats --repo_id "your-username/dataset-name"
    

    Error Handling

    • Repository exists: Script will notify and continue with configuration
    • Invalid JSON: Clear error message with parsing details
    • Network issues: Automatic retry for transient failures
    • Token permissions: Validation before operations begin

    Combined Workflow Examples

    Example 1: Create Training Subset from Existing Dataset

    # 1. Explore the source dataset
    python scripts/sql_manager.py describe --dataset "cais/mmlu"
    python scripts/sql_manager.py histogram --dataset "cais/mmlu" --column "subject"
    
    # 2. Query and create subset
    python scripts/sql_manager.py query \
      --dataset "cais/mmlu" \
      --sql "SELECT * FROM data WHERE subject IN ('nutrition', 'anatomy', 'clinical_knowledge')" \
      --push-to "username/mmlu-medical-subset" \
      --private
    

    Example 2: Transform and Reshape Data

    # Transform MMLU to QA format with correct answers extracted
    python scripts/sql_manager.py query \
      --dataset "cais/mmlu" \
      --sql "SELECT question, choices[answer] as correct_answer, subject FROM data" \
      --push-to "username/mmlu-qa-format"
    

    Example 3: Merge Multiple Dataset Splits

    # Export multiple splits and combine
    python scripts/sql_manager.py export \
      --dataset "cais/mmlu" \
      --split "*" \
      --output "mmlu_all.parquet"
    

    Example 4: Quality Filtering

    # Filter for high-quality examples
    python scripts/sql_manager.py query \
      --dataset "squad" \
      --sql "SELECT * FROM data WHERE LENGTH(context) > 500 AND LENGTH(question) > 20" \
      --push-to "username/squad-filtered"
    

    Example 5: Create Custom Training Dataset

    # 1. Query source data
    python scripts/sql_manager.py export \
      --dataset "cais/mmlu" \
      --sql "SELECT question, subject FROM data WHERE subject='nutrition'" \
      --output "nutrition_source.jsonl" \
      --format jsonl
    
    # 2. Process with your pipeline (add answers, format, etc.)
    
    # 3. Push processed data
    python scripts/dataset_manager.py init --repo_id "username/nutrition-training"
    python scripts/dataset_manager.py add_rows \
      --repo_id "username/nutrition-training" \
      --template qa \
      --rows_json "$(cat processed_data.json)"
    
    Recommended Servers
    Hugging Face
    Hugging Face
    HubSpot
    HubSpot
    Google BigQuery
    Google BigQuery
    Repository
    patchy631/ai-engineering-hub
    Files