Smithery Logo
MCPsSkillsDocsPricing
Login
Smithery Logo

Give agents more agency

Resources

DocumentationPrivacy PolicySystem Status

Company

PricingAboutBlog

Connect

© 2026 Smithery. All rights reserved.

    terrylica

    clickhouse-architect

    terrylica/clickhouse-architect
    Data & Analytics
    9

    About

    SKILL.md

    Install

    Install via Skills CLI

    or add to your agent
    • Claude Code
      Claude Code
    • Codex
      Codex
    • OpenClaw
      OpenClaw
    • Cursor
      Cursor
    • Amp
      Amp
    • GitHub Copilot
      GitHub Copilot
    • Gemini CLI
      Gemini CLI
    • Kilo Code
      Kilo Code
    • Junie
      Junie
    • Replit
      Replit
    • Windsurf
      Windsurf
    • Cline
      Cline
    • Continue
      Continue
    • OpenCode
      OpenCode
    • OpenHands
      OpenHands
    • Roo Code
      Roo Code
    • Augment
      Augment
    • Goose
      Goose
    • Trae
      Trae
    • Zencoder
      Zencoder
    • Antigravity
      Antigravity
    ├─
    ├─
    └─

    About

    ClickHouse schema design and optimization. TRIGGERS - ClickHouse schema, compression codecs, MergeTree, ORDER BY tuning, partition key.

    SKILL.md

    ClickHouse Architect

    Prescriptive schema design, compression selection, and performance optimization for ClickHouse (v24.4+). Covers both ClickHouse Cloud (SharedMergeTree) and self-hosted (ReplicatedMergeTree) deployments.

    Self-Evolving Skill: This skill improves through use. If instructions are wrong, parameters drifted, or a workaround was needed — fix this file immediately, don't defer. Only update for real, reproducible issues.

    When to Use This Skill

    Use this skill when:

    • Designing ClickHouse table schemas with ORDER BY key selection
    • Selecting compression codecs for column types
    • Configuring partition keys for data lifecycle management
    • Adding performance accelerators (projections, indexes, dictionaries)
    • Auditing and optimizing existing ClickHouse schemas

    Core Methodology

    Schema Design Workflow

    Follow this sequence when designing or reviewing ClickHouse schemas:

    1. Define ORDER BY key (3-5 columns, lowest cardinality first)
    2. Select compression codecs per column type
    3. Configure PARTITION BY for data lifecycle management
    4. Add performance accelerators (projections, indexes)
    5. Validate with audit queries (see scripts/)
    6. Document with COMMENT statements — ClickHouse table and column COMMENTs are the single source of truth (SSoT) for what each column means, how it was computed, and what constraints apply. No external doc, skill, or wiki supersedes the COMMENT. See references/schema-documentation.md

    ORDER BY Key Selection

    The ORDER BY clause is the most critical decision in ClickHouse schema design.

    Rules:

    • Limit to 3-5 columns maximum (each additional column has diminishing returns)
    • Place lowest cardinality columns first (e.g., tenant_id before timestamp)
    • Include all columns used in WHERE clauses for range queries
    • PRIMARY KEY must be a prefix of ORDER BY (or omit to use full ORDER BY)

    Example:

    -- Correct: Low cardinality first, 4 columns
    CREATE TABLE trades (
        exchange LowCardinality(String),
        symbol LowCardinality(String),
        timestamp DateTime64(3),
        trade_id UInt64,
        price Float64,
        quantity Float64
    ) ENGINE = MergeTree()
    ORDER BY (exchange, symbol, timestamp, trade_id);
    
    -- Wrong: High cardinality first (10x slower queries)
    ORDER BY (trade_id, timestamp, symbol, exchange);
    

    Compression Codec Quick Reference

    Column Type Default Codec Read-Heavy Alternative Example
    DateTime/DateTime64 CODEC(DoubleDelta, ZSTD) CODEC(DoubleDelta, LZ4) timestamp DateTime64(3) CODEC(DoubleDelta, ZSTD)
    Float prices/gauges CODEC(Gorilla, ZSTD) CODEC(Gorilla, LZ4) price Float64 CODEC(Gorilla, ZSTD)
    Integer counters CODEC(T64, ZSTD) — count UInt64 CODEC(T64, ZSTD)
    Slowly changing integers CODEC(Delta, ZSTD) CODEC(Delta, LZ4) version UInt32 CODEC(Delta, ZSTD)
    String (low cardinality) LowCardinality(String) — status LowCardinality(String)
    General data CODEC(ZSTD(3)) CODEC(LZ4) Default compression level 3

    When to use LZ4 over ZSTD: LZ4 provides 1.76x faster decompression. Use LZ4 for read-heavy workloads with monotonic sequences (timestamps, counters). Use ZSTD (default) when compression ratio matters or data patterns are unknown.

    Note on codec combinations:

    Delta/DoubleDelta + Gorilla combinations are blocked by default (allow_suspicious_codecs) because Gorilla already performs implicit delta compression internally—combining them is redundant, not dangerous. A historical corruption bug (PR #45615, Jan 2023) was fixed, but the blocking remains as a best practice guardrail.

    Use each codec family independently for its intended data type:

    -- Correct usage
    price Float64 CODEC(Gorilla, ZSTD)              -- Floats: use Gorilla
    timestamp DateTime64 CODEC(DoubleDelta, ZSTD)   -- Timestamps: use DoubleDelta
    timestamp DateTime64 CODEC(DoubleDelta, LZ4)    -- Read-heavy: use LZ4
    

    PARTITION BY Guidelines

    PARTITION BY is for data lifecycle management, NOT query optimization.

    Rules:

    • Partition by time units (month, week) for TTL and data management
    • Keep partition count under 1000 total across all tables
    • Each partition should contain 1-300 parts maximum
    • Never partition by high-cardinality columns

    Example:

    -- Correct: Monthly partitions for TTL management
    PARTITION BY toYYYYMM(timestamp)
    
    -- Wrong: Daily partitions (too many parts)
    PARTITION BY toYYYYMMDD(timestamp)
    
    -- Wrong: High-cardinality partition key
    PARTITION BY user_id
    

    Anti-Patterns Checklist (v24.4+)

    Pattern Severity Modern Status Fix
    Too many parts (>300/partition) Critical Still critical Reduce partition granularity
    Small batch inserts (<1000) Critical Still critical Batch to 10k-100k rows
    High-cardinality first ORDER BY Critical Still critical Reorder: lowest cardinality first
    No memory limits High Still critical Set max_memory_usage
    Denormalization overuse High Still critical Use dictionaries + materialized views
    Large JOINs Medium 180x improved Still avoid for ultra-low-latency
    Mutations (UPDATE/DELETE) Medium 1700x improved Use lightweight UPDATEs (v24.4+); see DELETE Strategy Guide below

    DELETE Strategy Guide (v13.49.0+ Best Practices)

    Choose the right DELETE strategy based on scope. Ranked fastest to slowest:

    Strategy Syntax Speed Use When
    DROP PARTITION ALTER TABLE t DROP PARTITION (key1, key2, keyN) Instant (metadata-only) Purge entire partition ranges (months, corrupt data, test data)
    DELETE IN PARTITION ALTER TABLE t DELETE IN PARTITION (...) WHERE condition Fast (scans 1 partition) Targeted row removal within a known partition
    ALTER TABLE DELETE ALTER TABLE t DELETE WHERE condition Slow (scans all parts) Fallback when partition is unknown
    DELETE FROM (lightweight) DELETE FROM t WHERE condition Variable ANTI-PATTERN for write pipelines — see warning below

    Anti-pattern: Lightweight DELETE FROM before INSERT

    DELETE FROM sets _row_exists=0 masks instead of physically removing rows. These ghost rows:

    • Persist until ClickHouse background merge (unpredictable timing)
    • Show up in queries without FINAL as phantom data
    • Cause false anomalies in monitoring/integrity checks
    • Were the root cause of months of phantom Stathera gaps in production (opendeviationbar #269)

    Use DELETE FROM only for: ad-hoc data correction where ghost rows don't matter (analytics cleanup, dev/test). Never use in write pipelines where INSERT follows DELETE.

    All DELETE mutations should use: SETTINGS mutations_sync = 1 to block until completion (prevents INSERT-DELETE race conditions).

    Partition-aware DELETE tip: If your partition key includes the columns you're filtering on (e.g., PARTITION BY (symbol, threshold, toYYYYMM(timestamp))), use DELETE IN PARTITION to scope the scan to a single partition instead of scanning all parts.

    Table Engine Selection

    Deployment Engine Use Case
    ClickHouse Cloud SharedMergeTree Default for cloud deployments
    Self-hosted cluster ReplicatedMergeTree Multi-node with replication
    Self-hosted single MergeTree Single-node development/testing

    Cloud (SharedMergeTree):

    CREATE TABLE trades (...)
    ENGINE = SharedMergeTree('/clickhouse/tables/{shard}/trades', '{replica}')
    ORDER BY (exchange, symbol, timestamp);
    

    Self-hosted (ReplicatedMergeTree):

    CREATE TABLE trades (...)
    ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/trades', '{replica}')
    ORDER BY (exchange, symbol, timestamp);
    

    Skill Delegation Guide

    This skill is the hub for ClickHouse-related tasks. When the user's needs extend beyond schema design, invoke the related skills below.

    Delegation Decision Matrix

    User Need Invoke Skill Trigger Phrases
    Create database users, manage permissions devops-tools:clickhouse-cloud-management "create user", "GRANT", "permissions", "credentials"
    Configure DBeaver, generate connection JSON devops-tools:clickhouse-pydantic-config "DBeaver", "client config", "connection setup"
    Validate schema contracts against live database quality-tools:schema-e2e-validation "validate schema", "Earthly E2E", "schema contract"

    Typical Workflow Sequence

    1. Schema Design (THIS SKILL) → Design ORDER BY, compression, partitioning
    2. User Setup → clickhouse-cloud-management (if cloud credentials needed)
    3. Client Config → clickhouse-pydantic-config (generate DBeaver JSON)
    4. Validation → schema-e2e-validation (CI/CD schema contracts)

    Example: Full Stack Request

    User: "I need to design a trades table for ClickHouse Cloud and set up DBeaver to query it."

    Expected behavior:

    1. Use THIS skill for schema design
    2. Invoke clickhouse-cloud-management for creating database user
    3. Invoke clickhouse-pydantic-config for DBeaver configuration

    Performance Accelerators

    Projections

    Create alternative sort orders that ClickHouse automatically selects:

    ALTER TABLE trades ADD PROJECTION trades_by_symbol (
        SELECT * ORDER BY symbol, timestamp
    );
    ALTER TABLE trades MATERIALIZE PROJECTION trades_by_symbol;
    

    Materialized Views

    Pre-compute aggregations for dashboard queries:

    CREATE MATERIALIZED VIEW trades_hourly_mv
    ENGINE = SummingMergeTree()
    ORDER BY (exchange, symbol, hour)
    AS SELECT
        exchange,
        symbol,
        toStartOfHour(timestamp) AS hour,
        sum(quantity) AS total_volume,
        count() AS trade_count
    FROM trades
    GROUP BY exchange, symbol, hour;
    

    Dictionaries

    Replace JOINs with O(1) dictionary lookups for large-scale star schemas:

    When to use dictionaries (v24.4+):

    • Fact tables with 100M+ rows joining dimension tables
    • Dimension tables 1k-500k rows with monotonic keys
    • LEFT ANY JOIN semantics required

    When JOINs are sufficient (v24.4+):

    • Dimension tables <500 rows (JOIN overhead negligible)
    • v24.4+ predicate pushdown provides 8-180x improvements
    • Complex JOIN types (FULL, RIGHT, multi-condition)

    Benchmark context: 6.6x speedup measured on Star Schema Benchmark (1.4B rows).

    CREATE DICTIONARY symbol_info (
        symbol String,
        name String,
        sector String
    )
    PRIMARY KEY symbol
    SOURCE(CLICKHOUSE(TABLE 'symbols'))
    LAYOUT(FLAT())  -- Best for <500k entries with monotonic keys
    LIFETIME(3600);
    
    -- Use in queries (O(1) lookup)
    SELECT
        symbol,
        dictGet('symbol_info', 'name', symbol) AS symbol_name
    FROM trades;
    

    Scripts

    Execute comprehensive schema audit:

    clickhouse-client --multiquery < scripts/schema-audit.sql
    

    The audit script checks:

    • Part count per partition (threshold: 300)
    • Compression ratios by column
    • Query performance patterns
    • Replication lag (if applicable)
    • Memory usage patterns

    Additional Resources

    Reference Files

    Reference Content
    references/schema-design-workflow.md Complete workflow with examples
    references/compression-codec-selection.md Decision tree + benchmarks
    references/anti-patterns-and-fixes.md 13 deadly sins + v24.4+ status
    references/audit-and-diagnostics.md Query interpretation guide
    references/idiomatic-architecture.md Parameterized views, dictionaries, dedup
    references/schema-documentation.md COMMENT patterns + naming for AI understanding
    references/cache-schema-evolution.md Cache invalidation + schema evolution patterns

    External Documentation

    • ClickHouse Best Practices
    • Altinity Knowledge Base
    • ClickHouse Blog

    Python Driver Policy

    Use clickhouse-connect (official) for all Python integrations.

    # ✅ RECOMMENDED: clickhouse-connect (official, HTTP)
    import clickhouse_connect
    
    client = clickhouse_connect.get_client(
        host='localhost',
        port=8123,  # HTTP port
        username='default',
        password=''
    )
    result = client.query("SELECT * FROM trades LIMIT 1000")
    df = client.query_df("SELECT * FROM trades")  # Pandas integration
    

    Why NOT clickhouse-driver

    Factor clickhouse-connect clickhouse-driver
    Maintainer ClickHouse Inc. Solo developer
    Weekly commits Yes (active) Sparse (months)
    Open issues 41 (addressed) 76 (accumulating)
    Downloads/week 2.7M 1.5M
    Bus factor risk Low (company) High (1 person)

    Do NOT use clickhouse-driver despite its ~26% speed advantage for large exports. The maintenance risk outweighs performance gains:

    • Single maintainer (mymarilyn) with no succession plan
    • Issues accumulating without response
    • Risk of abandonment breaks production code

    Exception: Only consider clickhouse-driver if you have extreme performance requirements (exporting millions of rows) AND accept the maintenance risk.

    ClickHouse COMMENT = Single Source of Truth

    Every ClickHouse table and column MUST have a COMMENT that fully documents its meaning, computation method, and constraints. The COMMENT is the SSoT — no external document, skill, or wiki supersedes it.

    Why

    • ClickHouse COMMENTs have no length limit, support newlines, URLs, and unicode
    • Zero performance impact on queries (pure metadata, never in data path)
    • Visible via DESCRIBE table, SHOW CREATE TABLE, system.columns
    • Survives schema migrations (preserved through ALTER operations)

    What to Include in COMMENTs

    • Column purpose in plain English
    • Computation formula (if derived/computed)
    • Unit (seconds, milliseconds, bps, ratio)
    • Valid range or enum values
    • Anti-patterns (what NOT to do with this column)
    • GitHub issue link for provenance
    • Source script that populates the column

    Example

    ALTER TABLE t COMMENT COLUMN session_label
    'STRICT session label. 8 values: sydney_only, tokyo_only, ...
    Only set when ENTIRE bar (open→close) falls within one session.
    cross_session = bar spans boundary. Use WHERE is_pure_session=1.
    GitHub: https://github.com/org/repo/issues/54
    Source: scripts/populate-sessions/populate_v3.py';
    

    Anti-Pattern

    NEVER create a ClickHouse column without a COMMENT. A column without documentation is a column that will be misused.

    Related Skills

    Skill Purpose
    devops-tools:clickhouse-cloud-management User/permission management
    devops-tools:clickhouse-pydantic-config DBeaver connection generation
    quality-tools:schema-e2e-validation YAML schema contracts
    quality-tools:multi-agent-e2e-validation Database migration validation

    Troubleshooting

    Issue Cause Solution
    Too many parts Over-partitioned Reduce partition granularity (monthly not daily)
    Slow queries Wrong ORDER BY order Put lowest cardinality columns first
    High memory usage No memory limits set Configure max_memory_usage setting
    Codec error on Delta+Gorilla Suspicious codec combination Use each codec family independently
    Projection not used Optimizer chose different plan Check EXPLAIN to verify projection selection
    Dictionary stale Lifetime expired Increase LIFETIME or trigger refresh
    Replication lag Part merges falling behind Check merge_tree settings, add resources
    INSERT too slow Small batch sizes Batch to 10k-100k rows per INSERT

    Post-Execution Reflection

    After this skill completes, check before closing:

    1. Did the command succeed? — If not, fix the instruction or error table that caused the failure.
    2. Did parameters or output change? — If the underlying tool's interface drifted, update Usage examples and Parameters table to match.
    3. Was a workaround needed? — If you had to improvise (different flags, extra steps), update this SKILL.md so the next invocation doesn't need the same workaround.

    Only update if the issue is real and reproducible — not speculative.

    Recommended Servers
    ThinAir Data
    ThinAir Data
    Thoughtbox
    Thoughtbox
    Browser tool
    Browser tool
    Repository
    terrylica/cc-skills
    Files