clickhouse-architect

terrylica/clickhouse-architect

Data & Analytics

About

SKILL.md

clickhouse-architect

terrylica/clickhouse-architect

Data & Analytics

About

ClickHouse schema design and optimization. TRIGGERS - ClickHouse schema, compression codecs, MergeTree, ORDER BY tuning, partition key.

SKILL.md

ClickHouse Architect

Prescriptive schema design, compression selection, and performance optimization for ClickHouse (v24.4+). Covers both ClickHouse Cloud (SharedMergeTree) and self-hosted (ReplicatedMergeTree) deployments.

Self-Evolving Skill: This skill improves through use. If instructions are wrong, parameters drifted, or a workaround was needed — fix this file immediately, don't defer. Only update for real, reproducible issues.

When to Use This Skill

Use this skill when:

Designing ClickHouse table schemas with ORDER BY key selection
Selecting compression codecs for column types
Configuring partition keys for data lifecycle management
Adding performance accelerators (projections, indexes, dictionaries)
Auditing and optimizing existing ClickHouse schemas

Core Methodology

Schema Design Workflow

Follow this sequence when designing or reviewing ClickHouse schemas:

Define ORDER BY key (3-5 columns, lowest cardinality first)
Select compression codecs per column type
Configure PARTITION BY for data lifecycle management
Add performance accelerators (projections, indexes)
Validate with audit queries (see scripts/)
Document with COMMENT statements — ClickHouse table and column COMMENTs are the single source of truth (SSoT) for what each column means, how it was computed, and what constraints apply. No external doc, skill, or wiki supersedes the COMMENT. See references/schema-documentation.md

ORDER BY Key Selection

The ORDER BY clause is the most critical decision in ClickHouse schema design.

Rules:

Limit to 3-5 columns maximum (each additional column has diminishing returns)
Place lowest cardinality columns first (e.g., tenant_id before timestamp)
Include all columns used in WHERE clauses for range queries
PRIMARY KEY must be a prefix of ORDER BY (or omit to use full ORDER BY)

Example:

-- Correct: Low cardinality first, 4 columns
CREATE TABLE trades (
    exchange LowCardinality(String),
    symbol LowCardinality(String),
    timestamp DateTime64(3),
    trade_id UInt64,
    price Float64,
    quantity Float64
) ENGINE = MergeTree()
ORDER BY (exchange, symbol, timestamp, trade_id);

-- Wrong: High cardinality first (10x slower queries)
ORDER BY (trade_id, timestamp, symbol, exchange);

Compression Codec Quick Reference

Column Type	Default Codec	Read-Heavy Alternative	Example
DateTime/DateTime64	`CODEC(DoubleDelta, ZSTD)`	`CODEC(DoubleDelta, LZ4)`	`timestamp DateTime64(3) CODEC(DoubleDelta, ZSTD)`
Float prices/gauges	`CODEC(Gorilla, ZSTD)`	`CODEC(Gorilla, LZ4)`	`price Float64 CODEC(Gorilla, ZSTD)`
Integer counters	`CODEC(T64, ZSTD)`	—	`count UInt64 CODEC(T64, ZSTD)`
Slowly changing integers	`CODEC(Delta, ZSTD)`	`CODEC(Delta, LZ4)`	`version UInt32 CODEC(Delta, ZSTD)`
String (low cardinality)	`LowCardinality(String)`	—	`status LowCardinality(String)`
General data	`CODEC(ZSTD(3))`	`CODEC(LZ4)`	Default compression level 3

When to use LZ4 over ZSTD: LZ4 provides 1.76x faster decompression. Use LZ4 for read-heavy workloads with monotonic sequences (timestamps, counters). Use ZSTD (default) when compression ratio matters or data patterns are unknown.

Note on codec combinations:

Delta/DoubleDelta + Gorilla combinations are blocked by default (allow_suspicious_codecs) because Gorilla already performs implicit delta compression internally—combining them is redundant, not dangerous. A historical corruption bug (PR #45615, Jan 2023) was fixed, but the blocking remains as a best practice guardrail.

Use each codec family independently for its intended data type:

-- Correct usage
price Float64 CODEC(Gorilla, ZSTD)              -- Floats: use Gorilla
timestamp DateTime64 CODEC(DoubleDelta, ZSTD)   -- Timestamps: use DoubleDelta
timestamp DateTime64 CODEC(DoubleDelta, LZ4)    -- Read-heavy: use LZ4

PARTITION BY Guidelines

PARTITION BY is for data lifecycle management, NOT query optimization.

Rules:

Partition by time units (month, week) for TTL and data management
Keep partition count under 1000 total across all tables
Each partition should contain 1-300 parts maximum
Never partition by high-cardinality columns

Example:

-- Correct: Monthly partitions for TTL management
PARTITION BY toYYYYMM(timestamp)

-- Wrong: Daily partitions (too many parts)
PARTITION BY toYYYYMMDD(timestamp)

-- Wrong: High-cardinality partition key
PARTITION BY user_id

Anti-Patterns Checklist (v24.4+)

Pattern	Severity	Modern Status	Fix
Too many parts (>300/partition)	Critical	Still critical	Reduce partition granularity
Small batch inserts (<1000)	Critical	Still critical	Batch to 10k-100k rows
High-cardinality first ORDER BY	Critical	Still critical	Reorder: lowest cardinality first
No memory limits	High	Still critical	Set `max_memory_usage`
Denormalization overuse	High	Still critical	Use dictionaries + materialized views
Large JOINs	Medium	180x improved	Still avoid for ultra-low-latency
Mutations (UPDATE/DELETE)	Medium	1700x improved	Use lightweight UPDATEs (v24.4+); see DELETE Strategy Guide below

DELETE Strategy Guide (v13.49.0+ Best Practices)

Choose the right DELETE strategy based on scope. Ranked fastest to slowest:

Strategy	Syntax	Speed	Use When
`DROP PARTITION`	`ALTER TABLE t DROP PARTITION (key1, key2, keyN)`	Instant (metadata-only)	Purge entire partition ranges (months, corrupt data, test data)
`DELETE IN PARTITION`	`ALTER TABLE t DELETE IN PARTITION (...) WHERE condition`	Fast (scans 1 partition)	Targeted row removal within a known partition
`ALTER TABLE DELETE`	`ALTER TABLE t DELETE WHERE condition`	Slow (scans all parts)	Fallback when partition is unknown
`DELETE FROM` (lightweight)	`DELETE FROM t WHERE condition`	Variable	ANTI-PATTERN for write pipelines — see warning below

Anti-pattern: Lightweight DELETE FROM before INSERT

DELETE FROM sets _row_exists=0 masks instead of physically removing rows. These ghost rows:

Persist until ClickHouse background merge (unpredictable timing)
Show up in queries without FINAL as phantom data
Cause false anomalies in monitoring/integrity checks
Were the root cause of months of phantom Stathera gaps in production (opendeviationbar #269)

Use DELETE FROM only for: ad-hoc data correction where ghost rows don't matter (analytics cleanup, dev/test). Never use in write pipelines where INSERT follows DELETE.

All DELETE mutations should use: SETTINGS mutations_sync = 1 to block until completion (prevents INSERT-DELETE race conditions).

Partition-aware DELETE tip: If your partition key includes the columns you're filtering on (e.g., PARTITION BY (symbol, threshold, toYYYYMM(timestamp))), use DELETE IN PARTITION to scope the scan to a single partition instead of scanning all parts.

Table Engine Selection

Deployment	Engine	Use Case
ClickHouse Cloud	`SharedMergeTree`	Default for cloud deployments
Self-hosted cluster	`ReplicatedMergeTree`	Multi-node with replication
Self-hosted single	`MergeTree`	Single-node development/testing

Cloud (SharedMergeTree):

CREATE TABLE trades (...)
ENGINE = SharedMergeTree('/clickhouse/tables/{shard}/trades', '{replica}')
ORDER BY (exchange, symbol, timestamp);

Self-hosted (ReplicatedMergeTree):

CREATE TABLE trades (...)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/trades', '{replica}')
ORDER BY (exchange, symbol, timestamp);

Skill Delegation Guide

This skill is the hub for ClickHouse-related tasks. When the user's needs extend beyond schema design, invoke the related skills below.

Delegation Decision Matrix

User Need	Invoke Skill	Trigger Phrases
Create database users, manage permissions	`devops-tools:clickhouse-cloud-management`	"create user", "GRANT", "permissions", "credentials"
Configure DBeaver, generate connection JSON	`devops-tools:clickhouse-pydantic-config`	"DBeaver", "client config", "connection setup"
Validate schema contracts against live database	`quality-tools:schema-e2e-validation`	"validate schema", "Earthly E2E", "schema contract"

Typical Workflow Sequence

Schema Design (THIS SKILL) → Design ORDER BY, compression, partitioning
User Setup → clickhouse-cloud-management (if cloud credentials needed)
Client Config → clickhouse-pydantic-config (generate DBeaver JSON)
Validation → schema-e2e-validation (CI/CD schema contracts)

Example: Full Stack Request

User: "I need to design a trades table for ClickHouse Cloud and set up DBeaver to query it."

Expected behavior:

Use THIS skill for schema design
Invoke clickhouse-cloud-management for creating database user
Invoke clickhouse-pydantic-config for DBeaver configuration

Performance Accelerators

Projections

Create alternative sort orders that ClickHouse automatically selects:

ALTER TABLE trades ADD PROJECTION trades_by_symbol (
    SELECT * ORDER BY symbol, timestamp
);
ALTER TABLE trades MATERIALIZE PROJECTION trades_by_symbol;

Materialized Views

Pre-compute aggregations for dashboard queries:

CREATE MATERIALIZED VIEW trades_hourly_mv
ENGINE = SummingMergeTree()
ORDER BY (exchange, symbol, hour)
AS SELECT
    exchange,
    symbol,
    toStartOfHour(timestamp) AS hour,
    sum(quantity) AS total_volume,
    count() AS trade_count
FROM trades
GROUP BY exchange, symbol, hour;

Dictionaries

Replace JOINs with O(1) dictionary lookups for large-scale star schemas:

When to use dictionaries (v24.4+):

Fact tables with 100M+ rows joining dimension tables
Dimension tables 1k-500k rows with monotonic keys
LEFT ANY JOIN semantics required

When JOINs are sufficient (v24.4+):

Dimension tables <500 rows (JOIN overhead negligible)
v24.4+ predicate pushdown provides 8-180x improvements
Complex JOIN types (FULL, RIGHT, multi-condition)

Benchmark context: 6.6x speedup measured on Star Schema Benchmark (1.4B rows).

CREATE DICTIONARY symbol_info (
    symbol String,
    name String,
    sector String
)
PRIMARY KEY symbol
SOURCE(CLICKHOUSE(TABLE 'symbols'))
LAYOUT(FLAT())  -- Best for <500k entries with monotonic keys
LIFETIME(3600);

-- Use in queries (O(1) lookup)
SELECT
    symbol,
    dictGet('symbol_info', 'name', symbol) AS symbol_name
FROM trades;

Scripts

Execute comprehensive schema audit:

clickhouse-client --multiquery < scripts/schema-audit.sql

The audit script checks:

Part count per partition (threshold: 300)
Compression ratios by column
Query performance patterns
Replication lag (if applicable)
Memory usage patterns

Additional Resources

Reference Files

Reference	Content
`references/schema-design-workflow.md`	Complete workflow with examples
`references/compression-codec-selection.md`	Decision tree + benchmarks
`references/anti-patterns-and-fixes.md`	13 deadly sins + v24.4+ status
`references/audit-and-diagnostics.md`	Query interpretation guide
`references/idiomatic-architecture.md`	Parameterized views, dictionaries, dedup
`references/schema-documentation.md`	COMMENT patterns + naming for AI understanding
`references/cache-schema-evolution.md`	Cache invalidation + schema evolution patterns

External Documentation

Python Driver Policy

Use clickhouse-connect (official) for all Python integrations.

# ✅ RECOMMENDED: clickhouse-connect (official, HTTP)
import clickhouse_connect

client = clickhouse_connect.get_client(
    host='localhost',
    port=8123,  # HTTP port
    username='default',
    password=''
)
result = client.query("SELECT * FROM trades LIMIT 1000")
df = client.query_df("SELECT * FROM trades")  # Pandas integration

Why NOT `clickhouse-driver`

Factor	clickhouse-connect	clickhouse-driver
Maintainer	ClickHouse Inc.	Solo developer
Weekly commits	Yes (active)	Sparse (months)
Open issues	41 (addressed)	76 (accumulating)
Downloads/week	2.7M	1.5M
Bus factor risk	Low (company)	High (1 person)

Do NOT use clickhouse-driver despite its ~26% speed advantage for large exports. The maintenance risk outweighs performance gains:

Single maintainer (mymarilyn) with no succession plan
Issues accumulating without response
Risk of abandonment breaks production code

Exception: Only consider clickhouse-driver if you have extreme performance requirements (exporting millions of rows) AND accept the maintenance risk.

ClickHouse COMMENT = Single Source of Truth

Every ClickHouse table and column MUST have a COMMENT that fully documents its meaning, computation method, and constraints. The COMMENT is the SSoT — no external document, skill, or wiki supersedes it.

Why

ClickHouse COMMENTs have no length limit, support newlines, URLs, and unicode
Zero performance impact on queries (pure metadata, never in data path)
Visible via DESCRIBE table, SHOW CREATE TABLE, system.columns
Survives schema migrations (preserved through ALTER operations)

What to Include in COMMENTs

Column purpose in plain English
Computation formula (if derived/computed)
Unit (seconds, milliseconds, bps, ratio)
Valid range or enum values
Anti-patterns (what NOT to do with this column)
GitHub issue link for provenance
Source script that populates the column

Example

ALTER TABLE t COMMENT COLUMN session_label
'STRICT session label. 8 values: sydney_only, tokyo_only, ...
Only set when ENTIRE bar (open→close) falls within one session.
cross_session = bar spans boundary. Use WHERE is_pure_session=1.
GitHub: https://github.com/org/repo/issues/54
Source: scripts/populate-sessions/populate_v3.py';

Anti-Pattern

NEVER create a ClickHouse column without a COMMENT. A column without documentation is a column that will be misused.

Related Skills

Skill	Purpose
`devops-tools:clickhouse-cloud-management`	User/permission management
`devops-tools:clickhouse-pydantic-config`	DBeaver connection generation
`quality-tools:schema-e2e-validation`	YAML schema contracts
`quality-tools:multi-agent-e2e-validation`	Database migration validation

Troubleshooting

Issue	Cause	Solution
Too many parts	Over-partitioned	Reduce partition granularity (monthly not daily)
Slow queries	Wrong ORDER BY order	Put lowest cardinality columns first
High memory usage	No memory limits set	Configure max_memory_usage setting
Codec error on Delta+Gorilla	Suspicious codec combination	Use each codec family independently
Projection not used	Optimizer chose different plan	Check EXPLAIN to verify projection selection
Dictionary stale	Lifetime expired	Increase LIFETIME or trigger refresh
Replication lag	Part merges falling behind	Check merge_tree settings, add resources
INSERT too slow	Small batch sizes	Batch to 10k-100k rows per INSERT

Post-Execution Reflection

After this skill completes, check before closing:

Did the command succeed? — If not, fix the instruction or error table that caused the failure.
Did parameters or output change? — If the underlying tool's interface drifted, update Usage examples and Parameters table to match.
Was a workaround needed? — If you had to improvise (different flags, extra steps), update this SKILL.md so the next invocation doesn't need the same workaround.

Only update if the issue is real and reproducible — not speculative.

About

SKILL.md

About

ClickHouse schema design and optimization. TRIGGERS - ClickHouse schema, compression codecs, MergeTree, ORDER BY tuning, partition key.

SKILL.md

ClickHouse Architect

Self-Evolving Skill: This skill improves through use. If instructions are wrong, parameters drifted, or a workaround was needed — fix this file immediately, don't defer. Only update for real, reproducible issues.

When to Use This Skill

Use this skill when:

Designing ClickHouse table schemas with ORDER BY key selection
Selecting compression codecs for column types
Configuring partition keys for data lifecycle management
Adding performance accelerators (projections, indexes, dictionaries)
Auditing and optimizing existing ClickHouse schemas

Core Methodology

Schema Design Workflow

Follow this sequence when designing or reviewing ClickHouse schemas:

Define ORDER BY key (3-5 columns, lowest cardinality first)
Select compression codecs per column type
Configure PARTITION BY for data lifecycle management
Add performance accelerators (projections, indexes)
Validate with audit queries (see scripts/)
Document with COMMENT statements — ClickHouse table and column COMMENTs are the single source of truth (SSoT) for what each column means, how it was computed, and what constraints apply. No external doc, skill, or wiki supersedes the COMMENT. See references/schema-documentation.md

ORDER BY Key Selection

The ORDER BY clause is the most critical decision in ClickHouse schema design.

Rules:

Limit to 3-5 columns maximum (each additional column has diminishing returns)
Place lowest cardinality columns first (e.g., tenant_id before timestamp)
Include all columns used in WHERE clauses for range queries
PRIMARY KEY must be a prefix of ORDER BY (or omit to use full ORDER BY)

Example:

-- Correct: Low cardinality first, 4 columns
CREATE TABLE trades (
    exchange LowCardinality(String),
    symbol LowCardinality(String),
    timestamp DateTime64(3),
    trade_id UInt64,
    price Float64,
    quantity Float64
) ENGINE = MergeTree()
ORDER BY (exchange, symbol, timestamp, trade_id);

-- Wrong: High cardinality first (10x slower queries)
ORDER BY (trade_id, timestamp, symbol, exchange);

Compression Codec Quick Reference

Column Type	Default Codec	Read-Heavy Alternative	Example
DateTime/DateTime64	`CODEC(DoubleDelta, ZSTD)`	`CODEC(DoubleDelta, LZ4)`	`timestamp DateTime64(3) CODEC(DoubleDelta, ZSTD)`
Float prices/gauges	`CODEC(Gorilla, ZSTD)`	`CODEC(Gorilla, LZ4)`	`price Float64 CODEC(Gorilla, ZSTD)`
Integer counters	`CODEC(T64, ZSTD)`	—	`count UInt64 CODEC(T64, ZSTD)`
Slowly changing integers	`CODEC(Delta, ZSTD)`	`CODEC(Delta, LZ4)`	`version UInt32 CODEC(Delta, ZSTD)`
String (low cardinality)	`LowCardinality(String)`	—	`status LowCardinality(String)`
General data	`CODEC(ZSTD(3))`	`CODEC(LZ4)`	Default compression level 3

Note on codec combinations:

Use each codec family independently for its intended data type:

-- Correct usage
price Float64 CODEC(Gorilla, ZSTD)              -- Floats: use Gorilla
timestamp DateTime64 CODEC(DoubleDelta, ZSTD)   -- Timestamps: use DoubleDelta
timestamp DateTime64 CODEC(DoubleDelta, LZ4)    -- Read-heavy: use LZ4

PARTITION BY Guidelines

PARTITION BY is for data lifecycle management, NOT query optimization.

Rules:

Partition by time units (month, week) for TTL and data management
Keep partition count under 1000 total across all tables
Each partition should contain 1-300 parts maximum
Never partition by high-cardinality columns

Example:

-- Correct: Monthly partitions for TTL management
PARTITION BY toYYYYMM(timestamp)

-- Wrong: Daily partitions (too many parts)
PARTITION BY toYYYYMMDD(timestamp)

-- Wrong: High-cardinality partition key
PARTITION BY user_id

Anti-Patterns Checklist (v24.4+)

Pattern	Severity	Modern Status	Fix
Too many parts (>300/partition)	Critical	Still critical	Reduce partition granularity
Small batch inserts (<1000)	Critical	Still critical	Batch to 10k-100k rows
High-cardinality first ORDER BY	Critical	Still critical	Reorder: lowest cardinality first
No memory limits	High	Still critical	Set `max_memory_usage`
Denormalization overuse	High	Still critical	Use dictionaries + materialized views
Large JOINs	Medium	180x improved	Still avoid for ultra-low-latency
Mutations (UPDATE/DELETE)	Medium	1700x improved	Use lightweight UPDATEs (v24.4+); see DELETE Strategy Guide below

DELETE Strategy Guide (v13.49.0+ Best Practices)

Choose the right DELETE strategy based on scope. Ranked fastest to slowest:

Strategy	Syntax	Speed	Use When
`DROP PARTITION`	`ALTER TABLE t DROP PARTITION (key1, key2, keyN)`	Instant (metadata-only)	Purge entire partition ranges (months, corrupt data, test data)
`DELETE IN PARTITION`	`ALTER TABLE t DELETE IN PARTITION (...) WHERE condition`	Fast (scans 1 partition)	Targeted row removal within a known partition
`ALTER TABLE DELETE`	`ALTER TABLE t DELETE WHERE condition`	Slow (scans all parts)	Fallback when partition is unknown
`DELETE FROM` (lightweight)	`DELETE FROM t WHERE condition`	Variable	ANTI-PATTERN for write pipelines — see warning below

Anti-pattern: Lightweight DELETE FROM before INSERT

DELETE FROM sets _row_exists=0 masks instead of physically removing rows. These ghost rows:

Persist until ClickHouse background merge (unpredictable timing)
Show up in queries without FINAL as phantom data
Cause false anomalies in monitoring/integrity checks
Were the root cause of months of phantom Stathera gaps in production (opendeviationbar #269)

Use DELETE FROM only for: ad-hoc data correction where ghost rows don't matter (analytics cleanup, dev/test). Never use in write pipelines where INSERT follows DELETE.

All DELETE mutations should use: SETTINGS mutations_sync = 1 to block until completion (prevents INSERT-DELETE race conditions).

Table Engine Selection

Deployment	Engine	Use Case
ClickHouse Cloud	`SharedMergeTree`	Default for cloud deployments
Self-hosted cluster	`ReplicatedMergeTree`	Multi-node with replication
Self-hosted single	`MergeTree`	Single-node development/testing

Cloud (SharedMergeTree):

CREATE TABLE trades (...)
ENGINE = SharedMergeTree('/clickhouse/tables/{shard}/trades', '{replica}')
ORDER BY (exchange, symbol, timestamp);

Self-hosted (ReplicatedMergeTree):

CREATE TABLE trades (...)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/trades', '{replica}')
ORDER BY (exchange, symbol, timestamp);

Skill Delegation Guide

This skill is the hub for ClickHouse-related tasks. When the user's needs extend beyond schema design, invoke the related skills below.

Delegation Decision Matrix

User Need	Invoke Skill	Trigger Phrases
Create database users, manage permissions	`devops-tools:clickhouse-cloud-management`	"create user", "GRANT", "permissions", "credentials"
Configure DBeaver, generate connection JSON	`devops-tools:clickhouse-pydantic-config`	"DBeaver", "client config", "connection setup"
Validate schema contracts against live database	`quality-tools:schema-e2e-validation`	"validate schema", "Earthly E2E", "schema contract"

Typical Workflow Sequence

Schema Design (THIS SKILL) → Design ORDER BY, compression, partitioning
User Setup → clickhouse-cloud-management (if cloud credentials needed)
Client Config → clickhouse-pydantic-config (generate DBeaver JSON)
Validation → schema-e2e-validation (CI/CD schema contracts)

Example: Full Stack Request

User: "I need to design a trades table for ClickHouse Cloud and set up DBeaver to query it."

Expected behavior:

Use THIS skill for schema design
Invoke clickhouse-cloud-management for creating database user
Invoke clickhouse-pydantic-config for DBeaver configuration

Performance Accelerators

Projections

Create alternative sort orders that ClickHouse automatically selects:

ALTER TABLE trades ADD PROJECTION trades_by_symbol (
    SELECT * ORDER BY symbol, timestamp
);
ALTER TABLE trades MATERIALIZE PROJECTION trades_by_symbol;

Materialized Views

Pre-compute aggregations for dashboard queries:

CREATE MATERIALIZED VIEW trades_hourly_mv
ENGINE = SummingMergeTree()
ORDER BY (exchange, symbol, hour)
AS SELECT
    exchange,
    symbol,
    toStartOfHour(timestamp) AS hour,
    sum(quantity) AS total_volume,
    count() AS trade_count
FROM trades
GROUP BY exchange, symbol, hour;

Dictionaries

Replace JOINs with O(1) dictionary lookups for large-scale star schemas:

When to use dictionaries (v24.4+):

Fact tables with 100M+ rows joining dimension tables
Dimension tables 1k-500k rows with monotonic keys
LEFT ANY JOIN semantics required

When JOINs are sufficient (v24.4+):

Dimension tables <500 rows (JOIN overhead negligible)
v24.4+ predicate pushdown provides 8-180x improvements
Complex JOIN types (FULL, RIGHT, multi-condition)

Benchmark context: 6.6x speedup measured on Star Schema Benchmark (1.4B rows).

CREATE DICTIONARY symbol_info (
    symbol String,
    name String,
    sector String
)
PRIMARY KEY symbol
SOURCE(CLICKHOUSE(TABLE 'symbols'))
LAYOUT(FLAT())  -- Best for <500k entries with monotonic keys
LIFETIME(3600);

-- Use in queries (O(1) lookup)
SELECT
    symbol,
    dictGet('symbol_info', 'name', symbol) AS symbol_name
FROM trades;

Scripts

Execute comprehensive schema audit:

clickhouse-client --multiquery < scripts/schema-audit.sql

The audit script checks:

Part count per partition (threshold: 300)
Compression ratios by column
Query performance patterns
Replication lag (if applicable)
Memory usage patterns

Additional Resources

Reference Files

Reference	Content
`references/schema-design-workflow.md`	Complete workflow with examples
`references/compression-codec-selection.md`	Decision tree + benchmarks
`references/anti-patterns-and-fixes.md`	13 deadly sins + v24.4+ status
`references/audit-and-diagnostics.md`	Query interpretation guide
`references/idiomatic-architecture.md`	Parameterized views, dictionaries, dedup
`references/schema-documentation.md`	COMMENT patterns + naming for AI understanding
`references/cache-schema-evolution.md`	Cache invalidation + schema evolution patterns

External Documentation

Python Driver Policy

Use clickhouse-connect (official) for all Python integrations.

# ✅ RECOMMENDED: clickhouse-connect (official, HTTP)
import clickhouse_connect

client = clickhouse_connect.get_client(
    host='localhost',
    port=8123,  # HTTP port
    username='default',
    password=''
)
result = client.query("SELECT * FROM trades LIMIT 1000")
df = client.query_df("SELECT * FROM trades")  # Pandas integration

Why NOT `clickhouse-driver`

Factor	clickhouse-connect	clickhouse-driver
Maintainer	ClickHouse Inc.	Solo developer
Weekly commits	Yes (active)	Sparse (months)
Open issues	41 (addressed)	76 (accumulating)
Downloads/week	2.7M	1.5M
Bus factor risk	Low (company)	High (1 person)

Do NOT use clickhouse-driver despite its ~26% speed advantage for large exports. The maintenance risk outweighs performance gains:

Single maintainer (mymarilyn) with no succession plan
Issues accumulating without response
Risk of abandonment breaks production code

Exception: Only consider clickhouse-driver if you have extreme performance requirements (exporting millions of rows) AND accept the maintenance risk.

ClickHouse COMMENT = Single Source of Truth

Why

ClickHouse COMMENTs have no length limit, support newlines, URLs, and unicode
Zero performance impact on queries (pure metadata, never in data path)
Visible via DESCRIBE table, SHOW CREATE TABLE, system.columns
Survives schema migrations (preserved through ALTER operations)

What to Include in COMMENTs

Column purpose in plain English
Computation formula (if derived/computed)
Unit (seconds, milliseconds, bps, ratio)
Valid range or enum values
Anti-patterns (what NOT to do with this column)
GitHub issue link for provenance
Source script that populates the column

Example

ALTER TABLE t COMMENT COLUMN session_label
'STRICT session label. 8 values: sydney_only, tokyo_only, ...
Only set when ENTIRE bar (open→close) falls within one session.
cross_session = bar spans boundary. Use WHERE is_pure_session=1.
GitHub: https://github.com/org/repo/issues/54
Source: scripts/populate-sessions/populate_v3.py';

Anti-Pattern

NEVER create a ClickHouse column without a COMMENT. A column without documentation is a column that will be misused.

Related Skills

Skill	Purpose
`devops-tools:clickhouse-cloud-management`	User/permission management
`devops-tools:clickhouse-pydantic-config`	DBeaver connection generation
`quality-tools:schema-e2e-validation`	YAML schema contracts
`quality-tools:multi-agent-e2e-validation`	Database migration validation

Troubleshooting

Issue	Cause	Solution
Too many parts	Over-partitioned	Reduce partition granularity (monthly not daily)
Slow queries	Wrong ORDER BY order	Put lowest cardinality columns first
High memory usage	No memory limits set	Configure max_memory_usage setting
Codec error on Delta+Gorilla	Suspicious codec combination	Use each codec family independently
Projection not used	Optimizer chose different plan	Check EXPLAIN to verify projection selection
Dictionary stale	Lifetime expired	Increase LIFETIME or trigger refresh
Replication lag	Part merges falling behind	Check merge_tree settings, add resources
INSERT too slow	Small batch sizes	Batch to 10k-100k rows per INSERT

Post-Execution Reflection

After this skill completes, check before closing:

Did the command succeed? — If not, fix the instruction or error table that caused the failure.
Did parameters or output change? — If the underlying tool's interface drifted, update Usage examples and Parameters table to match.
Was a workaround needed? — If you had to improvise (different flags, extra steps), update this SKILL.md so the next invocation doesn't need the same workaround.

Only update if the issue is real and reproducible — not speculative.

clickhouse-architect

About

SKILL.md

clickhouse-architect

About

SKILL.md

ClickHouse Architect

When to Use This Skill

Core Methodology

Schema Design Workflow

ORDER BY Key Selection

Compression Codec Quick Reference

PARTITION BY Guidelines

Anti-Patterns Checklist (v24.4+)

DELETE Strategy Guide (v13.49.0+ Best Practices)

Table Engine Selection

Skill Delegation Guide

Delegation Decision Matrix

Typical Workflow Sequence

Example: Full Stack Request

Performance Accelerators

Projections

Materialized Views

Dictionaries

Scripts

Additional Resources

Reference Files

External Documentation

Python Driver Policy

Why NOT clickhouse-driver

ClickHouse COMMENT = Single Source of Truth

Why

What to Include in COMMENTs

Example

Anti-Pattern

Related Skills

Troubleshooting

Post-Execution Reflection

About

SKILL.md

About

SKILL.md

ClickHouse Architect

When to Use This Skill

Core Methodology

Schema Design Workflow

ORDER BY Key Selection

Compression Codec Quick Reference

PARTITION BY Guidelines

Anti-Patterns Checklist (v24.4+)

DELETE Strategy Guide (v13.49.0+ Best Practices)

Table Engine Selection

Skill Delegation Guide

Delegation Decision Matrix

Typical Workflow Sequence

Example: Full Stack Request

Performance Accelerators

Projections

Materialized Views

Dictionaries

Scripts

Additional Resources

Reference Files

External Documentation

Python Driver Policy

Why NOT clickhouse-driver

ClickHouse COMMENT = Single Source of Truth

Why

What to Include in COMMENTs

Example

Anti-Pattern

Related Skills

Troubleshooting

Post-Execution Reflection

Why NOT `clickhouse-driver`

Why NOT `clickhouse-driver`