bigquery-etl-core

mozilla/bigquery-etl-core

Coding

About

SKILL.md

bigquery-etl-core

mozilla/bigquery-etl-core

Coding

About

The core skill for working within the bigquery-etl repository. Use this skill when understanding project structure, conventions, and common patterns...

SKILL.md

BigQuery ETL Core

Composable: Foundation skill that works with model-requirements, query-writer, metadata-manager, sql-test-generator, and bigconfig-generator skills When to use: Understanding project structure, conventions, common patterns, and finding schema descriptions for construction

Project Overview

The bigquery-etl project manages BigQuery table definitions, queries, and associated metadata for Mozilla. Similar to dbt, the repository maintains query definitions with associated metadata and schemas.

Each table/query typically consists of three files:

query.sql OR query.py - The query definition (SQL or Python)
metadata.yaml - Metadata about scheduling, ownership, and dependencies (see metadata-manager skill)
schema.yaml - BigQuery schema definition with field types and descriptions (see metadata-manager skill)

Note: Most tables use query.sql (~95%). Use query.py for API calls, multi-project queries, or complex Python operations. See query-writer skill for details.

🚨 REQUIRED READING - Start Here

When starting work in bigquery-etl, READ these foundational references:

Naming Conventions: READ references/naming_conventions.md
- Table naming patterns
- Dataset organization
- Version suffix conventions
Dataset Organization: READ references/dataset_naming_conventions.md
- Common dataset suffixes (_derived, _stable, _live)
- When to use each dataset type
- Dataset naming rules
Schema Resources: READ references/discovery_resources.md
- Schema description sources (Glean Dictionary, ProbeInfo API, DataHub)
- Priority order for schema lookup during construction
- Common mozfun UDFs
Privacy Guidelines: READ references/privacy_guidelines.md
- Data handling requirements
- PII considerations
- Workgroup access patterns

Directory Structure

sql/{project}/{dataset}/{table_name}/
├── query.sql OR query.py
├── metadata.yaml
└── schema.yaml

See assets/directory_structure_example.txt for detailed examples.

Key principles:

Always flat: sql/{project}/{dataset}/{table_name}/
Never use subdirectories within table directories
Table names always include version suffix (_v1, _v2, etc.)

Schema & Description Resources for Construction

Finding Schema Descriptions

Priority order for schema lookup during construction:

Local files first: Check sql/*/schema.yaml and metadata.yaml files
- Most reliable and up-to-date source
- Contains field descriptions written by table owners
Glean Dictionary: For _live and _stable tables
- URL: https://dictionary.telemetry.mozilla.org/
- Contains metric descriptions from Glean schema definitions
- Use WebFetch with targeted prompts to extract specific field descriptions
ProbeInfo API: For Glean metric metadata
- Endpoints: https://probeinfo.telemetry.mozilla.org/glean/{product}/metrics
- Provides metric definitions and descriptions programmatically
- Use for validating metric references in queries
DataHub MCP: Only as last resort
- MUST READ references/datahub_best_practices.md BEFORE any DataHub queries
- Use for schema lookup when not available in local files or Glean Dictionary
- Extract ONLY necessary fields (column names, types, descriptions)
- Use for downstream impact analysis when modifying tables

See references/discovery_resources.md for:

Detailed guidance on each schema source
ProbeInfo API endpoints and usage patterns
Glean Dictionary URL patterns for different products
DataHub MCP best practices for construction
Common mozfun UDFs
Key documentation links

Naming Conventions

Table Names:

Use snake_case with version suffix: clients_daily_event_v1
Common suffixes: _daily, _hourly, _aggregates, _summary

Field Names:

Use snake_case: submission_date, client_id, n_total_events
Prefix counts with n_: n_events, n_sessions
Standard Mozilla fields: submission_date, client_id, sample_id, normalized_channel, normalized_country_code, app_version

See references/naming_conventions.md for:

Complete naming patterns and conventions
Reserved/common patterns to avoid
BigQuery project naming conventions

Dataset Organization

See references/dataset_naming_conventions.md for:

Dataset naming patterns by suffix (_derived, _external, etc.)
Common dataset prefixes by product/source
Table versioning patterns
Incremental vs full refresh query patterns

Privacy & Data Handling

Mozilla follows strict data privacy policies:

No PII in derived tables
Use client-level identifiers (client_id) not individual identifiers
Respect data retention policies (~2 years for client-level data)
Label client-level tables with table_type: client_level in metadata.yaml

See references/privacy_guidelines.md for:

Key principles from Mozilla's data platform
Geo IP lookup and user agent parsing policies
Best practices for data handling
Deletion request support
Sample ID usage for sampling

BigQuery & Mozilla Conventions

Partitioning & Clustering

Most tables use day partitioning on submission_date
Clustering improves query performance for filtered/joined fields
See metadata-manager skill for detailed partitioning and clustering configuration

Common UDFs (mozfun)

Browse available functions: https://mozilla.github.io/bigquery-etl/mozfun/

Common functions:

mozfun.map.get_key() - Extract values from key-value maps
mozfun.norm.truncate_version() - Normalize version strings
mozfun.stats.mode_last() - Statistical mode calculation

UDF source code in sql/mozfun/ directory.

Glean Overview

Glean is Mozilla's product analytics & telemetry solution, providing consistent measurement across all Mozilla products.

Key concepts:

Metric types: Counter, boolean, string, event, etc.
Pings: Collections of metrics (e.g., baseline, events, metrics)
Applications: Products using Glean (Fenix, Focus, Firefox iOS, etc.)

Common Glean datasets in BigQuery:

Pattern: {app_id}.{ping_name} (e.g., org_mozilla_fenix.baseline)
All have auto-generated schemas based on metric definitions

See references/glean_overview.md for:

What is Glean and how it differs from Firefox Desktop Telemetry
Glean SDK and metric type details
Common Glean datasets in BigQuery
When to use Glean Dictionary

bigquery-etl CLI Commands

See references/bqetl_cli_commands.md for:

Key bqetl CLI commands for query creation, validation, schema updates
How to find the right DAG for scheduling
Backfill creation commands

Best Practices

General principles:

Always include field descriptions in schema.yaml (see metadata-manager skill)
Add header comments explaining query purpose (see query-writer skill)
Reference bug/ticket numbers for context
Document any data exclusions or filtering logic

See assets/query_structure_example.sql for standard query structure.

Version migration:

Create new _v2 table when making breaking schema changes
Keep _v1 running during migration period
Update views to point to new version
Coordinate with downstream consumers before deprecating old version

For detailed best practices, see:

Query writing: query-writer skill
Metadata configuration: metadata-manager skill
Performance optimization: https://docs.telemetry.mozilla.org/cookbooks/bigquery/optimization.html
Recommended practices: https://mozilla.github.io/bigquery-etl/reference/recommended_practices/

Integration with Other Skills

bigquery-etl-core serves as the foundation skill that other skills build upon:

Works with model-requirements

Provides naming conventions for new tables and datasets
Supplies common field naming patterns for requirements gathering
Offers privacy guidelines for data model planning

Works with query-writer

Provides project structure and naming conventions
Supplies common patterns and mozfun UDF references
Offers schema description lookup guidance for construction

Works with metadata-manager

Provides DAG naming patterns and scheduling conventions
Supplies partitioning and clustering best practices
Offers ownership and labeling patterns

Works with sql-test-generator

Provides test structure and fixture naming conventions
Supplies common table patterns for test creation
Offers query parameter conventions

Works with bigconfig-generator

Provides table naming conventions for Bigeye monitoring configuration
Supplies dataset organization patterns
Offers field naming standards for data quality checks

This skill is always available and does not need to be explicitly invoked - it provides foundational knowledge that other skills reference.

Reference Examples

Real query examples in the repository:

Simple query: sql/moz-fx-data-shared-prod/mozilla_vpn_derived/users_v1/query.sql
Aggregation with GROUP BY: sql/moz-fx-data-shared-prod/telemetry_derived/clients_daily_event_v1/query.sql
Complex query with CTEs: sql/moz-fx-data-shared-prod/telemetry_derived/event_events_v1/query.sql
Python ETL (INFORMATION_SCHEMA): sql/moz-fx-data-shared-prod/monitoring_derived/bigquery_table_storage_v1/query.py
Python ETL (External API): sql/moz-fx-data-shared-prod/bigeye_derived/user_service_v1/query.py

For more examples, explore the sql/moz-fx-data-shared-prod/ directory.

Bundled Resources

References

references/discovery_resources.md - Schema description sources (Glean Dictionary, ProbeInfo API, DataHub MCP), priority order for construction, documentation links
references/naming_conventions.md - Complete naming patterns for tables, fields, and projects
references/dataset_naming_conventions.md - Dataset organization and versioning patterns
references/privacy_guidelines.md - Mozilla data privacy policies and best practices
references/glean_overview.md - Glean SDK concepts and BigQuery dataset structures
references/bqetl_cli_commands.md - Key CLI commands and DAG discovery

DataHub Usage (CRITICAL for Token Efficiency)

BEFORE using any DataHub MCP tools (mcp__datahub-cloud__*), you MUST:

READ references/datahub_best_practices.md - Comprehensive token optimization strategies
Follow priority order: local files → documentation → DataHub (only as last resort)
Use search-first patterns and extract minimal fields from responses

Assets

assets/query_structure_example.sql - Standard query.sql structure with common patterns
assets/directory_structure_example.txt - File organization examples

About

SKILL.md

About

The core skill for working within the bigquery-etl repository. Use this skill when understanding project structure, conventions, and common patterns...

SKILL.md

BigQuery ETL Core

Project Overview

Each table/query typically consists of three files:

query.sql OR query.py - The query definition (SQL or Python)
metadata.yaml - Metadata about scheduling, ownership, and dependencies (see metadata-manager skill)
schema.yaml - BigQuery schema definition with field types and descriptions (see metadata-manager skill)

Note: Most tables use query.sql (~95%). Use query.py for API calls, multi-project queries, or complex Python operations. See query-writer skill for details.

🚨 REQUIRED READING - Start Here

When starting work in bigquery-etl, READ these foundational references:

Naming Conventions: READ references/naming_conventions.md
- Table naming patterns
- Dataset organization
- Version suffix conventions
Dataset Organization: READ references/dataset_naming_conventions.md
- Common dataset suffixes (_derived, _stable, _live)
- When to use each dataset type
- Dataset naming rules
Schema Resources: READ references/discovery_resources.md
- Schema description sources (Glean Dictionary, ProbeInfo API, DataHub)
- Priority order for schema lookup during construction
- Common mozfun UDFs
Privacy Guidelines: READ references/privacy_guidelines.md
- Data handling requirements
- PII considerations
- Workgroup access patterns

Directory Structure

sql/{project}/{dataset}/{table_name}/
├── query.sql OR query.py
├── metadata.yaml
└── schema.yaml

See assets/directory_structure_example.txt for detailed examples.

Key principles:

Always flat: sql/{project}/{dataset}/{table_name}/
Never use subdirectories within table directories
Table names always include version suffix (_v1, _v2, etc.)

Schema & Description Resources for Construction

Finding Schema Descriptions

Priority order for schema lookup during construction:

Local files first: Check sql/*/schema.yaml and metadata.yaml files
- Most reliable and up-to-date source
- Contains field descriptions written by table owners
Glean Dictionary: For _live and _stable tables
- URL: https://dictionary.telemetry.mozilla.org/
- Contains metric descriptions from Glean schema definitions
- Use WebFetch with targeted prompts to extract specific field descriptions
ProbeInfo API: For Glean metric metadata
- Endpoints: https://probeinfo.telemetry.mozilla.org/glean/{product}/metrics
- Provides metric definitions and descriptions programmatically
- Use for validating metric references in queries
DataHub MCP: Only as last resort
- MUST READ references/datahub_best_practices.md BEFORE any DataHub queries
- Use for schema lookup when not available in local files or Glean Dictionary
- Extract ONLY necessary fields (column names, types, descriptions)
- Use for downstream impact analysis when modifying tables

See references/discovery_resources.md for:

Detailed guidance on each schema source
ProbeInfo API endpoints and usage patterns
Glean Dictionary URL patterns for different products
DataHub MCP best practices for construction
Common mozfun UDFs
Key documentation links

Naming Conventions

Table Names:

Use snake_case with version suffix: clients_daily_event_v1
Common suffixes: _daily, _hourly, _aggregates, _summary

Field Names:

Use snake_case: submission_date, client_id, n_total_events
Prefix counts with n_: n_events, n_sessions
Standard Mozilla fields: submission_date, client_id, sample_id, normalized_channel, normalized_country_code, app_version

See references/naming_conventions.md for:

Complete naming patterns and conventions
Reserved/common patterns to avoid
BigQuery project naming conventions

Dataset Organization

See references/dataset_naming_conventions.md for:

Dataset naming patterns by suffix (_derived, _external, etc.)
Common dataset prefixes by product/source
Table versioning patterns
Incremental vs full refresh query patterns

Privacy & Data Handling

Mozilla follows strict data privacy policies:

No PII in derived tables
Use client-level identifiers (client_id) not individual identifiers
Respect data retention policies (~2 years for client-level data)
Label client-level tables with table_type: client_level in metadata.yaml

See references/privacy_guidelines.md for:

Key principles from Mozilla's data platform
Geo IP lookup and user agent parsing policies
Best practices for data handling
Deletion request support
Sample ID usage for sampling

BigQuery & Mozilla Conventions

Partitioning & Clustering

Most tables use day partitioning on submission_date
Clustering improves query performance for filtered/joined fields
See metadata-manager skill for detailed partitioning and clustering configuration

Common UDFs (mozfun)

Browse available functions: https://mozilla.github.io/bigquery-etl/mozfun/

Common functions:

mozfun.map.get_key() - Extract values from key-value maps
mozfun.norm.truncate_version() - Normalize version strings
mozfun.stats.mode_last() - Statistical mode calculation

UDF source code in sql/mozfun/ directory.

Glean Overview

Glean is Mozilla's product analytics & telemetry solution, providing consistent measurement across all Mozilla products.

Key concepts:

Metric types: Counter, boolean, string, event, etc.
Pings: Collections of metrics (e.g., baseline, events, metrics)
Applications: Products using Glean (Fenix, Focus, Firefox iOS, etc.)

Common Glean datasets in BigQuery:

Pattern: {app_id}.{ping_name} (e.g., org_mozilla_fenix.baseline)
All have auto-generated schemas based on metric definitions

See references/glean_overview.md for:

What is Glean and how it differs from Firefox Desktop Telemetry
Glean SDK and metric type details
Common Glean datasets in BigQuery
When to use Glean Dictionary

bigquery-etl CLI Commands

See references/bqetl_cli_commands.md for:

Key bqetl CLI commands for query creation, validation, schema updates
How to find the right DAG for scheduling
Backfill creation commands

Best Practices

General principles:

Always include field descriptions in schema.yaml (see metadata-manager skill)
Add header comments explaining query purpose (see query-writer skill)
Reference bug/ticket numbers for context
Document any data exclusions or filtering logic

See assets/query_structure_example.sql for standard query structure.

Version migration:

Create new _v2 table when making breaking schema changes
Keep _v1 running during migration period
Update views to point to new version
Coordinate with downstream consumers before deprecating old version

For detailed best practices, see:

Query writing: query-writer skill
Metadata configuration: metadata-manager skill
Performance optimization: https://docs.telemetry.mozilla.org/cookbooks/bigquery/optimization.html
Recommended practices: https://mozilla.github.io/bigquery-etl/reference/recommended_practices/

Integration with Other Skills

bigquery-etl-core serves as the foundation skill that other skills build upon:

Works with model-requirements

Provides naming conventions for new tables and datasets
Supplies common field naming patterns for requirements gathering
Offers privacy guidelines for data model planning

Works with query-writer

Provides project structure and naming conventions
Supplies common patterns and mozfun UDF references
Offers schema description lookup guidance for construction

Works with metadata-manager

Provides DAG naming patterns and scheduling conventions
Supplies partitioning and clustering best practices
Offers ownership and labeling patterns

Works with sql-test-generator

Provides test structure and fixture naming conventions
Supplies common table patterns for test creation
Offers query parameter conventions

Works with bigconfig-generator

Provides table naming conventions for Bigeye monitoring configuration
Supplies dataset organization patterns
Offers field naming standards for data quality checks

This skill is always available and does not need to be explicitly invoked - it provides foundational knowledge that other skills reference.

Reference Examples

Real query examples in the repository:

Simple query: sql/moz-fx-data-shared-prod/mozilla_vpn_derived/users_v1/query.sql
Aggregation with GROUP BY: sql/moz-fx-data-shared-prod/telemetry_derived/clients_daily_event_v1/query.sql
Complex query with CTEs: sql/moz-fx-data-shared-prod/telemetry_derived/event_events_v1/query.sql
Python ETL (INFORMATION_SCHEMA): sql/moz-fx-data-shared-prod/monitoring_derived/bigquery_table_storage_v1/query.py
Python ETL (External API): sql/moz-fx-data-shared-prod/bigeye_derived/user_service_v1/query.py

For more examples, explore the sql/moz-fx-data-shared-prod/ directory.

Bundled Resources

References

references/discovery_resources.md - Schema description sources (Glean Dictionary, ProbeInfo API, DataHub MCP), priority order for construction, documentation links
references/naming_conventions.md - Complete naming patterns for tables, fields, and projects
references/dataset_naming_conventions.md - Dataset organization and versioning patterns
references/privacy_guidelines.md - Mozilla data privacy policies and best practices
references/glean_overview.md - Glean SDK concepts and BigQuery dataset structures
references/bqetl_cli_commands.md - Key CLI commands and DAG discovery

DataHub Usage (CRITICAL for Token Efficiency)

BEFORE using any DataHub MCP tools (mcp__datahub-cloud__*), you MUST:

READ references/datahub_best_practices.md - Comprehensive token optimization strategies
Follow priority order: local files → documentation → DataHub (only as last resort)
Use search-first patterns and extract minimal fields from responses

Assets

assets/query_structure_example.sql - Standard query.sql structure with common patterns
assets/directory_structure_example.txt - File organization examples