Smithery Logo
MCPsSkillsDocsPricing
Login
Smithery Logo

Accelerating the Agent Economy

Resources

DocumentationPrivacy PolicySystem Status

Company

PricingAboutBlog

Connect

© 2026 Smithery. All rights reserved.

    mozilla

    bigquery-etl-core

    mozilla/bigquery-etl-core
    Coding
    1

    About

    SKILL.md

    Install

    Install via Skills CLI

    or add to your agent
    • Claude Code
      Claude Code
    • Codex
      Codex
    • OpenClaw
      OpenClaw
    • Cursor
      Cursor
    • Amp
      Amp
    • GitHub Copilot
      GitHub Copilot
    • Gemini CLI
      Gemini CLI
    • Kilo Code
      Kilo Code
    • Junie
      Junie
    • Replit
      Replit
    • Windsurf
      Windsurf
    • Cline
      Cline
    • Continue
      Continue
    • OpenCode
      OpenCode
    • OpenHands
      OpenHands
    • Roo Code
      Roo Code
    • Augment
      Augment
    • Goose
      Goose
    • Trae
      Trae
    • Zencoder
      Zencoder
    • Antigravity
      Antigravity
    ├─
    ├─
    └─

    About

    The core skill for working within the bigquery-etl repository. Use this skill when understanding project structure, conventions, and common patterns...

    SKILL.md

    BigQuery ETL Core

    Composable: Foundation skill that works with model-requirements, query-writer, metadata-manager, sql-test-generator, and bigconfig-generator skills When to use: Understanding project structure, conventions, common patterns, and finding schema descriptions for construction

    Project Overview

    The bigquery-etl project manages BigQuery table definitions, queries, and associated metadata for Mozilla. Similar to dbt, the repository maintains query definitions with associated metadata and schemas.

    Each table/query typically consists of three files:

    • query.sql OR query.py - The query definition (SQL or Python)
    • metadata.yaml - Metadata about scheduling, ownership, and dependencies (see metadata-manager skill)
    • schema.yaml - BigQuery schema definition with field types and descriptions (see metadata-manager skill)

    Note: Most tables use query.sql (~95%). Use query.py for API calls, multi-project queries, or complex Python operations. See query-writer skill for details.

    🚨 REQUIRED READING - Start Here

    When starting work in bigquery-etl, READ these foundational references:

    1. Naming Conventions: READ references/naming_conventions.md

      • Table naming patterns
      • Dataset organization
      • Version suffix conventions
    2. Dataset Organization: READ references/dataset_naming_conventions.md

      • Common dataset suffixes (_derived, _stable, _live)
      • When to use each dataset type
      • Dataset naming rules
    3. Schema Resources: READ references/discovery_resources.md

      • Schema description sources (Glean Dictionary, ProbeInfo API, DataHub)
      • Priority order for schema lookup during construction
      • Common mozfun UDFs
    4. Privacy Guidelines: READ references/privacy_guidelines.md

      • Data handling requirements
      • PII considerations
      • Workgroup access patterns

    Directory Structure

    sql/{project}/{dataset}/{table_name}/
    ├── query.sql OR query.py
    ├── metadata.yaml
    └── schema.yaml
    

    See assets/directory_structure_example.txt for detailed examples.

    Key principles:

    • Always flat: sql/{project}/{dataset}/{table_name}/
    • Never use subdirectories within table directories
    • Table names always include version suffix (_v1, _v2, etc.)

    Schema & Description Resources for Construction

    Finding Schema Descriptions

    Priority order for schema lookup during construction:

    1. Local files first: Check sql/*/schema.yaml and metadata.yaml files

      • Most reliable and up-to-date source
      • Contains field descriptions written by table owners
    2. Glean Dictionary: For _live and _stable tables

      • URL: https://dictionary.telemetry.mozilla.org/
      • Contains metric descriptions from Glean schema definitions
      • Use WebFetch with targeted prompts to extract specific field descriptions
    3. ProbeInfo API: For Glean metric metadata

      • Endpoints: https://probeinfo.telemetry.mozilla.org/glean/{product}/metrics
      • Provides metric definitions and descriptions programmatically
      • Use for validating metric references in queries
    4. DataHub MCP: Only as last resort

      • MUST READ references/datahub_best_practices.md BEFORE any DataHub queries
      • Use for schema lookup when not available in local files or Glean Dictionary
      • Extract ONLY necessary fields (column names, types, descriptions)
      • Use for downstream impact analysis when modifying tables

    See references/discovery_resources.md for:

    • Detailed guidance on each schema source
    • ProbeInfo API endpoints and usage patterns
    • Glean Dictionary URL patterns for different products
    • DataHub MCP best practices for construction
    • Common mozfun UDFs
    • Key documentation links

    Naming Conventions

    Table Names:

    • Use snake_case with version suffix: clients_daily_event_v1
    • Common suffixes: _daily, _hourly, _aggregates, _summary

    Field Names:

    • Use snake_case: submission_date, client_id, n_total_events
    • Prefix counts with n_: n_events, n_sessions
    • Standard Mozilla fields: submission_date, client_id, sample_id, normalized_channel, normalized_country_code, app_version

    See references/naming_conventions.md for:

    • Complete naming patterns and conventions
    • Reserved/common patterns to avoid
    • BigQuery project naming conventions

    Dataset Organization

    See references/dataset_naming_conventions.md for:

    • Dataset naming patterns by suffix (_derived, _external, etc.)
    • Common dataset prefixes by product/source
    • Table versioning patterns
    • Incremental vs full refresh query patterns

    Privacy & Data Handling

    Mozilla follows strict data privacy policies:

    • No PII in derived tables
    • Use client-level identifiers (client_id) not individual identifiers
    • Respect data retention policies (~2 years for client-level data)
    • Label client-level tables with table_type: client_level in metadata.yaml

    See references/privacy_guidelines.md for:

    • Key principles from Mozilla's data platform
    • Geo IP lookup and user agent parsing policies
    • Best practices for data handling
    • Deletion request support
    • Sample ID usage for sampling

    BigQuery & Mozilla Conventions

    Partitioning & Clustering

    • Most tables use day partitioning on submission_date
    • Clustering improves query performance for filtered/joined fields
    • See metadata-manager skill for detailed partitioning and clustering configuration

    Common UDFs (mozfun)

    Browse available functions: https://mozilla.github.io/bigquery-etl/mozfun/

    Common functions:

    • mozfun.map.get_key() - Extract values from key-value maps
    • mozfun.norm.truncate_version() - Normalize version strings
    • mozfun.stats.mode_last() - Statistical mode calculation

    UDF source code in sql/mozfun/ directory.

    Glean Overview

    Glean is Mozilla's product analytics & telemetry solution, providing consistent measurement across all Mozilla products.

    Key concepts:

    • Metric types: Counter, boolean, string, event, etc.
    • Pings: Collections of metrics (e.g., baseline, events, metrics)
    • Applications: Products using Glean (Fenix, Focus, Firefox iOS, etc.)

    Common Glean datasets in BigQuery:

    • Pattern: {app_id}.{ping_name} (e.g., org_mozilla_fenix.baseline)
    • All have auto-generated schemas based on metric definitions

    See references/glean_overview.md for:

    • What is Glean and how it differs from Firefox Desktop Telemetry
    • Glean SDK and metric type details
    • Common Glean datasets in BigQuery
    • When to use Glean Dictionary

    bigquery-etl CLI Commands

    See references/bqetl_cli_commands.md for:

    • Key bqetl CLI commands for query creation, validation, schema updates
    • How to find the right DAG for scheduling
    • Backfill creation commands

    Best Practices

    General principles:

    • Always include field descriptions in schema.yaml (see metadata-manager skill)
    • Add header comments explaining query purpose (see query-writer skill)
    • Reference bug/ticket numbers for context
    • Document any data exclusions or filtering logic

    See assets/query_structure_example.sql for standard query structure.

    Version migration:

    • Create new _v2 table when making breaking schema changes
    • Keep _v1 running during migration period
    • Update views to point to new version
    • Coordinate with downstream consumers before deprecating old version

    For detailed best practices, see:

    • Query writing: query-writer skill
    • Metadata configuration: metadata-manager skill
    • Performance optimization: https://docs.telemetry.mozilla.org/cookbooks/bigquery/optimization.html
    • Recommended practices: https://mozilla.github.io/bigquery-etl/reference/recommended_practices/

    Integration with Other Skills

    bigquery-etl-core serves as the foundation skill that other skills build upon:

    Works with model-requirements

    • Provides naming conventions for new tables and datasets
    • Supplies common field naming patterns for requirements gathering
    • Offers privacy guidelines for data model planning

    Works with query-writer

    • Provides project structure and naming conventions
    • Supplies common patterns and mozfun UDF references
    • Offers schema description lookup guidance for construction

    Works with metadata-manager

    • Provides DAG naming patterns and scheduling conventions
    • Supplies partitioning and clustering best practices
    • Offers ownership and labeling patterns

    Works with sql-test-generator

    • Provides test structure and fixture naming conventions
    • Supplies common table patterns for test creation
    • Offers query parameter conventions

    Works with bigconfig-generator

    • Provides table naming conventions for Bigeye monitoring configuration
    • Supplies dataset organization patterns
    • Offers field naming standards for data quality checks

    This skill is always available and does not need to be explicitly invoked - it provides foundational knowledge that other skills reference.

    Reference Examples

    Real query examples in the repository:

    • Simple query: sql/moz-fx-data-shared-prod/mozilla_vpn_derived/users_v1/query.sql
    • Aggregation with GROUP BY: sql/moz-fx-data-shared-prod/telemetry_derived/clients_daily_event_v1/query.sql
    • Complex query with CTEs: sql/moz-fx-data-shared-prod/telemetry_derived/event_events_v1/query.sql
    • Python ETL (INFORMATION_SCHEMA): sql/moz-fx-data-shared-prod/monitoring_derived/bigquery_table_storage_v1/query.py
    • Python ETL (External API): sql/moz-fx-data-shared-prod/bigeye_derived/user_service_v1/query.py

    For more examples, explore the sql/moz-fx-data-shared-prod/ directory.

    Bundled Resources

    References

    • references/discovery_resources.md - Schema description sources (Glean Dictionary, ProbeInfo API, DataHub MCP), priority order for construction, documentation links
    • references/naming_conventions.md - Complete naming patterns for tables, fields, and projects
    • references/dataset_naming_conventions.md - Dataset organization and versioning patterns
    • references/privacy_guidelines.md - Mozilla data privacy policies and best practices
    • references/glean_overview.md - Glean SDK concepts and BigQuery dataset structures
    • references/bqetl_cli_commands.md - Key CLI commands and DAG discovery

    DataHub Usage (CRITICAL for Token Efficiency)

    BEFORE using any DataHub MCP tools (mcp__datahub-cloud__*), you MUST:

    • READ references/datahub_best_practices.md - Comprehensive token optimization strategies
    • Follow priority order: local files → documentation → DataHub (only as last resort)
    • Use search-first patterns and extract minimal fields from responses

    Assets

    • assets/query_structure_example.sql - Standard query.sql structure with common patterns
    • assets/directory_structure_example.txt - File organization examples
    Recommended Servers
    Google BigQuery
    Google BigQuery
    Vercel Grep
    Vercel Grep
    Bitbucket
    Bitbucket
    Repository
    mozilla/bigquery-etl-skills
    Files