bigquery

lanej/bigquery

Data & Analytics

1 installs

About

SKILL.md

bigquery

lanej/bigquery

Data & Analytics

1 installs

About

Use bigquery CLI (instead of bq) for all Google BigQuery and GCP data warehouse operations including SQL query execution, data ingestion (streaming insert, bulk load, JSONL/CSV/Parquet), data...

SKILL.md

BigQuery CLI Skill

You are a BigQuery specialist using the bigquery CLI tool. This skill provides comprehensive guidance for working with Google BigQuery through a unified TypeScript/Bun CLI with query execution, cost awareness, and MCP server integration.

Core Capabilities

The bigquery CLI provides:

Authentication: Check status and gcloud integration
Query Execution: Run SQL queries with automatic cost estimation and confirmation prompts
Dry Run: Estimate query costs ($0 to run)
Dataset Operations: List, create, update, delete, describe datasets
Table Operations: List, describe, insert, load, extract, create, update, delete, copy tables
Job Management: List, get, and cancel BigQuery jobs
MCP Server: Built-in stdio and HTTP modes for AI integration (read-only by default, --enable-writes flag)
Streaming Support: Native --stream flag for large result sets (outputs JSONL)
Output Formats: JSON (default), JSONL (streaming), and text (human-readable)

Authentication

Check Authentication Status

# Check if authenticated and verify required scopes
bigquery auth check

# Output shows:
# - Authentication status
# - Active account
# - Token expiration
# - BigQuery scopes availability

Authentication Methods

Uses Google Cloud SDK (gcloud) authentication:

Application Default Credentials (ADC)
OAuth access tokens
Service account keys (via GOOGLE_APPLICATION_CREDENTIALS)

No separate BigQuery authentication required - uses existing gcloud credentials.

Best Practice: Always run bigquery auth check first to verify authentication before operations.

Output Format Defaults

Different commands have different default output formats:

Command	Default Format	Notes
`query`	JSON	Machine-readable for pipelines, includes metadata
`datasets list`	text	Human-readable
`tables list`	text	Human-readable
`tables describe`	text	Human-readable
`jobs list`	text	Human-readable

All commands support --format json or --format text to override defaults.

Available Formats:

JSON: Structured output with metadata (rows, bytesProcessed, cost, cacheHit)
Text: Human-readable formatted output
JSONL: Newline-delimited JSON (one object per line), ideal for streaming and pipelines

Query Operations

Running Queries

# Basic query execution (automatic dry-run + cost confirmation)
bigquery query "SELECT * FROM dataset.table LIMIT 10"

# Skip cost confirmation for automation
bigquery query --yes "SELECT COUNT(*) FROM dataset.table"

# JSON output (default, includes full metadata)
bigquery query "SELECT * FROM dataset.table LIMIT 5"

# JSONL output (one JSON object per line, ideal for pipelines)
bigquery query "SELECT * FROM dataset.table" --jsonl

# Stream large results (implies JSONL format)
bigquery query "SELECT * FROM large_table" --stream > results.jsonl

# Text format (human-readable)
bigquery query --format text "SELECT * FROM dataset.table LIMIT 5"

# Query from file
bigquery query --file query.sql

# Query from stdin
cat query.sql | bigquery query --stdin --yes

# Custom project
bigquery query --project my-other-project "SELECT 1"

Cost Awareness: The query command automatically:

Runs a dry-run to estimate cost before execution (costs $0)
Displays bytes to be processed and estimated cost
Prompts for confirmation if cost exceeds threshold (default: 1 GB)
Skips confirmation for queries below threshold or when --yes is used
Includes cost metadata in response (bytesProcessed, estimatedCostUSD, cacheHit)

Query Options

# Skip cost confirmation (REQUIRED for automation/scripts)
bigquery query --yes "SELECT * FROM dataset.table"

# Set custom cost threshold (default: 1 GB)
bigquery query --cost-threshold 5 "SELECT * FROM large_table"

# Environment variable for persistent threshold
export BIGQUERY_COST_THRESHOLD_GB=5.0

# Limit rows returned (does NOT reduce cost/bytes scanned)
bigquery query "SELECT * FROM table" --max-results 100

# Stream large results (outputs JSONL)
bigquery query "SELECT * FROM large_table" --stream > results.jsonl

Query Flags:

Flag	Description
`--yes` / `-y`	Skip cost confirmation prompt (for automation)
`--format <format>`	Output format: `json` (default), `text`, `jsonl`
`--jsonl`	Shorthand for `--format jsonl`
`--stream`	Stream results as they arrive (implies JSONL)
`--cost-threshold <gb>`	Cost confirmation threshold in GB (default: 1)
`--max-results <n>`	Max rows to return (does NOT reduce cost)
`--file <path>`	Read query from file
`--stdin`	Read query from stdin
`-p, --project <id>`	GCP project ID

Important Notes:

--max-results only limits returned rows, NOT bytes scanned (no cost reduction)
Use LIMIT in SQL to reduce scanned data
Use --yes for automation (not --force, which doesn't exist)

Query Output Formats

# JSON output (default, machine-readable)
bigquery query "SELECT * FROM dataset.table"

# Returns:
# {
#   "rows": [...],
#   "totalRows": 123,
#   "bytesProcessed": 1048576,
#   "bytesProcessedGB": 0.001,
#   "estimatedCostUSD": 0.00000625,
#   "cacheHit": false
# }

# JSONL output (one JSON object per line)
bigquery query "SELECT * FROM dataset.table" --jsonl

# Returns:
# {"col1":"value1","col2":123}
# {"col1":"value2","col2":456}

# Text output (human-readable table)
bigquery query --format text "SELECT * FROM dataset.table"

Dry Run (Cost Estimation)

# Estimate cost without executing
bigquery dry-run "SELECT * FROM large_dataset.table WHERE date >= '2025-01-01'"

# Returns (JSON):
# {
#   "bytesProcessed": "1073741824",
#   "bytesProcessedGB": 1.0,
#   "bytesProcessedMB": 1024.0,
#   "estimatedCostUSD": 0.00625
# }

# Text format
bigquery dry-run "SELECT * FROM large_table" --format text

# Returns:
# Dry-run Results:
# ──────────────────────────────────────────────────
# Bytes to process: 1.00 GB
# Estimated cost:   $0.0063

Use dry-run to:

Estimate costs before running expensive queries
Validate query syntax
Check partition pruning effectiveness
Test queries in CI/CD pipelines

Cost Formula: (bytesProcessed / 1TB) * $6.25

Dataset Operations

Listing Datasets

# List datasets in current project (text format, default)
bigquery datasets list

# List datasets in another project
bigquery datasets list --project other-project

# JSON output
bigquery datasets list --format json

# Example output shows:
# - Dataset ID
# - Location
# - Creation time
# - Labels (if any)

Describing Datasets

# Show dataset metadata
bigquery datasets describe project.dataset

# JSON output
bigquery datasets describe project.dataset --format json

Creating Datasets

# Create dataset in default location
bigquery datasets create my-project.new_dataset

# Create with description and location
bigquery datasets create my-project.new_dataset \
  --description "Analytics data warehouse" \
  --location US

# Create with default table expiration (30 days)
bigquery datasets create my-project.temp_dataset \
  --default-ttl 30 \
  --location US

# Create with labels
bigquery datasets create my-project.new_dataset \
  --labels "env=prod,team=analytics"

Updating Datasets

# Update description
bigquery datasets update my-project.existing_dataset \
  --description "Updated description"

# Update default table expiration
bigquery datasets update my-project.existing_dataset \
  --default-ttl 30

# Add/update labels
bigquery datasets update my-project.existing_dataset \
  --labels env=staging \
  --labels team=data

Deleting Datasets

# Delete empty dataset (prompts for confirmation)
bigquery datasets delete my-project.old_dataset

# Delete non-empty dataset (includes all tables)
bigquery datasets delete my-project.old_dataset --force

# Skip confirmation (for automation)
bigquery datasets delete my-project.old_dataset --force --yes

Table Operations

Listing Tables

# List tables in a dataset (text format, first 10)
bigquery tables list my-project.my-dataset

# JSON output
bigquery tables list my-project.my-dataset --format json

# With pagination
bigquery tables list my-project.my-dataset --max-results 20 --page-token <token>

Describing Table Schema

# Show table schema and metadata (text format)
bigquery tables describe my-project.my-dataset.my-table

# JSON output
bigquery tables describe my-project.my-dataset.my-table --format json

# Output includes:
# - Column names and types
# - Nullability (NULLABLE, REQUIRED, REPEATED)
# - Mode information
# - Table metadata (row count, size, location)

Creating Tables

# Create table with JSON schema
bigquery tables create my-project.dataset.users \
  --schema '[{"name":"id","type":"STRING"},{"name":"email","type":"STRING"}]' \
  --description "User data"

Updating Tables

# Update description
bigquery tables update my-project.dataset.table --description "Updated description"

# Set expiration (30 days in seconds)
bigquery tables update my-project.dataset.table --expiration 2592000

# Add labels
bigquery tables update my-project.dataset.table \
  --labels owner=team-data \
  --labels environment=production

# Require partition filter for queries
bigquery tables update my-project.dataset.table --require-partition-filter true

Copying Tables

# Copy table within same project
bigquery tables copy my-project.source.table my-project.dest.table_copy

# Copy to different dataset
bigquery tables copy my-project.source.table my-project.archive.table_backup

# Overwrite destination if exists
bigquery tables copy my-project.source.table my-project.dest.existing_table \
  --write-disposition WRITE_TRUNCATE

Deleting Tables

# Delete table
bigquery tables delete my-project.dataset.old_table

Inserting Rows (Small Datasets)

Best for <1000 rows. Uses streaming insert API for immediate availability.

JSONL (Newline-Delimited JSON) Format

From JSONL File:

# Create sample JSONL file
cat > users.jsonl <<EOF
{"id": "1", "name": "Alice Johnson", "email": "alice@example.com", "age": 30}
{"id": "2", "name": "Bob Smith", "email": "bob@example.com", "age": 25}
{"id": "3", "name": "Charlie Brown", "email": "charlie@example.com", "age": 35}
EOF

# Insert from JSONL file
bigquery tables insert my-project.dataset.users users.jsonl --format jsonl

From JSONL Stream (stdin):

# Stream from command output
echo '{"id": "1", "name": "Alice", "email": "alice@example.com"}' | \
  bigquery tables insert my-project.dataset.users - --format jsonl

# Stream from heredoc
cat << EOF | bigquery tables insert my-project.dataset.users - --format jsonl
{"id": "1", "name": "Alice", "email": "alice@example.com", "age": 30}
{"id": "2", "name": "Bob", "email": "bob@example.com", "age": 25}
{"id": "3", "name": "Charlie", "email": "charlie@example.com", "age": 35}
EOF

# Stream from application output
my-etl-tool --output jsonl | bigquery tables insert my-project.dataset.events -

# Stream from jq transformation
cat raw_data.json | jq -c '.records[]' | \
  bigquery tables insert my-project.dataset.processed -

JSONL Format Requirements:

Each line is a separate JSON object
Empty lines are automatically skipped
No commas between objects
Ideal for streaming and large datasets
Format: {"field1":"value1","field2":"value2"}\n

Additional Insert Options

# Skip invalid rows instead of failing
bigquery tables insert my-project.dataset.users users.jsonl --skip-invalid-rows

# Ignore unknown fields in data
bigquery tables insert my-project.dataset.users users.jsonl --ignore-unknown-values

# Combine options for production pipelines
cat production_data.jsonl | \
  bigquery tables insert my-project.dataset.production - \
    --format jsonl \
    --skip-invalid-rows \
    --ignore-unknown-values

Insert Options:

--format <FORMAT>: Data format (json or jsonl)
--skip-invalid-rows: Skip invalid rows instead of failing
--ignore-unknown-values: Ignore unknown fields in data

Loading Data (Large Datasets)

Best for >10MB files or >1000 rows. Uses BigQuery load jobs.

# Load from Cloud Storage URI (RECOMMENDED)
bigquery tables load my-project.dataset.users \
  gs://my-bucket/data.csv --format csv

# Load with schema auto-detection
bigquery tables load my-project.dataset.new_table \
  gs://my-bucket/data.csv --format csv --autodetect

# Load with replace write disposition (truncates table first)
bigquery tables load my-project.dataset.users \
  gs://my-bucket/data.csv --format csv --write-disposition WRITE_TRUNCATE

# Load JSON file
bigquery tables load my-project.dataset.events \
  gs://my-bucket/events.json --format json

# Supported formats: csv, json, jsonl, avro, parquet
bigquery tables load my-project.dataset.table \
  gs://my-bucket/data.parquet --format parquet

# CSV with skip leading rows
bigquery tables load my-project.dataset.table \
  gs://my-bucket/data.csv --format csv --skip-leading-rows 1

# Append to existing table (default)
bigquery tables load my-project.dataset.table \
  gs://my-bucket/data.json --format json --write-disposition WRITE_APPEND

Load Options:

--format <FORMAT>: Source format (csv, json, jsonl, parquet, avro)
--autodetect: Auto-detect schema from source
--skip-leading-rows <N>: Skip N leading rows (CSV)
--write-disposition <MODE>: WRITE_TRUNCATE (replace), WRITE_APPEND (default), WRITE_EMPTY
--create-disposition <MODE>: CREATE_IF_NEEDED (default), CREATE_NEVER

When to Use:

Large datasets (>1000 rows or >10MB)
Data already in Cloud Storage
Bulk data migrations

When NOT to Use:

Small datasets (<1000 rows) → Use tables insert instead

Extracting Data

Export table data to Cloud Storage in various formats:

# Extract table to Cloud Storage as CSV
bigquery tables extract my-project.dataset.users \
  gs://my-bucket/exports/users.csv --format csv

# Extract as JSON (newline-delimited)
bigquery tables extract my-project.dataset.events \
  gs://my-bucket/exports/events-*.json --format json

# Extract with compression
bigquery tables extract my-project.dataset.large_table \
  gs://my-bucket/exports/data-*.csv.gz --format csv --compression GZIP

# Extract as Parquet
bigquery tables extract my-project.dataset.analytics \
  gs://my-bucket/exports/analytics.parquet --format parquet

# CSV with header
bigquery tables extract my-project.dataset.data \
  gs://my-bucket/data.csv --format csv --print-header

Supported Formats: CSV, JSON (newline-delimited), Parquet, Avro Compression: NONE (default), GZIP

Job Management

BigQuery jobs are asynchronous operations for queries, loads, exports, and copies.

Listing Jobs

# List recent jobs
bigquery jobs list

# List with pagination
bigquery jobs list --max-results 20 --page-token <token>

# Filter by state
bigquery jobs list --state-filter running
bigquery jobs list --state-filter done

# Show jobs from all users
bigquery jobs list --all-users

# JSON output
bigquery jobs list --format json

Showing Job Details

# Show job details
bigquery jobs get job_abc123xyz

# JSON output
bigquery jobs get job_abc123xyz --format json

Canceling Jobs

# Cancel a running job
bigquery jobs cancel job_abc123xyz

MCP Server Integration

The BigQuery MCP server provides AI integration via Model Context Protocol.

Starting MCP Server

STDIO Mode (for local clients):

# Start MCP server in stdio mode (read-only)
bigquery mcp stdio

# Start with write operations enabled
bigquery mcp stdio --enable-writes

# Server will:
# - Accept MCP protocol messages on stdin
# - Send responses on stdout
# - Expose BigQuery tools to MCP clients

HTTP Mode (for network clients):

# Start HTTP MCP server on default port 8080
bigquery mcp http

# Specify custom port
bigquery mcp http --port 3000

# With OAuth and email domain restriction
bigquery mcp http \
  --google-client-id <id> \
  --google-client-secret <secret> \
  --domain example.com

# Enable write operations
bigquery mcp http --enable-writes

# Server provides:
# - HTTP endpoint for MCP protocol
# - JSON-RPC over HTTP
# - Remote access to BigQuery tools

MCP Configuration

Configure in Claude Code or other MCP-enabled applications:

STDIO Mode (.claude/mcp.json or similar):

{
  "mcpServers": {
    "bigquery": {
      "command": "bigquery",
      "args": ["mcp", "stdio"],
      "env": {
        "GOOGLE_CLOUD_PROJECT": "my-project"
      }
    }
  }
}

STDIO Mode with Writes Enabled:

{
  "mcpServers": {
    "bigquery": {
      "command": "bigquery",
      "args": ["mcp", "stdio", "--enable-writes"],
      "env": {
        "GOOGLE_CLOUD_PROJECT": "my-project"
      }
    }
  }
}

HTTP Mode:

{
  "mcpServers": {
    "bigquery": {
      "url": "http://localhost:8080",
      "transport": "http"
    }
  }
}

Common Workflows

Workflow 1: Exploratory Data Analysis

# 1. Verify authentication
bigquery auth check

# 2. List available datasets
bigquery datasets list

# 3. List tables in dataset
bigquery tables list my-project.analytics

# 4. Check table schema
bigquery tables describe my-project.analytics.events

# 5. Preview data (text format for readability)
bigquery query --format text \
  "SELECT * FROM my-project.analytics.events LIMIT 10"

# 6. Get row count
bigquery query "SELECT COUNT(*) as total FROM my-project.analytics.events"

# 7. Check data distribution
bigquery query --format text "
  SELECT
    DATE(timestamp) as date,
    COUNT(*) as events
  FROM my-project.analytics.events
  GROUP BY date
  ORDER BY date DESC
  LIMIT 30
"

Workflow 2: Cost-Aware Query Development

# 1. Dry run to estimate cost
bigquery dry-run "
  SELECT *
  FROM my-project.large_dataset.table
  WHERE date >= '2025-01-01'
"

# 2. If cost is acceptable, run query
bigquery query "
  SELECT *
  FROM my-project.large_dataset.table
  WHERE date >= '2025-01-01'
"

# 3. For automation, skip confirmation
bigquery query --yes "
  SELECT *
  FROM my-project.large_dataset.table
  WHERE date >= '2025-01-01'
" > results.json

Workflow 3: Data Loading Pipeline

# 1. Load initial data
bigquery tables load my-project.dataset.events \
  gs://bucket/events-2025-01-01.csv \
  --format csv \
  --write-disposition WRITE_TRUNCATE

# 2. Append incremental data
bigquery tables load my-project.dataset.events \
  gs://bucket/events-2025-01-02.csv \
  --format csv \
  --write-disposition WRITE_APPEND

# 3. Verify data loaded
bigquery query "
  SELECT
    DATE(timestamp) as date,
    COUNT(*) as count
  FROM my-project.dataset.events
  GROUP BY date
  ORDER BY date
"

Workflow 4: Real-Time Data Insertion

# 1. Stream JSONL from application
my-app --output jsonl | bigquery tables insert my-project.dataset.events -

# 2. Stream with transformation and error handling
cat raw_events.json | jq -c '.events[]' | \
  bigquery tables insert my-project.dataset.events - \
    --skip-invalid-rows \
    --ignore-unknown-values

Workflow 5: Streaming Large Results

# 1. Stream query results to JSONL file
bigquery query "SELECT * FROM my-project.dataset.large_table" \
  --stream > results.jsonl

# 2. Process with DuckDB
duckdb -c "SELECT user_id, COUNT(*) FROM read_json_auto('results.jsonl') GROUP BY user_id"

# 3. Or process line-by-line
cat results.jsonl | while IFS= read -r line; do
  echo "$line" | jq '.field'
done

Append-Only Raw Table Pattern

Use this architecture when a table needs to track evolving state over time without destructive updates.

Structure:

table_raw — append-only write target; never UPDATE or DELETE rows in-place
table (VIEW) — the read surface; deduplicates raw via ROW_NUMBER() OVER (PARTITION BY <natural_key> ORDER BY loaded_at DESC) = 1

Write rules:

All INSERT operations target table_raw only; never INSERT into the view (BigQuery rejects DML on views)
Include a loaded_at TIMESTAMP REQUIRED column on the raw table; always set it to CURRENT_TIMESTAMP() at write time
To "update" a record: INSERT a new row with corrected values and a fresh loaded_at; the view surfaces the newest row automatically
Validate domain constraints before writing (e.g., expected ID prefixes, enum values, FK existence); skip invalid rows with a warning rather than failing the batch

Read rules:

Always query the view, never the raw table directly
Apply domain-validity guards in the view's inner WHERE clause to prevent corrupt rows from surfacing even if they reach the raw table

Integrity checks:

Verify periodically: row count on view == COUNT(DISTINCT <natural_key>) on raw (filtered to valid rows only)
Within-version duplicates (same (<natural_key>, version_tag) appearing more than once in raw) indicate a write path ran more than once; clean with MERGE … WHEN NOT MATCHED BY SOURCE THEN DELETE using a MAX(loaded_at) keeper subquery
Foreign key validation: LEFT JOIN the raw table to the referenced source table; rows where the join produces NULL are orphans and should be investigated or deleted

Best Practices

Query Development

Always dry-run first: Use bigquery dry-run to estimate costs
Validate before running: Check syntax and cost before execution
Use text format for exploration: --format text for human-readable tables
Use JSON for automation: --format json for machine processing
Use JSONL for streaming: --jsonl or --stream for large results
Skip confirmations in scripts: Use --yes flag for automation

Cost Management

Dry run expensive queries: Always estimate with bigquery dry-run
Monitor bytes processed: Check query cost estimates before running
Use partition pruning: Filter on partitioned columns in WHERE clauses
Limit result sets: Use LIMIT for exploratory queries
Select only needed columns: SELECT col1, col2 not SELECT *
Set cost thresholds: Use BIGQUERY_COST_THRESHOLD_GB environment variable

Table Architecture

Raw + view split: Use an append-only *_raw table for writes and a dedup VIEW for reads; never write to the view, never read from raw directly
loaded_at as dedup key: The loaded_at TIMESTAMP REQUIRED column determines which row wins in the view; always write CURRENT_TIMESTAMP() at INSERT time
Updates via INSERT: To correct a row, INSERT a new version with corrected values and a fresh loaded_at; do not UPDATE the raw table in-place
Validate before INSERT: Check domain constraints (ID format, enum values, FK existence) before writing; skip bad rows with a logged warning rather than allowing them to corrupt the raw table
Guard the view: Add a WHERE clause to the view's inner query enforcing domain invariants as a last-resort filter against rows that bypass write-time validation
Never target a view for DML: BigQuery rejects INSERT/UPDATE/DELETE on views; always use the underlying raw table name in write code paths

Authentication

Check auth first: Run bigquery auth check before operations
Use service accounts: For automation and CI/CD
Verify scopes: Ensure all required BigQuery scopes are granted
Re-authenticate when needed: If check fails

Data Loading

Choose the right method:
- Use insert for <1000 rows (streaming insert API, immediate availability)
- Use load for >10MB files or >1000 rows (load jobs with Cloud Storage)
Use JSONL for streaming: Newline-delimited JSON is ideal for streaming pipelines
Stream from stdin: Use - as file argument to pipe data from applications
Handle bad records: Use --skip-invalid-rows for messy data
Choose write disposition: WRITE_TRUNCATE for full refresh, WRITE_APPEND for incremental
Use appropriate formats: CSV for simple data, JSON/JSONL for complex, Parquet for large datasets

Output Format Selection

JSON (--format json): Default, full metadata, parsing with jq
Text (--format text): Human-readable, terminal inspection
JSONL (--jsonl or --stream): Streaming, pipelines, DuckDB ingestion

Examples:

# Parse with jq
bigquery query "SELECT * FROM table" | jq '.rows[] | select(.amount > 100)'

# Stream to file
bigquery query "SELECT * FROM large_table" --stream > data.jsonl

# Load into DuckDB
bigquery query "SELECT * FROM table" --jsonl | \
  duckdb -c "SELECT * FROM read_json_auto('/dev/stdin')"

MCP Server

Use stdio for local: Prefer stdio mode for local MCP clients
Use HTTP for remote: Use HTTP mode for networked deployments
Read-only by default: Only enable writes when needed (--enable-writes)
Secure HTTP endpoints: Put HTTP server behind authentication/firewall
Monitor server logs: Check for errors and performance issues

Configuration

Environment Variables

# Set default project
export GOOGLE_CLOUD_PROJECT=my-project

# Set cost threshold (in GB, default: 1)
export BIGQUERY_COST_THRESHOLD_GB=5.0

# Set credentials (for service accounts)
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json

# Add to ~/.zshrc or ~/.bashrc for persistence
echo 'export GOOGLE_CLOUD_PROJECT=my-project' >> ~/.zshrc
echo 'export BIGQUERY_COST_THRESHOLD_GB=5.0' >> ~/.zshrc

Authentication Methods

User Credentials (interactive):

gcloud auth application-default login
bigquery auth check

Service Account (automation):

export GOOGLE_APPLICATION_CREDENTIALS=/path/to/sa-key.json
bigquery auth check

Troubleshooting

Issue: "Not authenticated" or "Permission denied"

Solution: Check authentication and scopes

# Check current auth status
bigquery auth check

# Re-authenticate if needed
gcloud auth application-default login

# Verify gcloud is set to correct project
gcloud config get-value project

# Set project if needed
gcloud config set project my-project

Issue: "Table not found"

Solution: Use fully qualified table names

# Wrong - missing project/dataset
bigquery query "SELECT * FROM table"

# Correct - fully qualified
bigquery query "SELECT * FROM my-project.my-dataset.my-table"

# Or use backticks for reserved words
bigquery query "SELECT * FROM \`my-project.my-dataset.my-table\`"

Issue: "Query too expensive"

Solution: Check cost with dry-run and optimize

# Check estimated cost
bigquery dry-run "SELECT * FROM large_table WHERE date >= '2025-01-01'"

# Optimize with partition filters
bigquery dry-run "
  SELECT * FROM large_table
  WHERE _PARTITIONDATE = '2025-01-15'
"

Quick Reference

# Authentication
bigquery auth check                          # Check auth status

# Queries (default: JSON output)
bigquery query "SELECT ..."                  # Execute query (JSON)
bigquery query --yes "SELECT ..."            # Skip confirmation
bigquery query --format text "SELECT ..."    # Human-readable table
bigquery query --jsonl "SELECT ..."          # JSONL output
bigquery query --stream "SELECT ..."         # Stream large results
bigquery dry-run "SELECT ..."                # Estimate cost

# Datasets
bigquery datasets list                       # List datasets
bigquery datasets describe PROJECT.DATASET   # Describe dataset
bigquery datasets create PROJECT.DATASET     # Create dataset
bigquery datasets update PROJECT.DATASET --description "..." # Update dataset
bigquery datasets delete PROJECT.DATASET     # Delete dataset

# Tables - Read Operations
bigquery tables list PROJECT.DATASET         # List tables
bigquery tables describe TABLE               # Show schema

# Tables - Write Operations
bigquery tables create TABLE --schema "..."  # Create table
bigquery tables insert TABLE file.jsonl --format jsonl  # Insert from file
cat data.jsonl | bigquery tables insert TABLE -         # Stream insert
bigquery tables load TABLE gs://bucket/file.csv         # Bulk load
bigquery tables copy SOURCE DEST             # Copy table
bigquery tables delete TABLE                 # Delete table

# Tables - Extract
bigquery tables extract TABLE gs://bucket/output.csv    # Export to GCS

# Jobs
bigquery jobs list                           # List jobs
bigquery jobs get JOB_ID                     # Job details
bigquery jobs cancel JOB_ID                  # Cancel job

# MCP Server
bigquery mcp stdio                           # MCP server (stdio, read-only)
bigquery mcp stdio --enable-writes           # MCP server (stdio, with writes)
bigquery mcp http --port 3000                # MCP server (HTTP)

Summary

Primary commands:

bigquery auth check - Authentication management
bigquery query - Execute SQL with automatic cost awareness
bigquery dry-run - Estimate query costs ($0 to run)
bigquery datasets {list,describe,create,update,delete} - Dataset operations
bigquery tables {list,describe,insert,load,extract,create,update,delete,copy} - Table operations
bigquery jobs {list,get,cancel} - Job management
bigquery mcp {stdio,http} - MCP server modes

Key features:

Cost-aware query execution with automatic dry-run and confirmation prompts
Configurable cost thresholds (BIGQUERY_COST_THRESHOLD_GB)
Streaming insert API for real-time data (<1000 rows)
Bulk load from Cloud Storage (>10MB files)
Native streaming support for large results (--stream flag outputs JSONL)
Built-in MCP server for AI integration (stdio and HTTP modes)
Three output formats: JSON (default, full metadata), JSONL (streaming), text (human-readable)
Instant startup with TypeScript/Bun (~10ms vs ~500ms for Python bq)

About

SKILL.md

About

Use bigquery CLI (instead of bq) for all Google BigQuery and GCP data warehouse operations including SQL query execution, data ingestion (streaming insert, bulk load, JSONL/CSV/Parquet), data...

SKILL.md

BigQuery CLI Skill

Core Capabilities

The bigquery CLI provides:

Authentication: Check status and gcloud integration
Query Execution: Run SQL queries with automatic cost estimation and confirmation prompts
Dry Run: Estimate query costs ($0 to run)
Dataset Operations: List, create, update, delete, describe datasets
Table Operations: List, describe, insert, load, extract, create, update, delete, copy tables
Job Management: List, get, and cancel BigQuery jobs
MCP Server: Built-in stdio and HTTP modes for AI integration (read-only by default, --enable-writes flag)
Streaming Support: Native --stream flag for large result sets (outputs JSONL)
Output Formats: JSON (default), JSONL (streaming), and text (human-readable)

Authentication

Check Authentication Status

# Check if authenticated and verify required scopes
bigquery auth check

# Output shows:
# - Authentication status
# - Active account
# - Token expiration
# - BigQuery scopes availability

Authentication Methods

Uses Google Cloud SDK (gcloud) authentication:

Application Default Credentials (ADC)
OAuth access tokens
Service account keys (via GOOGLE_APPLICATION_CREDENTIALS)

No separate BigQuery authentication required - uses existing gcloud credentials.

Best Practice: Always run bigquery auth check first to verify authentication before operations.

Output Format Defaults

Different commands have different default output formats:

Command	Default Format	Notes
`query`	JSON	Machine-readable for pipelines, includes metadata
`datasets list`	text	Human-readable
`tables list`	text	Human-readable
`tables describe`	text	Human-readable
`jobs list`	text	Human-readable

All commands support --format json or --format text to override defaults.

Available Formats:

JSON: Structured output with metadata (rows, bytesProcessed, cost, cacheHit)
Text: Human-readable formatted output
JSONL: Newline-delimited JSON (one object per line), ideal for streaming and pipelines

Query Operations

Running Queries

# Basic query execution (automatic dry-run + cost confirmation)
bigquery query "SELECT * FROM dataset.table LIMIT 10"

# Skip cost confirmation for automation
bigquery query --yes "SELECT COUNT(*) FROM dataset.table"

# JSON output (default, includes full metadata)
bigquery query "SELECT * FROM dataset.table LIMIT 5"

# JSONL output (one JSON object per line, ideal for pipelines)
bigquery query "SELECT * FROM dataset.table" --jsonl

# Stream large results (implies JSONL format)
bigquery query "SELECT * FROM large_table" --stream > results.jsonl

# Text format (human-readable)
bigquery query --format text "SELECT * FROM dataset.table LIMIT 5"

# Query from file
bigquery query --file query.sql

# Query from stdin
cat query.sql | bigquery query --stdin --yes

# Custom project
bigquery query --project my-other-project "SELECT 1"

Cost Awareness: The query command automatically:

Runs a dry-run to estimate cost before execution (costs $0)
Displays bytes to be processed and estimated cost
Prompts for confirmation if cost exceeds threshold (default: 1 GB)
Skips confirmation for queries below threshold or when --yes is used
Includes cost metadata in response (bytesProcessed, estimatedCostUSD, cacheHit)

Query Options

# Skip cost confirmation (REQUIRED for automation/scripts)
bigquery query --yes "SELECT * FROM dataset.table"

# Set custom cost threshold (default: 1 GB)
bigquery query --cost-threshold 5 "SELECT * FROM large_table"

# Environment variable for persistent threshold
export BIGQUERY_COST_THRESHOLD_GB=5.0

# Limit rows returned (does NOT reduce cost/bytes scanned)
bigquery query "SELECT * FROM table" --max-results 100

# Stream large results (outputs JSONL)
bigquery query "SELECT * FROM large_table" --stream > results.jsonl

Query Flags:

Flag	Description
`--yes` / `-y`	Skip cost confirmation prompt (for automation)
`--format <format>`	Output format: `json` (default), `text`, `jsonl`
`--jsonl`	Shorthand for `--format jsonl`
`--stream`	Stream results as they arrive (implies JSONL)
`--cost-threshold <gb>`	Cost confirmation threshold in GB (default: 1)
`--max-results <n>`	Max rows to return (does NOT reduce cost)
`--file <path>`	Read query from file
`--stdin`	Read query from stdin
`-p, --project <id>`	GCP project ID

Important Notes:

--max-results only limits returned rows, NOT bytes scanned (no cost reduction)
Use LIMIT in SQL to reduce scanned data
Use --yes for automation (not --force, which doesn't exist)

Query Output Formats

# JSON output (default, machine-readable)
bigquery query "SELECT * FROM dataset.table"

# Returns:
# {
#   "rows": [...],
#   "totalRows": 123,
#   "bytesProcessed": 1048576,
#   "bytesProcessedGB": 0.001,
#   "estimatedCostUSD": 0.00000625,
#   "cacheHit": false
# }

# JSONL output (one JSON object per line)
bigquery query "SELECT * FROM dataset.table" --jsonl

# Returns:
# {"col1":"value1","col2":123}
# {"col1":"value2","col2":456}

# Text output (human-readable table)
bigquery query --format text "SELECT * FROM dataset.table"

Dry Run (Cost Estimation)

# Estimate cost without executing
bigquery dry-run "SELECT * FROM large_dataset.table WHERE date >= '2025-01-01'"

# Returns (JSON):
# {
#   "bytesProcessed": "1073741824",
#   "bytesProcessedGB": 1.0,
#   "bytesProcessedMB": 1024.0,
#   "estimatedCostUSD": 0.00625
# }

# Text format
bigquery dry-run "SELECT * FROM large_table" --format text

# Returns:
# Dry-run Results:
# ──────────────────────────────────────────────────
# Bytes to process: 1.00 GB
# Estimated cost:   $0.0063

Use dry-run to:

Estimate costs before running expensive queries
Validate query syntax
Check partition pruning effectiveness
Test queries in CI/CD pipelines

Cost Formula: (bytesProcessed / 1TB) * $6.25

Dataset Operations

Listing Datasets

# List datasets in current project (text format, default)
bigquery datasets list

# List datasets in another project
bigquery datasets list --project other-project

# JSON output
bigquery datasets list --format json

# Example output shows:
# - Dataset ID
# - Location
# - Creation time
# - Labels (if any)

Describing Datasets

# Show dataset metadata
bigquery datasets describe project.dataset

# JSON output
bigquery datasets describe project.dataset --format json

Creating Datasets

# Create dataset in default location
bigquery datasets create my-project.new_dataset

# Create with description and location
bigquery datasets create my-project.new_dataset \
  --description "Analytics data warehouse" \
  --location US

# Create with default table expiration (30 days)
bigquery datasets create my-project.temp_dataset \
  --default-ttl 30 \
  --location US

# Create with labels
bigquery datasets create my-project.new_dataset \
  --labels "env=prod,team=analytics"

Updating Datasets

# Update description
bigquery datasets update my-project.existing_dataset \
  --description "Updated description"

# Update default table expiration
bigquery datasets update my-project.existing_dataset \
  --default-ttl 30

# Add/update labels
bigquery datasets update my-project.existing_dataset \
  --labels env=staging \
  --labels team=data

Deleting Datasets

# Delete empty dataset (prompts for confirmation)
bigquery datasets delete my-project.old_dataset

# Delete non-empty dataset (includes all tables)
bigquery datasets delete my-project.old_dataset --force

# Skip confirmation (for automation)
bigquery datasets delete my-project.old_dataset --force --yes

Table Operations

Listing Tables

# List tables in a dataset (text format, first 10)
bigquery tables list my-project.my-dataset

# JSON output
bigquery tables list my-project.my-dataset --format json

# With pagination
bigquery tables list my-project.my-dataset --max-results 20 --page-token <token>

Describing Table Schema

# Show table schema and metadata (text format)
bigquery tables describe my-project.my-dataset.my-table

# JSON output
bigquery tables describe my-project.my-dataset.my-table --format json

# Output includes:
# - Column names and types
# - Nullability (NULLABLE, REQUIRED, REPEATED)
# - Mode information
# - Table metadata (row count, size, location)

Creating Tables

# Create table with JSON schema
bigquery tables create my-project.dataset.users \
  --schema '[{"name":"id","type":"STRING"},{"name":"email","type":"STRING"}]' \
  --description "User data"

Updating Tables

# Update description
bigquery tables update my-project.dataset.table --description "Updated description"

# Set expiration (30 days in seconds)
bigquery tables update my-project.dataset.table --expiration 2592000

# Add labels
bigquery tables update my-project.dataset.table \
  --labels owner=team-data \
  --labels environment=production

# Require partition filter for queries
bigquery tables update my-project.dataset.table --require-partition-filter true

Copying Tables

# Copy table within same project
bigquery tables copy my-project.source.table my-project.dest.table_copy

# Copy to different dataset
bigquery tables copy my-project.source.table my-project.archive.table_backup

# Overwrite destination if exists
bigquery tables copy my-project.source.table my-project.dest.existing_table \
  --write-disposition WRITE_TRUNCATE

Deleting Tables

# Delete table
bigquery tables delete my-project.dataset.old_table

Inserting Rows (Small Datasets)

Best for <1000 rows. Uses streaming insert API for immediate availability.

JSONL (Newline-Delimited JSON) Format

From JSONL File:

# Create sample JSONL file
cat > users.jsonl <<EOF
{"id": "1", "name": "Alice Johnson", "email": "alice@example.com", "age": 30}
{"id": "2", "name": "Bob Smith", "email": "bob@example.com", "age": 25}
{"id": "3", "name": "Charlie Brown", "email": "charlie@example.com", "age": 35}
EOF

# Insert from JSONL file
bigquery tables insert my-project.dataset.users users.jsonl --format jsonl

From JSONL Stream (stdin):

# Stream from command output
echo '{"id": "1", "name": "Alice", "email": "alice@example.com"}' | \
  bigquery tables insert my-project.dataset.users - --format jsonl

# Stream from heredoc
cat << EOF | bigquery tables insert my-project.dataset.users - --format jsonl
{"id": "1", "name": "Alice", "email": "alice@example.com", "age": 30}
{"id": "2", "name": "Bob", "email": "bob@example.com", "age": 25}
{"id": "3", "name": "Charlie", "email": "charlie@example.com", "age": 35}
EOF

# Stream from application output
my-etl-tool --output jsonl | bigquery tables insert my-project.dataset.events -

# Stream from jq transformation
cat raw_data.json | jq -c '.records[]' | \
  bigquery tables insert my-project.dataset.processed -

JSONL Format Requirements:

Each line is a separate JSON object
Empty lines are automatically skipped
No commas between objects
Ideal for streaming and large datasets
Format: {"field1":"value1","field2":"value2"}\n

Additional Insert Options

# Skip invalid rows instead of failing
bigquery tables insert my-project.dataset.users users.jsonl --skip-invalid-rows

# Ignore unknown fields in data
bigquery tables insert my-project.dataset.users users.jsonl --ignore-unknown-values

# Combine options for production pipelines
cat production_data.jsonl | \
  bigquery tables insert my-project.dataset.production - \
    --format jsonl \
    --skip-invalid-rows \
    --ignore-unknown-values

Insert Options:

--format <FORMAT>: Data format (json or jsonl)
--skip-invalid-rows: Skip invalid rows instead of failing
--ignore-unknown-values: Ignore unknown fields in data

Loading Data (Large Datasets)

Best for >10MB files or >1000 rows. Uses BigQuery load jobs.

# Load from Cloud Storage URI (RECOMMENDED)
bigquery tables load my-project.dataset.users \
  gs://my-bucket/data.csv --format csv

# Load with schema auto-detection
bigquery tables load my-project.dataset.new_table \
  gs://my-bucket/data.csv --format csv --autodetect

# Load with replace write disposition (truncates table first)
bigquery tables load my-project.dataset.users \
  gs://my-bucket/data.csv --format csv --write-disposition WRITE_TRUNCATE

# Load JSON file
bigquery tables load my-project.dataset.events \
  gs://my-bucket/events.json --format json

# Supported formats: csv, json, jsonl, avro, parquet
bigquery tables load my-project.dataset.table \
  gs://my-bucket/data.parquet --format parquet

# CSV with skip leading rows
bigquery tables load my-project.dataset.table \
  gs://my-bucket/data.csv --format csv --skip-leading-rows 1

# Append to existing table (default)
bigquery tables load my-project.dataset.table \
  gs://my-bucket/data.json --format json --write-disposition WRITE_APPEND

Load Options:

--format <FORMAT>: Source format (csv, json, jsonl, parquet, avro)
--autodetect: Auto-detect schema from source
--skip-leading-rows <N>: Skip N leading rows (CSV)
--write-disposition <MODE>: WRITE_TRUNCATE (replace), WRITE_APPEND (default), WRITE_EMPTY
--create-disposition <MODE>: CREATE_IF_NEEDED (default), CREATE_NEVER

When to Use:

Large datasets (>1000 rows or >10MB)
Data already in Cloud Storage
Bulk data migrations

When NOT to Use:

Small datasets (<1000 rows) → Use tables insert instead

Extracting Data

Export table data to Cloud Storage in various formats:

# Extract table to Cloud Storage as CSV
bigquery tables extract my-project.dataset.users \
  gs://my-bucket/exports/users.csv --format csv

# Extract as JSON (newline-delimited)
bigquery tables extract my-project.dataset.events \
  gs://my-bucket/exports/events-*.json --format json

# Extract with compression
bigquery tables extract my-project.dataset.large_table \
  gs://my-bucket/exports/data-*.csv.gz --format csv --compression GZIP

# Extract as Parquet
bigquery tables extract my-project.dataset.analytics \
  gs://my-bucket/exports/analytics.parquet --format parquet

# CSV with header
bigquery tables extract my-project.dataset.data \
  gs://my-bucket/data.csv --format csv --print-header

Supported Formats: CSV, JSON (newline-delimited), Parquet, Avro Compression: NONE (default), GZIP

Job Management

BigQuery jobs are asynchronous operations for queries, loads, exports, and copies.

Listing Jobs

# List recent jobs
bigquery jobs list

# List with pagination
bigquery jobs list --max-results 20 --page-token <token>

# Filter by state
bigquery jobs list --state-filter running
bigquery jobs list --state-filter done

# Show jobs from all users
bigquery jobs list --all-users

# JSON output
bigquery jobs list --format json

Showing Job Details

# Show job details
bigquery jobs get job_abc123xyz

# JSON output
bigquery jobs get job_abc123xyz --format json

Canceling Jobs

# Cancel a running job
bigquery jobs cancel job_abc123xyz

MCP Server Integration

The BigQuery MCP server provides AI integration via Model Context Protocol.

Starting MCP Server

STDIO Mode (for local clients):

# Start MCP server in stdio mode (read-only)
bigquery mcp stdio

# Start with write operations enabled
bigquery mcp stdio --enable-writes

# Server will:
# - Accept MCP protocol messages on stdin
# - Send responses on stdout
# - Expose BigQuery tools to MCP clients

HTTP Mode (for network clients):

# Start HTTP MCP server on default port 8080
bigquery mcp http

# Specify custom port
bigquery mcp http --port 3000

# With OAuth and email domain restriction
bigquery mcp http \
  --google-client-id <id> \
  --google-client-secret <secret> \
  --domain example.com

# Enable write operations
bigquery mcp http --enable-writes

# Server provides:
# - HTTP endpoint for MCP protocol
# - JSON-RPC over HTTP
# - Remote access to BigQuery tools

MCP Configuration

Configure in Claude Code or other MCP-enabled applications:

STDIO Mode (.claude/mcp.json or similar):

{
  "mcpServers": {
    "bigquery": {
      "command": "bigquery",
      "args": ["mcp", "stdio"],
      "env": {
        "GOOGLE_CLOUD_PROJECT": "my-project"
      }
    }
  }
}

STDIO Mode with Writes Enabled:

{
  "mcpServers": {
    "bigquery": {
      "command": "bigquery",
      "args": ["mcp", "stdio", "--enable-writes"],
      "env": {
        "GOOGLE_CLOUD_PROJECT": "my-project"
      }
    }
  }
}

HTTP Mode:

{
  "mcpServers": {
    "bigquery": {
      "url": "http://localhost:8080",
      "transport": "http"
    }
  }
}

Common Workflows

Workflow 1: Exploratory Data Analysis

# 1. Verify authentication
bigquery auth check

# 2. List available datasets
bigquery datasets list

# 3. List tables in dataset
bigquery tables list my-project.analytics

# 4. Check table schema
bigquery tables describe my-project.analytics.events

# 5. Preview data (text format for readability)
bigquery query --format text \
  "SELECT * FROM my-project.analytics.events LIMIT 10"

# 6. Get row count
bigquery query "SELECT COUNT(*) as total FROM my-project.analytics.events"

# 7. Check data distribution
bigquery query --format text "
  SELECT
    DATE(timestamp) as date,
    COUNT(*) as events
  FROM my-project.analytics.events
  GROUP BY date
  ORDER BY date DESC
  LIMIT 30
"

Workflow 2: Cost-Aware Query Development

# 1. Dry run to estimate cost
bigquery dry-run "
  SELECT *
  FROM my-project.large_dataset.table
  WHERE date >= '2025-01-01'
"

# 2. If cost is acceptable, run query
bigquery query "
  SELECT *
  FROM my-project.large_dataset.table
  WHERE date >= '2025-01-01'
"

# 3. For automation, skip confirmation
bigquery query --yes "
  SELECT *
  FROM my-project.large_dataset.table
  WHERE date >= '2025-01-01'
" > results.json

Workflow 3: Data Loading Pipeline

# 1. Load initial data
bigquery tables load my-project.dataset.events \
  gs://bucket/events-2025-01-01.csv \
  --format csv \
  --write-disposition WRITE_TRUNCATE

# 2. Append incremental data
bigquery tables load my-project.dataset.events \
  gs://bucket/events-2025-01-02.csv \
  --format csv \
  --write-disposition WRITE_APPEND

# 3. Verify data loaded
bigquery query "
  SELECT
    DATE(timestamp) as date,
    COUNT(*) as count
  FROM my-project.dataset.events
  GROUP BY date
  ORDER BY date
"

Workflow 4: Real-Time Data Insertion

# 1. Stream JSONL from application
my-app --output jsonl | bigquery tables insert my-project.dataset.events -

# 2. Stream with transformation and error handling
cat raw_events.json | jq -c '.events[]' | \
  bigquery tables insert my-project.dataset.events - \
    --skip-invalid-rows \
    --ignore-unknown-values

Workflow 5: Streaming Large Results

# 1. Stream query results to JSONL file
bigquery query "SELECT * FROM my-project.dataset.large_table" \
  --stream > results.jsonl

# 2. Process with DuckDB
duckdb -c "SELECT user_id, COUNT(*) FROM read_json_auto('results.jsonl') GROUP BY user_id"

# 3. Or process line-by-line
cat results.jsonl | while IFS= read -r line; do
  echo "$line" | jq '.field'
done

Append-Only Raw Table Pattern

Use this architecture when a table needs to track evolving state over time without destructive updates.

Structure:

table_raw — append-only write target; never UPDATE or DELETE rows in-place
table (VIEW) — the read surface; deduplicates raw via ROW_NUMBER() OVER (PARTITION BY <natural_key> ORDER BY loaded_at DESC) = 1

Write rules:

All INSERT operations target table_raw only; never INSERT into the view (BigQuery rejects DML on views)
Include a loaded_at TIMESTAMP REQUIRED column on the raw table; always set it to CURRENT_TIMESTAMP() at write time
To "update" a record: INSERT a new row with corrected values and a fresh loaded_at; the view surfaces the newest row automatically
Validate domain constraints before writing (e.g., expected ID prefixes, enum values, FK existence); skip invalid rows with a warning rather than failing the batch

Read rules:

Always query the view, never the raw table directly
Apply domain-validity guards in the view's inner WHERE clause to prevent corrupt rows from surfacing even if they reach the raw table

Integrity checks:

Verify periodically: row count on view == COUNT(DISTINCT <natural_key>) on raw (filtered to valid rows only)
Within-version duplicates (same (<natural_key>, version_tag) appearing more than once in raw) indicate a write path ran more than once; clean with MERGE … WHEN NOT MATCHED BY SOURCE THEN DELETE using a MAX(loaded_at) keeper subquery
Foreign key validation: LEFT JOIN the raw table to the referenced source table; rows where the join produces NULL are orphans and should be investigated or deleted

Best Practices

Query Development

Always dry-run first: Use bigquery dry-run to estimate costs
Validate before running: Check syntax and cost before execution
Use text format for exploration: --format text for human-readable tables
Use JSON for automation: --format json for machine processing
Use JSONL for streaming: --jsonl or --stream for large results
Skip confirmations in scripts: Use --yes flag for automation

Cost Management

Dry run expensive queries: Always estimate with bigquery dry-run
Monitor bytes processed: Check query cost estimates before running
Use partition pruning: Filter on partitioned columns in WHERE clauses
Limit result sets: Use LIMIT for exploratory queries
Select only needed columns: SELECT col1, col2 not SELECT *
Set cost thresholds: Use BIGQUERY_COST_THRESHOLD_GB environment variable

Table Architecture

Raw + view split: Use an append-only *_raw table for writes and a dedup VIEW for reads; never write to the view, never read from raw directly
loaded_at as dedup key: The loaded_at TIMESTAMP REQUIRED column determines which row wins in the view; always write CURRENT_TIMESTAMP() at INSERT time
Updates via INSERT: To correct a row, INSERT a new version with corrected values and a fresh loaded_at; do not UPDATE the raw table in-place
Validate before INSERT: Check domain constraints (ID format, enum values, FK existence) before writing; skip bad rows with a logged warning rather than allowing them to corrupt the raw table
Guard the view: Add a WHERE clause to the view's inner query enforcing domain invariants as a last-resort filter against rows that bypass write-time validation
Never target a view for DML: BigQuery rejects INSERT/UPDATE/DELETE on views; always use the underlying raw table name in write code paths

Authentication

Check auth first: Run bigquery auth check before operations
Use service accounts: For automation and CI/CD
Verify scopes: Ensure all required BigQuery scopes are granted
Re-authenticate when needed: If check fails

Data Loading

Choose the right method:
- Use insert for <1000 rows (streaming insert API, immediate availability)
- Use load for >10MB files or >1000 rows (load jobs with Cloud Storage)
Use JSONL for streaming: Newline-delimited JSON is ideal for streaming pipelines
Stream from stdin: Use - as file argument to pipe data from applications
Handle bad records: Use --skip-invalid-rows for messy data
Choose write disposition: WRITE_TRUNCATE for full refresh, WRITE_APPEND for incremental
Use appropriate formats: CSV for simple data, JSON/JSONL for complex, Parquet for large datasets

Output Format Selection

JSON (--format json): Default, full metadata, parsing with jq
Text (--format text): Human-readable, terminal inspection
JSONL (--jsonl or --stream): Streaming, pipelines, DuckDB ingestion

Examples:

# Parse with jq
bigquery query "SELECT * FROM table" | jq '.rows[] | select(.amount > 100)'

# Stream to file
bigquery query "SELECT * FROM large_table" --stream > data.jsonl

# Load into DuckDB
bigquery query "SELECT * FROM table" --jsonl | \
  duckdb -c "SELECT * FROM read_json_auto('/dev/stdin')"

MCP Server

Use stdio for local: Prefer stdio mode for local MCP clients
Use HTTP for remote: Use HTTP mode for networked deployments
Read-only by default: Only enable writes when needed (--enable-writes)
Secure HTTP endpoints: Put HTTP server behind authentication/firewall
Monitor server logs: Check for errors and performance issues

Configuration

Environment Variables

# Set default project
export GOOGLE_CLOUD_PROJECT=my-project

# Set cost threshold (in GB, default: 1)
export BIGQUERY_COST_THRESHOLD_GB=5.0

# Set credentials (for service accounts)
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json

# Add to ~/.zshrc or ~/.bashrc for persistence
echo 'export GOOGLE_CLOUD_PROJECT=my-project' >> ~/.zshrc
echo 'export BIGQUERY_COST_THRESHOLD_GB=5.0' >> ~/.zshrc

Authentication Methods

User Credentials (interactive):

gcloud auth application-default login
bigquery auth check

Service Account (automation):

export GOOGLE_APPLICATION_CREDENTIALS=/path/to/sa-key.json
bigquery auth check

Troubleshooting

Issue: "Not authenticated" or "Permission denied"

Solution: Check authentication and scopes

# Check current auth status
bigquery auth check

# Re-authenticate if needed
gcloud auth application-default login

# Verify gcloud is set to correct project
gcloud config get-value project

# Set project if needed
gcloud config set project my-project

Issue: "Table not found"

Solution: Use fully qualified table names

# Wrong - missing project/dataset
bigquery query "SELECT * FROM table"

# Correct - fully qualified
bigquery query "SELECT * FROM my-project.my-dataset.my-table"

# Or use backticks for reserved words
bigquery query "SELECT * FROM \`my-project.my-dataset.my-table\`"

Issue: "Query too expensive"

Solution: Check cost with dry-run and optimize

# Check estimated cost
bigquery dry-run "SELECT * FROM large_table WHERE date >= '2025-01-01'"

# Optimize with partition filters
bigquery dry-run "
  SELECT * FROM large_table
  WHERE _PARTITIONDATE = '2025-01-15'
"

Quick Reference

# Authentication
bigquery auth check                          # Check auth status

# Queries (default: JSON output)
bigquery query "SELECT ..."                  # Execute query (JSON)
bigquery query --yes "SELECT ..."            # Skip confirmation
bigquery query --format text "SELECT ..."    # Human-readable table
bigquery query --jsonl "SELECT ..."          # JSONL output
bigquery query --stream "SELECT ..."         # Stream large results
bigquery dry-run "SELECT ..."                # Estimate cost

# Datasets
bigquery datasets list                       # List datasets
bigquery datasets describe PROJECT.DATASET   # Describe dataset
bigquery datasets create PROJECT.DATASET     # Create dataset
bigquery datasets update PROJECT.DATASET --description "..." # Update dataset
bigquery datasets delete PROJECT.DATASET     # Delete dataset

# Tables - Read Operations
bigquery tables list PROJECT.DATASET         # List tables
bigquery tables describe TABLE               # Show schema

# Tables - Write Operations
bigquery tables create TABLE --schema "..."  # Create table
bigquery tables insert TABLE file.jsonl --format jsonl  # Insert from file
cat data.jsonl | bigquery tables insert TABLE -         # Stream insert
bigquery tables load TABLE gs://bucket/file.csv         # Bulk load
bigquery tables copy SOURCE DEST             # Copy table
bigquery tables delete TABLE                 # Delete table

# Tables - Extract
bigquery tables extract TABLE gs://bucket/output.csv    # Export to GCS

# Jobs
bigquery jobs list                           # List jobs
bigquery jobs get JOB_ID                     # Job details
bigquery jobs cancel JOB_ID                  # Cancel job

# MCP Server
bigquery mcp stdio                           # MCP server (stdio, read-only)
bigquery mcp stdio --enable-writes           # MCP server (stdio, with writes)
bigquery mcp http --port 3000                # MCP server (HTTP)

Summary

Primary commands:

bigquery auth check - Authentication management
bigquery query - Execute SQL with automatic cost awareness
bigquery dry-run - Estimate query costs ($0 to run)
bigquery datasets {list,describe,create,update,delete} - Dataset operations
bigquery tables {list,describe,insert,load,extract,create,update,delete,copy} - Table operations
bigquery jobs {list,get,cancel} - Job management
bigquery mcp {stdio,http} - MCP server modes

Key features:

Cost-aware query execution with automatic dry-run and confirmation prompts
Configurable cost thresholds (BIGQUERY_COST_THRESHOLD_GB)
Streaming insert API for real-time data (<1000 rows)
Bulk load from Cloud Storage (>10MB files)
Native streaming support for large results (--stream flag outputs JSONL)
Built-in MCP server for AI integration (stdio and HTTP modes)
Three output formats: JSON (default, full metadata), JSONL (streaming), text (human-readable)
Instant startup with TypeScript/Bun (~10ms vs ~500ms for Python bq)