Smithery Logo
MCPsSkillsDocsPricing
Login
Smithery Logo

Give agents more agency

Resources

DocumentationPrivacy PolicySystem Status

Company

PricingAboutBlog

Connect

© 2026 Smithery. All rights reserved.

    vanman2024

    google-cloud-configs

    vanman2024/google-cloud-configs
    AI & ML
    2

    About

    SKILL.md

    Install

    • Claude Code
      Claude Code
    • Codex
      Codex
    • OpenClaw
      OpenClaw
    • Cursor
      Cursor
    • Amp
      Amp
    • GitHub Copilot
      GitHub Copilot
    • Gemini CLI
      Gemini CLI
    • Kilo Code
      Kilo Code
    • Junie
      Junie
    • Replit
      Replit
    • Windsurf
      Windsurf
    • Cline
      Cline
    • Continue
      Continue
    • OpenCode
      OpenCode
    • OpenHands
      OpenHands
    • Roo Code
      Roo Code
    • Augment
      Augment
    • Goose
      Goose
    • Trae
      Trae
    • Zencoder
      Zencoder
    • Antigravity
      Antigravity
    • Download skill
    ├─
    ├─
    └─

    About

    Google Cloud Platform configuration templates for BigQuery ML and Vertex AI training with authentication setup, GPU/TPU configs, and cost estimation tools...

    SKILL.md

    Use when:

    • Setting up BigQuery ML for SQL-based machine learning
    • Configuring Vertex AI custom training jobs
    • Setting up GCP authentication for ML workflows
    • Selecting appropriate GPU/TPU configurations
    • Estimating costs for GCP ML training
    • Deploying models to Vertex AI endpoints
    • Configuring distributed training on GCP
    • Optimizing cost vs performance for cloud ML

    Platform Overview

    BigQuery ML

    What it is: SQL-based machine learning directly in BigQuery Best for:

    • Quick ML prototypes using existing data warehouse data
    • Classification, regression, forecasting on structured data
    • Users familiar with SQL but not Python/ML frameworks
    • Large-scale batch predictions

    Available Models:

    • Linear/Logistic Regression
    • XGBoost (BOOSTED_TREE)
    • Deep Neural Networks (DNN)
    • AutoML Tables
    • TensorFlow/PyTorch imported models

    Pricing:

    • Based on data processed (same as BigQuery queries)
    • $5 per TB processed for analysis
    • AutoML: $19.32/hour for training

    Vertex AI Training

    What it is: Fully managed ML training platform Best for:

    • Custom PyTorch/TensorFlow training
    • Large-scale distributed training
    • GPU/TPU-accelerated workloads
    • Production ML pipelines

    Available Compute:

    • CPUs: n1-standard, n1-highmem, n1-highcpu
    • GPUs: NVIDIA T4, P4, V100, P100, A100, L4
    • TPUs: v2, v3, v4, v5e (8 cores to 512 cores)

    Pricing:

    • CPU: $0.05-0.30/hour depending on machine type
    • GPU T4: $0.35/hour
    • GPU A100: $3.67/hour (40GB) or $4.95/hour (80GB)
    • TPU v3: $8.00/hour (8 cores)
    • TPU v4: $11.00/hour (8 cores)

    GPU/TPU Selection Guide

    GPU Selection (Vertex AI)

    T4 (16GB VRAM):

    • Use case: Inference, light training, small models
    • Cost: $0.35/hour
    • Good for: BERT-base, small CNNs, inference serving

    V100 (16GB VRAM):

    • Use case: Mid-size training, mixed precision training
    • Cost: $2.48/hour
    • Good for: ResNet training, medium transformers

    A100 (40GB/80GB VRAM):

    • Use case: Large model training, distributed training
    • Cost: $3.67/hour (40GB), $4.95/hour (80GB)
    • Good for: GPT-style models, large vision models, multi-GPU training

    L4 (24GB VRAM):

    • Use case: Modern alternative to T4, better performance
    • Cost: $0.66/hour
    • Good for: Mid-size models, efficient inference

    TPU Selection (Vertex AI)

    TPU v2 (8 cores):

    • Use case: TensorFlow/JAX training, matrix operations
    • Cost: $4.50/hour
    • Memory: 8GB per core (64GB total)
    • Good for: Legacy TensorFlow models

    TPU v3 (8 cores):

    • Use case: Standard TPU training
    • Cost: $8.00/hour
    • Memory: 16GB per core (128GB total)
    • Good for: BERT, T5, image classification

    TPU v4 (8 cores):

    • Use case: Latest generation, best performance
    • Cost: $11.00/hour
    • Memory: 32GB per core (256GB total)
    • Good for: Large language models, cutting-edge research

    TPU v5e (8 cores):

    • Use case: Cost-optimized TPU
    • Cost: $2.50/hour
    • Good for: Development, training at scale on budget

    Multi-node TPU Pods:

    • v3-32: 32 cores, $32/hour
    • v3-128: 128 cores, $128/hour
    • v4-128: 128 cores, $176/hour
    • Use for: Massive distributed training (GPT-3 scale)

    Usage

    Setup BigQuery ML Environment

    bash scripts/setup-bigquery-ml.sh
    

    Prompts for:

    • GCP Project ID
    • BigQuery dataset name
    • Service account credentials
    • Default model type preference

    Creates:

    • bigquery_config.json - Project configuration
    • .bigqueryrc - CLI configuration
    • Example training SQL in examples/

    Setup Vertex AI Training Environment

    bash scripts/setup-vertex-ai.sh
    

    Prompts for:

    • GCP Project ID
    • Region (us-central1, europe-west4, etc.)
    • Service account credentials
    • Default machine type
    • GPU/TPU preference

    Creates:

    • vertex_config.yaml - Training job configuration
    • vertex_requirements.txt - Python dependencies
    • Training script template

    Configure GCP Authentication

    bash scripts/configure-auth.sh
    

    Prompts for:

    • Authentication method (service account, user account, workload identity)
    • Service account key path (if applicable)
    • IAM roles needed

    Creates:

    • .gcp_auth_config - Authentication configuration
    • Sets GOOGLE_APPLICATION_CREDENTIALS environment variable
    • Validates permissions

    Required IAM Roles:

    • BigQuery ML: roles/bigquery.dataEditor, roles/bigquery.jobUser
    • Vertex AI: roles/aiplatform.user, roles/storage.objectAdmin
    • Both: roles/serviceusage.serviceUsageConsumer

    Estimate GCP Training Costs

    bash scripts/estimate-gcp-cost.sh
    

    Interactive prompts:

    • Platform: BigQuery ML or Vertex AI
    • If BigQuery ML: Data size to process
    • If Vertex AI:
      • Machine type (CPU/GPU/TPU)
      • Number of machines
      • Training duration estimate
      • Storage requirements

    Output:

    • Estimated compute cost
    • Storage cost
    • Data transfer cost (if applicable)
    • Total estimated cost
    • Cost comparison with other GCP options

    Templates

    BigQuery ML Training Template (templates/bigquery_ml_training.sql)

    SQL template for creating and training models:

    • Model creation syntax
    • Feature engineering examples
    • Training options (L1/L2 reg, learning rate, etc.)
    • Evaluation queries
    • Prediction queries

    Supported model types:

    • LINEAR_REG, LOGISTIC_REG
    • BOOSTED_TREE_CLASSIFIER, BOOSTED_TREE_REGRESSOR
    • DNN_CLASSIFIER, DNN_REGRESSOR
    • AUTOML_CLASSIFIER, AUTOML_REGRESSOR

    Vertex AI Training Job Template (templates/vertex_training_job.py)

    Python template for custom training:

    • Training loop structure
    • Distributed training setup (PyTorch DDP)
    • Checkpointing and model saving
    • Metrics logging to Vertex AI
    • Hyperparameter tuning integration

    Includes:

    • Single GPU training
    • Multi-GPU training (DataParallel, DistributedDataParallel)
    • TPU training with PyTorch/XLA
    • Cloud Storage integration

    GPU Configuration Template (templates/vertex_gpu_config.yaml)

    YAML configuration for GPU training jobs:

    • Machine type selection
    • GPU type and count
    • Disk configuration
    • Network configuration
    • Environment variables

    Presets included:

    • Single T4 (budget)
    • Single A100 (standard)
    • 4x A100 (distributed)
    • 8x A100 (large-scale)

    TPU Configuration Template (templates/vertex_tpu_config.yaml)

    YAML configuration for TPU training jobs:

    • TPU type and topology
    • TPU version selection
    • JAX/TensorFlow runtime
    • XLA compilation flags

    Presets included:

    • v3-8 (single TPU)
    • v4-32 (TPU pod slice)
    • v5e-8 (cost-optimized)

    GCP Authentication Template (templates/gcp_auth.json)

    Service account configuration template:

    • Project ID
    • Service account email
    • Key file path
    • Required scopes
    • IAM role assignments

    Security notes:

    • Uses placeholders only (never real keys)
    • Documents how to create service accounts
    • Includes .gitignore protection

    Examples

    BigQuery ML Regression Example (examples/bigquery-regression-example.sql)

    Complete example:

    • Dataset: NYC taxi trip data
    • Task: Predict trip duration
    • Model: BOOSTED_TREE_REGRESSOR
    • Includes feature engineering, training, evaluation

    Demonstrates:

    • CREATE MODEL syntax
    • TRANSFORM clause for feature engineering
    • MODEL evaluation
    • Batch predictions

    Vertex AI PyTorch Training Example (examples/vertex-pytorch-training.py)

    Complete training script:

    • Dataset: IMDB sentiment analysis
    • Model: DistilBERT fine-tuning
    • Training: Single GPU
    • Logging: Vertex AI experiments

    Demonstrates:

    • Loading data from GCS
    • Training loop with mixed precision
    • Checkpointing to GCS
    • Metrics logging
    • Model export to Vertex AI

    Vertex AI Distributed Training Example (examples/vertex-distributed-training.py)

    Multi-GPU training example:

    • Dataset: ImageNet subset
    • Model: ResNet-50
    • Training: 4x A100 with DDP
    • Scaling: Linear scaling rule

    Demonstrates:

    • PyTorch DistributedDataParallel
    • Gradient accumulation
    • Learning rate scaling
    • Synchronized batch norm
    • Multi-node coordination

    Hugging Face Fine-tuning on Vertex AI (examples/vertex-huggingface-finetuning.py)

    Production fine-tuning template:

    • Dataset: Custom text classification
    • Model: BERT/RoBERTa/DeBERTa
    • Training: Hugging Face Trainer API
    • Deployment: Vertex AI endpoint

    Demonstrates:

    • Hugging Face Trainer integration
    • Hyperparameter tuning with Vertex AI
    • Model versioning
    • Endpoint deployment
    • Online predictions

    Cost Optimization Tips

    BigQuery ML

    Reduce data processed:

    • Use partitioned tables
    • Filter data in WHERE clause before training
    • Use table sampling for experimentation
    • Cache intermediate results

    Use appropriate model types:

    • Start with LINEAR_REG/LOGISTIC_REG (cheapest)
    • Use BOOSTED_TREE for better accuracy at moderate cost
    • Reserve AutoML for when simpler models fail

    Optimize queries:

    • Avoid SELECT * (specify columns)
    • Use clustering on filter columns
    • Materialize views for repeated training

    Vertex AI

    Machine type selection:

    • Start with CPU for prototyping
    • Use T4 for small models (cheapest GPU)
    • Use A100 only for large models that need it
    • Consider TPU v5e for TensorFlow/JAX (very cost-effective)

    Training optimization:

    • Use preemptible instances (60-70% cheaper, can be interrupted)
    • Enable automatic checkpoint/resume for preemptible
    • Use mixed precision training (FP16/BF16) for faster training
    • Profile to eliminate CPU bottlenecks

    Storage optimization:

    • Store datasets in Cloud Storage (cheaper than persistent disk)
    • Use Filestore only if needed for POSIX filesystem
    • Clean up old model artifacts
    • Use lifecycle policies to archive old data

    Multi-GPU efficiency:

    • Ensure near-linear scaling before adding more GPUs
    • Profile inter-GPU communication
    • Use gradient accumulation instead of larger batch sizes
    • Consider 2x GPUs instead of 1x larger GPU (often same cost, better availability)

    Integration with ML Training Plugin

    This skill integrates with other ml-training components:

    • training-patterns: Provides GCP configs for generated training scripts
    • cost-calculator: Uses GCP pricing data for budget planning
    • monitoring-dashboard: Integrates with Vertex AI TensorBoard
    • validation-scripts: Validates GCP credentials and permissions
    • integration-helpers: Deploys trained models to Vertex AI endpoints

    Common Workflows

    Workflow 1: Quick BigQuery ML Prototype

    1. Run bash scripts/setup-bigquery-ml.sh
    2. Copy templates/bigquery_ml_training.sql to your project
    3. Modify SQL for your dataset and features
    4. Run training query in BigQuery console
    5. Evaluate with built-in ML.EVALUATE()
    6. Export predictions with ML.PREDICT()

    Time: 30 minutes setup + training time Cost: $5 per TB of data processed

    Workflow 2: Custom PyTorch Training on Vertex AI

    1. Run bash scripts/configure-auth.sh
    2. Run bash scripts/setup-vertex-ai.sh
    3. Copy templates/vertex_training_job.py
    4. Customize training loop for your model
    5. Copy templates/vertex_gpu_config.yaml
    6. Submit job: gcloud ai custom-jobs create ...
    7. Monitor in Vertex AI console

    Time: 1 hour setup + training time Cost: Depends on GPU/TPU selection

    Workflow 3: Large-Scale Distributed Training

    1. Setup Vertex AI (workflow 2)
    2. Copy examples/vertex-distributed-training.py
    3. Adapt for your model architecture
    4. Test locally with 1 GPU
    5. Test with 2 GPUs to verify scaling
    6. Scale to 4-8 GPUs for full training
    7. Use preemptible instances with checkpointing

    Time: 2-4 hours setup + training time Cost: $15-60/hour depending on GPU count

    Troubleshooting

    BigQuery ML Issues

    "Insufficient permissions":

    • Verify roles/bigquery.dataEditor and roles/bigquery.jobUser
    • Check dataset-level permissions
    • Ensure billing is enabled

    "Model training failed":

    • Check for NULL values in features
    • Verify data types match model expectations
    • Review feature engineering TRANSFORM clause
    • Check for sufficient training data

    Vertex AI Issues

    "Service account lacks permissions":

    • Verify roles/aiplatform.user
    • Add roles/storage.objectAdmin for GCS access
    • Check project-level IAM policies

    "GPU/TPU quota exceeded":

    • Request quota increase in GCP console
    • Use different region with availability
    • Start with smaller GPU/TPU configuration
    • Use preemptible instances (separate quota)

    "Training job crashes":

    • Check for CUDA OOM (reduce batch size)
    • Verify dependencies in requirements.txt
    • Review logs in Cloud Logging
    • Test locally before submitting to Vertex

    Security Best Practices

    Credentials Management

    DO:

    • ✅ Use service accounts with minimal permissions
    • ✅ Store credentials in Secret Manager
    • ✅ Use Workload Identity for GKE deployments
    • ✅ Rotate service account keys regularly
    • ✅ Add .gitignore for *.json key files

    DON'T:

    • ❌ Hardcode credentials in code
    • ❌ Commit service account keys to git
    • ❌ Use overly permissive roles (e.g., Owner)
    • ❌ Share service account keys across projects
    • ❌ Use personal credentials for production

    IAM Best Practices

    • Use separate service accounts for training vs serving
    • Grant roles at resource level, not project level when possible
    • Use Workload Identity Federation instead of keys when possible
    • Enable Cloud Audit Logs for ML API usage
    • Review IAM permissions quarterly

    Performance Benchmarks

    BigQuery ML vs Vertex AI

    BigQuery ML:

    • Best for: Structured data, SQL users, quick prototypes
    • Training time: Minutes to hours (depends on data size)
    • Scalability: Automatic (serverless)
    • Cost: $5/TB processed

    Vertex AI Custom Training:

    • Best for: Deep learning, custom architectures, GPU/TPU workloads
    • Training time: Hours to days (configurable hardware)
    • Scalability: Manual (choose machine type)
    • Cost: $0.35-20/hour depending on hardware

    Rule of thumb:

    • Use BigQuery ML for tabular data with < 100M rows
    • Use Vertex AI for images, text, audio, or custom models
    • Use Vertex AI for models requiring GPU/TPU acceleration

    Additional Resources

    • GCP ML Documentation: https://cloud.google.com/vertex-ai/docs
    • BigQuery ML Reference: https://cloud.google.com/bigquery-ml/docs
    • Pricing Calculator: https://cloud.google.com/products/calculator
    • TPU Best Practices: https://cloud.google.com/tpu/docs/best-practices
    • Vertex AI Samples: https://github.com/GoogleCloudPlatform/vertex-ai-samples
    Recommended Servers
    Local Model Suitability MCP
    Local Model Suitability MCP
    fillin
    fillin
    ThinAir Data
    ThinAir Data
    Repository
    vanman2024/ai-dev-marketplace
    Files