google-cloud-configs

vanman2024/google-cloud-configs

AI & ML

About

SKILL.md

google-cloud-configs

vanman2024/google-cloud-configs

AI & ML

About

Google Cloud Platform configuration templates for BigQuery ML and Vertex AI training with authentication setup, GPU/TPU configs, and cost estimation tools...

SKILL.md

Use when:

Setting up BigQuery ML for SQL-based machine learning
Configuring Vertex AI custom training jobs
Setting up GCP authentication for ML workflows
Selecting appropriate GPU/TPU configurations
Estimating costs for GCP ML training
Deploying models to Vertex AI endpoints
Configuring distributed training on GCP
Optimizing cost vs performance for cloud ML

Platform Overview

BigQuery ML

What it is: SQL-based machine learning directly in BigQuery Best for:

Quick ML prototypes using existing data warehouse data
Classification, regression, forecasting on structured data
Users familiar with SQL but not Python/ML frameworks
Large-scale batch predictions

Available Models:

Linear/Logistic Regression
XGBoost (BOOSTED_TREE)
Deep Neural Networks (DNN)
AutoML Tables
TensorFlow/PyTorch imported models

Pricing:

Based on data processed (same as BigQuery queries)
$5 per TB processed for analysis
AutoML: $19.32/hour for training

Vertex AI Training

What it is: Fully managed ML training platform Best for:

Custom PyTorch/TensorFlow training
Large-scale distributed training
GPU/TPU-accelerated workloads
Production ML pipelines

Available Compute:

CPUs: n1-standard, n1-highmem, n1-highcpu
GPUs: NVIDIA T4, P4, V100, P100, A100, L4
TPUs: v2, v3, v4, v5e (8 cores to 512 cores)

Pricing:

CPU: $0.05-0.30/hour depending on machine type
GPU T4: $0.35/hour
GPU A100: $3.67/hour (40GB) or $4.95/hour (80GB)
TPU v3: $8.00/hour (8 cores)
TPU v4: $11.00/hour (8 cores)

GPU/TPU Selection Guide

GPU Selection (Vertex AI)

T4 (16GB VRAM):

Use case: Inference, light training, small models
Cost: $0.35/hour
Good for: BERT-base, small CNNs, inference serving

V100 (16GB VRAM):

Use case: Mid-size training, mixed precision training
Cost: $2.48/hour
Good for: ResNet training, medium transformers

A100 (40GB/80GB VRAM):

Use case: Large model training, distributed training
Cost: $3.67/hour (40GB), $4.95/hour (80GB)
Good for: GPT-style models, large vision models, multi-GPU training

L4 (24GB VRAM):

Use case: Modern alternative to T4, better performance
Cost: $0.66/hour
Good for: Mid-size models, efficient inference

TPU Selection (Vertex AI)

TPU v2 (8 cores):

Use case: TensorFlow/JAX training, matrix operations
Cost: $4.50/hour
Memory: 8GB per core (64GB total)
Good for: Legacy TensorFlow models

TPU v3 (8 cores):

Use case: Standard TPU training
Cost: $8.00/hour
Memory: 16GB per core (128GB total)
Good for: BERT, T5, image classification

TPU v4 (8 cores):

Use case: Latest generation, best performance
Cost: $11.00/hour
Memory: 32GB per core (256GB total)
Good for: Large language models, cutting-edge research

TPU v5e (8 cores):

Use case: Cost-optimized TPU
Cost: $2.50/hour
Good for: Development, training at scale on budget

Multi-node TPU Pods:

v3-32: 32 cores, $32/hour
v3-128: 128 cores, $128/hour
v4-128: 128 cores, $176/hour
Use for: Massive distributed training (GPT-3 scale)

Usage

Setup BigQuery ML Environment

bash scripts/setup-bigquery-ml.sh

Prompts for:

GCP Project ID
BigQuery dataset name
Service account credentials
Default model type preference

Creates:

bigquery_config.json - Project configuration
.bigqueryrc - CLI configuration
Example training SQL in examples/

Setup Vertex AI Training Environment

bash scripts/setup-vertex-ai.sh

Prompts for:

GCP Project ID
Region (us-central1, europe-west4, etc.)
Service account credentials
Default machine type
GPU/TPU preference

Creates:

vertex_config.yaml - Training job configuration
vertex_requirements.txt - Python dependencies
Training script template

Configure GCP Authentication

bash scripts/configure-auth.sh

Prompts for:

Authentication method (service account, user account, workload identity)
Service account key path (if applicable)
IAM roles needed

Creates:

.gcp_auth_config - Authentication configuration
Sets GOOGLE_APPLICATION_CREDENTIALS environment variable
Validates permissions

Required IAM Roles:

BigQuery ML: roles/bigquery.dataEditor, roles/bigquery.jobUser
Vertex AI: roles/aiplatform.user, roles/storage.objectAdmin
Both: roles/serviceusage.serviceUsageConsumer

Estimate GCP Training Costs

bash scripts/estimate-gcp-cost.sh

Interactive prompts:

Platform: BigQuery ML or Vertex AI
If BigQuery ML: Data size to process
If Vertex AI:
- Machine type (CPU/GPU/TPU)
- Number of machines
- Training duration estimate
- Storage requirements

Output:

Estimated compute cost
Storage cost
Data transfer cost (if applicable)
Total estimated cost
Cost comparison with other GCP options

Templates

BigQuery ML Training Template (`templates/bigquery_ml_training.sql`)

SQL template for creating and training models:

Model creation syntax
Feature engineering examples
Training options (L1/L2 reg, learning rate, etc.)
Evaluation queries
Prediction queries

Supported model types:

LINEAR_REG, LOGISTIC_REG
BOOSTED_TREE_CLASSIFIER, BOOSTED_TREE_REGRESSOR
DNN_CLASSIFIER, DNN_REGRESSOR
AUTOML_CLASSIFIER, AUTOML_REGRESSOR

Vertex AI Training Job Template (`templates/vertex_training_job.py`)

Python template for custom training:

Training loop structure
Distributed training setup (PyTorch DDP)
Checkpointing and model saving
Metrics logging to Vertex AI
Hyperparameter tuning integration

Includes:

Single GPU training
Multi-GPU training (DataParallel, DistributedDataParallel)
TPU training with PyTorch/XLA
Cloud Storage integration

GPU Configuration Template (`templates/vertex_gpu_config.yaml`)

YAML configuration for GPU training jobs:

Machine type selection
GPU type and count
Disk configuration
Network configuration
Environment variables

Presets included:

Single T4 (budget)
Single A100 (standard)
4x A100 (distributed)
8x A100 (large-scale)

TPU Configuration Template (`templates/vertex_tpu_config.yaml`)

YAML configuration for TPU training jobs:

TPU type and topology
TPU version selection
JAX/TensorFlow runtime
XLA compilation flags

Presets included:

v3-8 (single TPU)
v4-32 (TPU pod slice)
v5e-8 (cost-optimized)

GCP Authentication Template (`templates/gcp_auth.json`)

Service account configuration template:

Project ID
Service account email
Key file path
Required scopes
IAM role assignments

Security notes:

Uses placeholders only (never real keys)
Documents how to create service accounts
Includes .gitignore protection

Examples

BigQuery ML Regression Example (`examples/bigquery-regression-example.sql`)

Complete example:

Dataset: NYC taxi trip data
Task: Predict trip duration
Model: BOOSTED_TREE_REGRESSOR
Includes feature engineering, training, evaluation

Demonstrates:

CREATE MODEL syntax
TRANSFORM clause for feature engineering
MODEL evaluation
Batch predictions

Vertex AI PyTorch Training Example (`examples/vertex-pytorch-training.py`)

Complete training script:

Dataset: IMDB sentiment analysis
Model: DistilBERT fine-tuning
Training: Single GPU
Logging: Vertex AI experiments

Demonstrates:

Loading data from GCS
Training loop with mixed precision
Checkpointing to GCS
Metrics logging
Model export to Vertex AI

Vertex AI Distributed Training Example (`examples/vertex-distributed-training.py`)

Multi-GPU training example:

Dataset: ImageNet subset
Model: ResNet-50
Training: 4x A100 with DDP
Scaling: Linear scaling rule

Demonstrates:

PyTorch DistributedDataParallel
Gradient accumulation
Learning rate scaling
Synchronized batch norm
Multi-node coordination

Hugging Face Fine-tuning on Vertex AI (`examples/vertex-huggingface-finetuning.py`)

Production fine-tuning template:

Dataset: Custom text classification
Model: BERT/RoBERTa/DeBERTa
Training: Hugging Face Trainer API
Deployment: Vertex AI endpoint

Demonstrates:

Hugging Face Trainer integration
Hyperparameter tuning with Vertex AI
Model versioning
Endpoint deployment
Online predictions

Cost Optimization Tips

BigQuery ML

Reduce data processed:

Use partitioned tables
Filter data in WHERE clause before training
Use table sampling for experimentation
Cache intermediate results

Use appropriate model types:

Start with LINEAR_REG/LOGISTIC_REG (cheapest)
Use BOOSTED_TREE for better accuracy at moderate cost
Reserve AutoML for when simpler models fail

Optimize queries:

Avoid SELECT * (specify columns)
Use clustering on filter columns
Materialize views for repeated training

Vertex AI

Machine type selection:

Start with CPU for prototyping
Use T4 for small models (cheapest GPU)
Use A100 only for large models that need it
Consider TPU v5e for TensorFlow/JAX (very cost-effective)

Training optimization:

Use preemptible instances (60-70% cheaper, can be interrupted)
Enable automatic checkpoint/resume for preemptible
Use mixed precision training (FP16/BF16) for faster training
Profile to eliminate CPU bottlenecks

Storage optimization:

Store datasets in Cloud Storage (cheaper than persistent disk)
Use Filestore only if needed for POSIX filesystem
Clean up old model artifacts
Use lifecycle policies to archive old data

Multi-GPU efficiency:

Ensure near-linear scaling before adding more GPUs
Profile inter-GPU communication
Use gradient accumulation instead of larger batch sizes
Consider 2x GPUs instead of 1x larger GPU (often same cost, better availability)

Integration with ML Training Plugin

This skill integrates with other ml-training components:

training-patterns: Provides GCP configs for generated training scripts
cost-calculator: Uses GCP pricing data for budget planning
monitoring-dashboard: Integrates with Vertex AI TensorBoard
validation-scripts: Validates GCP credentials and permissions
integration-helpers: Deploys trained models to Vertex AI endpoints

Common Workflows

Workflow 1: Quick BigQuery ML Prototype

Run bash scripts/setup-bigquery-ml.sh
Copy templates/bigquery_ml_training.sql to your project
Modify SQL for your dataset and features
Run training query in BigQuery console
Evaluate with built-in ML.EVALUATE()
Export predictions with ML.PREDICT()

Time: 30 minutes setup + training time Cost: $5 per TB of data processed

Workflow 2: Custom PyTorch Training on Vertex AI

Run bash scripts/configure-auth.sh
Run bash scripts/setup-vertex-ai.sh
Copy templates/vertex_training_job.py
Customize training loop for your model
Copy templates/vertex_gpu_config.yaml
Submit job: gcloud ai custom-jobs create ...
Monitor in Vertex AI console

Time: 1 hour setup + training time Cost: Depends on GPU/TPU selection

Workflow 3: Large-Scale Distributed Training

Setup Vertex AI (workflow 2)
Copy examples/vertex-distributed-training.py
Adapt for your model architecture
Test locally with 1 GPU
Test with 2 GPUs to verify scaling
Scale to 4-8 GPUs for full training
Use preemptible instances with checkpointing

Time: 2-4 hours setup + training time Cost: $15-60/hour depending on GPU count

Troubleshooting

BigQuery ML Issues

"Insufficient permissions":

Verify roles/bigquery.dataEditor and roles/bigquery.jobUser
Check dataset-level permissions
Ensure billing is enabled

"Model training failed":

Check for NULL values in features
Verify data types match model expectations
Review feature engineering TRANSFORM clause
Check for sufficient training data

Vertex AI Issues

"Service account lacks permissions":

Verify roles/aiplatform.user
Add roles/storage.objectAdmin for GCS access
Check project-level IAM policies

"GPU/TPU quota exceeded":

Request quota increase in GCP console
Use different region with availability
Start with smaller GPU/TPU configuration
Use preemptible instances (separate quota)

"Training job crashes":

Check for CUDA OOM (reduce batch size)
Verify dependencies in requirements.txt
Review logs in Cloud Logging
Test locally before submitting to Vertex

Security Best Practices

Credentials Management

DO:

✅ Use service accounts with minimal permissions
✅ Store credentials in Secret Manager
✅ Use Workload Identity for GKE deployments
✅ Rotate service account keys regularly
✅ Add .gitignore for *.json key files

DON'T:

❌ Hardcode credentials in code
❌ Commit service account keys to git
❌ Use overly permissive roles (e.g., Owner)
❌ Share service account keys across projects
❌ Use personal credentials for production

IAM Best Practices

Use separate service accounts for training vs serving
Grant roles at resource level, not project level when possible
Use Workload Identity Federation instead of keys when possible
Enable Cloud Audit Logs for ML API usage
Review IAM permissions quarterly

Performance Benchmarks

BigQuery ML vs Vertex AI

BigQuery ML:

Best for: Structured data, SQL users, quick prototypes
Training time: Minutes to hours (depends on data size)
Scalability: Automatic (serverless)
Cost: $5/TB processed

Vertex AI Custom Training:

Best for: Deep learning, custom architectures, GPU/TPU workloads
Training time: Hours to days (configurable hardware)
Scalability: Manual (choose machine type)
Cost: $0.35-20/hour depending on hardware

Rule of thumb:

Use BigQuery ML for tabular data with < 100M rows
Use Vertex AI for images, text, audio, or custom models
Use Vertex AI for models requiring GPU/TPU acceleration

Additional Resources

GCP ML Documentation: https://cloud.google.com/vertex-ai/docs
BigQuery ML Reference: https://cloud.google.com/bigquery-ml/docs
Pricing Calculator: https://cloud.google.com/products/calculator
TPU Best Practices: https://cloud.google.com/tpu/docs/best-practices
Vertex AI Samples: https://github.com/GoogleCloudPlatform/vertex-ai-samples

About

SKILL.md

About

Google Cloud Platform configuration templates for BigQuery ML and Vertex AI training with authentication setup, GPU/TPU configs, and cost estimation tools...

SKILL.md

Use when:

Setting up BigQuery ML for SQL-based machine learning
Configuring Vertex AI custom training jobs
Setting up GCP authentication for ML workflows
Selecting appropriate GPU/TPU configurations
Estimating costs for GCP ML training
Deploying models to Vertex AI endpoints
Configuring distributed training on GCP
Optimizing cost vs performance for cloud ML

Platform Overview

BigQuery ML

What it is: SQL-based machine learning directly in BigQuery Best for:

Quick ML prototypes using existing data warehouse data
Classification, regression, forecasting on structured data
Users familiar with SQL but not Python/ML frameworks
Large-scale batch predictions

Available Models:

Linear/Logistic Regression
XGBoost (BOOSTED_TREE)
Deep Neural Networks (DNN)
AutoML Tables
TensorFlow/PyTorch imported models

Pricing:

Based on data processed (same as BigQuery queries)
$5 per TB processed for analysis
AutoML: $19.32/hour for training

Vertex AI Training

What it is: Fully managed ML training platform Best for:

Custom PyTorch/TensorFlow training
Large-scale distributed training
GPU/TPU-accelerated workloads
Production ML pipelines

Available Compute:

CPUs: n1-standard, n1-highmem, n1-highcpu
GPUs: NVIDIA T4, P4, V100, P100, A100, L4
TPUs: v2, v3, v4, v5e (8 cores to 512 cores)

Pricing:

CPU: $0.05-0.30/hour depending on machine type
GPU T4: $0.35/hour
GPU A100: $3.67/hour (40GB) or $4.95/hour (80GB)
TPU v3: $8.00/hour (8 cores)
TPU v4: $11.00/hour (8 cores)

GPU/TPU Selection Guide

GPU Selection (Vertex AI)

T4 (16GB VRAM):

Use case: Inference, light training, small models
Cost: $0.35/hour
Good for: BERT-base, small CNNs, inference serving

V100 (16GB VRAM):

Use case: Mid-size training, mixed precision training
Cost: $2.48/hour
Good for: ResNet training, medium transformers

A100 (40GB/80GB VRAM):

Use case: Large model training, distributed training
Cost: $3.67/hour (40GB), $4.95/hour (80GB)
Good for: GPT-style models, large vision models, multi-GPU training

L4 (24GB VRAM):

Use case: Modern alternative to T4, better performance
Cost: $0.66/hour
Good for: Mid-size models, efficient inference

TPU Selection (Vertex AI)

TPU v2 (8 cores):

Use case: TensorFlow/JAX training, matrix operations
Cost: $4.50/hour
Memory: 8GB per core (64GB total)
Good for: Legacy TensorFlow models

TPU v3 (8 cores):

Use case: Standard TPU training
Cost: $8.00/hour
Memory: 16GB per core (128GB total)
Good for: BERT, T5, image classification

TPU v4 (8 cores):

Use case: Latest generation, best performance
Cost: $11.00/hour
Memory: 32GB per core (256GB total)
Good for: Large language models, cutting-edge research

TPU v5e (8 cores):

Use case: Cost-optimized TPU
Cost: $2.50/hour
Good for: Development, training at scale on budget

Multi-node TPU Pods:

v3-32: 32 cores, $32/hour
v3-128: 128 cores, $128/hour
v4-128: 128 cores, $176/hour
Use for: Massive distributed training (GPT-3 scale)

Usage

Setup BigQuery ML Environment

bash scripts/setup-bigquery-ml.sh

Prompts for:

GCP Project ID
BigQuery dataset name
Service account credentials
Default model type preference

Creates:

bigquery_config.json - Project configuration
.bigqueryrc - CLI configuration
Example training SQL in examples/

Setup Vertex AI Training Environment

bash scripts/setup-vertex-ai.sh

Prompts for:

GCP Project ID
Region (us-central1, europe-west4, etc.)
Service account credentials
Default machine type
GPU/TPU preference

Creates:

vertex_config.yaml - Training job configuration
vertex_requirements.txt - Python dependencies
Training script template

Configure GCP Authentication

bash scripts/configure-auth.sh

Prompts for:

Authentication method (service account, user account, workload identity)
Service account key path (if applicable)
IAM roles needed

Creates:

.gcp_auth_config - Authentication configuration
Sets GOOGLE_APPLICATION_CREDENTIALS environment variable
Validates permissions

Required IAM Roles:

BigQuery ML: roles/bigquery.dataEditor, roles/bigquery.jobUser
Vertex AI: roles/aiplatform.user, roles/storage.objectAdmin
Both: roles/serviceusage.serviceUsageConsumer

Estimate GCP Training Costs

bash scripts/estimate-gcp-cost.sh

Interactive prompts:

Platform: BigQuery ML or Vertex AI
If BigQuery ML: Data size to process
If Vertex AI:
- Machine type (CPU/GPU/TPU)
- Number of machines
- Training duration estimate
- Storage requirements

Output:

Estimated compute cost
Storage cost
Data transfer cost (if applicable)
Total estimated cost
Cost comparison with other GCP options

Templates

BigQuery ML Training Template (`templates/bigquery_ml_training.sql`)

SQL template for creating and training models:

Model creation syntax
Feature engineering examples
Training options (L1/L2 reg, learning rate, etc.)
Evaluation queries
Prediction queries

Supported model types:

LINEAR_REG, LOGISTIC_REG
BOOSTED_TREE_CLASSIFIER, BOOSTED_TREE_REGRESSOR
DNN_CLASSIFIER, DNN_REGRESSOR
AUTOML_CLASSIFIER, AUTOML_REGRESSOR

Vertex AI Training Job Template (`templates/vertex_training_job.py`)

Python template for custom training:

Training loop structure
Distributed training setup (PyTorch DDP)
Checkpointing and model saving
Metrics logging to Vertex AI
Hyperparameter tuning integration

Includes:

Single GPU training
Multi-GPU training (DataParallel, DistributedDataParallel)
TPU training with PyTorch/XLA
Cloud Storage integration

GPU Configuration Template (`templates/vertex_gpu_config.yaml`)

YAML configuration for GPU training jobs:

Machine type selection
GPU type and count
Disk configuration
Network configuration
Environment variables

Presets included:

Single T4 (budget)
Single A100 (standard)
4x A100 (distributed)
8x A100 (large-scale)

TPU Configuration Template (`templates/vertex_tpu_config.yaml`)

YAML configuration for TPU training jobs:

TPU type and topology
TPU version selection
JAX/TensorFlow runtime
XLA compilation flags

Presets included:

v3-8 (single TPU)
v4-32 (TPU pod slice)
v5e-8 (cost-optimized)

GCP Authentication Template (`templates/gcp_auth.json`)

Service account configuration template:

Project ID
Service account email
Key file path
Required scopes
IAM role assignments

Security notes:

Uses placeholders only (never real keys)
Documents how to create service accounts
Includes .gitignore protection

Examples

BigQuery ML Regression Example (`examples/bigquery-regression-example.sql`)

Complete example:

Dataset: NYC taxi trip data
Task: Predict trip duration
Model: BOOSTED_TREE_REGRESSOR
Includes feature engineering, training, evaluation

Demonstrates:

CREATE MODEL syntax
TRANSFORM clause for feature engineering
MODEL evaluation
Batch predictions

Vertex AI PyTorch Training Example (`examples/vertex-pytorch-training.py`)

Complete training script:

Dataset: IMDB sentiment analysis
Model: DistilBERT fine-tuning
Training: Single GPU
Logging: Vertex AI experiments

Demonstrates:

Loading data from GCS
Training loop with mixed precision
Checkpointing to GCS
Metrics logging
Model export to Vertex AI

Vertex AI Distributed Training Example (`examples/vertex-distributed-training.py`)

Multi-GPU training example:

Dataset: ImageNet subset
Model: ResNet-50
Training: 4x A100 with DDP
Scaling: Linear scaling rule

Demonstrates:

PyTorch DistributedDataParallel
Gradient accumulation
Learning rate scaling
Synchronized batch norm
Multi-node coordination

Hugging Face Fine-tuning on Vertex AI (`examples/vertex-huggingface-finetuning.py`)

Production fine-tuning template:

Dataset: Custom text classification
Model: BERT/RoBERTa/DeBERTa
Training: Hugging Face Trainer API
Deployment: Vertex AI endpoint

Demonstrates:

Hugging Face Trainer integration
Hyperparameter tuning with Vertex AI
Model versioning
Endpoint deployment
Online predictions

Cost Optimization Tips

BigQuery ML

Reduce data processed:

Use partitioned tables
Filter data in WHERE clause before training
Use table sampling for experimentation
Cache intermediate results

Use appropriate model types:

Start with LINEAR_REG/LOGISTIC_REG (cheapest)
Use BOOSTED_TREE for better accuracy at moderate cost
Reserve AutoML for when simpler models fail

Optimize queries:

Avoid SELECT * (specify columns)
Use clustering on filter columns
Materialize views for repeated training

Vertex AI

Machine type selection:

Start with CPU for prototyping
Use T4 for small models (cheapest GPU)
Use A100 only for large models that need it
Consider TPU v5e for TensorFlow/JAX (very cost-effective)

Training optimization:

Use preemptible instances (60-70% cheaper, can be interrupted)
Enable automatic checkpoint/resume for preemptible
Use mixed precision training (FP16/BF16) for faster training
Profile to eliminate CPU bottlenecks

Storage optimization:

Store datasets in Cloud Storage (cheaper than persistent disk)
Use Filestore only if needed for POSIX filesystem
Clean up old model artifacts
Use lifecycle policies to archive old data

Multi-GPU efficiency:

Ensure near-linear scaling before adding more GPUs
Profile inter-GPU communication
Use gradient accumulation instead of larger batch sizes
Consider 2x GPUs instead of 1x larger GPU (often same cost, better availability)

Integration with ML Training Plugin

This skill integrates with other ml-training components:

training-patterns: Provides GCP configs for generated training scripts
cost-calculator: Uses GCP pricing data for budget planning
monitoring-dashboard: Integrates with Vertex AI TensorBoard
validation-scripts: Validates GCP credentials and permissions
integration-helpers: Deploys trained models to Vertex AI endpoints

Common Workflows

Workflow 1: Quick BigQuery ML Prototype

Run bash scripts/setup-bigquery-ml.sh
Copy templates/bigquery_ml_training.sql to your project
Modify SQL for your dataset and features
Run training query in BigQuery console
Evaluate with built-in ML.EVALUATE()
Export predictions with ML.PREDICT()

Time: 30 minutes setup + training time Cost: $5 per TB of data processed

Workflow 2: Custom PyTorch Training on Vertex AI

Run bash scripts/configure-auth.sh
Run bash scripts/setup-vertex-ai.sh
Copy templates/vertex_training_job.py
Customize training loop for your model
Copy templates/vertex_gpu_config.yaml
Submit job: gcloud ai custom-jobs create ...
Monitor in Vertex AI console

Time: 1 hour setup + training time Cost: Depends on GPU/TPU selection

Workflow 3: Large-Scale Distributed Training

Setup Vertex AI (workflow 2)
Copy examples/vertex-distributed-training.py
Adapt for your model architecture
Test locally with 1 GPU
Test with 2 GPUs to verify scaling
Scale to 4-8 GPUs for full training
Use preemptible instances with checkpointing

Time: 2-4 hours setup + training time Cost: $15-60/hour depending on GPU count

Troubleshooting

BigQuery ML Issues

"Insufficient permissions":

Verify roles/bigquery.dataEditor and roles/bigquery.jobUser
Check dataset-level permissions
Ensure billing is enabled

"Model training failed":

Check for NULL values in features
Verify data types match model expectations
Review feature engineering TRANSFORM clause
Check for sufficient training data

Vertex AI Issues

"Service account lacks permissions":

Verify roles/aiplatform.user
Add roles/storage.objectAdmin for GCS access
Check project-level IAM policies

"GPU/TPU quota exceeded":

Request quota increase in GCP console
Use different region with availability
Start with smaller GPU/TPU configuration
Use preemptible instances (separate quota)

"Training job crashes":

Check for CUDA OOM (reduce batch size)
Verify dependencies in requirements.txt
Review logs in Cloud Logging
Test locally before submitting to Vertex

Security Best Practices

Credentials Management

DO:

✅ Use service accounts with minimal permissions
✅ Store credentials in Secret Manager
✅ Use Workload Identity for GKE deployments
✅ Rotate service account keys regularly
✅ Add .gitignore for *.json key files

DON'T:

❌ Hardcode credentials in code
❌ Commit service account keys to git
❌ Use overly permissive roles (e.g., Owner)
❌ Share service account keys across projects
❌ Use personal credentials for production

IAM Best Practices

Use separate service accounts for training vs serving
Grant roles at resource level, not project level when possible
Use Workload Identity Federation instead of keys when possible
Enable Cloud Audit Logs for ML API usage
Review IAM permissions quarterly

Performance Benchmarks

BigQuery ML vs Vertex AI

BigQuery ML:

Best for: Structured data, SQL users, quick prototypes
Training time: Minutes to hours (depends on data size)
Scalability: Automatic (serverless)
Cost: $5/TB processed

Vertex AI Custom Training:

Best for: Deep learning, custom architectures, GPU/TPU workloads
Training time: Hours to days (configurable hardware)
Scalability: Manual (choose machine type)
Cost: $0.35-20/hour depending on hardware

Rule of thumb:

Use BigQuery ML for tabular data with < 100M rows
Use Vertex AI for images, text, audio, or custom models
Use Vertex AI for models requiring GPU/TPU acceleration

Additional Resources

GCP ML Documentation: https://cloud.google.com/vertex-ai/docs
BigQuery ML Reference: https://cloud.google.com/bigquery-ml/docs
Pricing Calculator: https://cloud.google.com/products/calculator
TPU Best Practices: https://cloud.google.com/tpu/docs/best-practices
Vertex AI Samples: https://github.com/GoogleCloudPlatform/vertex-ai-samples

google-cloud-configs

About

SKILL.md

google-cloud-configs

About

SKILL.md

Platform Overview

BigQuery ML

Vertex AI Training

GPU/TPU Selection Guide

GPU Selection (Vertex AI)

TPU Selection (Vertex AI)

Usage

Setup BigQuery ML Environment

Setup Vertex AI Training Environment

Configure GCP Authentication

Estimate GCP Training Costs

Templates

BigQuery ML Training Template (templates/bigquery_ml_training.sql)

Vertex AI Training Job Template (templates/vertex_training_job.py)

GPU Configuration Template (templates/vertex_gpu_config.yaml)

TPU Configuration Template (templates/vertex_tpu_config.yaml)

GCP Authentication Template (templates/gcp_auth.json)

Examples

BigQuery ML Regression Example (examples/bigquery-regression-example.sql)

Vertex AI PyTorch Training Example (examples/vertex-pytorch-training.py)

Vertex AI Distributed Training Example (examples/vertex-distributed-training.py)

Hugging Face Fine-tuning on Vertex AI (examples/vertex-huggingface-finetuning.py)

Cost Optimization Tips

BigQuery ML

Vertex AI

Integration with ML Training Plugin

Common Workflows

Workflow 1: Quick BigQuery ML Prototype

Workflow 2: Custom PyTorch Training on Vertex AI

Workflow 3: Large-Scale Distributed Training

Troubleshooting

BigQuery ML Issues

Vertex AI Issues

Security Best Practices

Credentials Management

IAM Best Practices

Performance Benchmarks

BigQuery ML vs Vertex AI

Additional Resources

About

SKILL.md

About

SKILL.md

Platform Overview

BigQuery ML

Vertex AI Training

GPU/TPU Selection Guide

GPU Selection (Vertex AI)

TPU Selection (Vertex AI)

Usage

Setup BigQuery ML Environment

Setup Vertex AI Training Environment

Configure GCP Authentication

Estimate GCP Training Costs

Templates

BigQuery ML Training Template (templates/bigquery_ml_training.sql)

Vertex AI Training Job Template (templates/vertex_training_job.py)

GPU Configuration Template (templates/vertex_gpu_config.yaml)

TPU Configuration Template (templates/vertex_tpu_config.yaml)

GCP Authentication Template (templates/gcp_auth.json)

Examples

BigQuery ML Regression Example (examples/bigquery-regression-example.sql)

Vertex AI PyTorch Training Example (examples/vertex-pytorch-training.py)

Vertex AI Distributed Training Example (examples/vertex-distributed-training.py)

Hugging Face Fine-tuning on Vertex AI (examples/vertex-huggingface-finetuning.py)

Cost Optimization Tips

BigQuery ML

Vertex AI

Integration with ML Training Plugin

Common Workflows

Workflow 1: Quick BigQuery ML Prototype

Workflow 2: Custom PyTorch Training on Vertex AI

Workflow 3: Large-Scale Distributed Training

Troubleshooting

BigQuery ML Training Template (`templates/bigquery_ml_training.sql`)

Vertex AI Training Job Template (`templates/vertex_training_job.py`)

GPU Configuration Template (`templates/vertex_gpu_config.yaml`)

TPU Configuration Template (`templates/vertex_tpu_config.yaml`)

GCP Authentication Template (`templates/gcp_auth.json`)

BigQuery ML Regression Example (`examples/bigquery-regression-example.sql`)

Vertex AI PyTorch Training Example (`examples/vertex-pytorch-training.py`)

Vertex AI Distributed Training Example (`examples/vertex-distributed-training.py`)

Hugging Face Fine-tuning on Vertex AI (`examples/vertex-huggingface-finetuning.py`)

BigQuery ML Training Template (`templates/bigquery_ml_training.sql`)

Vertex AI Training Job Template (`templates/vertex_training_job.py`)

GPU Configuration Template (`templates/vertex_gpu_config.yaml`)

TPU Configuration Template (`templates/vertex_tpu_config.yaml`)

GCP Authentication Template (`templates/gcp_auth.json`)

BigQuery ML Regression Example (`examples/bigquery-regression-example.sql`)

Vertex AI PyTorch Training Example (`examples/vertex-pytorch-training.py`)

Vertex AI Distributed Training Example (`examples/vertex-distributed-training.py`)

Hugging Face Fine-tuning on Vertex AI (`examples/vertex-huggingface-finetuning.py`)