Verify a CVlization training pipeline example is properly structured, can build, trains successfully, and logs appropriate metrics...
Systematically verify that a CVlization training example is complete, properly structured, and functional.
Shared GPU Environment: This machine may be used by multiple users simultaneously. Before running GPU-intensive training:
nvidia-smiCheck that the example directory contains all required files:
# Navigate to example directory
cd examples/<capability>/<task>/<framework>/
# Expected structure:
# .
# ├── example.yaml # Required: CVL metadata
# ├── Dockerfile # Required: Container definition
# ├── build.sh # Required: Build script
# ├── train.sh # Required: Training script
# ├── train.py # Required: Training code
# ├── README.md # Recommended: Documentation
# ├── requirements.txt # Optional: Python dependencies
# ├── data/ # Optional: Data directory
# └── outputs/ # Created at runtime
Key files to check:
example.yaml - Must have: name, capability, stability, presets (build, train)Dockerfile - Should copy necessary files and install dependenciesbuild.sh - Must set SCRIPT_DIR and call docker buildtrain.sh - Must mount volumes correctly, pass environment variables, and forward CUDA_VISIBLE_DEVICES to the container (use ${CUDA_VISIBLE_DEVICES:+--env "CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"} so it's only set when the host has it set).gitignore - Must cover all runtime artifacts (weights, checkpoints, results). Note: CVL_OUTPUTS maps to the workspace root (the example dir itself), NOT outputs/, so files like *.pt, result.txt etc. will land in the example directory directly. Verify with git status after a test run.# Option 1: Build using script directly
./build.sh
# Option 2: Build using CVL CLI (recommended)
cvl run <example-name> build
# Verify image was created
docker images | grep <example-name>
# Expected: Image appears with recent timestamp
What to check:
cvl info <example-name> shows correct metadataStart training and monitor for proper initialization:
# Option 1: Run training using script directly
./train.sh
# Option 2: Run training using CVL CLI (recommended)
cvl run <example-name> train
# With custom parameters (if supported)
BATCH_SIZE=2 NUM_EPOCHS=1 ./train.sh
Immediate checks (first 30-60 seconds):
nvidia-smi)Monitor metrics appropriate to the task type:
train/loss (should decrease over time)# For LLM/generative models
tail -f logs/train.log | grep -i "loss\|iter\|step"
train/loss, train/accuracy, val/accuracy# Watch accuracy metrics
tail -f lightning_logs/version_0/metrics.csv
# or for WandB
tail -f logs/train.log | grep -i "accuracy\|acc"
train/loss, val/map (mean Average Precision), val/map_50loss_classifier, loss_box_reg, loss_objectness, loss_rpn_box_reg# Monitor detection metrics
tail -f logs/train.log | grep -i "map\|loss_classifier\|loss_box"
train/loss, val/iou (Intersection over Union), val/pixel_accuracy# Monitor segmentation metrics
tail -f lightning_logs/version_0/metrics.csv | grep -i "iou\|pixel"
train/loss, eval/loss, task-specific metrics# Check if adapters are being saved
ls -la outputs/*/lora_adapters/
# Should contain: adapter_config.json, adapter_model.safetensors
GPU VRAM Usage Monitoring (REQUIRED):
Before, during, and after training, actively monitor GPU VRAM usage:
# In another terminal, watch GPU memory in real-time
watch -n 1 nvidia-smi
# Or get detailed memory breakdown
nvidia-smi --query-gpu=index,name,memory.used,memory.total,memory.free,utilization.gpu --format=csv,noheader,nounits
# Record peak VRAM usage during training
nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits | awk '{print $1 " MB"}'
Expected metrics:
What to record for verification metadata:
Troubleshooting:
BATCH_SIZE, MAX_SEQ_LEN, or model sizeDocker Container Health:
# List running containers
docker ps
# Check logs for errors
docker logs <container-name-or-id>
# Verify mounts
docker inspect <container-id> | grep -A 10 Mounts
# Should see: workspace, cvlization_repo, huggingface cache
Output Directory:
# Check outputs are being written
ls -la outputs/ logs/ lightning_logs/
# Expected: Checkpoints, logs, or saved models appearing
# For WandB integration
ls -la wandb/
# Expected: run-<timestamp>-<id> directories
Verify that datasets and pretrained weights are cached properly:
# Check CVlization dataset cache
ls -la ~/.cache/cvlization/data/
# Expected: Dataset archives and extracted folders
# Examples: coco_panoptic_tiny/, stanford_background/, etc.
# Check framework-specific caches
ls -la ~/.cache/torch/hub/checkpoints/ # PyTorch pretrained weights
ls -la ~/.cache/huggingface/ # HuggingFace models
# Verify no repeated downloads on second run
# First run: Should see "Downloading..." messages
./train.sh 2>&1 | tee first_run.log
# Clean workspace data (but keep cache)
rm -rf ./data/
# Second run: Should NOT download again, uses cache
./train.sh 2>&1 | tee second_run.log
# Verify no download messages in second run
grep -i "download" second_run.log
# Expected: Minimal or no download activity (weights already cached)
What to verify:
~/.cache/cvlization/data/ (not ./data/)~/.cache/torch/, HuggingFace: ~/.cache/huggingface/)data_dir parameter passed to dataset buildersFor fast verification (useful during development):
# Run 1 epoch with limited data
MAX_TRAIN_SAMPLES=10 NUM_EPOCHS=1 ./train.sh
# Expected runtime: 1-5 minutes
# Verify: Completes without errors, metrics logged
After successful verification, update the example.yaml with verification metadata:
First, check GPU info:
# Get GPU model and VRAM
nvidia-smi --query-gpu=name,memory.total --format=csv,noheader
Format:
verification:
last_verified: 2025-10-25
last_verification_note: "Verified build, training initialization, lazy downloading, and metrics logging on [GPU_MODEL] ([VRAM]GB VRAM)"
What to include in the note:
nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounitsExample complete entry:
name: pose-estimation-mmpose
docker: mmpose
capability: perception/pose_estimation
# ... other fields ...
verification:
last_verified: 2025-10-25
last_verification_note: "Verified build, CVL CLI integration, and lazy downloading to ~/.cache/cvlization/data/. Training not fully verified due to GPU memory constraints (CUDA OOM on shared GPU)."
When to update:
# Issue: Dockerfile can't find files
# Fix: Check COPY paths are relative to Dockerfile location
# Issue: Dependency conflicts
# Fix: Check requirements.txt versions, update base image
# Issue: Large build context
# Fix: Add .dockerignore file
# Issue: CUDA out of memory
# Fix: Reduce BATCH_SIZE, MAX_SEQ_LEN, or image size
# Issue: Dataset not found
# Fix: Check data/ directory exists, run data preparation script
# Issue: Permission denied on outputs
# Fix: Ensure output directories are created before docker run
# Issue: Loss is NaN
# Fix: Reduce learning rate, check data normalization, verify labels
# Issue: No metrics logged
# Fix: Check training script has logging configured (wandb/tensorboard)
# Issue: Loss not decreasing
# Fix: Verify learning rate, check data quality, increase epochs
cd examples/perception/object_detection/torchvision
./build.sh
./train.sh
# Monitor: train/loss, val/map, val/map_50
# Success: mAP > 0.3 after a few epochs
cd examples/perception/segmentation/semantic_torchvision
./build.sh
./train.sh
# Monitor: train/loss, val/iou, val/pixel_accuracy
# Success: IoU > 0.5, pixel_accuracy > 80%
cd examples/generative/llm/nanogpt
./build.sh
./train.sh
# Monitor: train/loss, val/loss, iter time
# Success: Loss decreasing from ~4.0 to <2.0
cd examples/perception/doc_ai/granite_docling_finetune
./build.sh
MAX_TRAIN_SAMPLES=20 NUM_EPOCHS=1 ./train.sh
# Monitor: train/loss, eval/loss
# Success: Both losses decrease, adapters saved to outputs/
These examples integrate with the CVL command system:
# List all available examples
cvl list
# Get example info
cvl info granite_docling_finetune
# Run example directly (uses example.yaml presets)
cvl run granite_docling_finetune build
cvl run granite_docling_finetune train
A training pipeline passes verification when:
./build.sh and cvl run <name> build)./train.sh and cvl run <name> train)~/.cache/cvlization/data/ (NOT to local ./data/), pretrained weights cached to framework-specific locations (~/.cache/torch/, ~/.cache/huggingface/).gitignore covers all runtime artifacts (weights land in example dir root via CVL_OUTPUTS, not outputs/)cvl info <name> shows correct metadata, build and train presets workverification field containing last_verified date and last_verification_noteCheck these files for debugging:
train.py - Core training logicDockerfile - Environment setuprequirements.txt - Python dependenciesexample.yaml - CVL metadata and presetsREADME.md - Usage instructionsMAX_TRAIN_SAMPLES=<small_number> for fast validationnvidia-smi in separate terminaldocker logs <container> if training hangsWANDB_API_KEY environment variableWANDB_MODE=offline ./train.sh --track ... to verify wandb hooks fire correctly without needing a real API key. Check for wandb: Synced N W&B file(s), M media file(s) in the output to confirm images and scalars were logged.docker run --rm --mount "type=bind,src=${HOME}/.cache/cvlization,dst=/data" <image> rm -rf /data/path/to/stale/files