Smithery Logo
MCPsSkillsDocsPricing
Login
Smithery Logo

Accelerating the Agent Economy

Resources

DocumentationPrivacy PolicySystem Status

Company

PricingAboutBlog

Connect

© 2026 Smithery. All rights reserved.

    davila7

    distributed-llm-pretraining-torchtitan

    davila7/distributed-llm-pretraining-torchtitan
    AI & ML
    19,892
    2 installs

    About

    SKILL.md

    Install

    Install via Skills CLI

    or add to your agent
    • Claude Code
      Claude Code
    • Codex
      Codex
    • OpenClaw
      OpenClaw
    • Cursor
      Cursor
    • Amp
      Amp
    • GitHub Copilot
      GitHub Copilot
    • Gemini CLI
      Gemini CLI
    • Kilo Code
      Kilo Code
    • Junie
      Junie
    • Replit
      Replit
    • Windsurf
      Windsurf
    • Cline
      Cline
    • Continue
      Continue
    • OpenCode
      OpenCode
    • OpenHands
      OpenHands
    • Roo Code
      Roo Code
    • Augment
      Augment
    • Goose
      Goose
    • Trae
      Trae
    • Zencoder
      Zencoder
    • Antigravity
      Antigravity
    ├─
    ├─
    └─

    About

    Provides PyTorch-native distributed LLM pretraining using torchtitan with 4D parallelism (FSDP2, TP, PP, CP)...

    SKILL.md

    TorchTitan - PyTorch Native Distributed LLM Pretraining

    Quick start

    TorchTitan is PyTorch's official platform for large-scale LLM pretraining with composable 4D parallelism (FSDP2, TP, PP, CP), achieving 65%+ speedups over baselines on H100 GPUs.

    Installation:

    # From PyPI (stable)
    pip install torchtitan
    
    # From source (latest features, requires PyTorch nightly)
    git clone https://github.com/pytorch/torchtitan
    cd torchtitan
    pip install -r requirements.txt
    

    Download tokenizer:

    # Get HF token from https://huggingface.co/settings/tokens
    python scripts/download_hf_assets.py --repo_id meta-llama/Llama-3.1-8B --assets tokenizer --hf_token=...
    

    Start training on 8 GPUs:

    CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh
    

    Common workflows

    Workflow 1: Pretrain Llama 3.1 8B on single node

    Copy this checklist:

    Single Node Pretraining:
    - [ ] Step 1: Download tokenizer
    - [ ] Step 2: Configure training
    - [ ] Step 3: Launch training
    - [ ] Step 4: Monitor and checkpoint
    

    Step 1: Download tokenizer

    python scripts/download_hf_assets.py \
      --repo_id meta-llama/Llama-3.1-8B \
      --assets tokenizer \
      --hf_token=YOUR_HF_TOKEN
    

    Step 2: Configure training

    Edit or create a TOML config file:

    # llama3_8b_custom.toml
    [job]
    dump_folder = "./outputs"
    description = "Llama 3.1 8B training"
    
    [model]
    name = "llama3"
    flavor = "8B"
    hf_assets_path = "./assets/hf/Llama-3.1-8B"
    
    [optimizer]
    name = "AdamW"
    lr = 3e-4
    
    [lr_scheduler]
    warmup_steps = 200
    
    [training]
    local_batch_size = 2
    seq_len = 8192
    max_norm = 1.0
    steps = 1000
    dataset = "c4"
    
    [parallelism]
    data_parallel_shard_degree = -1  # Use all GPUs for FSDP
    
    [activation_checkpoint]
    mode = "selective"
    selective_ac_option = "op"
    
    [checkpoint]
    enable = true
    folder = "checkpoint"
    interval = 500
    

    Step 3: Launch training

    # 8 GPUs on single node
    CONFIG_FILE="./llama3_8b_custom.toml" ./run_train.sh
    
    # Or explicitly with torchrun
    torchrun --nproc_per_node=8 \
      -m torchtitan.train \
      --job.config_file ./llama3_8b_custom.toml
    

    Step 4: Monitor and checkpoint

    TensorBoard logs are saved to ./outputs/tb/:

    tensorboard --logdir ./outputs/tb
    

    Workflow 2: Multi-node training with SLURM

    Multi-Node Training:
    - [ ] Step 1: Configure parallelism for scale
    - [ ] Step 2: Set up SLURM script
    - [ ] Step 3: Submit job
    - [ ] Step 4: Resume from checkpoint
    

    Step 1: Configure parallelism for scale

    For 70B model on 256 GPUs (32 nodes):

    [parallelism]
    data_parallel_shard_degree = 32  # FSDP across 32 ranks
    tensor_parallel_degree = 8        # TP within node
    pipeline_parallel_degree = 1      # No PP for 70B
    context_parallel_degree = 1       # Increase for long sequences
    

    Step 2: Set up SLURM script

    #!/bin/bash
    #SBATCH --job-name=llama70b
    #SBATCH --nodes=32
    #SBATCH --ntasks-per-node=8
    #SBATCH --gpus-per-node=8
    
    srun torchrun \
      --nnodes=32 \
      --nproc_per_node=8 \
      --rdzv_backend=c10d \
      --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
      -m torchtitan.train \
      --job.config_file ./llama3_70b.toml
    

    Step 3: Submit job

    sbatch multinode_trainer.slurm
    

    Step 4: Resume from checkpoint

    Training auto-resumes if checkpoint exists in configured folder.

    Workflow 3: Enable Float8 training for H100s

    Float8 provides 30-50% speedup on H100 GPUs.

    Float8 Training:
    - [ ] Step 1: Install torchao
    - [ ] Step 2: Configure Float8
    - [ ] Step 3: Launch with compile
    

    Step 1: Install torchao

    USE_CPP=0 pip install git+https://github.com/pytorch/ao.git
    

    Step 2: Configure Float8

    Add to your TOML config:

    [model]
    converters = ["quantize.linear.float8"]
    
    [quantize.linear.float8]
    enable_fsdp_float8_all_gather = true
    precompute_float8_dynamic_scale_for_fsdp = true
    filter_fqns = ["output"]  # Exclude output layer
    
    [compile]
    enable = true
    components = ["model", "loss"]
    

    Step 3: Launch with compile

    CONFIG_FILE="./llama3_8b.toml" ./run_train.sh \
      --model.converters="quantize.linear.float8" \
      --quantize.linear.float8.enable_fsdp_float8_all_gather \
      --compile.enable
    

    Workflow 4: 4D parallelism for 405B models

    4D Parallelism (FSDP + TP + PP + CP):
    - [ ] Step 1: Create seed checkpoint
    - [ ] Step 2: Configure 4D parallelism
    - [ ] Step 3: Launch on 512 GPUs
    

    Step 1: Create seed checkpoint

    Required for consistent initialization across PP stages:

    NGPU=1 CONFIG_FILE=./llama3_405b.toml ./run_train.sh \
      --checkpoint.enable \
      --checkpoint.create_seed_checkpoint \
      --parallelism.data_parallel_shard_degree 1 \
      --parallelism.tensor_parallel_degree 1 \
      --parallelism.pipeline_parallel_degree 1
    

    Step 2: Configure 4D parallelism

    [parallelism]
    data_parallel_shard_degree = 8   # FSDP
    tensor_parallel_degree = 8       # TP within node
    pipeline_parallel_degree = 8     # PP across nodes
    context_parallel_degree = 1      # CP for long sequences
    
    [training]
    local_batch_size = 32
    seq_len = 8192
    

    Step 3: Launch on 512 GPUs

    # 64 nodes x 8 GPUs = 512 GPUs
    srun torchrun --nnodes=64 --nproc_per_node=8 \
      -m torchtitan.train \
      --job.config_file ./llama3_405b.toml
    

    When to use vs alternatives

    Use TorchTitan when:

    • Pretraining LLMs from scratch (8B to 405B+)
    • Need PyTorch-native solution without third-party dependencies
    • Require composable 4D parallelism (FSDP2, TP, PP, CP)
    • Training on H100s with Float8 support
    • Want interoperable checkpoints with torchtune/HuggingFace

    Use alternatives instead:

    • Megatron-LM: Maximum performance for NVIDIA-only deployments
    • DeepSpeed: Broader ZeRO optimization ecosystem, inference support
    • Axolotl/TRL: Fine-tuning rather than pretraining
    • LitGPT: Educational, smaller-scale training

    Common issues

    Issue: Out of memory on large models

    Enable activation checkpointing and reduce batch size:

    [activation_checkpoint]
    mode = "full"  # Instead of "selective"
    
    [training]
    local_batch_size = 1
    

    Or use gradient accumulation:

    [training]
    local_batch_size = 1
    global_batch_size = 32  # Accumulates gradients
    

    Issue: TP causes high memory with async collectives

    Set environment variable:

    export TORCH_NCCL_AVOID_RECORD_STREAMS=1
    

    Issue: Float8 training not faster

    Float8 only benefits large GEMMs. Filter small layers:

    [quantize.linear.float8]
    filter_fqns = ["attention.wk", "attention.wv", "output", "auto_filter_small_kn"]
    

    Issue: Checkpoint loading fails after parallelism change

    Use DCP's resharding capability:

    # Convert sharded checkpoint to single file
    python -m torch.distributed.checkpoint.format_utils \
      dcp_to_torch checkpoint/step-1000 checkpoint.pt
    

    Issue: Pipeline parallelism initialization

    Create seed checkpoint first (see Workflow 4, Step 1).

    Supported models

    Model Sizes Status
    Llama 3.1 8B, 70B, 405B Production
    Llama 4 Various Experimental
    DeepSeek V3 16B, 236B, 671B (MoE) Experimental
    GPT-OSS 20B, 120B (MoE) Experimental
    Qwen 3 Various Experimental
    Flux Diffusion Experimental

    Performance benchmarks (H100)

    Model GPUs Parallelism TPS/GPU Techniques
    Llama 8B 8 FSDP 5,762 Baseline
    Llama 8B 8 FSDP+compile+FP8 8,532 +48%
    Llama 70B 256 FSDP+TP+AsyncTP 876 2D parallel
    Llama 405B 512 FSDP+TP+PP 128 3D parallel

    Advanced topics

    FSDP2 configuration: See references/fsdp.md for detailed FSDP2 vs FSDP1 comparison and ZeRO equivalents.

    Float8 training: See references/float8.md for tensorwise vs rowwise scaling recipes.

    Checkpointing: See references/checkpoint.md for HuggingFace conversion and async checkpointing.

    Adding custom models: See references/custom-models.md for TrainSpec protocol.

    Resources

    • GitHub: https://github.com/pytorch/torchtitan
    • Paper: https://arxiv.org/abs/2410.06511
    • ICLR 2025: https://iclr.cc/virtual/2025/poster/29620
    • PyTorch Forum: https://discuss.pytorch.org/c/distributed/torchtitan/44
    Recommended Servers
    Parallel Tasks
    Parallel Tasks
    Hugging Face
    Hugging Face
    Parallel Web Search
    Parallel Web Search
    Repository
    davila7/claude-code-templates
    Files