Smithery Logo
MCPsSkillsDocsPricing
Login
Smithery Logo

Give agents more agency

Resources

DocumentationPrivacy PolicySystem Status

Company

PricingAboutBlog

Connect

© 2026 Smithery. All rights reserved.

    davila7

    verl-rl-training

    davila7/verl-rl-training
    AI & ML
    19,892

    About

    SKILL.md

    Install

    • Telegram
      Telegram
    • Slack
      Slack
    • Claude Code
      Claude Code
    • Codex
      Codex
    • OpenClaw
      OpenClaw
    • Cursor
      Cursor
    • Amp
      Amp
    • GitHub Copilot
      GitHub Copilot
    • Gemini CLI
      Gemini CLI
    • Kilo Code
      Kilo Code
    • Junie
      Junie
    • Replit
      Replit
    • Windsurf
      Windsurf
    • Cline
      Cline
    • Continue
      Continue
    • OpenCode
      OpenCode
    • OpenHands
      OpenHands
    • Roo Code
      Roo Code
    • Augment
      Augment
    • Goose
      Goose
    • Trae
      Trae
    • Zencoder
      Zencoder
    • Antigravity
      Antigravity
    • Download skill
    ├─
    ├─
    └─

    About

    Provides guidance for training LLMs with reinforcement learning using verl (Volcano Engine RL)...

    SKILL.md

    verl: Volcano Engine Reinforcement Learning for LLMs

    verl is a flexible, efficient, and production-ready RL training library for large language models from ByteDance's Seed team. It implements the HybridFlow framework (EuroSys 2025) and powers models like Doubao-1.5-pro achieving O1-level performance on math benchmarks.

    When to Use verl

    Choose verl when you need:

    • Production-ready RL training at scale (tested up to 671B parameters)
    • Flexibility to swap backends (FSDP ↔ Megatron-LM ↔ vLLM ↔ SGLang)
    • Support for multiple RL algorithms (PPO, GRPO, RLOO, REINFORCE++, DAPO)
    • Multi-turn rollout with tool calling for agentic workflows
    • Vision-language model RL training

    Consider alternatives when:

    • You need Megatron-native training → use slime or miles
    • You want PyTorch-native abstractions with Monarch → use torchforge
    • You only need simple SFT/DPO → use TRL or Axolotl

    Key Features

    • Training backends: FSDP, FSDP2, Megatron-LM
    • Rollout engines: vLLM, SGLang, HuggingFace Transformers
    • Algorithms: PPO, GRPO, DAPO, RLOO, ReMax, REINFORCE++, SPIN, SPPO
    • Models: Qwen-3, Llama-3.1, DeepSeek, Gemma-2 (0.5B to 671B)
    • Advanced: LoRA RL, sequence parallelism, expert parallelism, multi-turn tools

    Installation

    # Option 1: pip install
    pip install verl[vllm]  # or verl[sglang] for SGLang backend
    
    # Option 2: Docker (recommended for production)
    docker pull verlai/verl:vllm011.latest
    
    # Option 3: From source
    git clone https://github.com/volcengine/verl.git
    cd verl && pip install -e .[vllm,math]
    

    Quick Start: GRPO Training

    python3 -m verl.trainer.main_ppo \
        algorithm.adv_estimator=grpo \
        data.train_files=~/data/gsm8k/train.parquet \
        actor_rollout_ref.model.path=Qwen/Qwen2.5-7B \
        actor_rollout_ref.rollout.n=8 \
        actor_rollout_ref.actor.use_kl_loss=True \
        trainer.n_gpus_per_node=8
    

    Core Architecture

    verl uses a HybridFlow programming model separating control flow from computation:

    ┌─────────────────────────────────────────────────────────┐
    │ Single-Process Controller (Ray)                         │
    │ - Orchestrates: rollout → reward → train → sync        │
    └─────────────────────┬───────────────────────────────────┘
                          │
    ┌─────────────────────▼───────────────────────────────────┐
    │ Multi-Process Workers                                   │
    │ ├── ActorRolloutRefWorker (policy + generation)        │
    │ ├── CriticWorker (value estimation, PPO only)          │
    │ └── RewardManager (model-based or rule-based rewards)  │
    └─────────────────────────────────────────────────────────┘
    

    Workflow 1: Math Reasoning with GRPO

    Use this workflow for training reasoning models on math tasks like GSM8K or MATH.

    Prerequisites Checklist

    • GPU cluster with 8+ GPUs (H100 recommended)
    • Dataset in parquet format with prompt and reward_model columns
    • Base model from HuggingFace Hub

    Step 1: Prepare Dataset

    import pandas as pd
    
    data = [
        {
            "prompt": [{"role": "user", "content": "What is 15 + 27?"}],
            "reward_model": {"ground_truth": "42"}
        },
        # ... more examples
    ]
    df = pd.DataFrame(data)
    df.to_parquet("train.parquet")
    

    Step 2: Define Reward Function

    # reward_function.py
    import re
    
    def compute_reward(responses, ground_truths):
        rewards = []
        for response, gt in zip(responses, ground_truths):
            # Extract answer from response
            match = re.search(r'\\boxed{([^}]+)}', response)
            if match and match.group(1).strip() == gt.strip():
                rewards.append(1.0)
            else:
                rewards.append(0.0)
        return rewards
    

    Step 3: Create Training Config

    # config/grpo_math.yaml
    algorithm:
      adv_estimator: grpo
      gamma: 1.0
      lam: 1.0
    
    data:
      train_files: /path/to/train.parquet
      val_files: /path/to/val.parquet
      train_batch_size: 256
      max_prompt_length: 512
      max_response_length: 2048
    
    actor_rollout_ref:
      model:
        path: Qwen/Qwen2.5-7B-Instruct
      actor:
        use_kl_loss: true
        kl_loss_coef: 0.001
        ppo_mini_batch_size: 64
      rollout:
        name: vllm
        n: 8  # samples per prompt
        temperature: 0.7
        top_p: 0.95
    
    trainer:
      total_epochs: 3
      n_gpus_per_node: 8
      save_freq: 100
    

    Step 4: Launch Training

    python3 -m verl.trainer.main_ppo \
        --config-path config \
        --config-name grpo_math \
        trainer.experiment_name=grpo_math_qwen7b
    

    Step 5: Monitor and Validate

    • Check WandB/TensorBoard for loss curves
    • Verify reward is increasing over steps
    • Run evaluation on held-out test set

    Workflow 2: PPO with Critic Model

    Use this workflow when you need value-based advantage estimation (GAE).

    Key Differences from GRPO

    • Requires separate critic model
    • Uses Generalized Advantage Estimation (GAE)
    • Better for tasks with dense rewards

    Configuration

    algorithm:
      adv_estimator: gae  # Use GAE instead of GRPO
      gamma: 0.99
      lam: 0.95
    
    critic:
      model:
        path: Qwen/Qwen2.5-7B-Instruct  # Can be same or different from actor
      ppo_mini_batch_size: 64
    
    actor_rollout_ref:
      actor:
        use_kl_loss: true
        kl_loss_coef: 0.02
        clip_ratio: 0.2  # PPO clipping
    

    Launch with Critic

    python3 -m verl.trainer.main_ppo \
        algorithm.adv_estimator=gae \
        critic.model.path=Qwen/Qwen2.5-7B-Instruct \
        trainer.n_gpus_per_node=8
    

    Workflow 3: Large-Scale Training with Megatron

    Use this workflow for models >70B parameters or when you need expert parallelism.

    Prerequisites

    • Install Megatron-LM bridge: pip install mbridge
    • Convert model to Megatron format
    • Multi-node cluster with NVLink/InfiniBand

    Configuration for 70B+ Models

    actor_rollout_ref:
      model:
        path: /path/to/megatron/checkpoint
        backend: megatron
      actor:
        strategy: megatron
        tensor_model_parallel_size: 8
        pipeline_model_parallel_size: 2
      rollout:
        name: vllm
        tensor_parallel_size: 8
    

    Launch Multi-Node

    # On head node
    ray start --head --port=6379
    
    # On worker nodes
    ray start --address='head_ip:6379'
    
    # Launch training
    python3 -m verl.trainer.main_ppo \
        trainer.nnodes=4 \
        trainer.n_gpus_per_node=8
    

    Configuration Reference

    Algorithm Selection

    Algorithm adv_estimator Use Case
    GRPO grpo Critic-free, math/reasoning
    PPO/GAE gae Dense rewards, value estimation
    REINFORCE++ reinforce_plus_plus Variance reduction
    RLOO rloo Leave-one-out baseline
    ReMax remax Maximum reward baseline
    OPO opo Optimal policy optimization

    Key Parameters

    # Rollout parameters
    actor_rollout_ref.rollout.n: 8              # Samples per prompt
    actor_rollout_ref.rollout.temperature: 0.7  # Sampling temperature
    actor_rollout_ref.rollout.top_p: 0.95       # Nucleus sampling
    
    # Training parameters
    actor_rollout_ref.actor.lr: 1e-6            # Learning rate
    actor_rollout_ref.actor.ppo_mini_batch_size: 64
    actor_rollout_ref.actor.clip_ratio: 0.2     # PPO clip range
    
    # KL control
    actor_rollout_ref.actor.use_kl_loss: true
    actor_rollout_ref.actor.kl_loss_coef: 0.001
    algorithm.kl_ctrl.target_kl: 0.1            # For adaptive KL control
    

    Common Issues and Solutions

    Issue: OOM During Rollout

    Symptoms: CUDA out of memory during generation phase

    Solutions:

    # Reduce batch size
    actor_rollout_ref.rollout.log_prob_micro_batch_size: 4
    
    # Enable gradient checkpointing
    actor_rollout_ref.model.enable_gradient_checkpointing: true
    
    # Use FSDP2 with CPU offloading
    actor_rollout_ref.actor.strategy: fsdp2
    actor_rollout_ref.actor.fsdp_config.offload_policy: true
    

    Issue: Training Instability

    Symptoms: Loss spikes, reward collapse

    Solutions:

    # Reduce learning rate
    actor_rollout_ref.actor.lr: 5e-7
    
    # Increase KL penalty
    actor_rollout_ref.actor.kl_loss_coef: 0.01
    
    # Enable gradient clipping
    actor_rollout_ref.actor.max_grad_norm: 1.0
    

    Issue: Slow Weight Sync

    Symptoms: Long pauses between rollout and training

    Solutions:

    # Use FSDP2 for faster resharding
    actor_rollout_ref.actor.strategy=fsdp2
    
    # Enable async weight transfer
    trainer.async_weight_update=true
    

    Issue: vLLM Version Mismatch

    Symptoms: Import errors or generation failures

    Solution: Use compatible versions:

    pip install vllm>=0.8.5,<=0.12.0
    # Avoid vLLM 0.7.x (known bugs)
    

    Advanced Topics

    Multi-Turn Tool Calling

    See references/multi-turn.md for agentic workflows with tool use.

    Vision-Language Models

    actor_rollout_ref:
      model:
        path: Qwen/Qwen2.5-VL-7B-Instruct
      rollout:
        name: vllm
        enable_vision: true
    

    LoRA Training

    actor_rollout_ref:
      actor:
        lora:
          enabled: true
          r: 16
          alpha: 32
          target_modules: ["q_proj", "v_proj"]
    

    Resources

    • Documentation: https://verl.readthedocs.io/
    • Paper: https://arxiv.org/abs/2409.19256
    • GitHub: https://github.com/volcengine/verl
    • Recipes: https://github.com/verl-project/verl-recipe (DAPO, GSPO, etc.)
    • Community: Slack at verl-project
    Recommended Servers
    fillin
    fillin
    GroundRoute — Web Search for AI Agents
    GroundRoute — Web Search for AI Agents
    Apify
    Apify
    Repository
    davila7/claude-code-templates
    Files