MRLM is a comprehensive, production-ready library for training Large Language Models using multi-agent reinforcement learning. Built with a clean server-client architecture, full distributed training support, and state-of-the-art RL algorithms.
- π€ 4 RL Algorithms: PPO, GRPO, DPO, and SFT for diverse training scenarios
- π 4 Task Environments: Code execution, math reasoning, multi-agent debate, and tool use
- β‘ Distributed Training: Full FSDP and DDP support for multi-GPU/multi-node training
- ποΈ Server-Client Architecture: gRPC-based distributed environment hosting
- π YAML Configuration: Declarative configuration system for reproducible experiments
- π§ Professional CLI: Complete command-line interface for training, serving, and evaluation
- π― Production Ready: Type hints, comprehensive docs, and extensive examples
# From PyPI (coming soon)
pip install mrlm
# From source
git clone https://github.com/open-tinker/MRLM.git
cd MRLM
pip install -e .import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from mrlm.core import LLMEnvironment, EnvironmentMode
from mrlm.environments.math import MathReasoningEnvironment, MathProblemGenerator
from mrlm.algorithms.ppo import PPOTrainer
from mrlm.config import ExperimentConfig, TrainingConfig, PPOConfig
# Load model
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct")
# Create environments
policy_env = LLMEnvironment(model, tokenizer, mode=EnvironmentMode.SERVER)
eval_envs = [MathReasoningEnvironment(MathProblemGenerator()) for _ in range(4)]
# Configure and train
config = ExperimentConfig(
training=TrainingConfig(algorithm="ppo", num_epochs=50),
ppo=PPOConfig(clip_range=0.2, gamma=0.99),
)
trainer = PPOTrainer(policy_env, eval_envs, config)
trainer.train(num_iterations=50)# Train from config
mrlm train --config configs/code_generation_ppo.yaml
# Start environment server
mrlm serve --environments code,math,debate --port 50051
# Evaluate trained model
mrlm eval --model outputs/ppo/final --environment math --num-episodes 20
# Collect trajectories for SFT
mrlm collect --model Qwen/Qwen2.5-1.5B --environment code -n 100 -o data/trajectories.json
# Show system info
mrlm infoMRLM implements 4 state-of-the-art algorithms for LLM training:
| Algorithm | Type | Best For | Key Features |
|---|---|---|---|
| PPO | On-policy RL | General RL training | Clipped surrogate, GAE, stable updates |
| GRPO | Group-based RL | Variance reduction | Group-normalized rewards, efficient sampling |
| DPO | Offline preferences | Human alignment | No reward model, direct preference optimization |
| SFT | Supervised | Pre-training, world model | Behavioral cloning, next-state prediction |
# PPO - Most versatile, general-purpose
ppo_trainer = PPOTrainer(policy_env, eval_envs, config)
# GRPO - Better for high-variance tasks
grpo_trainer = GRPOTrainer(policy_env, eval_envs, config)
# DPO - Train on preference pairs
dpo_trainer = DPOTrainer(policy_env, preference_dataset, config)
# SFT - Pre-train on trajectories
sft_trainer = SFTTrainer(policy_env, eval_envs, config)All algorithms support:
- β Distributed training (FSDP/DDP)
- β Mixed precision (fp16/bf16)
- β Gradient accumulation
- β YAML configuration
- β Checkpointing and resumption
Train LLMs to write and debug code with test-based rewards.
from mrlm.environments.code import CodeExecutionEnvironment, CodeProblemGenerator
generator = CodeProblemGenerator()
env = CodeExecutionEnvironment(generator)Features:
- Python code execution with sandboxing
- Test case validation
- Syntax and functionality scoring
- Support for HumanEval-style problems
Solve mathematical problems with automatic answer verification.
from mrlm.environments.math import MathReasoningEnvironment, MathProblemGenerator
generator = MathProblemGenerator(difficulty_range=(1, 3))
env = MathReasoningEnvironment(generator)Problem Types:
- Arithmetic (addition, subtraction, multiplication, division)
- Algebra (equations, inequalities)
- Word problems
- Multi-step reasoning
Train agents to engage in structured debates.
from mrlm.environments.debate import DebateEnvironment, RuleBasedJudge
judge = RuleBasedJudge()
env = DebateEnvironment(judge=judge)Features:
- PRO/CON position assignment
- Argument quality evaluation
- Evidence and reasoning scoring
- Consistency metrics
Learn to use external tools for complex tasks.
from mrlm.environments.tools import ToolUseEnvironment
from mrlm.environments.tools.builtin_tools import create_default_tool_registry
registry = create_default_tool_registry() # Calculator, web search, Python REPL, filesystem
env = ToolUseEnvironment(registry)Built-in Tools:
- Calculator (math operations)
- Web Search (knowledge retrieval)
- Python REPL (code execution)
- File System (read/write)
Create custom environments by extending SimulatedEnvironment:
from mrlm.core import SimulatedEnvironment
from mrlm.core.types import Message, Observation, Reward
class MyEnvironment(SimulatedEnvironment):
def reset(self) -> Observation:
# Initialize task
return Observation(messages=[Message(...)])
def step(self, action: Message) -> tuple[Observation, Reward]:
# Process action, compute reward
return next_observation, rewardMRLM uses a clean server-client architecture for distributed training:
βββββββββββββββββββ gRPC βββββββββββββββββββ
β Policy Model β ββββββββββββββ β Environment 1 β
β (Training) β β (Client) β
β SERVER Mode β gRPC βββββββββββββββββββ€
β β ββββββββββββββ β Environment 2 β
βββββββββββββββββββ β (Client) β
βββββββββββββββββββ
Benefits:
- Scale environments independently
- Distribute compute across machines
- Hot-swap environments without restarting training
- Clear separation of concerns
Full support for large-scale training:
# Single node, 4 GPUs with FSDP
torchrun --nproc_per_node=4 train.py --strategy fsdp
# Multi-node, 2 nodes Γ 4 GPUs
# Node 0:
torchrun --nproc_per_node=4 --nnodes=2 --node_rank=0 \
--master_addr=<node0_ip> --master_port=12355 train.py
# Node 1:
torchrun --nproc_per_node=4 --nnodes=2 --node_rank=1 \
--master_addr=<node0_ip> --master_port=12355 train.pyStrategies:
- DDP (Distributed Data Parallel): Best for models < 10B parameters
- FSDP (Fully Sharded Data Parallel): For very large models (10B+)
- Mixed Precision: fp16/bf16 for 2x speedup
- Gradient Accumulation: Simulate larger batches
Use YAML configs for reproducible experiments:
# configs/my_experiment.yaml
experiment_name: code_generation_ppo
training:
algorithm: ppo
num_epochs: 100
batch_size: 16
learning_rate: 5.0e-6
ppo:
clip_range: 0.2
gamma: 0.99
gae_lambda: 0.95
model:
model_name_or_path: "Qwen/Qwen2.5-1.5B-Instruct"
torch_dtype: "float16"
eval_envs:
- env_type: code
mode: client
max_turns: 3
distributed:
enabled: true
strategy: "fsdp"Train with:
mrlm train --config configs/my_experiment.yamlThe examples/ directory contains 11+ comprehensive examples:
quickstart/simple_ppo.py- Minimal PPO exampletrain_code_ppo.py- Code generation with PPOtrain_math_ppo.py- Math reasoning with PPOtrain_grpo_math.py- GRPO trainingtrain_dpo_preferences.py- DPO from preferencestrain_sft_world_model.py- SFT for world modeltrain_tool_use_ppo.py- Tool use training
train_distributed_ppo.py- Multi-GPU distributed trainingtrain_multi_environment.py- Train on code, math, and tools simultaneouslytrain_hybrid_sft_ppo.py- Hybrid pipeline (SFT pre-train β PPO fine-tune)train_from_config.py- Config-based training
See examples/README.md for detailed guide.
mrlm train --config CONFIG [--output DIR] [--resume CHECKPOINT]Train a model from YAML configuration. Supports all algorithms (PPO, GRPO, DPO, SFT).
mrlm serve --environments ENV1,ENV2,... [--port PORT] [--host HOST]Start a gRPC server hosting multiple environments for distributed training.
mrlm eval --model MODEL --environment ENV [--num-episodes N] [--output FILE]Evaluate a trained model on an environment and save results.
mrlm collect --model MODEL --environment ENV --num-episodes N --output FILE [--filter-reward THRESHOLD]Collect trajectory data for SFT training.
mrlm infoDisplay system information, available algorithms, and environments.
# See examples/quickstart/simple_ppo.py
# A minimal, self-contained example showing PPO training# See examples/train_multi_environment.py
# Train a generalist model on multiple task types# See examples/train_hybrid_sft_ppo.py
# Best-practice two-stage training: SFT β PPO# See examples/train_distributed_ppo.py
# Scale to multiple GPUs with FSDP or DDPComprehensive documentation available:
- Architecture Guide - System design and components
- Examples README - Complete guide to all examples
- Installation Guide - Detailed installation instructions
- Contributing Guide - Development guidelines
- API Reference - Auto-generated API documentation
git clone https://github.com/open-tinker/MRLM.git
cd MRLM
pip install -e ".[dev]"
pre-commit installpytest # Run all tests
pytest --cov=mrlm # With coverage
pytest -m "not slow" # Skip slow testsblack src/ tests/ # Format code
ruff check src/ tests/ # Lint
mypy src/ # Type checkingTrain models to write, debug, and explain code with test-based rewards.
Improve mathematical problem-solving with structured reasoning rewards.
Teach models to use external tools and plan multi-step solutions.
Train conversational agents through multi-agent debate and evaluation.
Bootstrap RL training with supervised fine-tuning on demonstrations.
Align models with human preferences using DPO (no reward model needed).
Compared to other RL libraries for LLMs:
| Feature | MRLM | TRL | VeRL | Others |
|---|---|---|---|---|
| Algorithms | PPO, GRPO, DPO, SFT | PPO, DPO | PPO, GRPO | Varies |
| Distributed | FSDP, DDP, Multi-node | Limited | FSDP, DDP | Varies |
| Server-Client | β gRPC | β | β | β |
| Environments | 4 built-in + custom | Limited | Custom only | Varies |
| CLI Tool | β Full-featured | β | β | β |
| YAML Config | β | β | β | β |
| Production Ready | β | Partial | β | Varies |
MRLM is unique in providing:
- Complete environment suite out-of-the-box
- Professional CLI for production workflows
- Full server-client architecture for distributed environments
- 4 algorithms in one unified framework
- Production-ready with extensive docs and examples
We welcome contributions! See CONTRIBUTING.md for guidelines.
Areas for contribution:
- New environments (text summarization, translation, etc.)
- Additional RL algorithms (SAC, TD3, etc.)
- Performance optimizations
- Documentation improvements
- Bug reports and feature requests
Performance on standard benchmarks (coming soon):
| Task | Algorithm | Score | Training Time |
|---|---|---|---|
| HumanEval | PPO | TBD | TBD |
| GSM8K | GRPO | TBD | TBD |
| MT-Bench | DPO | TBD | TBD |
If you use MRLM in your research, please cite:
@software{mrlm2024,
title={MRLM: Multi-Agent Reinforcement Learning for LLMs},
author={MRLM Contributors},
year={2024},
url={https://github.com/open-tinker/MRLM},
version={0.1.0}
}Apache License 2.0 - see LICENSE for details.
MRLM is inspired by and builds upon ideas from:
- VeRL - Flexible RL framework for LLMs
- TRL - Transformer Reinforcement Learning
- OpenAI Gym - RL environment interface
- Ray RLlib - Distributed RL
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: mrlm-dev@example.com
- Core RL algorithms (PPO, GRPO, DPO, SFT)
- Built-in environments (Code, Math, Debate, Tools)
- Distributed training (FSDP, DDP)
- CLI tool and YAML configs
- Comprehensive test suite
- PyPI release
- Benchmark results
- Web UI for monitoring
- More environments (summarization, translation, etc.)
- More algorithms (SAC, A2C, etc.)
- Integration with popular LLM frameworks
Made with β€οΈ by the MRLM team
β Star us on GitHub if you find MRLM useful!