[feat] Add RLHF rollout integration support (verl) by sijyang · Pull Request #549 · ROCm/ATOM

sijyang · 2026-04-13T07:44:26Z

Overview

This PR enables ATOM to serve as a rollout backend for verl, a distributed RLHF training framework. In RLHF training, the system alternates between two phases:

Training phase — the trainer updates model weights via gradient descent (handled by verl)
Rollout phase — the inference engine generates responses using the latest weights (handled by ATOM)

This requires ATOM to support a new lifecycle that traditional serving doesn't need: receiving weight updates from an external trainer, dynamically releasing/reclaiming GPU memory between phases, and coordinating these operations across multiple DP ranks and TP ranks.

The integration is designed as a plugin layer (atom/rollout/) that extends ATOM's existing engine without modifying its core inference path. All changes to existing files are purely incremental additions (new methods, new fields, new message types), no existing behavior is altered.

Architecture

verl trainer (PyTorch DDP)
│
ATOMHttpServer (verl side, per-node)
│ ZMQ RPC
AsyncLLMEngine (atom/rollout/async_engine.py)
├── sleep() → release KV cache, free GPU memory for training
├── wake_up() → reallocate KV cache, ready for generation
├── load_weights() → receive updated weights via CUDA IPC
└── generate() → standard ATOM inference with logprobs

Weight Synchronization

Weight transfer uses CUDA IPC (weight_sync.py → weight_updater.py) for zero-copy GPU-to-GPU transfer via cudaIpcGetMemHandle/cudaIpcOpenMemHandle. Weights are packed into a GPU buffer, and IPC handles are sent to ModelRunner subprocesses. On multi-GPU setups (DP>1), per-GPU buffers ensure same-device IPC.

Weights are accumulated into fixed-size buckets and flushed incrementally, keeping peak memory overhead bounded regardless of model size.

Weight Update Pipeline

weight_updater.py handles the ModelRunner side of weight loading:

Maps incoming parameter names to ATOM's internal weight names (handling TP sharding, column/row parallel splits)
Supports packed weights (e.g., QKV fused) by slicing incoming tensors to correct offsets
Handles FP8 requantization — when the model uses FP8, incoming FP16/BF16 weights are quantized in-place with updated scales
Clears KV cache after weight update to prevent stale cache from previous weights

GPU Memory Lifecycle

memory_manager.py manages the sleep/wake cycle:

Sleep: deallocate KV cache blocks → torch.cuda.empty_cache() → memory returned to PyTorch/ROCm for trainer
Wake: empty_cache() → recalculate available blocks → reallocate KV cache → ready for inference
Each DP rank manages its own KV cache independently

DP Isolation

model_runner_ext.py (RLHFModelRunner) extends ATOM's ModelRunner for DP-isolated execution. Each DP rank's ModelRunners form an independent NCCL world scoped to TP only, with correct physical-to-logical device mapping and NCCL binding patches for ROCm multi-GPU setups.

Changes

New files (`atom/rollout/`)

File	Purpose
`__init__.py`	Package exports
`async_engine.py`	AsyncLLMEngine wrapper (sleep/wake/load_weights API)
`engine_utility.py`	Utility command handlers (update_weights, release/resume_memory)
`memory_manager.py`	GPU memory lifecycle (KV cache alloc/release, weight discard/resume)
`model_runner_ext.py`	RLHFModelRunner with DP isolation, NCCL device binding patch
`weight_sync.py`	Weight transfer via CUDA IPC (per-GPU buffers)
`weight_updater.py`	Weight update logic (packed weights, FP8 requantize, TP sharding)

Incremental changes to existing files

engine_core.py: utility queue, sleep mode, UTILITY_RESPONSE message type, DP sleep state sync
engine_core_mgr.py: utility_response_queue, broadcast_utility_command, broadcast_utility_command_sync
llm_engine.py: request_ids and logprobs support in add_request/generate/postprocess
async_proc.py: TP-rank barrier for safe weight update buffer reuse
scheduler.py: logprobs tracking in ScheduledBatch/ScheduledBatchOutput
sequence.py: request_id, return_logprobs, logprobs fields
sampling_params.py: logprobs parameter
config.py: runner_qualname, compilation_config dict→object conversion

ZhangLirong-amd · 2026-04-30T02:08:07Z


            config.num_kvcache_blocks = num_blocks
            if not config.enforce_eager:
-                # Start profiler before cudagraph capture only if mark-trace is enabled.


recover it, we need support it

ZhangLirong-amd · 2026-04-30T02:10:11Z

                self._finalizer,
                config.tensor_parallel_size,
-                "atom.model_engine.model_runner.ModelRunner",
+                config.runner_qualname,


do we have to change it?

Yes, this is needed for veRL rollout. AsyncLLMEngine uses RLHFModelRunner instead of the default ModelRunner.

…pport

…line

…integration (TP+DP)

…p API

…nd minor fixes

…parameters

…to atom/rollout/

… ModelRunner with DP isolation handling

…ion parameters and comments across multiple files

…methods

…r, simplify RLHFModelRunner

…L training in EngineCore and ModelRunner

sijyang closed this Apr 13, 2026

sijyang changed the title ~~Sijyang/verl dev~~ Add RLHF rollout integration support (verl) Apr 13, 2026

sijyang reopened this Apr 13, 2026

sijyang force-pushed the sijyang/verl_dev branch from ea9dc91 to 10bab61 Compare April 13, 2026 08:30

sijyang changed the title ~~Add RLHF rollout integration support (verl)~~ [feat]: Add RLHF rollout integration support (verl) Apr 13, 2026

sijyang changed the title ~~[feat]: Add RLHF rollout integration support (verl)~~ [feat] Add RLHF rollout integration support (verl) Apr 13, 2026

ZhangLirong-amd reviewed Apr 13, 2026

View reviewed changes

Comment thread atom/model_engine/engine_core.py Outdated

Comment thread atom/model_engine/engine_core.py Outdated

Comment thread atom/rollout/memory_manager.py

Comment thread atom/rollout/model_runner_ext.py

Comment thread atom/rollout/model_runner_ext.py Outdated

ZhangLirong-amd reviewed Apr 23, 2026

View reviewed changes

Comment thread atom/model_engine/model_runner.py

sijyang force-pushed the sijyang/verl_dev branch 3 times, most recently from 9210385 to 4b81ced Compare April 29, 2026 03:14

ZhangLirong-amd reviewed Apr 30, 2026

View reviewed changes

sijyang added 16 commits April 30, 2026 15:14

[verl] feat: add trust_remote_code arg and compilation_config dict su…

935397f

…pport

[verl] feat: add logprobs and request_id support across sampling pipe…

ce4ec39

…line

[verl] feat: weight sync, memory lifecycle and DP isolation for verl …

5c7a0dc

…integration (TP+DP)

[verl] feat: utility command dispatch and broadcast communication

cdbcf3f

[verl] feat: basic integration with verl - load_weights, sleep/wake_u…

85d5681

…p API

[atom] fix: rope parameters handling, remove CLI trust_remote_code, a…

8a934f8

…nd minor fixes

[atom] feat: implement packed weight handling in ModelRunner for FP8 …

b3699cd

…parameters

[verl] refactor: decouple RLHF rollout logic from inference engine in…

177cc36

…to atom/rollout/

[verl] feat: extend tokenIDProcessor for logprobs support and enhance…

188f4ce

… ModelRunner with DP isolation handling

fix: patch NCCL device binding for DP-isolated ModelRunner

63191bb

refactor: minimize diff against main by reverting non-functional changes

c399ce4

refactor: improve code readability by formatting and organizing funct…

93a61f3

…ion parameters and comments across multiple files

refactor: extract sleep logic from engine_core busy_loop into helper …

354a6a2

…methods

[verl] refactor: merge logprobs and DP isolation into base ModelRunne…

b8b6613

…r, simplify RLHFModelRunner

refactor: rename sleep state variables and update related logic for R…

70c7b4b

…L training in EngineCore and ModelRunner

fix: restore mark_trace profiler around cudagraph capture

ac8e987

sijyang force-pushed the sijyang/verl_dev branch 2 times, most recently from 7b97e86 to 71a493c Compare April 30, 2026 07:26

docs: add veRL + Megatron + ATOM environment setup guide for ROCm

cafb5e5

sijyang force-pushed the sijyang/verl_dev branch from 71a493c to cafb5e5 Compare April 30, 2026 09:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] Add RLHF rollout integration support (verl)#549

[feat] Add RLHF rollout integration support (verl)#549
sijyang wants to merge 17 commits intomainfrom
sijyang/verl_dev

sijyang commented Apr 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ZhangLirong-amd Apr 30, 2026

Uh oh!

sijyang Apr 30, 2026

Uh oh!

ZhangLirong-amd Apr 30, 2026

Uh oh!

sijyang Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sijyang commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Architecture

Weight Synchronization

Weight Update Pipeline

GPU Memory Lifecycle

DP Isolation

Changes

New files (atom/rollout/)

Incremental changes to existing files

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ZhangLirong-amd Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

sijyang Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

ZhangLirong-amd Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

sijyang Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sijyang commented Apr 13, 2026 •

edited

Loading

New files (`atom/rollout/`)