Enable GRPO Qwen 3 32B with 128k context length

**Is your feature request related to a problem? Please describe.**

Nemotron folks (i.e., @pjin-nvidia and @bxyu-nvidia ) would like to run GRPO experiments on Qwen 3 32B with 128k context length.

Currently they were able to launch the following configs:
- Qwen3 30B MoE, TP=4, 4x nodes, 32768 max context: https://wandb.ai/nvidia/pjin_nemo_rl_dev/runs/q9p0bvii/workspace?nw=nwuserpjinnvidia 
- Qwen3 32B dense, TP=8, 8x nodes, 49152 max context: https://wandb.ai/nvidia/pjin_nemo_rl_dev/runs/n3nlkqgr/workspace?nw=nwuserpjinnvidia

For this task, there's no restrictions on:
- DTensor path vs. Megatron Core path. Anything works would be good.
- Runtime. It would be good if the experiments can finish overnight, but effectively, as of now, Nemotron folks are not blocked by how long the experiments take to finish.

The current understanding of the technical challenge is on GPU global memory footprint, since activations dominate the memory footprint, the size of which scales linearly with the sequence length.

**Describe the solution you'd like**

Being able to run GRPO Qwen 3 32B + 128k context length with a reasonable number of nodes.
- The initial target is 8 8xH100 nodes.

**Describe alternatives you've considered**


**Additional context**



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable GRPO Qwen 3 32B with 128k context length #883

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Enable GRPO Qwen 3 32B with 128k context length #883

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions