Is your feature request related to a problem? Please describe.
Nemotron folks (i.e., @pjin-nvidia and @bxyu-nvidia ) would like to run GRPO experiments on Qwen 3 32B with 128k context length.
Currently they were able to launch the following configs:
For this task, there's no restrictions on:
- DTensor path vs. Megatron Core path. Anything works would be good.
- Runtime. It would be good if the experiments can finish overnight, but effectively, as of now, Nemotron folks are not blocked by how long the experiments take to finish.
The current understanding of the technical challenge is on GPU global memory footprint, since activations dominate the memory footprint, the size of which scales linearly with the sequence length.
Describe the solution you'd like
Being able to run GRPO Qwen 3 32B + 128k context length with a reasonable number of nodes.
- The initial target is 8 8xH100 nodes.
Describe alternatives you've considered
Additional context
Is your feature request related to a problem? Please describe.
Nemotron folks (i.e., @pjin-nvidia and @bxyu-nvidia ) would like to run GRPO experiments on Qwen 3 32B with 128k context length.
Currently they were able to launch the following configs:
For this task, there's no restrictions on:
The current understanding of the technical challenge is on GPU global memory footprint, since activations dominate the memory footprint, the size of which scales linearly with the sequence length.
Describe the solution you'd like
Being able to run GRPO Qwen 3 32B + 128k context length with a reasonable number of nodes.
Describe alternatives you've considered
Additional context