fix: memory optimizations for Nemotron12B 12k seqlen DPO training#926
fix: memory optimizations for Nemotron12B 12k seqlen DPO training#926
Conversation
Signed-off-by: Yubo Gao <yubog@nvidia.com>
terrykong
left a comment
There was a problem hiding this comment.
thanks for the improving performance @ybgao-nvidia !
Signed-off-by: Yubo Gao <yubog@nvidia.com>
Signed-off-by: Yubo Gao <yubog@nvidia.com>
Signed-off-by: Yubo Gao <yubog@nvidia.com>
Signed-off-by: Yubo Gao <yubog@nvidia.com>
Signed-off-by: Yubo Gao <yubog@nvidia.com>
Signed-off-by: Yubo Gao <yubog@nvidia.com>
Wait ... where is it disabled by default? |
Signed-off-by: Yubo Gao <yubog@nvidia.com>
Signed-off-by: Yubo Gao <yubog@nvidia.com>
wangshangsam
left a comment
There was a problem hiding this comment.
Some small nits, but otherwise LGTM
Co-authored-by: Shang Wang <samshang.wang@mail.utoronto.ca> Signed-off-by: Yubo Gao <yubog@nvidia.com>
Signed-off-by: Yubo Gao <yubog@nvidia.com>
|
|
|
Signed-off-by: Yubo Gao <yubog@nvidia.com>
|
|
wangshangsam
left a comment
There was a problem hiding this comment.
Thanks @ybgao-nvidia ! @pjin-nvidia @bxyu-nvidia FYI
Corresponding fix in Automodel: NVIDIA-NeMo/Automodel#391 |
…IDIA-NeMo#926) Signed-off-by: Yubo Gao <yubog@nvidia.com> Co-authored-by: Shang Wang <samshang.wang@mail.utoronto.ca> Co-authored-by: Terry Kong <terrycurtiskong@gmail.com> Signed-off-by: Qidong Su <qidongs@nvidia.com>
…IDIA-NeMo#926) Signed-off-by: Yubo Gao <yubog@nvidia.com> Co-authored-by: Shang Wang <samshang.wang@mail.utoronto.ca> Co-authored-by: Terry Kong <terrycurtiskong@gmail.com> Signed-off-by: Stanislav Kirdey <stan@inflection.ai>
…IDIA-NeMo#926) Signed-off-by: Yubo Gao <yubog@nvidia.com> Co-authored-by: Shang Wang <samshang.wang@mail.utoronto.ca> Co-authored-by: Terry Kong <terrycurtiskong@gmail.com> Signed-off-by: Qidong Su <qidongs@nvidia.com>
…IDIA-NeMo#926) Signed-off-by: Yubo Gao <yubog@nvidia.com> Co-authored-by: Shang Wang <samshang.wang@mail.utoronto.ca> Co-authored-by: Terry Kong <terrycurtiskong@gmail.com>
What does this PR do ?
Memory optimizations
This PR applies memory optimizations that allows for single-node (8xH100) training of the Nemotron 12B model with sequence length 12288.
We need the following optimizations to make 12K context work:
The additional checkpointed layers provides a significant decrease in peak memory usage with minimal performance impact. However, enabling a smaller
max_split_sizein the allocator does increase the step time slightly. The collated performance results are below:(66.56)
(73.24)
Removal of
configure_expandable_segmentsFurthermore, the current implementation of
configure_expandable_segmentsdoes not actually perform its intended function.torch.cuda.get_device_properties(0).majorwhich initializestorch, including the memory allocator. The subsequent assignment to the environment variable will therefore not affect the allocator. Instead, thetorch.cuda.memory._set_allocator_settingsfunction should be used.However, setting expandable segments results in minimal affect to peak memory usage while causing a large performance overhead (from 20s to 80s per training iteration).
We have deleted the function and the related invocations and tests to keep the runtime behaviour consistent. Should the need arise to set expandable segments, the user shall do so instead in the
env_varsin the recipe configuration.Minor fixes for config schema
Some tweaks are done to make config validation pass.
tensorboardfield of logger optionalIssues
This PR resolves #848.
Usage
It is recommended to run DPO training with
PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:64to reduce allocator fragmentation.Before your PR is "Ready for review"
Pre checks:
Additional Information