Use default value of initial_scale_power if FP16 scaling params not provided by ShukantPal · Pull Request #4986 · deepspeedai/DeepSpeed

ShukantPal · 2024-01-21T19:53:39Z

The dynamic_loss_scale_args is None if some scaling param is not specified in the config: https://github.com/microsoft/DeepSpeed/blob/9d2660d2a3fac767972f01ac96858b2605ffc0e4/deepspeed/runtime/config.py#L215

In that case, it seems like DeepSpeed is using 2^32 as the initial_scale instead of the 2^16 as specified in the docs here: https://www.deepspeed.ai/docs/config-json/#fp16-training-options

…rovided The dynamic_loss_scale_args is None if some scaling param is not specified in the config: https://github.com/microsoft/DeepSpeed/blob/9d2660d2a3fac767972f01ac96858b2605ffc0e4/deepspeed/runtime/config.py#L215 In that case, it seems like DeepSpeed is using 2**32 as the initial_scale instead of the 2**16 as specified in the docs here: https://github.com/microsoft/DeepSpeed/blob/9d2660d2a3fac767972f01ac96858b2605ffc0e4/deepspeed/runtime/config.py#L215

loadams · 2025-01-29T00:44:27Z

@ShukantPal - I know this PR is old, but I see the following error now on this PR:

            raise TimeoutError
        if self._success:
            return self._value
        else:
>           raise self._value
E           Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.

ShukantPal · 2025-01-29T06:39:42Z

Hi @loadams, I no longer have the bandwidth to support this PR (switched jobs :)). Feel free to close if this change is no longer applicable.

loadams · 2025-01-29T16:31:37Z

Hi @loadams, I no longer have the bandwidth to support this PR (switched jobs :)). Feel free to close if this change is no longer applicable.

No problem @ShukantPal - thanks for the update. I'll close this PR and open an issue to track the bug and make the needed fixes.

tjruwase · 2025-01-29T17:18:41Z

@loadams, I will tackle this in #6976

loadams · 2025-01-29T17:34:01Z

@loadams, I will tackle this in #6976

Thanks, you beat me to it!

Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>

ShukantPal requested review from mrwyattii and tjruwase as code owners January 21, 2024 19:53

Merge branch 'master' into shukant/fix-initial-scale

bb5eebd

loadams requested a review from tohtana as a code owner January 22, 2025 17:28

loadams removed the request for review from mrwyattii January 22, 2025 17:33

loadams approved these changes Jan 22, 2025

View reviewed changes

loadams self-assigned this Jan 22, 2025

loadams added 2 commits January 24, 2025 10:21

Merge branch 'master' into shukant/fix-initial-scale

0cf2bf4

Merge branch 'master' into shukant/fix-initial-scale

c38e8fd

loadams closed this Jan 29, 2025

tjruwase added a commit that referenced this pull request Jan 29, 2025

Address #4986

3694e07

tjruwase added a commit that referenced this pull request Feb 6, 2025

Address #4986

2bbb7b4

Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use default value of initial_scale_power if FP16 scaling params not provided#4986

Use default value of initial_scale_power if FP16 scaling params not provided#4986
ShukantPal wants to merge 4 commits intodeepspeedai:masterfrom
ShukantPal:shukant/fix-initial-scale

ShukantPal commented Jan 21, 2024 •

edited

Loading

Uh oh!

loadams commented Jan 29, 2025

Uh oh!

ShukantPal commented Jan 29, 2025

Uh oh!

loadams commented Jan 29, 2025

Uh oh!

tjruwase commented Jan 29, 2025

Uh oh!

loadams commented Jan 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ShukantPal commented Jan 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

loadams commented Jan 29, 2025

Uh oh!

ShukantPal commented Jan 29, 2025

Uh oh!

loadams commented Jan 29, 2025

Uh oh!

tjruwase commented Jan 29, 2025

Uh oh!

loadams commented Jan 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ShukantPal commented Jan 21, 2024 •

edited

Loading