Skip to content

Use default value of initial_scale_power if FP16 scaling params not provided#4986

Closed
ShukantPal wants to merge 4 commits intodeepspeedai:masterfrom
ShukantPal:shukant/fix-initial-scale
Closed

Use default value of initial_scale_power if FP16 scaling params not provided#4986
ShukantPal wants to merge 4 commits intodeepspeedai:masterfrom
ShukantPal:shukant/fix-initial-scale

Conversation

@ShukantPal
Copy link
Copy Markdown
Contributor

@ShukantPal ShukantPal commented Jan 21, 2024

The dynamic_loss_scale_args is None if some scaling param is not specified in the config: https://github.com/microsoft/DeepSpeed/blob/9d2660d2a3fac767972f01ac96858b2605ffc0e4/deepspeed/runtime/config.py#L215

In that case, it seems like DeepSpeed is using 2^32 as the initial_scale instead of the 2^16 as specified in the docs here: https://www.deepspeed.ai/docs/config-json/#fp16-training-options

…rovided

The dynamic_loss_scale_args is None if some scaling param is not specified in the config: https://github.com/microsoft/DeepSpeed/blob/9d2660d2a3fac767972f01ac96858b2605ffc0e4/deepspeed/runtime/config.py#L215

In that case, it seems like DeepSpeed is using 2**32 as the initial_scale instead of the 2**16 as specified in the docs here: https://github.com/microsoft/DeepSpeed/blob/9d2660d2a3fac767972f01ac96858b2605ffc0e4/deepspeed/runtime/config.py#L215
@loadams loadams requested a review from tohtana as a code owner January 22, 2025 17:28
@loadams loadams removed the request for review from mrwyattii January 22, 2025 17:33
@loadams loadams self-assigned this Jan 22, 2025
@loadams
Copy link
Copy Markdown
Collaborator

loadams commented Jan 29, 2025

@ShukantPal - I know this PR is old, but I see the following error now on this PR:

            raise TimeoutError
        if self._success:
            return self._value
        else:
>           raise self._value
E           Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.

@ShukantPal
Copy link
Copy Markdown
Contributor Author

Hi @loadams, I no longer have the bandwidth to support this PR (switched jobs :)). Feel free to close if this change is no longer applicable.

@loadams
Copy link
Copy Markdown
Collaborator

loadams commented Jan 29, 2025

Hi @loadams, I no longer have the bandwidth to support this PR (switched jobs :)). Feel free to close if this change is no longer applicable.

No problem @ShukantPal - thanks for the update. I'll close this PR and open an issue to track the bug and make the needed fixes.

@loadams loadams closed this Jan 29, 2025
@tjruwase
Copy link
Copy Markdown
Contributor

@loadams, I will tackle this in #6976

@loadams
Copy link
Copy Markdown
Collaborator

loadams commented Jan 29, 2025

@loadams, I will tackle this in #6976

Thanks, you beat me to it!

tjruwase added a commit that referenced this pull request Jan 29, 2025
tjruwase added a commit that referenced this pull request Feb 6, 2025
Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants