Describe the bug
When gradients are accumulated across micro batches, they should be averaged rather than summed up to eliminate the effect of global batch size. The current implementation results in a gradient scale proportional to gbs / mbs. This is not a big issue for Adam, but is still a potential issue for other optimizers and monitoring gradients.
https://github.com/NVIDIA/reinforcer/blob/main/nemo_reinforcer/models/policy/hf_policy.py#L297-L303
Describe the bug
When gradients are accumulated across micro batches, they should be averaged rather than summed up to eliminate the effect of global batch size. The current implementation results in a gradient scale proportional to
gbs / mbs. This is not a big issue for Adam, but is still a potential issue for other optimizers and monitoring gradients.https://github.com/NVIDIA/reinforcer/blob/main/nemo_reinforcer/models/policy/hf_policy.py#L297-L303