ZeRO stage 1 has different convergence compared to stage 0 and 2

When I train transformer architecture (encoder + decoder) with ZeRO stage 1, it shows different loss curve compared to stage 0 and stage 2.
On the other hand, the trainings with stage 0 and stage 2 look very similar.

![transformer-zero](https://user-images.githubusercontent.com/22177353/107832041-47847980-6d44-11eb-8226-f27f13608dba.png)

23: stage 0, 24: stage 1, 25: stage 2