You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When training with zero2offloading, I have 95% of GPU memory utilization but but only 20% of GPU utilization. And with no offloading it gives me OOM error. I am training it on 8 H100s.
HOw can I increase my GPU utilization
Also would like to know if I have to train it on 5 nodes (each with 8 h100s) Whats the best configuration? Can I use deepspeed 3? Or something like Deepspeed zero++?