### Describe the feature Can gradient accumulation be used in the pretraining of llama2-70b? And if so, how can it be enabled?
Describe the feature
Can gradient accumulation be used in the pretraining of llama2-70b?
And if so, how can it be enabled?