[FEATURE]: Can gradient accumulation be used in the pretraining of llama2-70b?

### Describe the feature

Can gradient accumulation be used in the pretraining of llama2-70b?
And if so, how can it be enabled？