-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
hi, I've been running demos in Megatron-LM-v1.1.5-ZeRO3 folder and I found some api breaking in /Megatron-LM-v1.1.5-ZeRO3/megatron/training.py
breaking 1
line 327: see_memory_usage(f'before forward {model.global_steps}', force=True)
line 333: see_memory_usage(f'before backward {model.global_steps}', force=True)
line 340: see_memory_usage(f'before optimizer {model.global_steps}', force=True)
while running pretrain_bert.py, some errors emerged and said that "model" does not have attribute "global_steps"
AttributeError: 'DistributedDataParallel' object has no attribute 'global_steps'
therefore, I have to comment these three lines.
breaking 2
line 330: loss, loss_reduced = forward_step_func(data_iterator, model, args.curriculum_learning)
while running this line, it said that forward_step() only receives 2 parameters.
TypeError: forward_step() takes 2 positional arguments but 3 were given
I checked out the source code in pretrain_bert.py, found that:
def forward_step(data_iterator, model):
so I removed "args.curriculum_learning", and it works, lol
I guess it's the upgrade of Megatron-lm or DeepSpeed or something that caused the api breaking, please fix, thanks a lot!
setup
the same as README.md,
python pretrain_bert.py \
$BERT_ARGS \
$OUTPUT_ARGS \
--save $CHECKPOINT_PATH \
--load $CHECKPOINT_PATH \
--data-path $DATA_PATH