Skip to content

api breaking in Megatron-LM-v1.1.5-ZeRO3 demo #130

@mark14wu

Description

@mark14wu

hi, I've been running demos in Megatron-LM-v1.1.5-ZeRO3 folder and I found some api breaking in /Megatron-LM-v1.1.5-ZeRO3/megatron/training.py

breaking 1

line 327: see_memory_usage(f'before forward {model.global_steps}', force=True)
line 333: see_memory_usage(f'before backward {model.global_steps}', force=True)
line 340: see_memory_usage(f'before optimizer {model.global_steps}', force=True)

while running pretrain_bert.py, some errors emerged and said that "model" does not have attribute "global_steps"

AttributeError: 'DistributedDataParallel' object has no attribute 'global_steps'

therefore, I have to comment these three lines.

breaking 2

line 330: loss, loss_reduced = forward_step_func(data_iterator, model, args.curriculum_learning)

while running this line, it said that forward_step() only receives 2 parameters.

TypeError: forward_step() takes 2 positional arguments but 3 were given

I checked out the source code in pretrain_bert.py, found that:

def forward_step(data_iterator, model):

so I removed "args.curriculum_learning", and it works, lol

I guess it's the upgrade of Megatron-lm or DeepSpeed or something that caused the api breaking, please fix, thanks a lot!

setup

the same as README.md,

python pretrain_bert.py \
       $BERT_ARGS \
       $OUTPUT_ARGS \
       --save $CHECKPOINT_PATH \
       --load $CHECKPOINT_PATH \
       --data-path $DATA_PATH

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions