Skip to content

[BUG]: Fail to load huggingface pretraining when use shardinit #2770

@sega-hsj

Description

@sega-hsj

🐛 Describe the bug

        world_size = torch.distributed.get_world_size()
        shard_pg = ProcessGroup(tp_degree=world_size) if args.shardinit else None
        default_dist_spec = ShardSpec([-1], [world_size]) if args.shardinit else None

        with ColoInitContext(device=get_current_device(),
                             dtype=torch.half,
                             default_dist_spec=default_dist_spec,
                             default_pg=shard_pg):
            model = BloomForCausalLM.from_pretrained(args.model_name_or_path)

When using shardinit, the model will be split into multiple GPUs first and then load the huggingface pertaining, so checkpoint mismatch will occur.

RuntimeError: Error(s) in loading state_dict for BloomForCausalLM:
        size mismatch for transformer.word_embeddings.weight: copying a param 
with shape torch.Size([46145, 4096]) from checkpoint, the shape in current model
is torch.Size([46145, 512]).
        size mismatch for transformer.word_embeddings_layernorm.weight: copying 
a param with shape torch.Size([4096]) from checkpoint, the shape in current 
model is torch.Size([512]).
        size mismatch for transformer.word_embeddings_layernorm.bias: copying a 
param with shape torch.Size([4096]) from checkpoint, the shape in current model 
is torch.Size([512]).
        size mismatch for transformer.h.0.input_layernorm.weight: copying a 
param with shape torch.Size([4096]) from checkpoint, the shape in current model 
is torch.Size([512]).

I wonder to know how to successfully load huggingface pertaining when using shardinit, seems that it's necessary when we want to fine-tune a very large model.

Environment

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions