🐛 Describe the bug
world_size = torch.distributed.get_world_size()
shard_pg = ProcessGroup(tp_degree=world_size) if args.shardinit else None
default_dist_spec = ShardSpec([-1], [world_size]) if args.shardinit else None
with ColoInitContext(device=get_current_device(),
dtype=torch.half,
default_dist_spec=default_dist_spec,
default_pg=shard_pg):
model = BloomForCausalLM.from_pretrained(args.model_name_or_path)
When using shardinit, the model will be split into multiple GPUs first and then load the huggingface pertaining, so checkpoint mismatch will occur.
RuntimeError: Error(s) in loading state_dict for BloomForCausalLM:
size mismatch for transformer.word_embeddings.weight: copying a param
with shape torch.Size([46145, 4096]) from checkpoint, the shape in current model
is torch.Size([46145, 512]).
size mismatch for transformer.word_embeddings_layernorm.weight: copying
a param with shape torch.Size([4096]) from checkpoint, the shape in current
model is torch.Size([512]).
size mismatch for transformer.word_embeddings_layernorm.bias: copying a
param with shape torch.Size([4096]) from checkpoint, the shape in current model
is torch.Size([512]).
size mismatch for transformer.h.0.input_layernorm.weight: copying a
param with shape torch.Size([4096]) from checkpoint, the shape in current model
is torch.Size([512]).
I wonder to know how to successfully load huggingface pertaining when using shardinit, seems that it's necessary when we want to fine-tune a very large model.
Environment
No response
🐛 Describe the bug
When using shardinit, the model will be split into multiple GPUs first and then load the huggingface pertaining, so checkpoint mismatch will occur.
I wonder to know how to successfully load huggingface pertaining when using shardinit, seems that it's necessary when we want to fine-tune a very large model.
Environment
No response