Fix random token-generation issue + MP-checkpoint loading/saving#2132
Fix random token-generation issue + MP-checkpoint loading/saving#2132
Conversation
…ed into ds-inference/bloom-fix
…ed into ds-inference/bloom-fix
Just fixed it, please give it a try |
|
Looks like I need to add a new key to my checkpoint.json? Is it mandatory? What value should I put in it for the huggingface checkpoint file list? EDIT: I looked at the code and set it to |
|
After getting past the The My new checkpoints json file looks like this: {"type": "BLOOM-176B",
"base_dir": "/home/ubuntu/.cache/deepspeed/bigscience/bloom",
"checkpoints": ["BLOOM-176B-non-tp.pt", "BLOOM-176B-non-tp.pt", "BLOOM-176B-non-tp.pt", "BLOOM-176B-non-tp.pt", "BLOOM-176B-non-tp.pt", "BLOOM-176B-non-tp.pt", "BLOOM-176B-non-tp.pt", "BLOOM-176B-non-tp.pt", "BLOOM-176B-tp_00.pt", "BLOOM-176B-tp_01.pt", "BLOOM-176B-tp_02.pt", "BLOOM-176B-tp_03.pt", "BLOOM-176B-tp_04.pt", "BLOOM-176B-tp_05.pt", "BLOOM-176B-tp_06.pt", "BLOOM-176B-tp_07.pt"],
"version": 1.0,
"parallelization": "tp",
"mp_size": 8}EDIT: I found the |
| v in dict(replaced_module.state_dict()).items() | ||
| if transformer_name not in k | ||
| }), | ||
| non_tp_ckpt_name) |
There was a problem hiding this comment.
f'{save_mp_checkpoint_path}/{non_tp_ckpt_name}'
There was a problem hiding this comment.
that's true, it's not saved correctly. I am gonna fix it now
There was a problem hiding this comment.
@zcrypt0, it also generates a config file under the same path that you can use to run inference with
There was a problem hiding this comment.
EDIT: I just realized the change in the non-tp file size, I will give it a try soon
There was a problem hiding this comment.
@RezaYazdaniAminabadi Just tested and it works without a hitch, nice! 👍
…ed into ds-inference/bloom-fix
|
Still getting this error @RezaYazdaniAminabadi running with batch size = 1 |
Ran this with CUDA 11.6 with DeepSpeed on master branch. |
|
@RezaYazdaniAminabadi how much time is cached tp model loading supposed to take? self.model is loaded using HF AutoModel as in bloom-ds-inference.py |
|
nvm |
|
@RezaYazdaniAminabadi I am seeing again after updating to master branch and saving without providing checkpoint json |
Just want to double check, did your install include this commit in master? #2237 |
|
@jeffra yes I am on the latest commit |
|
I use this code: run with : deepspeed --num_gpus 8 scripts/bloom-inference-server/cache_ds_checkpoints.py --model_name bigscience/bloom --dtype fp16 --save_mp_checkpoint_path ../DS_cacheimport argparse
import os
import deepspeed
import torch
from transformers import AutoConfig, AutoModelForCausalLM
def get_args() -> argparse.Namespace:
parser = argparse.ArgumentParser()
group = parser.add_argument_group(title="launch config")
group.add_argument("--local_rank", required=False,
type=int, help="used by dist launchers")
group.add_argument("--save_mp_checkpoint_path", required=True,
type=str, help="MP checkpoints path for DS inference")
group = parser.add_argument_group(title="model")
group.add_argument("--model_name", type=str,
required=True, help="model to use")
group.add_argument("--dtype", type=str, required=True,
choices=["bf16", "fp16"], help="dtype for model")
args = parser.parse_args()
if (args.dtype == "bf16"):
args.dtype = torch.bfloat16
elif (args.dtype == "fp16"):
args.dtype = torch.float16
return args
def main() -> None:
args = get_args()
if (args.local_rank == 0):
print("Loading model...")
world_size = int(os.getenv("WORLD_SIZE", "1"))
# Load model
with deepspeed.OnDevice(dtype=args.dtype, device="meta"):
model = AutoModelForCausalLM.from_config(
AutoConfig.from_pretrained(args.model_name),
torch_dtype=torch.bfloat16
)
model = model.eval()
if (args.dtype == torch.float16):
model = deepspeed.init_inference(
model,
mp_size=world_size,
dtype=args.dtype,
replace_with_kernel_inject=True,
save_mp_checkpoint_path=args.save_mp_checkpoint_path
)
elif (args.dtype == torch.bfloat16):
raise NotImplementedError("bfloat16 is not yet supported")
print("Model loaded")
if (__name__ == "__main__"):
main() |
|
@jeffra This issue is blocking bigscience-workshop/Megatron-DeepSpeed#328 |
|
@jeffra ^^ |
@mayank31398 Hi, also facing this issue. Is it solved right now? |
|
@pai4451 not yet |

This PR fixes the token-generation issue with different random seed on several MP ranks. It also adds the ability to load/save MP-partitioned checkpoints to speed up the checkpoint loading for inference.
cc: @stas00 @jeffra