-
Notifications
You must be signed in to change notification settings - Fork 690
[XPU] ep+tp all2all #4836
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[XPU] ep+tp all2all #4836
Conversation
|
Thanks for your contribution! |
fastdeploy/config.py
Outdated
| # ep+tp split strategy | ||
| # 0: qkv_linear + attn + out_linear + allreduce | ||
| # 1: allgather + qkv_linear + attn + all2all + out_linear | ||
| self.ep_tp_split_mode = int(os.getenv("EP_TP_SPLIT_MODE", 0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
新增环境变量放到envs.py中,命名以FD_开头,并提供注释
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我也建议是字符串类型,值可以是 all_reduce 或 all_to_all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| block_tables=share_inputs["block_tables"], | ||
| caches=share_inputs["caches"], | ||
| ) | ||
| xpu_forward_meta.token_num = token_num |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
forward_meta中添加字段需要在data class增加,就这一行而言,不建议写,直接通过xpu_forward_meta.ids_remove_padding.shape[0]就是token_num
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| no_tp_action_keys = copy.deepcopy(num_local_ffn_keys) | ||
| if fd_config.parallel_config.ep_tp_split_mode == 1: | ||
| for i in range(fd_config.model_config.moe_layer_start_index, fd_config.model_config.num_hidden_layers): | ||
| k = f"ernie.layers.{i}.self_attn.o_proj.weight" | ||
| if k in weight_list: | ||
| no_tp_action_keys.append(k) | ||
| tp_actions = cls._get_tensor_parallel_mappings(fd_config.model_config.pretrained_config) | ||
| new_actions = {k: v for k, v in tp_actions.items() if k not in num_local_ffn_keys} | ||
| new_actions = {k: v for k, v in tp_actions.items() if k not in no_tp_action_keys} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里记上一个TODO,V1 loader逻辑是不会走到这里的,需要适配V1 loader
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
适配V1 loader的时候会再统一处理
| out = norm_out[0].astype(x_dtype) | ||
| residual_out = norm_out[1].astype(residual_input_dtype) if residual_input is not None else None | ||
|
|
||
| if self.split_x: | ||
| residual_out = self.split(residual_out) | ||
| if self.allgather_out: | ||
| out = self.allgather(out, forward_meta.token_num) | ||
|
|
||
| if residual_input is None: | ||
| return out | ||
| else: | ||
| return norm_out[0].astype(x_dtype) | ||
| return out, residual_out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个是不是可以写到linear层里去,放到这里太奇怪了,norm层感知到tp/ep,不合理
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里需要切分residual_out,只能在norm中感知
| if self.split_token: | ||
| self.num_heads = fd_config.model_config.num_attention_heads | ||
| else: | ||
| self.num_heads = fd_config.model_config.num_attention_heads // self.nranks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
num_heads这块是不是可以删掉?row parallel linear不需要这个变量?后面需要self.num_heads * self.head_dim的话,是不是self.hidden_size就可以 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| if token_num_pad > token_num: | ||
| x_new = paddle.zeros([token_num_pad, x.shape[1]], x.dtype) | ||
| x_new[:token_num, :] = x | ||
| x = x_new |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这一块不用if判断了吧,否则cudagraph捕获不了
| reduce_results: bool = True, | ||
| skip_quant: bool = False, | ||
| weight_dtype="", | ||
| layer_id: int = -1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个新增参数,建议所有model.py里都传一下吧,默认为-1不合理
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
当前方案需要在layer层内部确定在组网的什么位置。
所有的out_linear都是用的RowParallelLinear,通过传入layer_id确定位置,这种方式比较trick。
比如mlp和shared_expert都是用的RowParallelLinear,都传入layer_id的话,就还需要其他信息来确定当前layer是不是out_linear
可能得看下怎么更准确得描述layer在模型的哪个位置,比如传入name string来感知当前位置
| token_num_per_rank = out.shape[0] | ||
| multi_outs = paddle.zeros([token_num_per_rank * self.tp_size, out.shape[1]], dtype=out.dtype) | ||
| paddle.distributed.all_gather(multi_outs, out, self.tp_group) | ||
| out = multi_outs if token_num is None else multi_outs[:token_num, :] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里也会导致cudagraph捕获不了,默认就multi_outs[:token_num, :] ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
482f9e9 to
64f8aa6
Compare
64f8aa6 to
4c0be8c
Compare
hong19860320
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
PR描述里给一下启动脚本示例吧~ |
| if self.split_x: | ||
| residual_out = self.split(residual_out) | ||
| if self.allgather_out: | ||
| out = self.allgather(out, forward_meta.ids_remove_padding.shape[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这俩分支场景现在有单测能测到么
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
有xpu的单测。gpu目前跑不到这里,需要gpu同学自己适配并验证
Motivation
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.