[ChatGlm] Adds support for the ChatGLM model#27883
[ChatGlm] Adds support for the ChatGLM model#27883ArthurZucker wants to merge 35 commits intomainfrom
ChatGlm] Adds support for the ChatGLM model#27883Conversation
Co-authored-by: Xunkai <xunkai55@gmail.com>
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
Hi~ o( ̄▽ ̄)ブ Could you tell me what your plans? May I know when the PR can be merged? I have PRs depended on this feature. |
|
@younesbelkada and I just came back from holidays, we are hoping end of the week maybe later! |
| "ChatGlmModel is using ChatGlmSdpaAttention, but `torch.nn.functional.scaled_dot_product_attention` does not support `output_attentions=True`. Falling back to the manual attention implementation, " | ||
| 'but specifying the manual implementation will be required from Transformers version v5.0.0 onwards. This warning can be removed using the argument `attn_implementation="eager"` when loading the model.' |
There was a problem hiding this comment.
should be split in two lines
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
|
cc @ArthurZucker could you give a first pass ? 🙏 |
ArthurZucker
left a comment
There was a problem hiding this comment.
A few nits here and there, ready otherwise! cc @younesbelkada
| @@ -0,0 +1,55 @@ | |||
| <!--Copyright 2023 The HuggingFace Team. All rights reserved. | |||
There was a problem hiding this comment.
Headers need an update
|
|
||
| Tips: | ||
|
|
||
| - TODO conversion tips if needed |
| @@ -0,0 +1,60 @@ | |||
| # Copyright 2023 EleutherAI and The HuggingFace Inc. team. All rights reserved. | |||
| rot_shape = cos.shape | ||
|
|
||
| # In the original ChatGLM repository the query and key states are manually | ||
| # reshaped into a shape `batch_size, num_q_heads, seq_len, head_dim // 2, 2` changing the order |
There was a problem hiding this comment.
don't you think that we can do this when converting the checkpoints?
There was a problem hiding this comment.
+1, we can move this complexity into the conversion script, and simplify the implementation here.
| if not self.multi_query_attention: | ||
| qkv_size = 3 * self.hidden_size | ||
| self.num_key_value_heads = 1 | ||
| else: | ||
| self.num_key_value_heads = config.multi_query_group_num | ||
| qkv_size = self.projection_size + 2 * self.hidden_size_per_attention_head * self.num_key_value_heads |
There was a problem hiding this comment.
we should just have a general implementation! supporting GQA means you can have MQA, MHA and GQA
There was a problem hiding this comment.
An alternative in my mind: We probably can split the QKV matrix in conversion scripts as well, so that the inference code here do not need to handle the division of QKV matrix. Also separated Q, K, V are more friendly to some specific platforms.
There was a problem hiding this comment.
I might have time to work on the addition next week, with a new cool feature that will help ease the integration! 🤗
There was a problem hiding this comment.
Hey Arthur! How's everything going?
GLM has already released GLM-4 model (slightly different with ChatGLM-3 on tokenizer but others are basically same). Could we resume the integration effort? Thanks!
| else: | ||
| raise ValueError(f"Unknown RoPE scaling type {scaling_type}") | ||
|
|
||
| def _split_heads(self, fused_qkv: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]: |
There was a problem hiding this comment.
we don't need if else
|
I think #33823 should fix this! |
What does this PR do?
Drafts the support of chat GLM