add chatglm3-6b and glm-4-9b-chat model support#6999
add chatglm3-6b and glm-4-9b-chat model support#6999mnlife wants to merge 5 commits intoggml-org:masterfrom
Conversation
8aee20e to
cb324f4
Compare
ed1d3ff to
9226518
Compare
d523390 to
f3bc337
Compare
bccb68f to
a096383
Compare
|
Is there any way to support glm-4 ? #7778 |
under development |
…m/THUDM/chatglm3-6b Signed-off-by: XingXing Qiao <qiaoxx@dingdao.com>
Signed-off-by: XingXing Qiao <qiaoxx@dingdao.com>
Signed-off-by: XingXing Qiao <qiaoxx@dingdao.com>
Signed-off-by: XingXing Qiao <qiaoxx@dingdao.com>
a03cbca to
bf430d6
Compare
Signed-off-by: XingXing Qiao <qiaoxx@dingdao.com>
|
Not sure if this is a model or an implementation issue, but computing the imatrix of Edit: Looks like running it on CPU instead of CUDA gets it past chunk 21 |
|
will the vision model of glm-4 also be considered? |
|
under development |
|
你好,我使用您的分支编译,使用NVIDIA显卡进行推理,使用模型为glm-4-9b-chat.Q5_K_S.gguf, 能够回答类似:你好;你是谁;写一首诗;这些简短的问题。 但是当提问变长时会出现回复乱码,例如:将以下中文翻译为英文: 生活和天气一样,有晴,有阴,偶尔还会下点雨,自然规律,生活不简单尽量简单过。 以下是执行的日志: .\build\bin\Release\llama-cli.exe -m D:\models\glm-4-9b-chat.Q5_K_S.gguf -p "[gMASK]<|user|>hi<|assistant|>" -t 16 --keep -1 -c 1024 -b 1024 -n -1 -s 123 -ngl 18 --color -i system_info: n_threads = 16 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | == Running in interactive mode. ==
hi |
I have already solved the incorrect answers issue based on this PR. Here is the PR. |
有人来做这些工作了,这个pr #8031 |
|
why is still pending? |




This pull request adds support for chatglm3-6b-chat and glm-4-9b-chat models. Fixes [#7778]
somethings I'm not sure about:
When I add my chat template to examples/server/public/prompt-formats.js and run llama-server, start the browser and input http://localhost:8080/ and change prompt style. The assistant always starts a new line before speaking.

The inference results are incorrect with the CUDA version.
below is some link about chatglm model: