graph : normalize Q, K, V shapes and add comments#12449
Conversation
| ggml_build_forward_expand(gf, ggml_cpy(ctx0, k_cur, k_cache_view)); | ||
|
|
||
| assert(v_cur->ne[0] == n_embd_v_gqa && v_cur->ne[1] == n_tokens); | ||
| v_cur = ggml_reshape_2d(ctx0, v_cur, n_embd_v_gqa, n_tokens); |
fairydreaming
left a comment
There was a problem hiding this comment.
I confirm that this fixed the remaining issue with T5 models. I went over the code and don't see any obvious problems.
Since these changes affect multiple models I briefly tested some that I had on disk: llama-3.1, llama 3.2, phi-4, gemma-2, deepseek r1, qwen 2.5 and t5 family. All seem to work fine.
|
I applied this diff to b4914 and it still does not work: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes system_info: n_threads = 4 (n_threads_batch = 4) / 4 | CUDA : ARCHS = 520,610,700,750 | FORCE_MMQ = 1 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | sampler seed: 2296139272
llama_perf_sampler_print: sampling time = 46.71 ms / 128 runs ( 0.36 ms per token, 2740.55 tokens per second) |
|
It does work now with -ngl 0 though, if that might give any clue as to problem. ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes system_info: n_threads = 4 (n_threads_batch = 4) / 4 | CUDA : ARCHS = 520,610,700,750 | FORCE_MMQ = 1 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | sampler seed: 2041154203 Heute regnet es [end of text] llama_perf_sampler_print: sampling time = 2.02 ms / 7 runs ( 0.29 ms per token, 3458.50 tokens per second) |
|
@steampunque Good catch, there still seem to be a problem when using T5 with the CUDA backend. I'm going to take a look at this now. |
|
@steampunque ok, it looks like the problem you found is unrelated to this PR, it's another issue introduced with the kv cache refactor PR. A temporary fix: |
|
@fairydreaming Can you check if this alternative patch also fixes the issue: diff --git a/src/llama-context.cpp b/src/llama-context.cpp
index 42332acf1..664703a89 100644
--- a/src/llama-context.cpp
+++ b/src/llama-context.cpp
@@ -1143,6 +1143,8 @@ int llama_context::encode(llama_batch & inp_batch) {
if (model.arch == LLM_ARCH_T5 && t_embd) {
//cross.t_embd = t_embd;
+ synchronize();
+
cross.n_embd = t_embd->ne[0];
cross.n_enc = t_embd->ne[1];
cross.v_embd.resize(cross.n_embd*cross.n_enc); |
Thanks for tracking that down so quickly, t5 is back on line : bash-5.1$ lm "<2de> Today the sun shines" |
|
@ggerganov Yeah, the |
|
@ggerganov I know it's been a while back, but I was recently taking a look at this change. Can I ask what was the motivation for
? Was it purely to fix the T5 bug described in #12435? |
|
It makes the signatures of the functions more consistent. We could also pass the tensors as 2D shapes, it does not matter much - just a matter of convention. |
* graph : normalize Q, K, V shapes and add comments ggml-ci * context : synchronize before getting cross attention data * model : fix command-r attention norm check
ref #12435 (comment)