Conversation
|
Unfortunately, |
|
A few remarks:
|
Yes, I'm still hesitating. But I think |
Somehow perplexity computation with Edit: fixed |
|
can we increment this value by 1 ? edit: oh, it was all in the llama.h/.cpp |
That would make the unaffected formats incompatible - F16, Q8. The clean way would be to define new formats Q4_4, Q4_5, etc. But that gets unwieldy quickly. |
|
@sw doesn't have to be though, during loading exceptions can be added in llama.cpp to treat old f16 and q8 format with either file versions 1 or 2 as forwards compatible. |
|
Close in favor of #1405 |
|
great, I finally compiled it on my pc(no avx2 support) AND with cuda support. And this change makes none of my models load :(. I dont know how to quantize things, Ive read a lot about it and I doubt I even have the PC-resources to do it. |
|
@ProfessorSparrs if you have the f16 files, qnantizing is very easy and WAY less recource intensive than running the model. :) (check the |
Implementation of #1241
Avoid unnecessary bit shuffling by packing the quants in a better way.
Requires model re-quantization
Q4_0Q4_1Q5_0Q5_1New timings:
Old timings:
overall, all these numbers seem to have about +/- 10% variablility from run to run. not ideal benchmark, but not sure what else to do