Running on Ubuntu, Intel Core i5-12400F, 32GB RAM.
Built according to README. Running the program with
python llamacpp_for_kobold.py ../llama.cpp/models/30Bnew/ggml-model-q4_0-ggjt.bin --threads 6
Generation seems to take ~5 seconds per token. This is substantially slower than llama.cpp, where I'm averaging around 900ms/token.
At first I thought it was an issue with the threading, but now I'm not so sure... Has anyone else observed similar performance discrepancies? Am I missing something?
Running on Ubuntu, Intel Core i5-12400F, 32GB RAM.
Built according to README. Running the program with
python llamacpp_for_kobold.py ../llama.cpp/models/30Bnew/ggml-model-q4_0-ggjt.bin --threads 6Generation seems to take ~5 seconds per token. This is substantially slower than llama.cpp, where I'm averaging around 900ms/token.
At first I thought it was an issue with the threading, but now I'm not so sure... Has anyone else observed similar performance discrepancies? Am I missing something?