Substantially slower than llama.cpp

Running on Ubuntu, Intel Core i5-12400F, 32GB RAM.

Built according to README. Running the program with 
```python llamacpp_for_kobold.py ../llama.cpp/models/30Bnew/ggml-model-q4_0-ggjt.bin --threads 6```

Generation seems to take ~5 seconds per token. This is substantially slower than llama.cpp, where I'm averaging around 900ms/token.

At first I thought it was an issue with the threading, but now I'm not so sure... Has anyone else observed similar performance discrepancies? Am I missing something?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Substantially slower than llama.cpp #10

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Substantially slower than llama.cpp #10

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions