This is a fork of the ggml-org/llama.cpp project, focused on developing custom quantisation types — currently the HIFI family of quantisation variants.
The HIFI quantisation types aim to deliver better quality at the same (or similar) model sizes compared to the standard quantisation options. This is an ongoing, actively developed project and public contributions are welcome.
To build and use HIFI quantised models, follow the detailed instructions in the HIFI Build Guide, which covers:
- Cloning and building this fork
- Downloading and converting base models
- Creating imatrix files
- Quantising models with the HIFI types
- Running perplexity tests and benchmarks
The upstream llama.cpp project enables LLM inference with minimal setup and state-of-the-art performance on a wide range of hardware — locally and in the cloud.
- Plain C/C++ implementation without any dependencies
- Apple silicon is a first-class citizen — optimised via ARM NEON, Accelerate and Metal frameworks
- AVX, AVX2, AVX512 and AMX support for x86 architectures
- RVV, ZVFH, ZFH, ZICBOP and ZIHINTPAUSE support for RISC-V architectures
- 1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantisation for faster inference and reduced memory use
- Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP and Moore Threads GPUs via MUSA)
- Vulkan and SYCL backend support
- CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity
For the full upstream project, see ggml-org/llama.cpp.
Supported models
Typically finetunes of the base models below are supported as well.
- LLaMA 🦙
- LLaMA 2 🦙🦙
- LLaMA 3 🦙🦙🦙
- Mistral 7B
- Mixtral MoE
- DBRX
- Jamba
- Falcon
- BERT
- Baichuan 1 & 2
- Aquila 1 & 2
- Starcoder models
- MPT
- Bloom
- Yi models
- StableLM models
- Deepseek models
- Qwen models
- Phi models
- GPT-2
- InternLM2
- Gemma
- Mamba
- Command-R models
- OLMo
- OLMo 2
- Granite models
- GPT-NeoX + Pythia
- Bitnet b1.58 models
- Flan T5
- ChatGLM3-6b + ChatGLM4-9b
- GLM-4-0414
- SmolLM
- RWKV-6
- Hunyuan models
| Backend | Target devices |
|---|---|
| Metal | Apple Silicon |
| BLAS | All |
| SYCL | Intel and Nvidia GPU |
| CUDA | Nvidia GPU |
| HIP | AMD GPU |
| Vulkan | GPU |
| CANN | Ascend NPU |
A CLI tool for accessing and experimenting with most of llama.cpp's functionality.
llama-cli -m model.ggufA lightweight, OpenAI API compatible, HTTP server for serving LLMs.
llama-server -m model.gguf --port 8080A tool for measuring the perplexity of a model over a given text — essential for evaluating quantisation quality.
llama-perplexity -m model.gguf -f file.txtBenchmark the performance of inference for various parameters.
llama-bench -m model.ggufThis is an ongoing project and public contributions are welcome. Whether it's new quantisation types, performance improvements, bug fixes, or documentation — all contributions are appreciated.
- Open a PR or issue on this repository
- See CONTRIBUTING.md for general guidelines (inherited from upstream)
- Read the HIFI Build Guide to get familiar with the project workflow
This fork inherits extensive documentation from the upstream project:
- yhirose/cpp-httplib - Single-header HTTP server, used by
llama-server- MIT license - stb-image - Single-header image format decoder, used by multimodal subsystem - Public domain
- nlohmann/json - Single-header JSON library, used by various tools/examples - MIT License
- miniaudio.h - Single-header audio format decoder, used by multimodal subsystem - Public domain
- subprocess.h - Single-header process launching solution for C and C++ - Public domain