llama.cpp — HIFI Quantisation Fork

This is a fork of the ggml-org/llama.cpp project, focused on developing custom quantisation types — currently the HIFI family of quantisation variants.

The HIFI quantisation types aim to deliver better quality at the same (or similar) model sizes compared to the standard quantisation options. This is an ongoing, actively developed project and public contributions are welcome.

Quick start

To build and use HIFI quantised models, follow the detailed instructions in the HIFI Build Guide, which covers:

Cloning and building this fork
Downloading and converting base models
Creating imatrix files
Quantising models with the HIFI types
Running perplexity tests and benchmarks

About llama.cpp

The upstream llama.cpp project enables LLM inference with minimal setup and state-of-the-art performance on a wide range of hardware — locally and in the cloud.

Plain C/C++ implementation without any dependencies
Apple silicon is a first-class citizen — optimised via ARM NEON, Accelerate and Metal frameworks
AVX, AVX2, AVX512 and AMX support for x86 architectures
RVV, ZVFH, ZFH, ZICBOP and ZIHINTPAUSE support for RISC-V architectures
1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantisation for faster inference and reduced memory use
Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP and Moore Threads GPUs via MUSA)
Vulkan and SYCL backend support
CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity

For the full upstream project, see ggml-org/llama.cpp.

Supported models

Typically finetunes of the base models below are supported as well.

Text-only

Multimodal

Supported backends

Backend	Target devices
Metal	Apple Silicon
BLAS	All
SYCL	Intel and Nvidia GPU
CUDA	Nvidia GPU
HIP	AMD GPU
Vulkan	GPU
CANN	Ascend NPU

Key tools

`llama-cli`

A CLI tool for accessing and experimenting with most of llama.cpp's functionality.

llama-cli -m model.gguf

`llama-server`

A lightweight, OpenAI API compatible, HTTP server for serving LLMs.

llama-server -m model.gguf --port 8080

`llama-perplexity`

A tool for measuring the perplexity of a model over a given text — essential for evaluating quantisation quality.

llama-perplexity -m model.gguf -f file.txt

`llama-bench`

Benchmark the performance of inference for various parameters.

llama-bench -m model.gguf

Contributing

This is an ongoing project and public contributions are welcome. Whether it's new quantisation types, performance improvements, bug fixes, or documentation — all contributions are appreciated.

Open a PR or issue on this repository
See CONTRIBUTING.md for general guidelines (inherited from upstream)
Read the HIFI Build Guide to get familiar with the project workflow

Upstream documentation

This fork inherits extensive documentation from the upstream project:

Dependencies

yhirose/cpp-httplib - Single-header HTTP server, used by llama-server - MIT license
stb-image - Single-header image format decoder, used by multimodal subsystem - Public domain
nlohmann/json - Single-header JSON library, used by various tools/examples - MIT License
miniaudio.h - Single-header audio format decoder, used by multimodal subsystem - Public domain
subprocess.h - Single-header process launching solution for C and C++ - Public domain

Name		Name	Last commit message	Last commit date
Latest commit History 8,453 Commits
.devops		.devops
.gemini		.gemini
.github		.github
benches		benches
ci		ci
cmake		cmake
common		common
docs		docs
examples		examples
ggml		ggml
gguf-py		gguf-py
grammars		grammars
include		include
licenses		licenses
media		media
models		models
pocs		pocs
requirements		requirements
scripts		scripts
src		src
tests		tests
tools		tools
vendor		vendor
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.dockerignore		.dockerignore
.ecrc		.ecrc
.editorconfig		.editorconfig
.flake8		.flake8
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
AUTHORS		AUTHORS
CLAUDE.md		CLAUDE.md
CMakeLists.txt		CMakeLists.txt
CMakePresets.json		CMakePresets.json
CODEOWNERS		CODEOWNERS
HIFI_BUILD_GUIDE.md		HIFI_BUILD_GUIDE.md
IMatrix_Guide.md		IMatrix_Guide.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
benchmark_speed_test.ps1		benchmark_speed_test.ps1
benchmark_speed_test.sh		benchmark_speed_test.sh
build-xcframework.sh		build-xcframework.sh
convert_hf_to_gguf.py		convert_hf_to_gguf.py
convert_hf_to_gguf_update.py		convert_hf_to_gguf_update.py
convert_llama_ggml_to_gguf.py		convert_llama_ggml_to_gguf.py
convert_lora_to_gguf.py		convert_lora_to_gguf.py
flake.lock		flake.lock
flake.nix		flake.nix
mypy.ini		mypy.ini
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llama.cpp — HIFI Quantisation Fork

Quick start

About llama.cpp

Text-only

Multimodal

Supported backends

Key tools

`llama-cli`

`llama-server`

`llama-perplexity`

`llama-bench`

Contributing

Upstream documentation

Dependencies

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

llama.cpp — HIFI Quantisation Fork

Quick start

About llama.cpp

Text-only

Multimodal

Supported backends

Key tools

llama-cli

llama-server

llama-perplexity

llama-bench

Contributing

Upstream documentation

Dependencies

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`llama-cli`

`llama-server`

`llama-perplexity`

`llama-bench`

Packages