Skip to content

geoffmunn/llama.cpp

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8,453 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

llama.cpp — HIFI Quantisation Fork

License: MIT

This is a fork of the ggml-org/llama.cpp project, focused on developing custom quantisation types — currently the HIFI family of quantisation variants.

The HIFI quantisation types aim to deliver better quality at the same (or similar) model sizes compared to the standard quantisation options. This is an ongoing, actively developed project and public contributions are welcome.

Quick start

To build and use HIFI quantised models, follow the detailed instructions in the HIFI Build Guide, which covers:

  • Cloning and building this fork
  • Downloading and converting base models
  • Creating imatrix files
  • Quantising models with the HIFI types
  • Running perplexity tests and benchmarks

About llama.cpp

The upstream llama.cpp project enables LLM inference with minimal setup and state-of-the-art performance on a wide range of hardware — locally and in the cloud.

  • Plain C/C++ implementation without any dependencies
  • Apple silicon is a first-class citizen — optimised via ARM NEON, Accelerate and Metal frameworks
  • AVX, AVX2, AVX512 and AMX support for x86 architectures
  • RVV, ZVFH, ZFH, ZICBOP and ZIHINTPAUSE support for RISC-V architectures
  • 1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantisation for faster inference and reduced memory use
  • Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP and Moore Threads GPUs via MUSA)
  • Vulkan and SYCL backend support
  • CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity

For the full upstream project, see ggml-org/llama.cpp.

Supported models

Typically finetunes of the base models below are supported as well.

Text-only

Multimodal

Supported backends

Backend Target devices
Metal Apple Silicon
BLAS All
SYCL Intel and Nvidia GPU
CUDA Nvidia GPU
HIP AMD GPU
Vulkan GPU
CANN Ascend NPU

Key tools

A CLI tool for accessing and experimenting with most of llama.cpp's functionality.

llama-cli -m model.gguf

A lightweight, OpenAI API compatible, HTTP server for serving LLMs.

llama-server -m model.gguf --port 8080

A tool for measuring the perplexity of a model over a given text — essential for evaluating quantisation quality.

llama-perplexity -m model.gguf -f file.txt

Benchmark the performance of inference for various parameters.

llama-bench -m model.gguf

Contributing

This is an ongoing project and public contributions are welcome. Whether it's new quantisation types, performance improvements, bug fixes, or documentation — all contributions are appreciated.

  • Open a PR or issue on this repository
  • See CONTRIBUTING.md for general guidelines (inherited from upstream)
  • Read the HIFI Build Guide to get familiar with the project workflow

Upstream documentation

This fork inherits extensive documentation from the upstream project:

Dependencies

  • yhirose/cpp-httplib - Single-header HTTP server, used by llama-server - MIT license
  • stb-image - Single-header image format decoder, used by multimodal subsystem - Public domain
  • nlohmann/json - Single-header JSON library, used by various tools/examples - MIT License
  • miniaudio.h - Single-header audio format decoder, used by multimodal subsystem - Public domain
  • subprocess.h - Single-header process launching solution for C and C++ - Public domain

About

LLM inference in C/C++

Resources

License

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages

  • C++ 56.5%
  • C 12.7%
  • Python 7.7%
  • Cuda 6.4%
  • HTML 4.1%
  • TypeScript 2.0%
  • Other 10.6%