Skip to content

Multi-threaded quantization#1075

Merged
ggerganov merged 6 commits intomasterfrom
multi-thread-quantize
Apr 20, 2023
Merged

Multi-threaded quantization#1075
ggerganov merged 6 commits intomasterfrom
multi-thread-quantize

Conversation

@ikawrakow
Copy link
Copy Markdown
Contributor

This PR adds multi-threading for quantization.

The gain is very minor for small models (e.g., LLaMA 7B) and simple quantization (Q4_0 and Q4_1), but very significant for large models and the now more elaborate Q4_2 quantization.

quantize-stats now finishes in just 14.5 seconds (7B) or 44 seconds (13B) on my computer for all 3 quantization types. The single-threaded version took 144 seconds (7B) or 242 seconds (13B).

Iwan Kawrakow added 2 commits April 19, 2023 20:22
Not much gain for simple quantizations, bit it will be important
for quantizations that require more CPU cycles.
It now does the job in ~14 seconds on my Mac for
Q4_0, Q4_1 and Q4_2. Single-threaded it was taking
more than 2 minutes after adding the more elaborate
version of Q4_2.
@ikawrakow ikawrakow requested review from sw and unbounded April 20, 2023 05:45
@DannyDaemonic
Copy link
Copy Markdown
Contributor

This could make more accurate but slow quantization methods more practical. (See #835.)

Comment thread llama.cpp Outdated
Comment thread llama.cpp Outdated
Comment thread llama.cpp Outdated
Comment thread ggml.c Outdated
@ggerganov ggerganov added the performance Speed related topics label Apr 20, 2023
Iwan Kawrakow added 3 commits April 20, 2023 18:17
After changing chunk_size to const int as suggested by
@ggerganov, clang and GCC starting to warn me that I don't
need to capture it in the lambda. So, I removed it from the
capture list. But that makes the MSVC build fail. So,
making it a constexpr to make every compiler happy.
@prusnak
Copy link
Copy Markdown
Contributor

prusnak commented Apr 20, 2023

Please resolve conflicts with the master branch

@ggerganov ggerganov merged commit 38de86a into master Apr 20, 2023
@ggerganov ggerganov deleted the multi-thread-quantize branch April 20, 2023 17:42
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
* Multi-threading quantization.

Not much gain for simple quantizations, bit it will be important
for quantizations that require more CPU cycles.

* Multi-threading for quantize-stats

It now does the job in ~14 seconds on my Mac for
Q4_0, Q4_1 and Q4_2. Single-threaded it was taking
more than 2 minutes after adding the more elaborate
version of Q4_2.

* Reviewer comments

* Avoiding compiler confusion

After changing chunk_size to const int as suggested by
@ggerganov, clang and GCC starting to warn me that I don't
need to capture it in the lambda. So, I removed it from the
capture list. But that makes the MSVC build fail. So,
making it a constexpr to make every compiler happy.

* Still fighting with lambda captures in MSVC

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
phuongncn pushed a commit to phuongncn/llama.cpp-gx10-dgx-sparks-deepseekv4 that referenced this pull request Apr 28, 2026
* Multi-threading quantization.

Not much gain for simple quantizations, bit it will be important
for quantizations that require more CPU cycles.

* Multi-threading for quantize-stats

It now does the job in ~14 seconds on my Mac for
Q4_0, Q4_1 and Q4_2. Single-threaded it was taking
more than 2 minutes after adding the more elaborate
version of Q4_2.

* Reviewer comments

* Avoiding compiler confusion

After changing chunk_size to const int as suggested by
@ggerganov, clang and GCC starting to warn me that I don't
need to capture it in the lambda. So, I removed it from the
capture list. But that makes the MSVC build fail. So,
making it a constexpr to make every compiler happy.

* Still fighting with lambda captures in MSVC

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Speed related topics

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants