merge b1565 by Nexesenex · Pull Request #17 · Nexesenex/croco.cpp

Nexesenex · 2023-11-26T01:59:53Z

No description provided.

* Update README.md * Update README.md Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> --------- Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* llama : keep track of used KV cells + better KV cache management * llama : zero KV cache used upon clear ggml-ci * llama : allow exporting a view of the KV cache (#4180) * Allow exporting a view of the KV cache * Allow dumping the sequences per cell in common * Track max contiguous cells value and position as well * Fix max contiguous empty cells index calculation Make dump functions deal with lengths or sequences counts > 10 better * Fix off by one error in dump_kv_cache_view * Add doc comments for KV cache view functions Eliminate cell sequence struct; use llama_seq_id directly Minor cleanups * common : add -dkvc arg for enabling kv cache dumps --------- Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>

* Fix incorrect format strings and uninitialized variables. * Address comments * Add the missing include statement

* Update README.md to use PATH for Windows ROCm * Update README.md * Update README.md

llama_token_eos(const struct llama_model *) is currently getting struct llama_context type variable context as a parameter.

* ggml-cuda : support stablelm rope * remove unused freq_base kernel parameter * add n_dims parameter to llm_build_k_shift, default to n_rot via overload * llama : fix llm_build_k_shift args --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Add openai-compatible POST /v1/chat/completions API endpoint to server example * fix code style * Update server README.md * Improve server README.md * Fix server.cpp code style according to review * server : some style changes * server : indentation * server : enable special tokens during tokenization by default * server : minor code style * server : change random string generator * straightforward /v1/models endpoint --------- Co-authored-by: kir-gadjello <111190790+kir-gadjello@users.noreply.github.com> Co-authored-by: Tobi Lütke <tobi@Tobis-MacBook-Pro.local>

…4189)

* reserve space for codepoints * improvement for the appended 0

* Use mmap in torch load, prefer .bin files when loading * Revert .bin > .safetensors preference

get the correct n_orig_ctx in metal

* lookahead : init * lookahead : generate and store n-grams * lookahead : use loop instead recursion to generate n-grams * lookahead : initial working implementation * lookahead : filter repeating n-grams * lookahead : use deterministic init * lookahead : add to Makefile * lookahead : fix a bug in the seq_id of the lookahead tokens * lookahead : add comments --------- Co-authored-by: slaren <slarengh@gmail.com>

* copy to llama.cpp as subdir * attempt enabling metal, fails * ggml metal compiles! * Update README.md * initial conversion to new format, utf8 errors? * bug fixes, but now has an invalid memory access :( * added O3, now has insufficient memory access * begin sync with master * update to match latest code, new errors * fixed it! * fix for loop conditionals, increase result size * fix current workflow errors * attempt a llama.swiftui workflow * Update .github/workflows/build.yml Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

…g.cmake (#3970) * Split CPP generation from build-info query * Remove blank lines * Add BUILD_SHARED_LIBS option

…l offload checks in llama.cpp (#4240) * ggml : use blas even if src0 is not F32 * llama : use n_threads_batch only when n_tokens >= 32 ggml-ci * llama : revert n_threads_batch logic ggml-ci

docs: update how to run

* fix: readme * chore: resolve comments * chore: resolve comments

* main : Call llama_log_set to use LOG_TEE * tabs to spaces

* ShareGPT4 compatibility (vision encoder only loading) Load only a CLIP vision encoder (as supplied by ShareGPT finetunes) Corrects the argument parsing for --img_mean and --img_std (which were previously not parsed but attempted to access) Defines defaults for img_mean and img_std which are equal to the llava 1.5 CLIP encoder, so you do not have to provide them * Update convert-image-encoder-to-gguf.py

* cmake : fix joining of REAL_GIT_DIR * fix includes with help from include-what-you-use * make : remove unneeded deps and add test-rope target * fix C includes in C++ source files * Revert "fix includes with help from include-what-you-use" This reverts commit 635e9fa.

Co-authored-by: Will Findley <findley@gmail.com>

* * add multiprompt support * * cleanup * * more cleanup * * remove atomicity of id_gen, and change lock_guard to unique_lock on completion requests * * remove all references to mutex_multitasks * Update examples/server/server.cpp Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * Update examples/server/server.cpp Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * Update examples/server/server.cpp Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * Update examples/server/server.cpp Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * * change to set --------- Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

* * add --log-disable to disable logging to file in the server example * * typo fix

* metal : implement soft_max_ext * cuda : implement soft_max_ext * ggml : implement soft_max_ext (CPU) * batched-bench : print threads ggml-ci * metal : simplify soft_max encoding ggml-ci * cuda : use 512 threads for soft_max instead of 32 * ggml : update soft max cpu * cuda : do warp-based block reduce * cuda : increase max block size to 1024 * cuda : fix warp reduction initialization of shared mem * metal : warp-based reduction for soft max kernel * metal : warp-based reduce for rms_norm * metal : simplify soft max kernel ggml-ci * alloc : fix build with debug

This commit adds a requirements file for the convert-hf-to-gguf.py script, and also add the torch and transformers packages to it. The motivation for this is that currently running convert-hf-to-gguf.py will produce the following error: ```console $ python3 -m venv venv $ source venv/bin/activate (venv) $ pip install -r requirements.txt Collecting numpy==1.24.4 Collecting sentencepiece==0.1.98 Collecting gguf>=0.1.0 Installing collected packages: sentencepiece, numpy, gguf Successfully installed gguf-0.5.1 numpy-1.24.4 sentencepiece-0.1.98 (venv) $ python convert-hf-to-gguf.py --help Traceback (most recent call last): File "llama.cpp/convert-hf-to-gguf.py", line 16, in <module> import torch ModuleNotFoundError: No module named 'torch' ``` With this commit, and using requirements-hf-to-gguf.txt instead of requirements.txt, the script can be run and shows the help output. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

happens with multi-threaded quantization of Qwen-72B ggml-ci

* enable qwen to llama.cpp * llama : do not GPU split bias tensors --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Support attention_bias on LLaMA architecture QKVO bias, should fix InternLM (#3133) and works for LLaMAfied Qwen models (#3743 (comment)). * check existence of qkvo bias while loading llama models Tested on LLaMA2, CUDA and CPU. * Update llama.cpp

* Fix token_to_piece implementation in Swift * Fix errors

* llama : pad KV cache size to 32 * metal : try to improve batched decoding

(cherry picked from commit mozilla-ai/llamafile@e8c92bc)

ggml-ci

* ggml : fix soft max out-of-bounds access ggml-ci * ggml : reuse ggml_get_n_tasks() in ggml_graph_plan() ggml-ci

* Merge mainline * Fix after merge * Remove CI check --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

merge: Experimental

* Basic JIT compilation for mul_mat, get_rows, and scale (#17) * scale jit working * preliminary working jit for getrows and mulmat, needs refining * simplified mul_mat preprocessing switch statement * get_rows fixes, mul_mat refinement * formatted + last edits * removed some extraneous prints * fixed get_rows, fixed workgroup dispatch in mul_mat. no gibberish * small fix * some changes, working * get_rows and mul_mat jit fixed and working * Update formatting * formatting * Add header --------- Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> Co-authored-by: Reese Levine <reeselevine1@gmail.com> * Start work on all-encompassing shader library * refactor argmax, set_rows * Refactor all but flashattention, mat mul * flashattention and matrix multiplication moved to new format * clean up preprocessing * Formatting * remove duplicate constants * Split large shaders into multiple static strings --------- Co-authored-by: neha-ha <137219201+neha-ha@users.noreply.github.com>

…better shader parameter handling (ggml-org#20173) * K quant speedup (#20) * Basic JIT compilation for mul_mat, get_rows, and scale (#17) * scale jit working * preliminary working jit for getrows and mulmat, needs refining * simplified mul_mat preprocessing switch statement * get_rows fixes, mul_mat refinement * formatted + last edits * removed some extraneous prints * fixed get_rows, fixed workgroup dispatch in mul_mat. no gibberish * small fix * some changes, working * get_rows and mul_mat jit fixed and working * Update formatting * formatting * Add header --------- Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> Co-authored-by: Reese Levine <reeselevine1@gmail.com> * Start work on all-encompassing shader library * refactor argmax, set_rows * Refactor all but flashattention, mat mul * no gibberish, all k quants added, merged * vec memory fix * q6_k matching metal on my machine, tests passing * Set tile size for q6_k separately * Separate out fast shaders --------- Co-authored-by: neha-ha <137219201+neha-ha@users.noreply.github.com> * Move towards writeBuffer for params * Move away from multiple buffers for set_rows errors, remove host buffer for parameter buffers, minor cleanups * Remove extra file * Formatting --------- Co-authored-by: neha-ha <137219201+neha-ha@users.noreply.github.com>

) * ggml: backend-agnostic tensor parallelism * support for GPT-OSS, Qwen 3 MoE * partial Vulkan fix * add support for 4/8 GPUs * unconditional peer access * re-use buffers + ggml contexts * fix output pattern * NCCL support * GGML: HIP: add RCCL support * Remove shfl and AllReduce from backend interface * move allocation workaround out of ggml-alloc.c * 2d tensor set/get support * Fix the seg fault without NCCL * Apply suggestion from JohannesGaessler * support for tensor dims % n_devs != 0 * fix view_offs scaling * arbitrary num. of GPUs/tensor split * fix compilation * better granularity estimate * Support device-specific host buffer types if all underlying backends expose the same type. This allows using pinned memory instead of pageable memory for CUDA. Fix compilation errors. * partial Qwen 3 Next support * Fix qwen3 30b (#8) * Fix crash with Qwen-30B-A3B Q4_0 Qwen-30B-A3B Q4_0 has an intermediate dimension of 768. Using a granularity of 256 forces an uneven split between GPUs, which is not supported by the current implementation. * Decide block size based on tensor quantization type * Fix crashes due to KV cache serialization (#9) KV cache serialization requires non-zero offsets on the tensor. Add support in the meta backend to set/get a tensor with a non-zero offset. * metal : fix build (#7) * static memory allocations, fix usage count * fix tensor granularity * more even memory distribution * use BF16 for allreduce * rebase fixup * better error message for unsupported architectures * Fix device mismatch during scatter of allReduce. (#11) There is a mismatch between the dst buffer device and the backend device, causing the use of sync copies * Enable the previous allreduce implementation. It is better in both perf and stability (#12) * delay AllReduce for Moe for less I/O * build : clean-up compile warnings * backend : move most of the meta backend API to ggml-backend-impl.h * cont : hide unused public API in the implementation * llama : use llama_device + remove ggml_backend_dev_is_meta() * ggml-backend : remove unused alloc include * minor : remove regex include * ggml : introduce ggml-ext.h for staging new APIs * rebase fixup * fix tests * llama : more robust logic for determining Meta devices (#16) * llama : more robust logic for determining Meta devices * cont : fix devs size check Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * cont : fix log type Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * disable roundtrip for meta backend * fix arch selection * Qwen 3.5 support * fix Gemma 4 MoE * fix OpenVino, SYCL * fix test-llama-archs for CPU-only builds * Fix Qwen 3.5 MoE * disable meta backend tests for WebGPU * tests : filter CPU-based devices from the Meta backend tests (#17) * meta : formatting, naming, indentation (#18) * formatting : llama-model.cpp * formatting : ggml-ext.h * formatting : ggml-backend-meta.cpp * meta : add TODO * add documentation * better error messages * fix GPT-OSS --------- Co-authored-by: Carl Philipp Klemm <carl@uvos.xyz> Co-authored-by: Gaurav Garg <gaugarg@nvidia.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

jammm and others added 30 commits November 20, 2023 17:02

readme : update ROCm Windows instructions (#4122)

dfc7cd4

* Update README.md * Update README.md Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> --------- Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

finetune - update readme to mention llama support only (#4148)

0b871f1

stablelm : simplify + speedup generation (#4153)

8e672ef

docs : add llama-star arch idea

ff8238f

examples : fix typo in parallel example doc comment (#4181)

9d5949f

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

readme : update hot topics

d103d93

Fix incorrect format strings and uninitialized variables. (#4133)

55978ce

* Fix incorrect format strings and uninitialized variables. * Address comments * Add the missing include statement

readme : use PATH for Windows ROCm (#4195)

b35f3d0

* Update README.md to use PATH for Windows ROCm * Update README.md * Update README.md

main.swift : fix eos checking (#4197)

2568a4b

llama_token_eos(const struct llama_model *) is currently getting struct llama_context type variable context as a parameter.

convert : fix tensors using grad in some models (#4173)

189d684

llama : set metal log callback correctly (#4204)

e9c13ff

readme : update hot topics

04814e7

Update docs for yarn_ext_factor <0.0 as unspecified instead of NaN (#…

3014b54

…4189)

llama : grammar reserve space in decode_utf8 (#4210)

f837c3a

* reserve space for codepoints * improvement for the appended 0

scripts : Use mmap in torch load (#4202)

1ddb52e

* Use mmap in torch load, prefer .bin files when loading * Revert .bin > .safetensors preference

metal : fix yarn (#4220)

22da055

get the correct n_orig_ctx in metal

readme : update hot topics

9656026

lookahead : support -n -1 infinite generation

3e73d31

ggml : fix -Warray-bounds warning with gcc (#4231)

f3b2698

readme : add Amica to UI list (#4230)

0dab8cd

cmake : fix issue with version info not getting baked into LlamaConfi…

b38a16d

…g.cmake (#3970) * Split CPP generation from build-info query * Remove blank lines * Add BUILD_SHARED_LIBS option

ggml : re-enable BLAS for CPU when src0 != F32 + remove redundant ful…

8406b09

…l offload checks in llama.cpp (#4240) * ggml : use blas even if src0 is not F32 * llama : use n_threads_batch only when n_tokens >= 32 ggml-ci * llama : revert n_threads_batch logic ggml-ci

ggml : restore abort() in GGML_ASSERT (#4242)

64e64aa

readme : add FreeChat (#4248)

4fea342

examples : add readme files

1f5cd83

ensan-hcl and others added 23 commits November 30, 2023 23:45

batched.swift : update README.md (#4214)

bde629b

docs: update how to run

docker : add finetune option (#4211)

3bd2c7c

readme : fix (#4135)

524907a

* fix: readme * chore: resolve comments * chore: resolve comments

main : pass LOG_TEE callback to llama.cpp log (#4033)

8efa0f6

* main : Call llama_log_set to use LOG_TEE * tabs to spaces

make : fix Apple clang determination bug (#4272)

d2809a3

Co-authored-by: Will Findley <findley@gmail.com>

server : add --log-disable to disable logging to file (#4260)

1d14411

* * add --log-disable to disable logging to file in the server example * * typo fix

llama : fix integer overflow during quantization (#4284)

880f579

happens with multi-threaded quantization of Qwen-72B ggml-ci

llama : add Qwen support (#4281)

37c746d

* enable qwen to llama.cpp * llama : do not GPU split bias tensors --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

build : enable libstdc++ assertions for debug builds (#4275)

511f52c

swift : fix token_to_piece implementation (#4278)

b220222

* Fix token_to_piece implementation in Swift * Fix errors

llama : support optional tensors (#4283)

d5a1cbd

llama : avoid using "optional" keyword (#4283)

5a7d312

llama : pad KV cache size (#4280)

d7b800b

* llama : pad KV cache size to 32 * metal : try to improve batched decoding

py : add grammar to oai like api (#4294)

6949b50

server : fix OpenAI API stop field to be optional (#4299)

33e171d

(cherry picked from commit mozilla-ai/llamafile@e8c92bc)

ggml : fix soft max out-of-bounds access (#4307)

adf3de4

ggml-ci

ggml : reuse ggml_get_n_tasks() in ggml_graph_plan() (#4308)

fbbc428

* ggml : fix soft max out-of-bounds access ggml-ci * ggml : reuse ggml_get_n_tasks() in ggml_graph_plan() ggml-ci

Nexesenex deleted the branch Nexesenex:concedo_exp_llamaster_up December 4, 2023 00:43

Nexesenex closed this Dec 4, 2023

Nexesenex pushed a commit that referenced this pull request Dec 22, 2024

Merge mainline - Aug 12 2024 (#17)

8f43e55

* Merge mainline * Fix after merge * Remove CI check --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Nexesenex pushed a commit that referenced this pull request Jul 1, 2025

Merge pull request #17 from esolithe/concedo_experimental

8973e4c

merge: Experimental

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge b1565#17

merge b1565#17
Nexesenex wants to merge 59 commits intoNexesenex:concedo_exp_llamaster_upfrom
ggml-org:master

Nexesenex commented Nov 26, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

19 participants

Conversation

Nexesenex commented Nov 26, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

19 participants