Concedo experimental by Nexesenex · Pull Request #18 · Nexesenex/croco.cpp

Nexesenex · 2023-12-01T07:20:24Z

No description provided.

* Update README.md * Update README.md Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> --------- Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* llama : keep track of used KV cells + better KV cache management * llama : zero KV cache used upon clear ggml-ci * llama : allow exporting a view of the KV cache (ggml-org#4180) * Allow exporting a view of the KV cache * Allow dumping the sequences per cell in common * Track max contiguous cells value and position as well * Fix max contiguous empty cells index calculation Make dump functions deal with lengths or sequences counts > 10 better * Fix off by one error in dump_kv_cache_view * Add doc comments for KV cache view functions Eliminate cell sequence struct; use llama_seq_id directly Minor cleanups * common : add -dkvc arg for enabling kv cache dumps --------- Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>

) * Fix incorrect format strings and uninitialized variables. * Address comments * Add the missing include statement

* Update README.md to use PATH for Windows ROCm * Update README.md * Update README.md

llama_token_eos(const struct llama_model *) is currently getting struct llama_context type variable context as a parameter.

* ggml-cuda : support stablelm rope * remove unused freq_base kernel parameter * add n_dims parameter to llm_build_k_shift, default to n_rot via overload * llama : fix llm_build_k_shift args --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Add openai-compatible POST /v1/chat/completions API endpoint to server example * fix code style * Update server README.md * Improve server README.md * Fix server.cpp code style according to review * server : some style changes * server : indentation * server : enable special tokens during tokenization by default * server : minor code style * server : change random string generator * straightforward /v1/models endpoint --------- Co-authored-by: kir-gadjello <111190790+kir-gadjello@users.noreply.github.com> Co-authored-by: Tobi Lütke <tobi@Tobis-MacBook-Pro.local>

…gml-org#4189)

* reserve space for codepoints * improvement for the appended 0

* Use mmap in torch load, prefer .bin files when loading * Revert .bin > .safetensors preference

get the correct n_orig_ctx in metal

* lookahead : init * lookahead : generate and store n-grams * lookahead : use loop instead recursion to generate n-grams * lookahead : initial working implementation * lookahead : filter repeating n-grams * lookahead : use deterministic init * lookahead : add to Makefile * lookahead : fix a bug in the seq_id of the lookahead tokens * lookahead : add comments --------- Co-authored-by: slaren <slarengh@gmail.com>

# Conflicts: # Makefile # README.md

* copy to llama.cpp as subdir * attempt enabling metal, fails * ggml metal compiles! * Update README.md * initial conversion to new format, utf8 errors? * bug fixes, but now has an invalid memory access :( * added O3, now has insufficient memory access * begin sync with master * update to match latest code, new errors * fixed it! * fix for loop conditionals, increase result size * fix current workflow errors * attempt a llama.swiftui workflow * Update .github/workflows/build.yml Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

…g.cmake (ggml-org#3970) * Split CPP generation from build-info query * Remove blank lines * Add BUILD_SHARED_LIBS option

…l offload checks in llama.cpp (ggml-org#4240) * ggml : use blas even if src0 is not F32 * llama : use n_threads_batch only when n_tokens >= 32 ggml-ci * llama : revert n_threads_batch logic ggml-ci

…gmentation causing issues in some scenarios.

# Conflicts: # .github/workflows/build.yml # CMakeLists.txt # README.md # scripts/build-info.cmake

… value instead of added on load.

# Conflicts: # README.md

Squashed commits: [cdb74264] fixed chub ai imports

Concedo experimental

) * ggml: backend-agnostic tensor parallelism * support for GPT-OSS, Qwen 3 MoE * partial Vulkan fix * add support for 4/8 GPUs * unconditional peer access * re-use buffers + ggml contexts * fix output pattern * NCCL support * GGML: HIP: add RCCL support * Remove shfl and AllReduce from backend interface * move allocation workaround out of ggml-alloc.c * 2d tensor set/get support * Fix the seg fault without NCCL * Apply suggestion from JohannesGaessler * support for tensor dims % n_devs != 0 * fix view_offs scaling * arbitrary num. of GPUs/tensor split * fix compilation * better granularity estimate * Support device-specific host buffer types if all underlying backends expose the same type. This allows using pinned memory instead of pageable memory for CUDA. Fix compilation errors. * partial Qwen 3 Next support * Fix qwen3 30b (#8) * Fix crash with Qwen-30B-A3B Q4_0 Qwen-30B-A3B Q4_0 has an intermediate dimension of 768. Using a granularity of 256 forces an uneven split between GPUs, which is not supported by the current implementation. * Decide block size based on tensor quantization type * Fix crashes due to KV cache serialization (#9) KV cache serialization requires non-zero offsets on the tensor. Add support in the meta backend to set/get a tensor with a non-zero offset. * metal : fix build (#7) * static memory allocations, fix usage count * fix tensor granularity * more even memory distribution * use BF16 for allreduce * rebase fixup * better error message for unsupported architectures * Fix device mismatch during scatter of allReduce. (#11) There is a mismatch between the dst buffer device and the backend device, causing the use of sync copies * Enable the previous allreduce implementation. It is better in both perf and stability (#12) * delay AllReduce for Moe for less I/O * build : clean-up compile warnings * backend : move most of the meta backend API to ggml-backend-impl.h * cont : hide unused public API in the implementation * llama : use llama_device + remove ggml_backend_dev_is_meta() * ggml-backend : remove unused alloc include * minor : remove regex include * ggml : introduce ggml-ext.h for staging new APIs * rebase fixup * fix tests * llama : more robust logic for determining Meta devices (#16) * llama : more robust logic for determining Meta devices * cont : fix devs size check Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * cont : fix log type Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * disable roundtrip for meta backend * fix arch selection * Qwen 3.5 support * fix Gemma 4 MoE * fix OpenVino, SYCL * fix test-llama-archs for CPU-only builds * Fix Qwen 3.5 MoE * disable meta backend tests for WebGPU * tests : filter CPU-based devices from the Meta backend tests (#17) * meta : formatting, naming, indentation (#18) * formatting : llama-model.cpp * formatting : ggml-ext.h * formatting : ggml-backend-meta.cpp * meta : add TODO * add documentation * better error messages * fix GPT-OSS --------- Co-authored-by: Carl Philipp Klemm <carl@uvos.xyz> Co-authored-by: Gaurav Garg <gaugarg@nvidia.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

jammm and others added 30 commits November 20, 2023 17:02

readme : update ROCm Windows instructions (ggml-org#4122)

dfc7cd4

* Update README.md * Update README.md Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> --------- Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

finetune - update readme to mention llama support only (ggml-org#4148)

0b871f1

stablelm : simplify + speedup generation (ggml-org#4153)

8e672ef

docs : add llama-star arch idea

ff8238f

examples : fix typo in parallel example doc comment (ggml-org#4181)

9d5949f

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

readme : update hot topics

d103d93

Fix incorrect format strings and uninitialized variables. (ggml-org#4133

55978ce

) * Fix incorrect format strings and uninitialized variables. * Address comments * Add the missing include statement

readme : use PATH for Windows ROCm (ggml-org#4195)

b35f3d0

* Update README.md to use PATH for Windows ROCm * Update README.md * Update README.md

main.swift : fix eos checking (ggml-org#4197)

2568a4b

llama_token_eos(const struct llama_model *) is currently getting struct llama_context type variable context as a parameter.

convert : fix tensors using grad in some models (ggml-org#4173)

189d684

llama : set metal log callback correctly (ggml-org#4204)

e9c13ff

readme : update hot topics

04814e7

Update docs for yarn_ext_factor <0.0 as unspecified instead of NaN (g…

3014b54

…gml-org#4189)

llama : grammar reserve space in decode_utf8 (ggml-org#4210)

f837c3a

* reserve space for codepoints * improvement for the appended 0

scripts : Use mmap in torch load (ggml-org#4202)

1ddb52e

* Use mmap in torch load, prefer .bin files when loading * Revert .bin > .safetensors preference

metal : fix yarn (ggml-org#4220)

22da055

get the correct n_orig_ctx in metal

Fix GPT2 not loading due to graph too small

a6eb9b8

explore quiet mode

bffa781

trigger quiet mode when selecting remotetunnel

2f51a6a

readme : update hot topics

9656026

lookahead : support -n -1 infinite generation

3e73d31

ggml : fix -Warray-bounds warning with gcc (ggml-org#4231)

f3b2698

updated lite

ec1796b

Merge branch 'master' into concedo_experimental

8acd7be

# Conflicts: # Makefile # README.md

reduce max ctx to fit instead of crashing

0e5f16d

kasumi-1 and others added 14 commits November 27, 2023 19:39

readme : add Amica to UI list (ggml-org#4230)

0dab8cd

cmake : fix issue with version info not getting baked into LlamaConfi…

b38a16d

…g.cmake (ggml-org#3970) * Split CPP generation from build-info query * Remove blank lines * Add BUILD_SHARED_LIBS option

ggml : re-enable BLAS for CPU when src0 != F32 + remove redundant ful…

8406b09

…l offload checks in llama.cpp (ggml-org#4240) * ggml : use blas even if src0 is not F32 * llama : use n_threads_batch only when n_tokens >= 32 ggml-ci * llama : revert n_threads_batch logic ggml-ci

show more info about available APIs

d2ef458

ggml : restore abort() in GGML_ASSERT (ggml-org#4242)

64e64aa

Allocate a small amount of extra context for GGUF to deal with KV fra…

ba5c333

…gmentation causing issues in some scenarios.

Merge branch 'master' into concedo_experimental

581021a

# Conflicts: # .github/workflows/build.yml # CMakeLists.txt # README.md # scripts/build-info.cmake

added a proper quiet mode

b75152e

refined multiuser mode

66ef4a2

readme : add FreeChat (ggml-org#4248)

4fea342

examples : add readme files

1f5cd83

updated docs, shifted kv extra space to be subtracted from user's ctx…

a012342

… value instead of added on load.

Merge branch 'master' into concedo_experimental

e9724cd

# Conflicts: # README.md

fixed chub ai imports (+1 squashed commits)

a195cde

Squashed commits: [cdb74264] fixed chub ai imports

Nexesenex marked this pull request as ready for review December 1, 2023 07:20

Nexesenex merged this pull request into Nexesenex:concedo_exp_llamaster_up Dec 1, 2023

Nexesenex pushed a commit that referenced this pull request Jul 1, 2025

Merge pull request #18 from esolithe/concedo_experimental

c772c59

Concedo experimental

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concedo experimental#18

Concedo experimental#18
Nexesenex merged 44 commits intoNexesenex:concedo_exp_llamaster_upfrom
LostRuins:concedo_experimental

Nexesenex commented Dec 1, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

16 participants

Conversation

Nexesenex commented Dec 1, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

16 participants