Skip to content

Work around for recalculating logits in cached prompts (Fixes #1585)#1609

Merged
DannyDaemonic merged 3 commits intoggml-org:masterfrom
DannyDaemonic:cached-logits-bandaid
May 29, 2023
Merged

Work around for recalculating logits in cached prompts (Fixes #1585)#1609
DannyDaemonic merged 3 commits intoggml-org:masterfrom
DannyDaemonic:cached-logits-bandaid

Conversation

@DannyDaemonic
Copy link
Copy Markdown
Contributor

This code just checks if we've used up all the given prompt but still have leftover tokens from the cache. In this case it instead leaves the last token in embd so it will be evaluated.

Evaluating one token isn't the end of the world, but it would be nice if it were faster. What I really wanted to do was something like this:

    llama_eval(ctx, embd.data(), 0, n_past - 1, params.n_threads);

But you can't eval 0 tokens.

I tried to edit llama_eval_internal to skip over the input calculations where n_past (N) is 0, but it's actually way more integrated than I expected. Perhaps there was something to #1281 after all. I think it would be nicer if llama_eval just supported 0 tokens, but it would also be nice to have a different api call that would evaluate the logits at a given position n_past.

This API call could also be used for implementing the mockup in #1608 where you can click on a token and see alternatives.

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

Comment thread examples/main/main.cpp Outdated
Comment thread examples/main/main.cpp
@KerfuffleV2
Copy link
Copy Markdown
Contributor

I can no longer reproduce the problem using the steps described here: #1550 (comment)

Also, as far as I can see prompt caching still works.

I don't know enough about the code to actually review this pull, but I can confirm that from a user's perspective it seems to fix the problem and doesn't break other stuff.

@SlyEcho
Copy link
Copy Markdown
Contributor

SlyEcho commented May 29, 2023

Yeah, I discovered the same "fix" for #1570

@SlyEcho
Copy link
Copy Markdown
Contributor

SlyEcho commented May 29, 2023

Merge in the latest master to fix the build errors.

@DannyDaemonic
Copy link
Copy Markdown
Contributor Author

@SlyEcho Thanks, that fixed the CI errors. For some reason I thought I could just wait until things were fixed and then rerun the checks.

@DannyDaemonic DannyDaemonic merged commit 2483676 into ggml-org:master May 29, 2023
@DannyDaemonic DannyDaemonic deleted the cached-logits-bandaid branch May 29, 2023 12:13
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
…g#1585) (ggml-org#1609)

* Work around for recalculating logits in cached prompts
phuongncn pushed a commit to phuongncn/llama.cpp-gx10-dgx-sparks-deepseekv4 that referenced this pull request Apr 28, 2026
…g#1585) (ggml-org#1609)

* Work around for recalculating logits in cached prompts
phuongncn pushed a commit to phuongncn/llama.cpp-gx10-dgx-sparks-deepseekv4 that referenced this pull request Apr 28, 2026
* ci: implement build matrix for CUDA/CPU containers with dynamic tagging

* fix: Updated Docker images/build-container.yml

* fix: Updated the documentation about Docker

* fix: Set Arch for 3090s

* fix: Updated build step name.

* fix: Set target ARCH as a variable

* feat: Added cleanup step

* feat: Added docker-bake and updated workflow

* fix: Issue with REPO_OWNER variable

* fix: Updated workflow to solve errors

* fix: Updated branch format

* fix: Wrong naming

* Update docker-bake.hcl

* Update build-container.yml

* Update ik_llama-cuda.Containerfile

* Update ik_llama-cpu.Containerfile

* Update docker-bake.hcl

* Update build-container.yml

* Removed action/cache

* added -sSL for reliability and fixed the URL path

* added -sSL for reliability and fixed the URL path CUDA containerfile

* fix: correct Dockerfile RUN command syntax errors

- Combine split apt-get install commands in both Containerfiles
- Fix broken cmake command continuation in ik_llama-cuda.Containerfile

* fix: correct llama-swap download URL in Containerfiles

- Fix broken line continuation in curl download URL for llama-swap

* perf: improve ccache configuration in Containerfiles

- Add CCACHE_UMASK=000 for cache accessibility across stages
- Add CCACHE_MAXSIZE=1G to prevent unbounded growth
- Initialize ccache with ccache -i during build stage

* fix: remove problematic ccache initialization from Containerfiles

- ccache -i fails because CCACHE_DIR mount doesn't exist yet at build time

* fix: add git to CPU Containerfile build dependencies

- Resolves CMake warning about missing Git for build info

* chore: optimize Containerfile with smaller images and better healthchecks

- Add --no-install-recommends to all apt-get commands for smaller image size
- Add ca-certificates to base stage for HTTPS support
- Merge redundant build copy commands from 3 layers to 1
- Fix llama-swap version from 198 to v199 (latest release)
- Add HEALTHCHECK configuration with interval/timeout/retries to server and swap stages
- Copy /app/lib in server stage to fix container startup

* chore: fix CUDA Containerfile healthchecks and swap version

- Add /app/lib copy in server stage to fix container startup
- Fix llama-swap version from 198 to v199 (latest release)
- Add HEALTHCHECK configuration with interval/timeout/retries

* chore: fix indentation in Containerfiles and add LD_LIBRARY_PATH for server target

* fix: add --break-system-packages flag for pip in CPU Containerfile

* feat: add git bind mount for build info and NCCL support for CUDA

* fix: remove libnccl-dev from CUDA build (already included in base image)

* fix: added Markdown files to ignore files

* feat: use BUILD_NUMBER-COMMIT pattern for docker image tags

- Add BUILD_NUMBER and LLAMA_COMMIT to build workflow
- Update docker-bake.hcl to use version tag format matching llama-server --version output
- Format: VARIANT-BUILD_NUMBER-COMMIT (e.g., cu12-full-4406-3bc90dfd)

* fix: fetch full git history for accurate BUILD_NUMBER

- Add fetch-depth: 0 to actions/checkout to get all commits
- This ensures git rev-list --count HEAD returns correct total commit count

* fix: fetch full git history in Dockerfile for accurate BUILD_NUMBER

- Add git fetch --unshallow to get complete commit history during build
- This ensures build-info.cpp is generated with correct LLAMA_BUILD_NUMBER

* chore: update GitHub Actions to latest versions for Node.js 24 compatibility

- docker/setup-buildx-action@v3 -> v4
- docker/login-action@v3 -> v4

* chore: update all GitHub Actions to Node.js 24 compatible versions

- actions/checkout@v4 -> v6
- docker/setup-buildx-action@v3 -> v5
- docker/login-action@v3 -> v6
- docker/bake-action@v5 -> v7

* fix: use CI-passed BUILD_NUMBER and LLAMA_COMMIT in Dockerfile

- Add BUILD_NUMBER and LLAMA_COMMIT as build args
- Fall back to git commands if not provided
- Pass values explicitly to cmake for accurate build info

* fix: pass BUILD_NUMBER and LLAMA_COMMIT as Docker build args

- Add BUILD_NUMBER and LLAMA_COMMIT to docker bake args
- These will be used by the Containerfile for accurate build info

* fix: revert docker actions to v4 (latest available versions)

* fix: calculate BUILD_NUMBER and LLAMA_COMMIT directly in Containerfile

- Removed ARG defaults since we calculate from git during build
- Use git rev-list --count HEAD and git rev-parse for accurate version info
- Falls back to 0/unknown if git commands fail

* feat: calculate BUILD_NUMBER and LLAMA_COMMIT in Containerfiles

- Add git-based version calculation in both CPU and CUDA Containerfiles
- Remove .git bind mount (git is copied with COPY .)
- Pass build info to CMake for accurate llama-server --version output

* feat: calculate BUILD_NUMBER and LLAMA_COMMIT in Containerfiles

- Add git-based version calculation using git rev-list and git rev-parse
- Copy .git directory separately to ensure git commands work during build
- Pass build info to CMake for accurate llama-server --version output

* fix: cache improvements for CUDA and CPU builds

* fix: "/.git": not found

* fix: Unnecessary mv llama-swap

* fix: Remove BUILD_NUMBER and LLAMA_COMMIT from docker file, calculated by cmake proc

* fix: remove .git from dockerignore for local and CI builds

- Enables cmake to access .git directory during Docker build
- Required for version calculation in llama-server binary
- GitHub Actions uses explicit mount via bake action set parameter

* fix: Remove mounts key from Build and Push step in gh workflow

* ci: add .git verification step before build

* refactor: standardize Containerfile structure and remove .git mount dependency

- Remove --mount=type=bind,source=.git,target=.git from both Containerfiles
- Replace COPY . . with git clone for cleaner build context
- Add CUSTOM_COMMIT ARG for optional custom commit switching
- Standardize ARG/ENV ordering and comment formatting across CPU/CUDA variants
- Install ca-certificates before git clone to fix SSL verification issues
- Rename 'Structured artifact collection' to 'Collect build artifacts'

* ci: remove broken cache pruning step

* ci: remove broken prune-cache job

- Remove prune-cache job that was failing due to missing .git directory
- The job required a checkout step and the cache pruning logic was non-critical

* chore: Removed step for Verifying .git existance in GH workflow

* fix: ensure build always proceeds even if git switch fails

- Add '|| true' to git switch command so build continues on failure
- This prevents the entire RUN step from failing when CUSTOM_COMMIT is invalid

* fix: resolve Docker build pipeline issues

- Remove external git clone from Containerfiles, use build context directly
- Add BUILD_NUMBER and BUILD_COMMIT as CMake cache variables in build-info.cmake
- Fix .devops/tools.sh inclusion by using explicit COPY for hidden directories
- Set USE_CCACHE=true for CI builds
- Clean up unused SHA_SHORT variable from docker-bake.hcl

Fixes: Build steps were cached incorrectly due to external git clone ignoring the actual build context source.

* fix: include .git in Docker build context and add verification

* ci: add .git directory verification step after checkout

* build: fix .git mount path for Docker build context compatibility

* build: fix .git mount path for Docker build context compatibility

* docker: include .git in build context for version calculation

* ci: add .git directory verification step after checkout

* chore: Removed unecessary Verify .git step (It was a test)

* docs: update README with docker-bake and build-local.sh instructions

* docs: remove build-local.sh reference (not in repo)

* ci: optimize disk usage by limiting fetch depth and cleaning workspace

* fix: cleanup step in workflow

---------

Co-authored-by: HP Prodesk <sourceupdev@gmail.com>
ljubomirj pushed a commit to ljubomirj/llama.cpp that referenced this pull request May 6, 2026
…g#1585) (ggml-org#1609)

* Work around for recalculating logits in cached prompts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants