fix: #3722 - Optimize LSM Vector Index fallback scan to target specific bucket instead of entire type #3775
fix: #3722 - Optimize LSM Vector Index fallback scan to target specific bucket instead of entire type #3775lekmaneb wants to merge 1 commit intoArcadeData:mainfrom
Conversation
Up to standards ✅🟢 Issues
|
There was a problem hiding this comment.
Code Review
This pull request refactors the vector index rebuilding logic to use bucket-based operations instead of type-based operations. The changes replace type-name lookups with bucket ID lookups for counting and scanning documents. Review feedback suggests removing a condition that could prevent recovery during total page corruption and optimizing the document count retrieval by calling the bucket object directly while ensuring a null check is performed.
| boolean documentScanPerformed = false; | ||
| final String typeName = getTypeName(); | ||
| if (typeName != null && !ridToLatestVector.isEmpty()) { | ||
| if (metadata.associatedBucketId != -1 && !ridToLatestVector.isEmpty()) { |
There was a problem hiding this comment.
The condition !ridToLatestVector.isEmpty() prevents the fallback mechanism from triggering if the page parser fails to recover any vectors at all (e.g., due to severe corruption). Since the docCount check already handles the case where the bucket is empty, this extra check is unnecessary and prevents recovery in cases of total page corruption.
| if (metadata.associatedBucketId != -1 && !ridToLatestVector.isEmpty()) { | |
| if (metadata.associatedBucketId != -1) { |
| final com.arcadedb.engine.Bucket bucket = database.getSchema().getBucketById(metadata.associatedBucketId); | ||
| final long docCount = database.countBucket(bucket.getName()); | ||
| if (ridToLatestVector.size() < docCount * 8 / 10) { |
There was a problem hiding this comment.
Using database.countBucket(bucket.getName()) is less efficient than calling bucket.count() directly on the bucket object. Also, there is a risk of NullPointerException if bucket is null. While the surrounding try-catch block handles exceptions, a null check is preferred for clarity and to avoid unnecessary exception overhead.
| final com.arcadedb.engine.Bucket bucket = database.getSchema().getBucketById(metadata.associatedBucketId); | |
| final long docCount = database.countBucket(bucket.getName()); | |
| if (ridToLatestVector.size() < docCount * 8 / 10) { | |
| final com.arcadedb.engine.Bucket bucket = database.getSchema().getBucketById(metadata.associatedBucketId); | |
| final long docCount = bucket != null ? bucket.count() : 0; | |
| if (bucket != null && ridToLatestVector.size() < docCount * 8 / 10) { |
There was a problem hiding this comment.
Pull request overview
This PR fixes LSMVectorIndex’s “page-parse deficit” fallback path so it only counts/scans records from the bucket actually associated with the vector index (instead of scanning all buckets of a type), addressing incorrect counts and unnecessary scanning in multi-bucket types (issue #3722).
Changes:
- Resolve the target bucket via
metadata.associatedBucketIdrather thangetTypeName(). - Replace
database.countType(...)withdatabase.countBucket(...)for the cross-check heuristic. - Replace
database.scanType(...)withdatabase.scanBucket(...)for the fallback recovery scan.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (metadata.associatedBucketId != -1 && !ridToLatestVector.isEmpty()) { | ||
| try { | ||
| final long docCount = database.countType(typeName, false); | ||
| final com.arcadedb.engine.Bucket bucket = database.getSchema().getBucketById(metadata.associatedBucketId); | ||
| final long docCount = database.countBucket(bucket.getName()); | ||
| if (ridToLatestVector.size() < docCount * 8 / 10) { |
There was a problem hiding this comment.
The change scopes the fallback cross-check/scan to metadata.associatedBucketId, which fixes the multi-bucket type behavior described in #3722, but there’s no regression test covering the scenario where a type has multiple buckets and the vector index is bound to only one bucket. Please add a test that creates a type with 2+ buckets, builds a bucket-scoped LSM vector index, triggers the page-parser-missed-vectors fallback, and asserts only records from the associated bucket are scanned/used (records from the other bucket must not affect the docCount heuristic or be added to ridToLatestVector).
|
Claude code : to the 🗑️ Gemini CLI : to the 🔝 |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3775 +/- ##
==========================================
- Coverage 65.17% 65.02% -0.16%
==========================================
Files 1580 1580
Lines 116263 116275 +12
Branches 24658 24659 +1
==========================================
- Hits 75775 75606 -169
- Misses 30193 30355 +162
- Partials 10295 10314 +19 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Bumps the github-actions group with 3 updates: [anthropics/claude-code-action](https://github.com/anthropics/claude-code-action), [github/codeql-action](https://github.com/github/codeql-action) and [zgosalvez/github-actions-ensure-sha-pinned-actions](https://github.com/zgosalvez/github-actions-ensure-sha-pinned-actions). Updates `anthropics/claude-code-action` from 1.0.76 to 1.0.82 Release notes *Sourced from [anthropics/claude-code-action's releases](https://github.com/anthropics/claude-code-action/releases).* > v1.0.82 > ------- > > **Full Changelog**: <anthropics/claude-code-action@v1...v1.0.82> > > v1.0.81 > ------- > > **Full Changelog**: <anthropics/claude-code-action@v1...v1.0.81> > > v1.0.80 > ------- > > **Full Changelog**: <anthropics/claude-code-action@v1...v1.0.80> > > v1.0.79 > ------- > > **Full Changelog**: <anthropics/claude-code-action@v1...v1.0.79> > > v1.0.78 > ------- > > **Full Changelog**: <anthropics/claude-code-action@v1...v1.0.78> > > v1.0.77 > ------- > > Subprocess environment scrubbing for untrusted-input workflows > -------------------------------------------------------------- > > Workflows that configure `allowed_non_write_users` now automatically get `CLAUDE_CODE_SUBPROCESS_ENV_SCRUB=1`, which makes Claude Code (v2.1.79+) strip Anthropic and cloud provider credentials from the environment of subprocesses it spawns (Bash tool, hooks, MCP stdio servers). The parent Claude process keeps these vars for its own API calls — only child subprocess environments are scrubbed. > > **Why:** Workflows that process untrusted input (issue triage, PR review from non-write users) are exposed to prompt injection. A malicious issue body could trick Claude into running a Bash command that reads `$ANTHROPIC_API_KEY` via shell expansion and leaks it through an observable side channel. Scrubbing the subprocess environment removes the read primitive entirely. > > **What's scrubbed:** Anthropic auth tokens, cloud provider credentials, GitHub Actions OIDC and runtime tokens, OTEL auth headers. > > **What's kept:** `GITHUB_TOKEN` / `GH_TOKEN` — so wrapper scripts can still call the GitHub API. > > **Opt out:** Set `CLAUDE_CODE_SUBPROCESS_ENV_SCRUB: "0"` at the job or step level if your workflow legitimately needs a subprocess to inherit these credentials. > > **No action required** for most users — if you've configured `allowed_non_write_users`, scrubbing is now on automatically. If your workflow breaks because a subprocess expected inherited credentials, re-inject them explicitly (e.g., via MCP server `env:` config) or use the opt-out. > > What's Changed > -------------- > > * Auto-set subprocess env scrub when allowed\_non\_write\_users is configured by [`@OctavianGuzu`](https://github.com/OctavianGuzu) in [anthropics/claude-code-action#1093](https://redirect.github.com/anthropics/claude-code-action/pull/1093) > > **Full Changelog**: <anthropics/claude-code-action@v1.0.76...v1.0.77> Commits * [`88c168b`](anthropics/claude-code-action@88c168b) chore: bump Claude Code to 2.1.87 and Agent SDK to 0.2.87 * [`e7b588b`](anthropics/claude-code-action@e7b588b) chore: bump Claude Code to 2.1.86 and Agent SDK to 0.2.86 * [`094bd24`](anthropics/claude-code-action@094bd24) chore: bump Claude Code to 2.1.85 and Agent SDK to 0.2.85 * [`3ac52d0`](anthropics/claude-code-action@3ac52d0) chore: bump Claude Code to 2.1.84 and Agent SDK to 0.2.84 * [`0ee1bee`](anthropics/claude-code-action@0ee1bee) chore: bump Claude Code to 2.1.83 and Agent SDK to 0.2.83 * [`ff9acae`](anthropics/claude-code-action@ff9acae) Auto-set subprocess env scrub when allowed\_non\_write\_users is configured ([ArcadeData#1093](https://redirect.github.com/anthropics/claude-code-action/issues/1093)) * See full diff in [compare view](anthropics/claude-code-action@6062f37...88c168b) Updates `github/codeql-action` from 4.34.1 to 4.35.1 Release notes *Sourced from [github/codeql-action's releases](https://github.com/github/codeql-action/releases).* > v4.35.1 > ------- > > * Fix incorrect minimum required Git version for [improved incremental analysis](https://redirect.github.com/github/roadmap/issues/1158): it should have been 2.36.0, not 2.11.0. [ArcadeData#3781](https://redirect.github.com/github/codeql-action/pull/3781) > > v4.35.0 > ------- > > * Reduced the minimum Git version required for [improved incremental analysis](https://redirect.github.com/github/roadmap/issues/1158) from 2.38.0 to 2.11.0. [ArcadeData#3767](https://redirect.github.com/github/codeql-action/pull/3767) > * Update default CodeQL bundle version to [2.25.1](https://github.com/github/codeql-action/releases/tag/codeql-bundle-v2.25.1). [ArcadeData#3773](https://redirect.github.com/github/codeql-action/pull/3773) Changelog *Sourced from [github/codeql-action's changelog](https://github.com/github/codeql-action/blob/main/CHANGELOG.md).* > CodeQL Action Changelog > ======================= > > See the [releases page](https://github.com/github/codeql-action/releases) for the relevant changes to the CodeQL CLI and language packs. > > [UNRELEASED] > ------------ > > No user facing changes. > > 4.35.1 - 27 Mar 2026 > -------------------- > > * Fix incorrect minimum required Git version for [improved incremental analysis](https://redirect.github.com/github/roadmap/issues/1158): it should have been 2.36.0, not 2.11.0. [ArcadeData#3781](https://redirect.github.com/github/codeql-action/pull/3781) > > 4.35.0 - 27 Mar 2026 > -------------------- > > * Reduced the minimum Git version required for [improved incremental analysis](https://redirect.github.com/github/roadmap/issues/1158) from 2.38.0 to 2.11.0. [ArcadeData#3767](https://redirect.github.com/github/codeql-action/pull/3767) > * Update default CodeQL bundle version to [2.25.1](https://github.com/github/codeql-action/releases/tag/codeql-bundle-v2.25.1). [ArcadeData#3773](https://redirect.github.com/github/codeql-action/pull/3773) > > 4.34.1 - 20 Mar 2026 > -------------------- > > * Downgrade default CodeQL bundle version to [2.24.3](https://github.com/github/codeql-action/releases/tag/codeql-bundle-v2.24.3) due to issues with a small percentage of Actions and JavaScript analyses. [ArcadeData#3762](https://redirect.github.com/github/codeql-action/pull/3762) > > 4.34.0 - 20 Mar 2026 > -------------------- > > * Added an experimental change which disables TRAP caching when [improved incremental analysis](https://redirect.github.com/github/roadmap/issues/1158) is enabled, since improved incremental analysis supersedes TRAP caching. This will improve performance and reduce Actions cache usage. We expect to roll this change out to everyone in March. [ArcadeData#3569](https://redirect.github.com/github/codeql-action/pull/3569) > * We are rolling out improved incremental analysis to C/C++ analyses that use build mode `none`. We expect this rollout to be complete by the end of April 2026. [ArcadeData#3584](https://redirect.github.com/github/codeql-action/pull/3584) > * Update default CodeQL bundle version to [2.25.0](https://github.com/github/codeql-action/releases/tag/codeql-bundle-v2.25.0). [ArcadeData#3585](https://redirect.github.com/github/codeql-action/pull/3585) > > 4.33.0 - 16 Mar 2026 > -------------------- > > * Upcoming change: Starting April 2026, the CodeQL Action will skip collecting file coverage information on pull requests to improve analysis performance. File coverage information will still be computed on non-PR analyses. Pull request analyses will log a warning about this upcoming change. [ArcadeData#3562](https://redirect.github.com/github/codeql-action/pull/3562) > > To opt out of this change: > > + **Repositories owned by an organization:** Create a custom repository property with the name `github-codeql-file-coverage-on-prs` and the type "True/false", then set this property to `true` in the repository's settings. For more information, see [Managing custom properties for repositories in your organization](https://docs.github.com/en/organizations/managing-organization-settings/managing-custom-properties-for-repositories-in-your-organization). Alternatively, if you are using an advanced setup workflow, you can set the `CODEQL_ACTION_FILE_COVERAGE_ON_PRS` environment variable to `true` in your workflow. > + **User-owned repositories using default setup:** Switch to an advanced setup workflow and set the `CODEQL_ACTION_FILE_COVERAGE_ON_PRS` environment variable to `true` in your workflow. > + **User-owned repositories using advanced setup:** Set the `CODEQL_ACTION_FILE_COVERAGE_ON_PRS` environment variable to `true` in your workflow. > * Fixed [a bug](https://redirect.github.com/github/codeql-action/issues/3555) which caused the CodeQL Action to fail loading repository properties if a "Multi select" repository property was configured for the repository. [ArcadeData#3557](https://redirect.github.com/github/codeql-action/pull/3557) > * The CodeQL Action now loads [custom repository properties](https://docs.github.com/en/organizations/managing-organization-settings/managing-custom-properties-for-repositories-in-your-organization) on GitHub Enterprise Server, enabling the customization of features such as `github-codeql-disable-overlay` that was previously only available on GitHub.com. [ArcadeData#3559](https://redirect.github.com/github/codeql-action/pull/3559) > * Once [private package registries](https://docs.github.com/en/code-security/how-tos/secure-at-scale/configure-organization-security/manage-usage-and-access/giving-org-access-private-registries) can be configured with OIDC-based authentication for organizations, the CodeQL Action will now be able to accept such configurations. [ArcadeData#3563](https://redirect.github.com/github/codeql-action/pull/3563) > * Fixed the retry mechanism for database uploads. Previously this would fail with the error "Response body object should not be disturbed or locked". [ArcadeData#3564](https://redirect.github.com/github/codeql-action/pull/3564) > * A warning is now emitted if the CodeQL Action detects a repository property whose name suggests that it relates to the CodeQL Action, but which is not one of the properties recognised by the current version of the CodeQL Action. [ArcadeData#3570](https://redirect.github.com/github/codeql-action/pull/3570) > > 4.32.6 - 05 Mar 2026 > -------------------- > > * Update default CodeQL bundle version to [2.24.3](https://github.com/github/codeql-action/releases/tag/codeql-bundle-v2.24.3). [ArcadeData#3548](https://redirect.github.com/github/codeql-action/pull/3548) > > 4.32.5 - 02 Mar 2026 > -------------------- > > * Repositories owned by an organization can now set up the `github-codeql-disable-overlay` custom repository property to disable [improved incremental analysis for CodeQL](https://redirect.github.com/github/roadmap/issues/1158). First, create a custom repository property with the name `github-codeql-disable-overlay` and the type "True/false" in the organization's settings. Then in the repository's settings, set this property to `true` to disable improved incremental analysis. For more information, see [Managing custom properties for repositories in your organization](https://docs.github.com/en/organizations/managing-organization-settings/managing-custom-properties-for-repositories-in-your-organization). This feature is not yet available on GitHub Enterprise Server. [ArcadeData#3507](https://redirect.github.com/github/codeql-action/pull/3507) > * Added an experimental change so that when [improved incremental analysis](https://redirect.github.com/github/roadmap/issues/1158) fails on a runner — potentially due to insufficient disk space — the failure is recorded in the Actions cache so that subsequent runs will automatically skip improved incremental analysis until something changes (e.g. a larger runner is provisioned or a new CodeQL version is released). We expect to roll this change out to everyone in March. [ArcadeData#3487](https://redirect.github.com/github/codeql-action/pull/3487) > * The minimum memory check for improved incremental analysis is now skipped for CodeQL 2.24.3 and later, which has reduced peak RAM usage. [ArcadeData#3515](https://redirect.github.com/github/codeql-action/pull/3515) ... (truncated) Commits * [`c10b806`](github/codeql-action@c10b806) Merge pull request [ArcadeData#3782](https://redirect.github.com/github/codeql-action/issues/3782) from github/update-v4.35.1-d6d1743b8 * [`c5ffd06`](github/codeql-action@c5ffd06) Update changelog for v4.35.1 * [`d6d1743`](github/codeql-action@d6d1743) Merge pull request [ArcadeData#3781](https://redirect.github.com/github/codeql-action/issues/3781) from github/henrymercer/update-git-minimum-version * [`65d2efa`](github/codeql-action@65d2efa) Add changelog note * [`2437b20`](github/codeql-action@2437b20) Update minimum git version for overlay to 2.36.0 * [`ea5f719`](github/codeql-action@ea5f719) Merge pull request [ArcadeData#3775](https://redirect.github.com/github/codeql-action/issues/3775) from github/dependabot/npm\_and\_yarn/node-forge-1.4.0 * [`45ceeea`](github/codeql-action@45ceeea) Merge pull request [ArcadeData#3777](https://redirect.github.com/github/codeql-action/issues/3777) from github/mergeback/v4.35.0-to-main-b8bb9f28 * [`24448c9`](github/codeql-action@24448c9) Rebuild * [`7c51060`](github/codeql-action@7c51060) Update changelog and version after v4.35.0 * [`b8bb9f2`](github/codeql-action@b8bb9f2) Merge pull request [ArcadeData#3776](https://redirect.github.com/github/codeql-action/issues/3776) from github/update-v4.35.0-0078ad667 * Additional commits viewable in [compare view](github/codeql-action@3869755...c10b806) Updates `zgosalvez/github-actions-ensure-sha-pinned-actions` from 5.0.3 to 5.0.4 Release notes *Sourced from [zgosalvez/github-actions-ensure-sha-pinned-actions's releases](https://github.com/zgosalvez/github-actions-ensure-sha-pinned-actions/releases).* > v5.0.4 > ------ > > What's Changed > -------------- > > * Bump picomatch from 2.3.1 to 2.3.2 by [`@dependabot`](https://github.com/dependabot)[bot] in [zgosalvez/github-actions-ensure-sha-pinned-actions#302](https://redirect.github.com/zgosalvez/github-actions-ensure-sha-pinned-actions/pull/302) > * Bump eslint from 10.0.3 to 10.1.0 by [`@dependabot`](https://github.com/dependabot)[bot] in [zgosalvez/github-actions-ensure-sha-pinned-actions#301](https://redirect.github.com/zgosalvez/github-actions-ensure-sha-pinned-actions/pull/301) > * Bump brace-expansion by [`@dependabot`](https://github.com/dependabot)[bot] in [zgosalvez/github-actions-ensure-sha-pinned-actions#303](https://redirect.github.com/zgosalvez/github-actions-ensure-sha-pinned-actions/pull/303) > * Bump yaml from 2.8.2 to 2.8.3 by [`@dependabot`](https://github.com/dependabot)[bot] in [zgosalvez/github-actions-ensure-sha-pinned-actions#300](https://redirect.github.com/zgosalvez/github-actions-ensure-sha-pinned-actions/pull/300) > > **Full Changelog**: <zgosalvez/github-actions-ensure-sha-pinned-actions@v5...v5.0.4> Commits * [`ca46236`](zgosalvez/github-actions-ensure-sha-pinned-actions@ca46236) Bump yaml from 2.8.2 to 2.8.3 ([ArcadeData#300](https://redirect.github.com/zgosalvez/github-actions-ensure-sha-pinned-actions/issues/300)) * [`c1f725e`](zgosalvez/github-actions-ensure-sha-pinned-actions@c1f725e) Bump brace-expansion ([ArcadeData#303](https://redirect.github.com/zgosalvez/github-actions-ensure-sha-pinned-actions/issues/303)) * [`2a0679d`](zgosalvez/github-actions-ensure-sha-pinned-actions@2a0679d) Bump eslint from 10.0.3 to 10.1.0 ([ArcadeData#301](https://redirect.github.com/zgosalvez/github-actions-ensure-sha-pinned-actions/issues/301)) * [`4533f2e`](zgosalvez/github-actions-ensure-sha-pinned-actions@4533f2e) Bump picomatch from 2.3.1 to 2.3.2 ([ArcadeData#302](https://redirect.github.com/zgosalvez/github-actions-ensure-sha-pinned-actions/issues/302)) * See full diff in [compare view](zgosalvez/github-actions-ensure-sha-pinned-actions@471d5ac...ca46236) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore major version` will close this group update PR and stop Dependabot creating any more for the specific dependency's major version (unless you unignore this specific dependency's major version or upgrade to it yourself) - `@dependabot ignore minor version` will close this group update PR and stop Dependabot creating any more for the specific dependency's minor version (unless you unignore this specific dependency's minor version or upgrade to it yourself) - `@dependabot ignore ` will close this group update PR and stop Dependabot creating any more for the specific dependency (unless you unignore this specific dependency or upgrade to it yourself) - `@dependabot unignore ` will remove all of the ignore conditions of the specified dependency - `@dependabot unignore ` will remove the ignore condition of the specified dependency and ignore conditions
…build is scoped to the specific bucket associated with the vector index, not the entire type Based on PR ArcadeData#3775 by @lekmaneb. Fixes issue ArcadeData#3722
Fix: Scoped vector index fallback scan to specific bucket (#3722)
Description
This PR fixes an issue in
LSMVectorIndexwhere the fallback mechanism for recovering missing vectors would incorrectly scan all documents of a given type rather than restricting the scan to the specific bucket associated with the index.Previously, when the number of page-parsed vectors fell significantly short of the document count, the fallback relied on
database.countType()anddatabase.scanType(). If a type contained multiple buckets, this resulted in inaccurate counts and unnecessary scanning of documents outside the index's scope.The implementation now correctly leverages
metadata.associatedBucketIdto retrieve the target bucket, replacing type-wide operations withdatabase.countBucket()anddatabase.scanBucket().Changes Made
getTypeName()usage withmetadata.associatedBucketIdto resolve the specific target bucket.database.countBucket(bucket.getName())instead ofdatabase.countType().database.scanBucket(bucket.getName(), ...)to ensure only documents within the indexed bucket are scanned for missing vectors.Related Issues
by Gemini