test: fix flaky distributed vector build results test by myandpr · Pull Request #6268 · lance-format/lance

myandpr · 2026-03-24T08:29:38Z

Summary

This fixes a flaky regression test added in #6220 (b80fbb3231cf58dd50e5670f9c56d309999bbd73).

The affected test is:

test_distributed_vector_build_commits_multiple_segments_and_preserves_query_results

Recent failures showed up in both of these runs:

https://github.com/lance-format/lance/actions/runs/23450834811
main at 244c721504c6ef0b4c2f9700a342509976898d6e
https://github.com/lance-format/lance/actions/runs/23460892697
fix: preserve row last-updated metadata for rewrite-columns updates #6263

In those failures, different platforms / index variants failed:

linux-arm / case_1_ivf_flat on main
linux-build / case_2_ivf_pq on fix: preserve row last-updated metadata for rewrite-columns updates #6263

That points to an existing flaky test.

Root Cause

The test compared the exact Top-K _rowid results between:

a single-segment index build, and
a distributed multi-segment index build

However, the query path used by the test is ANN (ANNIVFPartition) under the default probing behavior. With partial probing, the candidate set can differ slightly between single-segment and multi-segment layouts, especially near the tail of Top-K. That makes exact _rowid equality too strict for this test and causes intermittent failures.

Fix

Make the test probe all IVF partitions before comparing Top-K row ids:

add .minimum_nprobes(TWO_FRAG_NUM_PARTITIONS) to the test query

This keeps the existing strong assertion (ids_single == ids_split) but removes the probing-related source of nondeterminism.

Testing

Local verification:

export PROTOC=/opt/homebrew/opt/protobuf@3/bin/protoc
cargo test -p lance test_distributed_vector_build_commits_multiple_segments_and_preserves_query_results -- --nocapture

Observed result:

case_1_ivf_flat ... ok
case_2_ivf_pq ... ok
case_3_ivf_sq ... ok

I also verified during debugging that with full probing enabled, repeated runs of the previously flaky ivf_flat / ivf_pq cases became stable.

github-actions · 2026-03-24T08:30:35Z

Review

LGTM. Clean one-line fix for test flakiness.

Setting minimum_nprobes(TWO_FRAG_NUM_PARTITIONS) (which equals the total number of IVF partitions) ensures all partitions are probed during the query, making Top-K results deterministic and eliminating the source of flakiness. The fix is applied inside collect_row_ids so it affects both the single and split dataset queries symmetrically.

No issues found.

codecov · 2026-03-24T09:01:21Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

myandpr · 2026-03-24T09:45:00Z

Hey @Xuanwo @wjones127 , could you take a look this PR when you have a chance?

This makes the distributed vector build test probe all IVF partitions before comparing Top-K row ids, which should stabilize the test and unblock CI.

Xuanwo

Thank you!

myandpr · 2026-03-24T12:02:23Z

@Xuanwo Thanks for the review — appreciate it!🙌

## Summary This fixes a flaky regression test added in #6220 (`b80fbb3231cf58dd50e5670f9c56d309999bbd73`). The affected test is: - `test_distributed_vector_build_commits_multiple_segments_and_preserves_query_results` Recent failures showed up in both of these runs: - https://github.com/lance-format/lance/actions/runs/23450834811 `main` at `244c721504c6ef0b4c2f9700a342509976898d6e` - https://github.com/lance-format/lance/actions/runs/23460892697 #6263 In those failures, different platforms / index variants failed: - `linux-arm / case_1_ivf_flat` on `main` - `linux-build / case_2_ivf_pq` on #6263 That points to an existing flaky test. ## Root Cause The test compared the exact Top-K `_rowid` results between: - a single-segment index build, and - a distributed multi-segment index build However, the query path used by the test is ANN (`ANNIVFPartition`) under the default probing behavior. With partial probing, the candidate set can differ slightly between single-segment and multi-segment layouts, especially near the tail of Top-K. That makes exact `_rowid` equality too strict for this test and causes intermittent failures. ## Fix Make the test probe all IVF partitions before comparing Top-K row ids: - add `.minimum_nprobes(TWO_FRAG_NUM_PARTITIONS)` to the test query This keeps the existing strong assertion (`ids_single == ids_split`) but removes the probing-related source of nondeterminism. ## Testing Local verification: - `export PROTOC=/opt/homebrew/opt/protobuf@3/bin/protoc` - `cargo test -p lance test_distributed_vector_build_commits_multiple_segments_and_preserves_query_results -- --nocapture` Observed result: - `case_1_ivf_flat ... ok` - `case_2_ivf_pq ... ok` - `case_3_ivf_sq ... ok` I also verified during debugging that with full probing enabled, repeated runs of the previously flaky `ivf_flat` / `ivf_pq` cases became stable.

) ## Summary This fixes a flaky regression test added in lance-format#6220 (`b80fbb3231cf58dd50e5670f9c56d309999bbd73`). The affected test is: - `test_distributed_vector_build_commits_multiple_segments_and_preserves_query_results` Recent failures showed up in both of these runs: - https://github.com/lance-format/lance/actions/runs/23450834811 `main` at `244c721504c6ef0b4c2f9700a342509976898d6e` - https://github.com/lance-format/lance/actions/runs/23460892697 lance-format#6263 In those failures, different platforms / index variants failed: - `linux-arm / case_1_ivf_flat` on `main` - `linux-build / case_2_ivf_pq` on lance-format#6263 That points to an existing flaky test. ## Root Cause The test compared the exact Top-K `_rowid` results between: - a single-segment index build, and - a distributed multi-segment index build However, the query path used by the test is ANN (`ANNIVFPartition`) under the default probing behavior. With partial probing, the candidate set can differ slightly between single-segment and multi-segment layouts, especially near the tail of Top-K. That makes exact `_rowid` equality too strict for this test and causes intermittent failures. ## Fix Make the test probe all IVF partitions before comparing Top-K row ids: - add `.minimum_nprobes(TWO_FRAG_NUM_PARTITIONS)` to the test query This keeps the existing strong assertion (`ids_single == ids_split`) but removes the probing-related source of nondeterminism. ## Testing Local verification: - `export PROTOC=/opt/homebrew/opt/protobuf@3/bin/protoc` - `cargo test -p lance test_distributed_vector_build_commits_multiple_segments_and_preserves_query_results -- --nocapture` Observed result: - `case_1_ivf_flat ... ok` - `case_2_ivf_pq ... ok` - `case_3_ivf_sq ... ok` I also verified during debugging that with full probing enabled, repeated runs of the previously flaky `ivf_flat` / `ivf_pq` cases became stable.

test: stabilize distributed vector build query results

0923903

github-actions Bot added the chore label Mar 24, 2026

myandpr mentioned this pull request Mar 24, 2026

fix: preserve row last-updated metadata for rewrite-columns updates #6263

Open

myandpr changed the title ~~test: stabilize distributed vector build query results~~ test: fix flaky distributed vector build query results test Mar 24, 2026

myandpr changed the title ~~test: fix flaky distributed vector build query results test~~ test: fix flaky distributed vector build results test Mar 24, 2026

Xuanwo approved these changes Mar 24, 2026

View reviewed changes

Xuanwo merged commit 30cad43 into lance-format:main Mar 24, 2026
42 checks passed

github-actions Bot mentioned this pull request Mar 24, 2026

test: stabilize distributed IVF grouped build query test #6281

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: fix flaky distributed vector build results test#6268

test: fix flaky distributed vector build results test#6268
Xuanwo merged 1 commit intolance-format:mainfrom
myandpr:fix-flaky-distributed-vector-test

myandpr commented Mar 24, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Mar 24, 2026

Uh oh!

codecov Bot commented Mar 24, 2026

Uh oh!

myandpr commented Mar 24, 2026

Uh oh!

Xuanwo left a comment

Uh oh!

Uh oh!

myandpr commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

myandpr commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Fix

Testing

Uh oh!

github-actions Bot commented Mar 24, 2026

Review

Uh oh!

codecov Bot commented Mar 24, 2026

Codecov Report

Uh oh!

myandpr commented Mar 24, 2026

Uh oh!

Xuanwo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

myandpr commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

myandpr commented Mar 24, 2026 •

edited

Loading