Skip to content

test: fix flaky distributed vector build results test#6268

Merged
Xuanwo merged 1 commit intolance-format:mainfrom
myandpr:fix-flaky-distributed-vector-test
Mar 24, 2026
Merged

test: fix flaky distributed vector build results test#6268
Xuanwo merged 1 commit intolance-format:mainfrom
myandpr:fix-flaky-distributed-vector-test

Conversation

@myandpr
Copy link
Copy Markdown
Contributor

@myandpr myandpr commented Mar 24, 2026

Summary

This fixes a flaky regression test added in #6220 (b80fbb3231cf58dd50e5670f9c56d309999bbd73).

The affected test is:

  • test_distributed_vector_build_commits_multiple_segments_and_preserves_query_results

Recent failures showed up in both of these runs:

In those failures, different platforms / index variants failed:

That points to an existing flaky test.

Root Cause

The test compared the exact Top-K _rowid results between:

  • a single-segment index build, and
  • a distributed multi-segment index build

However, the query path used by the test is ANN (ANNIVFPartition) under the default probing behavior. With partial probing, the candidate set can differ slightly between single-segment and multi-segment layouts, especially near the tail of Top-K. That makes exact _rowid equality too strict for this test and causes intermittent failures.

Fix

Make the test probe all IVF partitions before comparing Top-K row ids:

  • add .minimum_nprobes(TWO_FRAG_NUM_PARTITIONS) to the test query

This keeps the existing strong assertion (ids_single == ids_split) but removes the probing-related source of nondeterminism.

Testing

Local verification:

  • export PROTOC=/opt/homebrew/opt/protobuf@3/bin/protoc
  • cargo test -p lance test_distributed_vector_build_commits_multiple_segments_and_preserves_query_results -- --nocapture

Observed result:

  • case_1_ivf_flat ... ok
  • case_2_ivf_pq ... ok
  • case_3_ivf_sq ... ok

I also verified during debugging that with full probing enabled, repeated runs of the previously flaky ivf_flat / ivf_pq cases became stable.

@github-actions github-actions Bot added the chore label Mar 24, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Review

LGTM. Clean one-line fix for test flakiness.

Setting minimum_nprobes(TWO_FRAG_NUM_PARTITIONS) (which equals the total number of IVF partitions) ensures all partitions are probed during the query, making Top-K results deterministic and eliminating the source of flakiness. The fix is applied inside collect_row_ids so it affects both the single and split dataset queries symmetrically.

No issues found.

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 24, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@myandpr myandpr changed the title test: stabilize distributed vector build query results test: fix flaky distributed vector build query results test Mar 24, 2026
@myandpr myandpr changed the title test: fix flaky distributed vector build query results test test: fix flaky distributed vector build results test Mar 24, 2026
@myandpr
Copy link
Copy Markdown
Contributor Author

myandpr commented Mar 24, 2026

Hey @Xuanwo @wjones127 , could you take a look this PR when you have a chance?

This makes the distributed vector build test probe all IVF partitions before comparing Top-K row ids, which should stabilize the test and unblock CI.

Copy link
Copy Markdown
Collaborator

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@Xuanwo Xuanwo merged commit 30cad43 into lance-format:main Mar 24, 2026
42 checks passed
@myandpr
Copy link
Copy Markdown
Contributor Author

myandpr commented Mar 24, 2026

@Xuanwo Thanks for the review — appreciate it!🙌

westonpace pushed a commit that referenced this pull request Mar 24, 2026
## Summary
This fixes a flaky regression test added in #6220
(`b80fbb3231cf58dd50e5670f9c56d309999bbd73`).

The affected test is:
-
`test_distributed_vector_build_commits_multiple_segments_and_preserves_query_results`

Recent failures showed up in both of these runs:
- https://github.com/lance-format/lance/actions/runs/23450834811
  `main` at `244c721504c6ef0b4c2f9700a342509976898d6e`
- https://github.com/lance-format/lance/actions/runs/23460892697
   #6263
  

In those failures, different platforms / index variants failed:
- `linux-arm / case_1_ivf_flat` on `main`
- `linux-build / case_2_ivf_pq` on #6263

That points to an existing flaky test.

## Root Cause
The test compared the exact Top-K `_rowid` results between:
- a single-segment index build, and
- a distributed multi-segment index build

However, the query path used by the test is ANN (`ANNIVFPartition`)
under the default probing behavior. With partial probing, the candidate
set can differ slightly between single-segment and multi-segment
layouts, especially near the tail of Top-K. That makes exact `_rowid`
equality too strict for this test and causes intermittent failures.

## Fix
Make the test probe all IVF partitions before comparing Top-K row ids:
- add `.minimum_nprobes(TWO_FRAG_NUM_PARTITIONS)` to the test query

This keeps the existing strong assertion (`ids_single == ids_split`) but
removes the probing-related source of nondeterminism.

## Testing
Local verification:
- `export PROTOC=/opt/homebrew/opt/protobuf@3/bin/protoc`
- `cargo test -p lance
test_distributed_vector_build_commits_multiple_segments_and_preserves_query_results
-- --nocapture`

Observed result:
- `case_1_ivf_flat ... ok`
- `case_2_ivf_pq ... ok`
- `case_3_ivf_sq ... ok`

I also verified during debugging that with full probing enabled,
repeated runs of the previously flaky `ivf_flat` / `ivf_pq` cases became
stable.
wjones127 pushed a commit to wjones127/lance that referenced this pull request Mar 29, 2026
)

## Summary
This fixes a flaky regression test added in lance-format#6220
(`b80fbb3231cf58dd50e5670f9c56d309999bbd73`).

The affected test is:
-
`test_distributed_vector_build_commits_multiple_segments_and_preserves_query_results`

Recent failures showed up in both of these runs:
- https://github.com/lance-format/lance/actions/runs/23450834811
  `main` at `244c721504c6ef0b4c2f9700a342509976898d6e`
- https://github.com/lance-format/lance/actions/runs/23460892697
   lance-format#6263
  

In those failures, different platforms / index variants failed:
- `linux-arm / case_1_ivf_flat` on `main`
- `linux-build / case_2_ivf_pq` on lance-format#6263

That points to an existing flaky test.

## Root Cause
The test compared the exact Top-K `_rowid` results between:
- a single-segment index build, and
- a distributed multi-segment index build

However, the query path used by the test is ANN (`ANNIVFPartition`)
under the default probing behavior. With partial probing, the candidate
set can differ slightly between single-segment and multi-segment
layouts, especially near the tail of Top-K. That makes exact `_rowid`
equality too strict for this test and causes intermittent failures.

## Fix
Make the test probe all IVF partitions before comparing Top-K row ids:
- add `.minimum_nprobes(TWO_FRAG_NUM_PARTITIONS)` to the test query

This keeps the existing strong assertion (`ids_single == ids_split`) but
removes the probing-related source of nondeterminism.

## Testing
Local verification:
- `export PROTOC=/opt/homebrew/opt/protobuf@3/bin/protoc`
- `cargo test -p lance
test_distributed_vector_build_commits_multiple_segments_and_preserves_query_results
-- --nocapture`

Observed result:
- `case_1_ivf_flat ... ok`
- `case_2_ivf_pq ... ok`
- `case_3_ivf_sq ... ok`

I also verified during debugging that with full probing enabled,
repeated runs of the previously flaky `ivf_flat` / `ivf_pq` cases became
stable.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants