Skip to content

feat: clearer progress reporting for IVF#6126

Merged
westonpace merged 2 commits intolance-format:mainfrom
wkalt:feat/index-build-progress-improvements
Mar 11, 2026
Merged

feat: clearer progress reporting for IVF#6126
westonpace merged 2 commits intolance-format:mainfrom
wkalt:feat/index-build-progress-improvements

Conversation

@wkalt
Copy link
Copy Markdown
Contributor

@wkalt wkalt commented Mar 7, 2026

This breaks the "build_partitions" stage into "build_partitions" and "merge_partitions", and also updates the progress reporting on the shuffle phase to be in terms of rows instead of batches.

@github-actions github-actions Bot added the enhancement New feature or request label Mar 7, 2026
@wkalt
Copy link
Copy Markdown
Contributor Author

wkalt commented Mar 7, 2026

here is an example of the feedback that is possible with the change:

{"timestamp":"2026-03-07T14:45:41.136840Z","level":"INFO","fields":{"message":"index build progress","stage":"shuffle","event":"progress","total":"20000000","completed":"2630729","unit":"rows","pct":"13.2","rss_bytes":2835185664,"vss_bytes":4734652416},"target":"indexer::progress"}
{"timestamp":"2026-03-07T14:45:42.137026Z","level":"INFO","fields":{"message":"index build progress","stage":"shuffle","event":"progress","total":"20000000","completed":"2797961","unit":"rows","pct":"14.0","rss_bytes":3005493248,"vss_bytes":4744089600},"target":"indexer::progress"}
{"timestamp":"2026-03-07T14:45:43.137223Z","level":"INFO","fields":{"message":"index build progress","stage":"shuffle","event":"progress","total":"20000000","completed":"2897961","unit":"rows","pct":"14.5","rss_bytes":2757406720,"vss_bytes":4748283904},"target":"indexer::progress"}
{"timestamp":"2026-03-07T14:45:44.137407Z","level":"INFO","fields":{"message":"index build progress","stage":"shuffle","event":"progress","total":"20000000","completed":"3030729","unit":"rows","pct":"15.2","rss_bytes":2792321024,"vss_bytes":4750905344},"target":"indexer::progress"}
{"timestamp":"2026-03-07T14:45:45.137582Z","level":"INFO","fields":{"message":"index build progress","stage":"shuffle","event":"progress","total":"20000000","completed":"3195922","unit":"rows","pct":"16.0","rss_bytes":2812579840,"vss_bytes":4750905344},"target":"indexer::progress"}

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 7, 2026

PR Review

Clean, well-scoped change. A few minor observations:

1. build_partitions stage now has no progress updates (low severity)

After the split, the build_partitions stage starts and immediately completes after self.build_partitions().boxed().await? — which just sets up the stream, not doing heavy work. This means it's an effectively instantaneous stage with no stage_progress calls. Consider whether it's worth keeping as a separate stage at all, or if the stage_start/stage_complete pair for build_partitions could just be removed (leaving only merge_partitions which is where the real work happens).

2. num_rows_to_shuffle — minor inefficiency with fragment filtering

In num_rows_to_shuffle, dataset.get_fragments() collects all fragments, then filters. This is fine for typical use but worth noting. The buffer_unordered(16) is a reasonable choice mirroring Dataset::count_all_rows().

Overall this is a clean improvement to progress reporting — switching from batch-count to row-count for shuffle progress and splitting the build/merge phases gives users much better visibility into index build progress.

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 7, 2026

Codecov Report

❌ Patch coverage is 73.91304% with 6 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/index/vector/builder.rs 72.72% 1 Missing and 5 partials ⚠️

📢 Thoughts on this report? Let us know!

Comment thread rust/lance/src/index/vector/builder.rs Outdated
.stage_start("build_partitions", num_partitions, "partitions")
.await?;
let build_idx_stream = self.build_partitions().boxed().await?;
progress.stage_complete("build_partitions").await?;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is a bit tricky because build_partitions only returns a stream. This stage could always finish so quickly that it becomes less useful.

What do you think about this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, thanks. I agree. I observed after running this that this completed quickly but didn't focus on it being just because it was constructing a stream.

Maybe we should just rename build_partitions stage to merge_partitions here? I'm happy with that too; it'll make it clearer where the time is going.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Xuanwo updated per ^

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think it makes more sense.

@wkalt wkalt force-pushed the feat/index-build-progress-improvements branch from d5d09e3 to 78d6b6b Compare March 9, 2026 18:07
wkalt added 2 commits March 10, 2026 06:48
This breaks the "build_partitions" stage into "build_partitions" and
"merge_partitions", and also updates the progress reporting on the
shuffle phase to be in terms of rows instead of batches.
@wkalt wkalt force-pushed the feat/index-build-progress-improvements branch from 78d6b6b to 75ba906 Compare March 10, 2026 13:48
@westonpace westonpace merged commit fa64837 into lance-format:main Mar 11, 2026
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants