chore: change the FTS benchmark data distribution by BubbleCal · Pull Request #5721 · lance-format/lance

BubbleCal · 2026-01-15T14:19:10Z

switch to zipf distribution to match real world dataset distribution

github-actions · 2026-01-15T14:20:17Z

Code Review

Overall: This PR improves benchmark realism by using Zipf distribution for term frequencies, which better reflects real-world text data. The changes are well-structured.

P1 Issue

Potential out-of-bounds index in Zipf sampling (line ~61 in diff):

let idx = rng.sample(word_zipf) as usize - 1;

The Zipf distribution samples values in the range [1, VOCAB_SIZE]. However, floating-point precision issues could theoretically produce a value slightly above VOCAB_SIZE, causing idx to exceed vocab.len() - 1. Consider clamping to prevent potential panic:

let idx = (rng.sample(word_zipf) as usize - 1).min(VOCAB_SIZE - 1);

Minor Observations (Non-blocking)

The invert_indexing benchmark still uses the old uniform distribution data generation (lance_datagen), while invert_search now uses Zipf. This inconsistency means the indexed data differs from the search queries' term distribution. If intentional, a comment explaining this would help future readers.
The import lance_datagen::{array, RowCount} at line 17 appears unused after this change and should be removed.

Overall, good improvement to benchmark realism. Please address the bounds check to prevent potential panics.

wjones127 · 2026-01-15T18:36:25Z

@BubbleCal I already changed random_sentence to use zipf distributed words here: bd18c03#diff-05a6cb2a51514a3883006392e5a3b6990e908be79d42414de787ba5c5b623c87

I think it's also better because it uses real words and has the stop words ranked high in distribution.

switch to zipf distribution to match real world dataset distribution

BubbleCal added 3 commits January 15, 2026 22:06

Use Zipf distribution in inverted bench

656110f

Use 15 tokens per inverted search

7fb7977

Pre-generate inverted search queries

2bc8cd8

github-actions Bot added the chore label Jan 15, 2026

BubbleCal requested a review from Xuanwo January 15, 2026 14:19

BubbleCal added 2 commits January 15, 2026 22:23

Clamp Zipf samples for bench vocab

38ca968

Format inverted bench

3e59369

BubbleCal requested a review from westonpace January 15, 2026 17:36

Xuanwo approved these changes Jan 15, 2026

View reviewed changes

BubbleCal merged commit c4d0953 into main Jan 15, 2026
27 of 28 checks passed

BubbleCal deleted the yang/zipf-inverted-bench branch January 15, 2026 17:47

jackye1995 pushed a commit to jackye1995/lance that referenced this pull request Jan 21, 2026

chore: change the FTS benchmark data distribution (lance-format#5721)

741b741

switch to zipf distribution to match real world dataset distribution

andrea-reale mentioned this pull request Mar 30, 2026

emilk/fix write starvation rerun-io/lance#12

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: change the FTS benchmark data distribution#5721

chore: change the FTS benchmark data distribution#5721
BubbleCal merged 5 commits intomainfrom
yang/zipf-inverted-bench

BubbleCal commented Jan 15, 2026

Uh oh!

github-actions Bot commented Jan 15, 2026

Uh oh!

Uh oh!

wjones127 commented Jan 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

BubbleCal commented Jan 15, 2026

Uh oh!

github-actions Bot commented Jan 15, 2026

Code Review

P1 Issue

Minor Observations (Non-blocking)

Uh oh!

Uh oh!

wjones127 commented Jan 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants