Skip to content

chore: change the FTS benchmark data distribution#5721

Merged
BubbleCal merged 5 commits intomainfrom
yang/zipf-inverted-bench
Jan 15, 2026
Merged

chore: change the FTS benchmark data distribution#5721
BubbleCal merged 5 commits intomainfrom
yang/zipf-inverted-bench

Conversation

@BubbleCal
Copy link
Copy Markdown
Contributor

switch to zipf distribution to match real world dataset distribution

@github-actions github-actions Bot added the chore label Jan 15, 2026
@BubbleCal BubbleCal requested a review from Xuanwo January 15, 2026 14:19
@github-actions
Copy link
Copy Markdown
Contributor

Code Review

Overall: This PR improves benchmark realism by using Zipf distribution for term frequencies, which better reflects real-world text data. The changes are well-structured.

P1 Issue

Potential out-of-bounds index in Zipf sampling (line ~61 in diff):

let idx = rng.sample(word_zipf) as usize - 1;

The Zipf distribution samples values in the range [1, VOCAB_SIZE]. However, floating-point precision issues could theoretically produce a value slightly above VOCAB_SIZE, causing idx to exceed vocab.len() - 1. Consider clamping to prevent potential panic:

let idx = (rng.sample(word_zipf) as usize - 1).min(VOCAB_SIZE - 1);

Minor Observations (Non-blocking)

  1. The invert_indexing benchmark still uses the old uniform distribution data generation (lance_datagen), while invert_search now uses Zipf. This inconsistency means the indexed data differs from the search queries' term distribution. If intentional, a comment explaining this would help future readers.

  2. The import lance_datagen::{array, RowCount} at line 17 appears unused after this change and should be removed.

Overall, good improvement to benchmark realism. Please address the bounds check to prevent potential panics.

@BubbleCal BubbleCal requested a review from westonpace January 15, 2026 17:36
@BubbleCal BubbleCal merged commit c4d0953 into main Jan 15, 2026
27 of 28 checks passed
@BubbleCal BubbleCal deleted the yang/zipf-inverted-bench branch January 15, 2026 17:47
@wjones127
Copy link
Copy Markdown
Contributor

@BubbleCal I already changed random_sentence to use zipf distributed words here: bd18c03#diff-05a6cb2a51514a3883006392e5a3b6990e908be79d42414de787ba5c5b623c87

I think it's also better because it uses real words and has the stop words ranked high in distribution.

jackye1995 pushed a commit to jackye1995/lance that referenced this pull request Jan 21, 2026
switch to zipf distribution to match real world dataset distribution
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants