chore: change the FTS benchmark data distribution#5721
Conversation
Code ReviewOverall: This PR improves benchmark realism by using Zipf distribution for term frequencies, which better reflects real-world text data. The changes are well-structured. P1 IssuePotential out-of-bounds index in Zipf sampling (line ~61 in diff): let idx = rng.sample(word_zipf) as usize - 1;The Zipf distribution samples values in the range let idx = (rng.sample(word_zipf) as usize - 1).min(VOCAB_SIZE - 1);Minor Observations (Non-blocking)
Overall, good improvement to benchmark realism. Please address the bounds check to prevent potential panics. |
|
@BubbleCal I already changed I think it's also better because it uses real words and has the stop words ranked high in distribution. |
switch to zipf distribution to match real world dataset distribution
switch to zipf distribution to match real world dataset distribution