fix: infer multivector sampling rows by BubbleCal · Pull Request #5534 · lance-format/lance

BubbleCal · 2025-12-18T14:56:23Z

Replace fixed 1030 assumption when sampling multivector rows for index training by inferring vectors-per-row from a non-null example. Fall back to n=1030 if no non-empty value is found. Adds unit tests for inferred n and fallback behavior.

chatgpt-codex-connector · 2025-12-18T14:56:29Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

github-actions · 2025-12-18T14:56:38Z

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

Xuanwo

Thank you!

Xuanwo · 2025-12-18T15:21:56Z

@claude review this pr

claude · 2025-12-18T15:22:13Z

Claude finished @Xuanwo's task in 1m 17s —— View job

PR Review

Todo List

Read the modified file to understand changes
Analyze code correctness and logic
Check test coverage and edge cases
Review code quality and adherence to standards
Post comprehensive review feedback

Summary

This PR improves the multivector training data sampling logic by dynamically inferring the number of vectors per row instead of hardcoding the value to 1030. The implementation is well-designed with good fallback behavior and test coverage.

Positive Observations

✅ Good implementation strategy: The two-phase approach (fast random sampling → fallback scanning) is efficient and handles edge cases well.

✅ Comprehensive test coverage: Tests cover both the happy path (inference succeeds) and fallback case (all nulls).

✅ Clear documentation: The comment explaining the estimation logic is helpful.

✅ Proper error handling: Appropriate use of warnings when falling back to default value.

Detailed Review

Code Logic (rust/lance/src/index/vector/utils.rs)

New function: estimate_multivector_vectors_per_row (lines 87-137)

The implementation follows a sensible two-phase strategy:

Fast path (lines 98-113): Tries 8 random samples of up to 64 rows each
- Efficiently skips null values
- Returns immediately on finding a non-empty value
- Good balance between speed and success rate
Fallback path (lines 115-130): Scans up to 1024 rows with null filtering
- Uses proper DataFusion filtering to avoid reading nulls
- Appropriate limit to prevent excessive scanning
- Clear warning when no valid value is found

Potential Issues/Questions:

Empty list handling: On line 109, you check if len > 0, but what if a multivector exists but has 0 vectors in it? Would that be a valid case or should it be treated as null/empty? If it's valid, you might skip it incorrectly. Consider whether this is the intended behavior.
Type safety: Lines 103 and 124 use as_list::<i32>() without checking if the array is actually a list type. While this should be guaranteed by the caller context, consider adding a type check or a comment explaining why it's safe.
Sampling bias: The random sampling approach (8 tries × 64 rows = ~512 rows) might not find a valid value in datasets with very sparse multivector data (e.g., <0.2% non-null). However, the fallback handles this well, so this is acceptable.

Integration with maybe_sample_training_data (lines 282-294)

The integration looks good. The estimation is only performed for List(_) types, which is correct.

Minor suggestion: Consider caching the vectors_per_row value in the dataset metadata if this function is called multiple times for the same column, though this might be premature optimization.

Test Coverage (lines 523-577)

Test 1: test_maybe_sample_training_data_multivector_infers_vectors_per_row

✅ Tests the happy path with non-null multivector data
✅ Verifies the correct number of training vectors is sampled
Uses vectors_per_row: 2, which is more realistic than the old assumption of 1030

Test 2: test_estimate_multivector_vectors_per_row_fallback_1030

✅ Tests the fallback case with 100% null data
✅ Verifies the default value of 1030 is returned
Good use of with_random_nulls(1.0) to create all-null data

Missing test cases (nice-to-have, not blocking):

Very sparse data (e.g., 1% non-null) to test the boundary between fast path and fallback
Variable-length multivectors to ensure the first non-empty one is representative
Edge case: num_rows == 0 (covered by early return but not tested)

Code Style & Standards

✅ Follows Rust naming conventions
✅ Appropriate use of async/await
✅ Good error messages with context
✅ Follows the project's pattern of using memory:// in tests

Performance Considerations

The performance impact should be minimal:

Fast path tries only ~512 rows in most cases
Fallback is limited to 1024 rows with null filtering
This is much better than the previous assumption which could lead to over-sampling

Recommendations

Consider documenting the "empty list" behavior: If a multivector row exists but has 0 vectors in it, clarify whether this should be treated as a valid value or skipped.

Optional enhancement: Add a debug log when the fast path succeeds to help with troubleshooting:

if len > 0 {
    debug!("Inferred {} vectors per row for column {}", len, column);
    return Ok(len);
}

Documentation: Consider adding a doc comment to estimate_multivector_vectors_per_row explaining:
- What it returns (number of vectors in a single multivector row)
- The fallback behavior
- Why 1030 is used as the default

Verdict

This is a solid improvement that removes a hardcoded assumption and makes the code more robust. The implementation is efficient, well-tested, and handles edge cases appropriately. The two minor suggestions above are optional enhancements, not blockers.

The PR looks good to merge! ✅

codecov · 2025-12-18T15:37:32Z

Codecov Report

❌ Patch coverage is 80.72289% with 16 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance/src/index/vector/utils.rs	80.72%	6 Missing and 10 partials ⚠️

📢 Thoughts on this report? Let us know!

Xuanwo · 2025-12-18T15:39:04Z

@claude review this pr

claude · 2025-12-18T15:39:15Z

Claude finished @Xuanwo's task in 1m 43s —— View job

PR Review

Todo List

Check for new changes since last review
Read the modified file to understand implementation
Analyze code correctness and edge cases
Review test coverage (codecov shows gaps)
Verify adherence to project standards
Provide comprehensive review feedback

Summary

This PR successfully replaces the hardcoded 1030 assumption with dynamic inference of vectors-per-row for multivector sampling. The implementation is sound with a smart two-phase approach (fast random sampling → comprehensive fallback), and includes good test coverage. All CI checks pass.

Implementation Analysis

Core Function: `estimate_multivector_vectors_per_row` (lines 87-137)

Strengths:
✅ Efficient two-phase strategy:

Fast path: 8 random samples of up to 64 rows each (lines 98-113)
Fallback: Filtered scan of up to 1024 rows with null filtering (lines 115-130)
Early return on num_rows == 0 (lines 92-94)

✅ Proper null handling: Skips null values in both phases

✅ Good logging: Warning message when falling back to default (lines 132-135)

Observations:

Empty list handling (line 109): The check if len > 0 skips multivectors that exist but contain 0 vectors. This seems intentional since a 0-length multivector wouldn't be useful for training. However, this could theoretically cause the function to miss valid data if the dataset only contains empty multivectors. The fallback to 1030 is reasonable in this edge case.
Type assumptions (lines 103, 124): The code uses as_list::<i32>() without type checking. This is safe given the context (caller verifies it's a List(_) type), but could benefit from a comment explaining the safety invariant.
Fallback value justification: The value 1030 appears arbitrary. A comment explaining why 1030 specifically (historical? average? empirical?) would be helpful for future maintainers.

Integration: `maybe_sample_training_data` (lines 282-294)

Strengths:
✅ Correctly gates estimation to List(_) types only
✅ Proper calculation: sample_size_hint.div_ceil(vectors_per_row).max(128)
✅ Good justification for minimum of 128 rows

Observation:

The comment mentions "64 MiB" but doesn't show the math. For reviewers: 128 multivectors × ~1000 vectors/multivector × vector_size ≈ 64 MiB (reasonable estimate).

Test Coverage

Test 1: `test_maybe_sample_training_data_multivector_infers_vectors_per_row` (lines 523-549)

✅ Tests happy path with vectors_per_row: 2
✅ Verifies correct number of training vectors sampled
✅ Uses realistic multivector data (not the old 1030 assumption)

Test 2: `test_estimate_multivector_vectors_per_row_fallback_1030` (lines 551-577)

✅ Tests fallback with 100% null data
✅ Verifies default value of 1030 returned
✅ Good use of with_random_nulls(1.0)

Coverage Gaps (from codecov: 80.72%, 16 lines missing)

The codecov report indicates 16 lines with missing coverage. Based on the code structure, these likely include:

Likely uncovered lines:

Error paths in get_column_from_batch (lines 35-44, 50-56, 64-70, 74-80)
Empty result handling in the fallback scan (line 136 return path)
Edge cases in the fast path when no valid values are found

Assessment: The missing coverage is primarily error handling paths and edge cases. The core logic (estimation + fallback) is well-tested. The error path coverage is less critical since those paths validate input correctness.

Nice-to-have tests (not blockers):

Sparse data test (e.g., 1% non-null) to verify fast path → fallback transition
Variable-length multivectors to ensure first found value is used
Edge case: num_rows < 64 (though early return handles this)

Code Quality & Standards

✅ Rust conventions: Proper naming, formatting, error handling
✅ Project standards (CLAUDE.md):

Uses memory:// URI in tests ✓
Uses rstest for parameterized tests (existing test in file) ✓
Meaningful comments where logic isn't self-evident ✓
Async-first architecture ✓

✅ Documentation: Good inline comments explaining the estimation strategy

⚠️ Minor: The estimate_multivector_vectors_per_row function lacks a doc comment. Consider adding:

/// Estimates the number of vectors per row in a multivector column by sampling.
///
/// Attempts to find a non-null, non-empty multivector value through random sampling,
/// falling back to a filtered scan if necessary. Returns 1030 if no valid value is found.
///
/// # Arguments
/// * `dataset` - The dataset to sample from
/// * `column` - The multivector column name
/// * `num_rows` - Total number of rows in the dataset
///
/// # Returns
/// The inferred vectors-per-row, or 1030 as a conservative default

Performance Impact

Positive:

Fast path tries only ~512 rows (8 × 64) in most cases
Fallback limited to 1024 rows with efficient null filtering
Much better than previous assumption which could lead to over/under-sampling

Negligible overhead: The estimation runs once per index training, not per query.

Security & Correctness

✅ No security concerns
✅ Logic is sound and handles edge cases appropriately
✅ Proper error propagation with context

Recommendations

Required: None - the PR is ready to merge

Optional Enhancements:

Add doc comment to estimate_multivector_vectors_per_row (see example above)

Clarify the 1030 default with a comment explaining its origin:

// Fallback to 1030, a conservative estimate based on typical multivector datasets
// observed in production. This ensures adequate training data when inference fails.
warn!("Could not find a non-empty multivector value for column {}, falling back to n=1030", column);

Debug logging when fast path succeeds (helpful for troubleshooting):

if len > 0 {
    log::debug!("Inferred {} vectors per row for multivector column '{}'", len, column);
    return Ok(len);
}

Type safety comment explaining the as_list::<i32>() safety:

// Safe: caller verifies column is List(_) type before calling this function
let list_array = array.as_list::<i32>();

Verdict

✅ LGTM - Approved for merge

This is a solid improvement that removes a brittle hardcoded assumption. The implementation is efficient, well-tested, and handles edge cases gracefully. The 80% patch coverage is acceptable given that most uncovered lines are error paths. The optional enhancements above would improve maintainability but are not blocking.

The PR successfully addresses its stated goal and follows project conventions.

Xuanwo · 2025-12-18T15:42:24Z

ok, I got it. The prompts for @claude and claude's auto review is different.

Replace fixed 1030 assumption when sampling multivector rows for index training by inferring vectors-per-row from a non-null example. Fall back to n=1030 if no non-empty value is found. Adds unit tests for inferred n and fallback behavior.

BubbleCal added 2 commits December 18, 2025 22:27

Infer multivector sample rows

0503804

Fallback multivector n=1030

7ac46bf

BubbleCal changed the title ~~Infer multivector sampling rows~~ fix: infer multivector sampling rows Dec 18, 2025

github-actions Bot added the bug Something isn't working label Dec 18, 2025

fmt: vector utils

b1996b7

wkalt approved these changes Dec 18, 2025

View reviewed changes

Xuanwo approved these changes Dec 18, 2025

View reviewed changes

BubbleCal merged commit c2a1535 into main Dec 18, 2025
28 checks passed

BubbleCal deleted the multivector-sample-rows branch December 18, 2025 15:45

andrea-reale mentioned this pull request Mar 30, 2026

emilk/fix write starvation rerun-io/lance#12

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: infer multivector sampling rows#5534

fix: infer multivector sampling rows#5534
BubbleCal merged 3 commits intomainfrom
multivector-sample-rows

BubbleCal commented Dec 18, 2025

Uh oh!

chatgpt-codex-connector Bot commented Dec 18, 2025

Uh oh!

github-actions Bot commented Dec 18, 2025

Uh oh!

Xuanwo left a comment

Uh oh!

Xuanwo commented Dec 18, 2025

Uh oh!

claude Bot commented Dec 18, 2025 •

edited

Loading

Uh oh!

codecov Bot commented Dec 18, 2025

Uh oh!

Xuanwo commented Dec 18, 2025

Uh oh!

claude Bot commented Dec 18, 2025 •

edited

Loading

Uh oh!

Xuanwo commented Dec 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

BubbleCal commented Dec 18, 2025

Uh oh!

chatgpt-codex-connector Bot commented Dec 18, 2025

Uh oh!

github-actions Bot commented Dec 18, 2025

Uh oh!

Xuanwo left a comment

Choose a reason for hiding this comment

Uh oh!

Xuanwo commented Dec 18, 2025

Uh oh!

claude Bot commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review

Todo List

Summary

Positive Observations

Detailed Review

Code Logic (rust/lance/src/index/vector/utils.rs)

Test Coverage (lines 523-577)

Code Style & Standards

Performance Considerations

Recommendations

Verdict

Uh oh!

codecov Bot commented Dec 18, 2025

Codecov Report

Uh oh!

Xuanwo commented Dec 18, 2025

Uh oh!

claude Bot commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review

Todo List

Summary

Implementation Analysis

Core Function: estimate_multivector_vectors_per_row (lines 87-137)

Integration: maybe_sample_training_data (lines 282-294)

Test Coverage

Test 1: test_maybe_sample_training_data_multivector_infers_vectors_per_row (lines 523-549)

Test 2: test_estimate_multivector_vectors_per_row_fallback_1030 (lines 551-577)

Coverage Gaps (from codecov: 80.72%, 16 lines missing)

Code Quality & Standards

Performance Impact

Security & Correctness

Recommendations

Required: None - the PR is ready to merge

Optional Enhancements:

Verdict

Uh oh!

Xuanwo commented Dec 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

claude Bot commented Dec 18, 2025 •

edited

Loading

claude Bot commented Dec 18, 2025 •

edited

Loading

Core Function: `estimate_multivector_vectors_per_row` (lines 87-137)

Integration: `maybe_sample_training_data` (lines 282-294)

Test 1: `test_maybe_sample_training_data_multivector_infers_vectors_per_row` (lines 523-549)

Test 2: `test_estimate_multivector_vectors_per_row_fallback_1030` (lines 551-577)