Skip to content

fix: infer multivector sampling rows#5534

Merged
BubbleCal merged 3 commits intomainfrom
multivector-sample-rows
Dec 18, 2025
Merged

fix: infer multivector sampling rows#5534
BubbleCal merged 3 commits intomainfrom
multivector-sample-rows

Conversation

@BubbleCal
Copy link
Copy Markdown
Contributor

Replace fixed 1030 assumption when sampling multivector rows for index training by inferring vectors-per-row from a non-null example. Fall back to n=1030 if no non-empty value is found. Adds unit tests for inferred n and fallback behavior.

@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@github-actions
Copy link
Copy Markdown
Contributor

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@BubbleCal BubbleCal changed the title Infer multivector sampling rows fix: infer multivector sampling rows Dec 18, 2025
@github-actions github-actions Bot added the bug Something isn't working label Dec 18, 2025
Copy link
Copy Markdown
Collaborator

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@Xuanwo
Copy link
Copy Markdown
Collaborator

Xuanwo commented Dec 18, 2025

@claude review this pr

@claude
Copy link
Copy Markdown

claude Bot commented Dec 18, 2025

Claude finished @Xuanwo's task in 1m 17s —— View job


PR Review

Todo List

  • Read the modified file to understand changes
  • Analyze code correctness and logic
  • Check test coverage and edge cases
  • Review code quality and adherence to standards
  • Post comprehensive review feedback

Summary

This PR improves the multivector training data sampling logic by dynamically inferring the number of vectors per row instead of hardcoding the value to 1030. The implementation is well-designed with good fallback behavior and test coverage.

Positive Observations

Good implementation strategy: The two-phase approach (fast random sampling → fallback scanning) is efficient and handles edge cases well.

Comprehensive test coverage: Tests cover both the happy path (inference succeeds) and fallback case (all nulls).

Clear documentation: The comment explaining the estimation logic is helpful.

Proper error handling: Appropriate use of warnings when falling back to default value.

Detailed Review

Code Logic (rust/lance/src/index/vector/utils.rs)

New function: estimate_multivector_vectors_per_row (lines 87-137)

The implementation follows a sensible two-phase strategy:

  1. Fast path (lines 98-113): Tries 8 random samples of up to 64 rows each

    • Efficiently skips null values
    • Returns immediately on finding a non-empty value
    • Good balance between speed and success rate
  2. Fallback path (lines 115-130): Scans up to 1024 rows with null filtering

    • Uses proper DataFusion filtering to avoid reading nulls
    • Appropriate limit to prevent excessive scanning
    • Clear warning when no valid value is found

Potential Issues/Questions:

  1. Empty list handling: On line 109, you check if len > 0, but what if a multivector exists but has 0 vectors in it? Would that be a valid case or should it be treated as null/empty? If it's valid, you might skip it incorrectly. Consider whether this is the intended behavior.

  2. Type safety: Lines 103 and 124 use as_list::<i32>() without checking if the array is actually a list type. While this should be guaranteed by the caller context, consider adding a type check or a comment explaining why it's safe.

  3. Sampling bias: The random sampling approach (8 tries × 64 rows = ~512 rows) might not find a valid value in datasets with very sparse multivector data (e.g., <0.2% non-null). However, the fallback handles this well, so this is acceptable.

Integration with maybe_sample_training_data (lines 282-294)

The integration looks good. The estimation is only performed for List(_) types, which is correct.

Minor suggestion: Consider caching the vectors_per_row value in the dataset metadata if this function is called multiple times for the same column, though this might be premature optimization.

Test Coverage (lines 523-577)

Test 1: test_maybe_sample_training_data_multivector_infers_vectors_per_row

  • ✅ Tests the happy path with non-null multivector data
  • ✅ Verifies the correct number of training vectors is sampled
  • Uses vectors_per_row: 2, which is more realistic than the old assumption of 1030

Test 2: test_estimate_multivector_vectors_per_row_fallback_1030

  • ✅ Tests the fallback case with 100% null data
  • ✅ Verifies the default value of 1030 is returned
  • Good use of with_random_nulls(1.0) to create all-null data

Missing test cases (nice-to-have, not blocking):

  • Very sparse data (e.g., 1% non-null) to test the boundary between fast path and fallback
  • Variable-length multivectors to ensure the first non-empty one is representative
  • Edge case: num_rows == 0 (covered by early return but not tested)

Code Style & Standards

✅ Follows Rust naming conventions
✅ Appropriate use of async/await
✅ Good error messages with context
✅ Follows the project's pattern of using memory:// in tests

Performance Considerations

The performance impact should be minimal:

  • Fast path tries only ~512 rows in most cases
  • Fallback is limited to 1024 rows with null filtering
  • This is much better than the previous assumption which could lead to over-sampling

Recommendations

  1. Consider documenting the "empty list" behavior: If a multivector row exists but has 0 vectors in it, clarify whether this should be treated as a valid value or skipped.

  2. Optional enhancement: Add a debug log when the fast path succeeds to help with troubleshooting:

    if len > 0 {
        debug!("Inferred {} vectors per row for column {}", len, column);
        return Ok(len);
    }
  3. Documentation: Consider adding a doc comment to estimate_multivector_vectors_per_row explaining:

    • What it returns (number of vectors in a single multivector row)
    • The fallback behavior
    • Why 1030 is used as the default

Verdict

This is a solid improvement that removes a hardcoded assumption and makes the code more robust. The implementation is efficient, well-tested, and handles edge cases appropriately. The two minor suggestions above are optional enhancements, not blockers.

The PR looks good to merge! ✅


@codecov
Copy link
Copy Markdown

codecov Bot commented Dec 18, 2025

Codecov Report

❌ Patch coverage is 80.72289% with 16 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/index/vector/utils.rs 80.72% 6 Missing and 10 partials ⚠️

📢 Thoughts on this report? Let us know!

@Xuanwo
Copy link
Copy Markdown
Collaborator

Xuanwo commented Dec 18, 2025

@claude review this pr

@claude
Copy link
Copy Markdown

claude Bot commented Dec 18, 2025

Claude finished @Xuanwo's task in 1m 43s —— View job


PR Review

Todo List

  • Check for new changes since last review
  • Read the modified file to understand implementation
  • Analyze code correctness and edge cases
  • Review test coverage (codecov shows gaps)
  • Verify adherence to project standards
  • Provide comprehensive review feedback

Summary

This PR successfully replaces the hardcoded 1030 assumption with dynamic inference of vectors-per-row for multivector sampling. The implementation is sound with a smart two-phase approach (fast random sampling → comprehensive fallback), and includes good test coverage. All CI checks pass.

Implementation Analysis

Core Function: estimate_multivector_vectors_per_row (lines 87-137)

Strengths:
Efficient two-phase strategy:

  • Fast path: 8 random samples of up to 64 rows each (lines 98-113)
  • Fallback: Filtered scan of up to 1024 rows with null filtering (lines 115-130)
  • Early return on num_rows == 0 (lines 92-94)

Proper null handling: Skips null values in both phases

Good logging: Warning message when falling back to default (lines 132-135)

Observations:

  1. Empty list handling (line 109): The check if len > 0 skips multivectors that exist but contain 0 vectors. This seems intentional since a 0-length multivector wouldn't be useful for training. However, this could theoretically cause the function to miss valid data if the dataset only contains empty multivectors. The fallback to 1030 is reasonable in this edge case.

  2. Type assumptions (lines 103, 124): The code uses as_list::<i32>() without type checking. This is safe given the context (caller verifies it's a List(_) type), but could benefit from a comment explaining the safety invariant.

  3. Fallback value justification: The value 1030 appears arbitrary. A comment explaining why 1030 specifically (historical? average? empirical?) would be helpful for future maintainers.

Integration: maybe_sample_training_data (lines 282-294)

Strengths:
✅ Correctly gates estimation to List(_) types only
✅ Proper calculation: sample_size_hint.div_ceil(vectors_per_row).max(128)
✅ Good justification for minimum of 128 rows

Observation:

  • The comment mentions "64 MiB" but doesn't show the math. For reviewers: 128 multivectors × ~1000 vectors/multivector × vector_size ≈ 64 MiB (reasonable estimate).

Test Coverage

Test 1: test_maybe_sample_training_data_multivector_infers_vectors_per_row (lines 523-549)

✅ Tests happy path with vectors_per_row: 2
✅ Verifies correct number of training vectors sampled
✅ Uses realistic multivector data (not the old 1030 assumption)

Test 2: test_estimate_multivector_vectors_per_row_fallback_1030 (lines 551-577)

✅ Tests fallback with 100% null data
✅ Verifies default value of 1030 returned
✅ Good use of with_random_nulls(1.0)

Coverage Gaps (from codecov: 80.72%, 16 lines missing)

The codecov report indicates 16 lines with missing coverage. Based on the code structure, these likely include:

Likely uncovered lines:

  • Error paths in get_column_from_batch (lines 35-44, 50-56, 64-70, 74-80)
  • Empty result handling in the fallback scan (line 136 return path)
  • Edge cases in the fast path when no valid values are found

Assessment: The missing coverage is primarily error handling paths and edge cases. The core logic (estimation + fallback) is well-tested. The error path coverage is less critical since those paths validate input correctness.

Nice-to-have tests (not blockers):

  • Sparse data test (e.g., 1% non-null) to verify fast path → fallback transition
  • Variable-length multivectors to ensure first found value is used
  • Edge case: num_rows < 64 (though early return handles this)

Code Quality & Standards

Rust conventions: Proper naming, formatting, error handling
Project standards (CLAUDE.md):

  • Uses memory:// URI in tests ✓
  • Uses rstest for parameterized tests (existing test in file) ✓
  • Meaningful comments where logic isn't self-evident ✓
  • Async-first architecture ✓

Documentation: Good inline comments explaining the estimation strategy

⚠️ Minor: The estimate_multivector_vectors_per_row function lacks a doc comment. Consider adding:

/// Estimates the number of vectors per row in a multivector column by sampling.
///
/// Attempts to find a non-null, non-empty multivector value through random sampling,
/// falling back to a filtered scan if necessary. Returns 1030 if no valid value is found.
///
/// # Arguments
/// * `dataset` - The dataset to sample from
/// * `column` - The multivector column name
/// * `num_rows` - Total number of rows in the dataset
///
/// # Returns
/// The inferred vectors-per-row, or 1030 as a conservative default

Performance Impact

Positive:

  • Fast path tries only ~512 rows (8 × 64) in most cases
  • Fallback limited to 1024 rows with efficient null filtering
  • Much better than previous assumption which could lead to over/under-sampling

Negligible overhead: The estimation runs once per index training, not per query.

Security & Correctness

✅ No security concerns
✅ Logic is sound and handles edge cases appropriately
✅ Proper error propagation with context

Recommendations

Required: None - the PR is ready to merge

Optional Enhancements:

  1. Add doc comment to estimate_multivector_vectors_per_row (see example above)

  2. Clarify the 1030 default with a comment explaining its origin:

    // Fallback to 1030, a conservative estimate based on typical multivector datasets
    // observed in production. This ensures adequate training data when inference fails.
    warn!("Could not find a non-empty multivector value for column {}, falling back to n=1030", column);
  3. Debug logging when fast path succeeds (helpful for troubleshooting):

    if len > 0 {
        log::debug!("Inferred {} vectors per row for multivector column '{}'", len, column);
        return Ok(len);
    }
  4. Type safety comment explaining the as_list::<i32>() safety:

    // Safe: caller verifies column is List(_) type before calling this function
    let list_array = array.as_list::<i32>();

Verdict

✅ LGTM - Approved for merge

This is a solid improvement that removes a brittle hardcoded assumption. The implementation is efficient, well-tested, and handles edge cases gracefully. The 80% patch coverage is acceptable given that most uncovered lines are error paths. The optional enhancements above would improve maintainability but are not blocking.

The PR successfully addresses its stated goal and follows project conventions.


@Xuanwo
Copy link
Copy Markdown
Collaborator

Xuanwo commented Dec 18, 2025

ok, I got it. The prompts for @claude and claude's auto review is different.

@BubbleCal BubbleCal merged commit c2a1535 into main Dec 18, 2025
28 checks passed
@BubbleCal BubbleCal deleted the multivector-sample-rows branch December 18, 2025 15:45
wjones127 pushed a commit to wjones127/lance that referenced this pull request Dec 19, 2025
Replace fixed 1030 assumption when sampling multivector rows for index
training by inferring vectors-per-row from a non-null example. Fall back
to n=1030 if no non-empty value is found. Adds unit tests for inferred n
and fallback behavior.
wjones127 pushed a commit that referenced this pull request Dec 19, 2025
Replace fixed 1030 assumption when sampling multivector rows for index
training by inferring vectors-per-row from a non-null example. Fall back
to n=1030 if no non-empty value is found. Adds unit tests for inferred n
and fallback behavior.
wjones127 pushed a commit to wjones127/lance that referenced this pull request Dec 30, 2025
Replace fixed 1030 assumption when sampling multivector rows for index
training by inferring vectors-per-row from a non-null example. Fall back
to n=1030 if no non-empty value is found. Adds unit tests for inferred n
and fallback behavior.
jackye1995 pushed a commit to jackye1995/lance that referenced this pull request Jan 21, 2026
Replace fixed 1030 assumption when sampling multivector rows for index
training by inferring vectors-per-row from a non-null example. Fall back
to n=1030 if no non-empty value is found. Adds unit tests for inferred n
and fallback behavior.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants