Skip to content

fix: support hamming distance in IndicesBuilder#6295

Merged
Xuanwo merged 1 commit intolance-format:mainfrom
jmhsieh:fix-hamming-indices-builder
Mar 25, 2026
Merged

fix: support hamming distance in IndicesBuilder#6295
Xuanwo merged 1 commit intolance-format:mainfrom
jmhsieh:fix-hamming-indices-builder

Conversation

@jmhsieh
Copy link
Copy Markdown
Contributor

@jmhsieh jmhsieh commented Mar 25, 2026

Summary

  • IndicesBuilder rejected uint8 vector columns and didn't include "hamming" in its allowed distance types, even though the underlying Rust train_ivf_model supports hamming via k-modes
  • Relaxes _normalize_column to accept unsigned integer value types alongside floats
  • Adds "hamming" to _normalize_distance_type's allowed list
  • Adds test_ivf_centroids_hamming test with uint8 vectors

Test plan

  • test_ivf_centroids_hamming — end-to-end IVF training with uint8 vectors and hamming distance
  • Existing test_ivf_centroids tests still pass (float path unchanged)

🤖 Generated with Claude Code

IndicesBuilder rejected uint8 vector columns and didn't allow "hamming"
as a distance type, even though the underlying Rust IVF training
supports hamming via k-modes. This relaxes the Python-side validation
to accept unsigned integer value types and adds "hamming" to the
allowed distance types.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added bug Something isn't working python labels Mar 25, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR Review

Clean, focused change. The Rust layer already validates that uint8 vectors are only used with hamming distance (train_ivf_model match + validate_distance_type_for), so relaxing the Python-side check is safe — invalid combos like uint8+L2 will get a clear error from Rust.

One minor suggestion:

In _normalize_column, unsigned integer columns are now accepted regardless of the distance type (which isn't known yet at column-validation time). This is fine because Rust validates it downstream, but a brief comment in _normalize_column noting that the dtype/distance-type compatibility is enforced later in train_ivf would help future readers understand why the check is intentionally loose:

# Unsigned integer types (e.g. uint8) are accepted here for hamming distance;
# dtype/distance-type compatibility is validated downstream in train_ivf.

Otherwise LGTM — test coverage is good, docstrings are updated, and the change is minimal.

Copy link
Copy Markdown
Collaborator

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@Xuanwo Xuanwo marked this pull request as ready for review March 25, 2026 18:28
@Xuanwo Xuanwo merged commit a6b033c into lance-format:main Mar 25, 2026
16 checks passed
wjones127 pushed a commit to wjones127/lance that referenced this pull request Mar 29, 2026
## Summary
- `IndicesBuilder` rejected uint8 vector columns and didn't include
"hamming" in its allowed distance types, even though the underlying Rust
`train_ivf_model` supports hamming via k-modes
- Relaxes `_normalize_column` to accept unsigned integer value types
alongside floats
- Adds "hamming" to `_normalize_distance_type`'s allowed list
- Adds `test_ivf_centroids_hamming` test with uint8 vectors


## Test plan
- [ ] `test_ivf_centroids_hamming` — end-to-end IVF training with uint8
vectors and hamming distance
- [ ] Existing `test_ivf_centroids` tests still pass (float path
unchanged)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants