Skip to content

feat(core): add Levenshtein-based suggestions to not-found errors in schema#5976

Merged
wjones127 merged 3 commits intolance-format:mainfrom
HemantSudarshan:fix-5642-levenshtein-suggestions
Feb 25, 2026
Merged

feat(core): add Levenshtein-based suggestions to not-found errors in schema#5976
wjones127 merged 3 commits intolance-format:mainfrom
HemantSudarshan:fix-5642-levenshtein-suggestions

Conversation

@HemantSudarshan
Copy link
Copy Markdown
Contributor

What

Closes #5642 (incrementally)

Enhances "column not found" and "field not found" error messages in Schema to suggest the closest matching field name using Levenshtein distance.

Before:
LanceError(Schema): Column vectr does not exist

After:
LanceError(Schema): Column vectr does not exist. Did you mean 'vector'?

Changes

Single file modified: lance-core/src/datatypes/schema.rs

  • Added levenshtein_distance() — standard edit distance with two-row DP optimization
  • Added suggest_field() — finds closest field name (threshold: edit distance ≤ 1/3 of the longer name's length)
  • Enhanced 3 error sites:
    • FieldRef::into_id — "Field 'X' not found in schema"
    • Schema::do_project — "Column X does not exist"
    • Schema::project_by_schema — "Field X not found"

Design Decisions

  • No new dependencies — implemented Levenshtein inline rather than adding strsim crate
  • No new error variants — enhanced existing Error::InvalidInput and Error::Schema message strings
  • 1/3 threshold — per issue guidance: suggestions only appear when fewer than 1/3 of characters need to change, preventing unhelpful suggestions for completely unrelated names
  • Incremental scope — this PR covers schema.rs only; additional error sites (scanner, projection, etc.) can follow

Testing

Added 4 tests:

  • test_levenshtein_distance — 11 assertions covering identical, empty, single-edit, multi-edit, and completely different strings
  • test_suggest_field — 6 assertions: close match, no match, exact match rejection, empty list, short names
  • test_suggest_field_edge_cases — 2 assertions: all-different short names, picks-closest-among-multiple
  • test_project_with_suggestion — integration test: verifies Schema::project includes suggestion for typo, and omits it for completely wrong names

@github-actions github-actions Bot added the enhancement New feature or request label Feb 21, 2026
@wjones127 wjones127 self-assigned this Feb 21, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 22, 2026

Codecov Report

❌ Patch coverage is 90.78947% with 14 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-core/src/error.rs 69.23% 2 Missing and 6 partials ⚠️
rust/lance-core/src/levenshtein.rs 94.36% 4 Missing ⚠️
rust/lance-core/src/datatypes/schema.rs 97.91% 1 Missing ⚠️
rust/lance/src/dataset/metadata.rs 85.71% 0 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

@HemantSudarshan HemantSudarshan force-pushed the fix-5642-levenshtein-suggestions branch from dfd1517 to e62735f Compare February 22, 2026 06:25
Copy link
Copy Markdown
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making a PR. This is a good start, but it misses a key requirement from the original issue:

I would recommend implementing the suggestion in the Display implementation instead of when creating the error. That way, we don't perform a complex calculation for the suggestion when the error might be expected and immediately handled.

Here is an example of how that might work: https://github.com/lance-format/lance/pull/5830/changes#diff-d8784e118b0c126058a9c40901117a537eb97065e1def0ddb94ec47ce5a2a9aeR16-R44

I think I'd prefer to create a new error variant to accomplish this, rather than re-using existing variants.

@HemantSudarshan HemantSudarshan force-pushed the fix-5642-levenshtein-suggestions branch from e62735f to 53028c0 Compare February 24, 2026 20:38
Copy link
Copy Markdown
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great work! I will merge once tests are passing.

@wjones127 wjones127 merged commit 898599e into lance-format:main Feb 25, 2026
28 checks passed
wjones127 added a commit to wjones127/lance that referenced this pull request Feb 25, 2026
…schema (lance-format#5976)

### What

Closes lance-format#5642 (incrementally)

Enhances "column not found" and "field not found" error messages in
`Schema` to suggest the closest matching field name using Levenshtein
distance.

**Before:**
`LanceError(Schema): Column vectr does not exist`

**After:**
`LanceError(Schema): Column vectr does not exist. Did you mean
'vector'?`

### Changes

Single file modified: `lance-core/src/datatypes/schema.rs`

- Added `levenshtein_distance()` — standard edit distance with two-row
DP optimization
- Added `suggest_field()` — finds closest field name (threshold: edit
distance ≤ 1/3 of the longer name's length)
- Enhanced 3 error sites:
  - `FieldRef::into_id` — "Field 'X' not found in schema"
  - `Schema::do_project` — "Column X does not exist"  
  - `Schema::project_by_schema` — "Field X not found"

### Design Decisions

- **No new dependencies** — implemented Levenshtein inline rather than
adding `strsim` crate
- **No new error variants** — enhanced existing `Error::InvalidInput`
and `Error::Schema` message strings
- **1/3 threshold** — per issue guidance: suggestions only appear when
fewer than 1/3 of characters need to change, preventing unhelpful
suggestions for completely unrelated names
- **Incremental scope** — this PR covers `schema.rs` only; additional
error sites (scanner, projection, etc.) can follow

### Testing

Added 4 tests:
- `test_levenshtein_distance` — 11 assertions covering identical, empty,
single-edit, multi-edit, and completely different strings
- `test_suggest_field` — 6 assertions: close match, no match, exact
match rejection, empty list, short names
- `test_suggest_field_edge_cases` — 2 assertions: all-different short
names, picks-closest-among-multiple
- `test_project_with_suggestion` — integration test: verifies
`Schema::project` includes suggestion for typo, and omits it for
completely wrong names

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
wjones127 added a commit to wjones127/lance that referenced this pull request Feb 25, 2026
…schema (lance-format#5976)

### What

Closes lance-format#5642 (incrementally)

Enhances "column not found" and "field not found" error messages in
`Schema` to suggest the closest matching field name using Levenshtein
distance.

**Before:**
`LanceError(Schema): Column vectr does not exist`

**After:**
`LanceError(Schema): Column vectr does not exist. Did you mean
'vector'?`

### Changes

Single file modified: `lance-core/src/datatypes/schema.rs`

- Added `levenshtein_distance()` — standard edit distance with two-row
DP optimization
- Added `suggest_field()` — finds closest field name (threshold: edit
distance ≤ 1/3 of the longer name's length)
- Enhanced 3 error sites:
  - `FieldRef::into_id` — "Field 'X' not found in schema"
  - `Schema::do_project` — "Column X does not exist"  
  - `Schema::project_by_schema` — "Field X not found"

### Design Decisions

- **No new dependencies** — implemented Levenshtein inline rather than
adding `strsim` crate
- **No new error variants** — enhanced existing `Error::InvalidInput`
and `Error::Schema` message strings
- **1/3 threshold** — per issue guidance: suggestions only appear when
fewer than 1/3 of characters need to change, preventing unhelpful
suggestions for completely unrelated names
- **Incremental scope** — this PR covers `schema.rs` only; additional
error sites (scanner, projection, etc.) can follow

### Testing

Added 4 tests:
- `test_levenshtein_distance` — 11 assertions covering identical, empty,
single-edit, multi-edit, and completely different strings
- `test_suggest_field` — 6 assertions: close match, no match, exact
match rejection, empty list, short names
- `test_suggest_field_edge_cases` — 2 assertions: all-different short
names, picks-closest-among-multiple
- `test_project_with_suggestion` — integration test: verifies
`Schema::project` includes suggestion for typo, and omits it for
completely wrong names

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
wjones127 added a commit that referenced this pull request Feb 26, 2026
…schema (#5976)

### What

Closes #5642 (incrementally)

Enhances "column not found" and "field not found" error messages in
`Schema` to suggest the closest matching field name using Levenshtein
distance.

**Before:**
`LanceError(Schema): Column vectr does not exist`

**After:**
`LanceError(Schema): Column vectr does not exist. Did you mean
'vector'?`

### Changes

Single file modified: `lance-core/src/datatypes/schema.rs`

- Added `levenshtein_distance()` — standard edit distance with two-row
DP optimization
- Added `suggest_field()` — finds closest field name (threshold: edit
distance ≤ 1/3 of the longer name's length)
- Enhanced 3 error sites:
  - `FieldRef::into_id` — "Field 'X' not found in schema"
  - `Schema::do_project` — "Column X does not exist"  
  - `Schema::project_by_schema` — "Field X not found"

### Design Decisions

- **No new dependencies** — implemented Levenshtein inline rather than
adding `strsim` crate
- **No new error variants** — enhanced existing `Error::InvalidInput`
and `Error::Schema` message strings
- **1/3 threshold** — per issue guidance: suggestions only appear when
fewer than 1/3 of characters need to change, preventing unhelpful
suggestions for completely unrelated names
- **Incremental scope** — this PR covers `schema.rs` only; additional
error sites (scanner, projection, etc.) can follow

### Testing

Added 4 tests:
- `test_levenshtein_distance` — 11 assertions covering identical, empty,
single-edit, multi-edit, and completely different strings
- `test_suggest_field` — 6 assertions: close match, no match, exact
match rejection, empty list, short names
- `test_suggest_field_edge_cases` — 2 assertions: all-different short
names, picks-closest-among-multiple
- `test_project_with_suggestion` — integration test: verifies
`Schema::project` includes suggestion for typo, and omits it for
completely wrong names

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enhance not found errors with suggestions

2 participants