feat(core): add Levenshtein-based suggestions to not-found errors in schema#5976
Conversation
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
dfd1517 to
e62735f
Compare
wjones127
left a comment
There was a problem hiding this comment.
Thanks for making a PR. This is a good start, but it misses a key requirement from the original issue:
I would recommend implementing the suggestion in the Display implementation instead of when creating the error. That way, we don't perform a complex calculation for the suggestion when the error might be expected and immediately handled.
Here is an example of how that might work: https://github.com/lance-format/lance/pull/5830/changes#diff-d8784e118b0c126058a9c40901117a537eb97065e1def0ddb94ec47ce5a2a9aeR16-R44
I think I'd prefer to create a new error variant to accomplish this, rather than re-using existing variants.
e62735f to
53028c0
Compare
wjones127
left a comment
There was a problem hiding this comment.
This is great work! I will merge once tests are passing.
…schema (lance-format#5976) ### What Closes lance-format#5642 (incrementally) Enhances "column not found" and "field not found" error messages in `Schema` to suggest the closest matching field name using Levenshtein distance. **Before:** `LanceError(Schema): Column vectr does not exist` **After:** `LanceError(Schema): Column vectr does not exist. Did you mean 'vector'?` ### Changes Single file modified: `lance-core/src/datatypes/schema.rs` - Added `levenshtein_distance()` — standard edit distance with two-row DP optimization - Added `suggest_field()` — finds closest field name (threshold: edit distance ≤ 1/3 of the longer name's length) - Enhanced 3 error sites: - `FieldRef::into_id` — "Field 'X' not found in schema" - `Schema::do_project` — "Column X does not exist" - `Schema::project_by_schema` — "Field X not found" ### Design Decisions - **No new dependencies** — implemented Levenshtein inline rather than adding `strsim` crate - **No new error variants** — enhanced existing `Error::InvalidInput` and `Error::Schema` message strings - **1/3 threshold** — per issue guidance: suggestions only appear when fewer than 1/3 of characters need to change, preventing unhelpful suggestions for completely unrelated names - **Incremental scope** — this PR covers `schema.rs` only; additional error sites (scanner, projection, etc.) can follow ### Testing Added 4 tests: - `test_levenshtein_distance` — 11 assertions covering identical, empty, single-edit, multi-edit, and completely different strings - `test_suggest_field` — 6 assertions: close match, no match, exact match rejection, empty list, short names - `test_suggest_field_edge_cases` — 2 assertions: all-different short names, picks-closest-among-multiple - `test_project_with_suggestion` — integration test: verifies `Schema::project` includes suggestion for typo, and omits it for completely wrong names --------- Co-authored-by: Will Jones <willjones127@gmail.com>
…schema (lance-format#5976) ### What Closes lance-format#5642 (incrementally) Enhances "column not found" and "field not found" error messages in `Schema` to suggest the closest matching field name using Levenshtein distance. **Before:** `LanceError(Schema): Column vectr does not exist` **After:** `LanceError(Schema): Column vectr does not exist. Did you mean 'vector'?` ### Changes Single file modified: `lance-core/src/datatypes/schema.rs` - Added `levenshtein_distance()` — standard edit distance with two-row DP optimization - Added `suggest_field()` — finds closest field name (threshold: edit distance ≤ 1/3 of the longer name's length) - Enhanced 3 error sites: - `FieldRef::into_id` — "Field 'X' not found in schema" - `Schema::do_project` — "Column X does not exist" - `Schema::project_by_schema` — "Field X not found" ### Design Decisions - **No new dependencies** — implemented Levenshtein inline rather than adding `strsim` crate - **No new error variants** — enhanced existing `Error::InvalidInput` and `Error::Schema` message strings - **1/3 threshold** — per issue guidance: suggestions only appear when fewer than 1/3 of characters need to change, preventing unhelpful suggestions for completely unrelated names - **Incremental scope** — this PR covers `schema.rs` only; additional error sites (scanner, projection, etc.) can follow ### Testing Added 4 tests: - `test_levenshtein_distance` — 11 assertions covering identical, empty, single-edit, multi-edit, and completely different strings - `test_suggest_field` — 6 assertions: close match, no match, exact match rejection, empty list, short names - `test_suggest_field_edge_cases` — 2 assertions: all-different short names, picks-closest-among-multiple - `test_project_with_suggestion` — integration test: verifies `Schema::project` includes suggestion for typo, and omits it for completely wrong names --------- Co-authored-by: Will Jones <willjones127@gmail.com>
…schema (#5976) ### What Closes #5642 (incrementally) Enhances "column not found" and "field not found" error messages in `Schema` to suggest the closest matching field name using Levenshtein distance. **Before:** `LanceError(Schema): Column vectr does not exist` **After:** `LanceError(Schema): Column vectr does not exist. Did you mean 'vector'?` ### Changes Single file modified: `lance-core/src/datatypes/schema.rs` - Added `levenshtein_distance()` — standard edit distance with two-row DP optimization - Added `suggest_field()` — finds closest field name (threshold: edit distance ≤ 1/3 of the longer name's length) - Enhanced 3 error sites: - `FieldRef::into_id` — "Field 'X' not found in schema" - `Schema::do_project` — "Column X does not exist" - `Schema::project_by_schema` — "Field X not found" ### Design Decisions - **No new dependencies** — implemented Levenshtein inline rather than adding `strsim` crate - **No new error variants** — enhanced existing `Error::InvalidInput` and `Error::Schema` message strings - **1/3 threshold** — per issue guidance: suggestions only appear when fewer than 1/3 of characters need to change, preventing unhelpful suggestions for completely unrelated names - **Incremental scope** — this PR covers `schema.rs` only; additional error sites (scanner, projection, etc.) can follow ### Testing Added 4 tests: - `test_levenshtein_distance` — 11 assertions covering identical, empty, single-edit, multi-edit, and completely different strings - `test_suggest_field` — 6 assertions: close match, no match, exact match rejection, empty list, short names - `test_suggest_field_edge_cases` — 2 assertions: all-different short names, picks-closest-among-multiple - `test_project_with_suggestion` — integration test: verifies `Schema::project` includes suggestion for typo, and omits it for completely wrong names --------- Co-authored-by: Will Jones <willjones127@gmail.com>
What
Closes #5642 (incrementally)
Enhances "column not found" and "field not found" error messages in
Schemato suggest the closest matching field name using Levenshtein distance.Before:
LanceError(Schema): Column vectr does not existAfter:
LanceError(Schema): Column vectr does not exist. Did you mean 'vector'?Changes
Single file modified:
lance-core/src/datatypes/schema.rslevenshtein_distance()— standard edit distance with two-row DP optimizationsuggest_field()— finds closest field name (threshold: edit distance ≤ 1/3 of the longer name's length)FieldRef::into_id— "Field 'X' not found in schema"Schema::do_project— "Column X does not exist"Schema::project_by_schema— "Field X not found"Design Decisions
strsimcrateError::InvalidInputandError::Schemamessage stringsschema.rsonly; additional error sites (scanner, projection, etc.) can followTesting
Added 4 tests:
test_levenshtein_distance— 11 assertions covering identical, empty, single-edit, multi-edit, and completely different stringstest_suggest_field— 6 assertions: close match, no match, exact match rejection, empty list, short namestest_suggest_field_edge_cases— 2 assertions: all-different short names, picks-closest-among-multipletest_project_with_suggestion— integration test: verifiesSchema::projectincludes suggestion for typo, and omits it for completely wrong names