fix!: correctly handle nulls in btree and bitmap indices#3211
fix!: correctly handle nulls in btree and bitmap indices#3211westonpace merged 4 commits intolance-format:mainfrom
Conversation
wjones127
left a comment
There was a problem hiding this comment.
Looks good. Although I think we might consider this a breaking change, since IIUC older versions won't be able to read the new scalar indices, right? Or will they?
| /// for faster lookup | ||
| #[derive(Clone, Debug, DeepSizeOf)] | ||
| pub struct BTreeIndex { | ||
| pub struct BTreeIndex<const BATCH_SIZE: u64 = DEFAULT_BTREE_BATCH_SIZE> { |
There was a problem hiding this comment.
What's the benefit of making BATCH_SIZE a compile-time constant?
There was a problem hiding this comment.
Eh, it was less that I wanted to make it a compile time constant and more that I didn't have a good way of passing the batch size to BTreeIndex::load (a static method). However, I realize now I can put it in the file schema. Let me do that real quick.
…. Address clippy warnings
They will not be able to. They would get a runtime error about "legacy method called on v2 file". However, that's more of a forward compatibility concern right? This is technically not a breaking change according to semver but I'm fine considering it as one. Otherwise A/B style upgades will break I will update the title. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3211 +/- ##
==========================================
- Coverage 78.59% 78.58% -0.01%
==========================================
Files 244 244
Lines 83957 84021 +64
Branches 83957 84021 +64
==========================================
+ Hits 65986 66032 +46
- Misses 15181 15193 +12
- Partials 2790 2796 +6
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There were a few issues with our null handling in scalar indices.
First, it appears I assumed earlier that
X < NULLandX > NULLwould always be false. However, inarrow-rsthe ordering considersNULLto be "the smallest value" and soX < NULLalways evaluated to true. This required some changes to the logic in the btree and bitmap indices.Second, the btree index was still using the v1 file format because it relied on the page size to keep track of the index's batch size. I've instead made the batch size a configurable property (configurable in code, not configurable by users) and made it so that btree can use the v2 file format.
Finally, related to the above, I changed it so we now write v2 files for all scalar indices, even if the dataset is a v1 dataset. I think that's a reasonable decision at this point.
The logic to fallback and read the old v1 files was already in place (I believe @BubbleCal added it back when working on inverted index) but I added a migration test just to be sure we weren't breaking our btree / bitmap support.
Users with existing bitmap indices will get the new correct behavior without any changes.
Users with existing btree indices will get some of the new correct behavior but will need to retrain their indices to get all of the correct behavior.
BREAKING CHANGE: Bitmap and btree indices will no longer be readable by older versions of Lance. This is not a "backwards compatibility change" (no APIs or code will stop working) but rather a "forwards compatibility change" (you need to be careful in a multi-verison deployment or if you roll back)