fix: join job may cause inconsistent delta indices#5328
fix: join job may cause inconsistent delta indices#5328wjones127 merged 9 commits intolance-format:mainfrom
Conversation
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
|
|
||
| pub fn partition_size(&self, part: usize) -> usize { | ||
| self.lengths[part] as usize | ||
| self.lengths.get(part).copied().unwrap_or_default() as usize |
There was a problem hiding this comment.
this will return a zero in the previous case that panicked. Is that a correct fix? Seems like zero would be an incorrect or unexpected result. What's the consequence of a zero return value?
There was a problem hiding this comment.
this gets the size of the given partition, say delta index A has 3 partitions and delta index B has 4 partitions, when we calculate the total partition size of 4-th partition, index A just returns 0, and this is correct because we can consider that index A has 4 partitions but the 4-th is empty.
and after the fix, we will merge all delta indices into one, so the unwrap_or_default() path will never be reached
| ) -> Result<Option<Box<dyn RecordBatchStream + Unpin + 'static>>> { | ||
| if partition_id >= self.partition_sizes.len() { | ||
| return Ok(None); | ||
| } |
There was a problem hiding this comment.
similar question here - is it a cause for concern that this function would be asked for an out of bounds partition? If so should this be an error?
| "partition index is {} but the number of partitions is {}, skip loading it", | ||
| part_idx, | ||
| index.ivf_model().num_partitions() | ||
| ); |
There was a problem hiding this comment.
is it correct to think we'll see this in some cases initially but should not see it in the long term?
There was a problem hiding this comment.
Yes, this is just for fixing the corrupted delta indices, after merging them, we will never reach these branches
wjones127
left a comment
There was a problem hiding this comment.
This seems good. I agree we should avoid doing partition joins in remaps; that seems like the job of optimize indices.
There's some minor naming changes that I think could improve readability, but these are existing issues anyways.
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
fix lance-format#5312 - handle the case of delta indices have diff number of partitions, split/join will merge all delta indices into one - trigger join on incremental indexing, not remap --------- Signed-off-by: BubbleCal <bubble-cal@outlook.com>
fix #5312