fix: respect fragment restrictions in vector and FTS searches when requested fragments#5924
Conversation
…cases refactor refactor
|
ACTION NEEDED The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
wjones127
left a comment
There was a problem hiding this comment.
The implementation seems reasonable, but I would be more convinced if you improved the unit test.
| let batches: Vec<_> = scanner | ||
| .try_into_stream() | ||
| .await | ||
| .unwrap() | ||
| .try_collect::<Vec<_>>() | ||
| .await | ||
| .unwrap(); |
There was a problem hiding this comment.
For tests, I'd recommend just calling .try_into_batch(). That makes things simpler.
| let batches: Vec<_> = scanner | |
| .try_into_stream() | |
| .await | |
| .unwrap() | |
| .try_collect::<Vec<_>>() | |
| .await | |
| .unwrap(); | |
| let batches = scanner | |
| .try_into_batch() | |
| .await | |
| .unwrap(); |
| // Now we have 3 fragments: | ||
| // Fragment 0: i=0..200 (indexed) | ||
| // Fragment 1: i=200..400 (indexed) | ||
| // Fragment 2: i=400..410 (unindexed) |
There was a problem hiding this comment.
I think this test would be more convincing if:
- Running the query without the fragment filter produced rows in all fragments
- There was another unindexed fragment that you were excluding
- You had another test case where you were filtering for a indexed fragment.
Could you modify the tests to cover all three of those?
|
Thanks @wjones127 ! I have addressed your comments. I am working on another PR that fix indexed_fragments filtering but let me know if you think I should put them together. |
wjones127
left a comment
There was a problem hiding this comment.
Thanks for adding those tests. I think this looks better, but would like to give you the opportunity to make the tests a bit shorter.
| let batch = scanner.try_into_batch().await.unwrap(); | ||
| let i_col = batch.column_by_name("i").unwrap(); | ||
| let i_array = i_col.as_any().downcast_ref::<Int32Array>().unwrap(); | ||
|
|
||
| // Should only get results from fragment 2 (i=400..410) | ||
| let mut has_results = false; | ||
| for idx in 0..i_array.len() { | ||
| has_results = true; | ||
| let val = i_array.value(idx); | ||
| assert!( | ||
| (400..410).contains(&val), | ||
| "Expected only values from fragment 2 (i=400..410), but got i={}", | ||
| val | ||
| ); | ||
| } | ||
| assert!(has_results, "Expected some results from fragment 2"); |
There was a problem hiding this comment.
You could simplify these tests to something like this:
| let batch = scanner.try_into_batch().await.unwrap(); | |
| let i_col = batch.column_by_name("i").unwrap(); | |
| let i_array = i_col.as_any().downcast_ref::<Int32Array>().unwrap(); | |
| // Should only get results from fragment 2 (i=400..410) | |
| let mut has_results = false; | |
| for idx in 0..i_array.len() { | |
| has_results = true; | |
| let val = i_array.value(idx); | |
| assert!( | |
| (400..410).contains(&val), | |
| "Expected only values from fragment 2 (i=400..410), but got i={}", | |
| val | |
| ); | |
| } | |
| assert!(has_results, "Expected some results from fragment 2"); | |
| let batch = scanner.try_into_batch().await.unwrap(); | |
| assert!(batch.num_rows() > 0, "Expected some results from fragment 2"); | |
| batch["i"].as_primitive<Int32Type>() | |
| .iter() | |
| .for_each(|val| { | |
| assert!( | |
| (400..410).contains(&val), | |
| "Expected only values from fragment 2 (i=400..410), but got i={}", | |
| val | |
| ); | |
| }); |
| // Test 2: Query only one unindexed fragment (fragment 2), excluding fragment 3 | ||
| let fragment_2 = vec![fragments[2].clone()]; | ||
|
|
||
| let mut scanner = test_ds.dataset.scan(); | ||
| scanner | ||
| .full_text_search(FullTextSearchQuery::new("s-405".into())) | ||
| .unwrap() | ||
| .with_fragments(fragment_2); | ||
|
|
||
| let batch = scanner.try_into_batch().await.unwrap(); | ||
| let i_col = batch.column_by_name("i").unwrap(); | ||
| let i_array = i_col.as_any().downcast_ref::<Int32Array>().unwrap(); | ||
|
|
||
| // Should only get results from fragment 2 (i=400..410) | ||
| let mut has_results = false; | ||
| for idx in 0..i_array.len() { | ||
| has_results = true; | ||
| let val = i_array.value(idx); | ||
| assert!( | ||
| (400..410).contains(&val), | ||
| "Expected only values from fragment 2 (i=400..410), but got i={}", | ||
| val | ||
| ); | ||
| } | ||
| assert!(has_results, "Expected some results from fragment 2"); |
There was a problem hiding this comment.
I like the number of test better, but does seem to be verbose. Could you refactor these into a common test function? That would make it a lot shorter.
Fixes bug where vector and FTS searches ignore
with_fragments()filter when querying unindexed fragments, which will return results from indexed fragments that were not requested.Note, this does not fix the issues where
with_fragmentscontains both indexed_fragement and non_indexed_fragment for FTS and vector search, and I have a separate follow up PR to fix the behavior.This PR combines my previous effort for fixing the issue:
#5081
#5080