-
Notifications
You must be signed in to change notification settings - Fork 116
[WIP] Modify optimizer rules to leverage an index with deleted source data file(s) #175
Conversation
rapoth
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good so far, thanks!
| NamedExpression.newExprId) | ||
| val deletedFileNames = index.excludedFiles.map(Literal(_)).toArray | ||
| val rel = baseRelation.copy(relation = relation, output = updatedOutput ++ Seq(lAttr)) | ||
| val filter = Filter(condition = Not(In(lAttr, deletedFileNames)), rel) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pirz One thing to evaluate here is what happens when a large number of files are deleted. For instance, I can imagine the query plan can get pretty big. Can you run some quick benchmark to test this for a few 1000 files deleted? I think this will have implication on query compilation time. An alternate way of doing this would be to express this as a JOIN against another table containing the list of deleted files.
CC: @imback82 @sezruby @apoorvedave1 for a second opinion in case they think this is unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good if @pirz could run TPCH with 100k chunks dataset for the injected filter condition of deleted files while I'm OOF this week :)
- build index with lineage column on full dataset (lineitem/part 0~99999 files)
- remove files
- case 1) delete 200 files
- case 2) delete 5000 files
- case 3) delete 25k files
- case 4) delete 50k files
- run
- run w/o index
- run w/o refreshed index => similar to partial index
- run w/ refreshed index => similar to hybrid scan, delete-only
- run w/ fully refreshed index
I usually run 4x4 cases but I think it's better to test (5k or 25k deleted files with refreshed index) quickly to see the regression.
(Jfyi, deleting files on remote storage, one by one, took quite a lot of time from my experience...)
|
Closed by #171 |
What is the context for this pull request?
This PR extends optimizer rules to consider and leverage indexes with deleted source data files.
This is needed as part of adding support for enforcing delete during read time.
What changes were proposed in this pull request?
This PR makes changes to
FilterIndexRuleandJoinIndexRuleand extends them so they can leverage an index with deleted source data files, whose index metadata is already updated by a refresh call.This is done by transforming query plan and adding extra a pair of
Filter -> Projectnodes on top of index scan node to exclude index records which are coming from deleted source data files listed in index'sexcludedfiles.Here is an example query plan transformed by
FilterIndexRuleforFilter -> Project:Here is an example query plan transformed by
FilterIndexRuleforselect *:Here is an example query plan transformed by
JoinIndexRule:Does this PR introduce any user-facing change?
Yes, with this PR users are able to leverage an index with deleted source data files during query time.
Once such an index is leveraged, query plans show an extra pair of
FilterandProjectnodes which are added to exclude index records originated from deleted source data files.How was this patch tested?
Test cases are added under
E2EHyperspaceRulesTests.scala.