-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Closed
Labels
enhancementNew feature or requestNew feature or request
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Currently the order of the physical plan optimizer rules does not consider the dependencies between rules, which may cause a few issues. For example,
- The rule of
CoalesceBatchesmay be influenced by the output of the rule ofRepartition. It's not necessary to add aCoalesceBatchesExecafter aRepartitionExec. - The
Repartitionmay change the partition from single to multiple so that the rule ofBasicEnforcementhas to be run twice. - The
BasicEnforcementis mixed with the selection of the global sort algorithm.
Describe the solution you'd like
It can be refined by the following steps:
- Extract the global sort algorithm selection from the
BasicEnforcementto be a separate rule,GlobalSortSelection. - Make the
Repartitionoptional. - Reorder the rules as following:
- AggregateStatistics
- Repartition(optional)
- GlobalSortSelection
- JoinSelection
- BasicEnforcement
- CoalesceBatches(optional)
The reason for this ordering is as follows:
- For
Repartition, in order to increase the parallelism, it will change the output partitioning of some operators in the plan tree, which will influence other rules. Therefore, it should be run as soon as possible. The reason to make it optional is it's not used for the distributed engine, Ballista. And it's conflicted with some parts of theBasicEnforcement, since it will introduce additional repartitioning while theBasicEnforcementaims at reducing unnecessary repartitioning. - For
GlobalSortSelection, since currently it will depend on the partition number to decide whether change the single node sort to parallel local sort and merge, it should be run after theRepartition. Since it will change the output ordering of some operators, it should be run beforeJoinSelectionandBasicEnforcement, which may depend on that. - For
JoinSelection, based on statistics, it will change the Auto mode to real join implementation, like collect left, or hash join, or future sort merge join, which will influence theBasicEnforcementto decide whether to add additional repartition and local sort to meet the distribution and ordering requirements. Therefore, it should be run beforeBasicEnforcement. - For
BasicEnforcement, before run this rule, please make sure that the whole plan tree is determined. - For
CoalesceBatches, it will not influence the distribution and ordering of the whole plan tree. Therefore, to avoid influencing other rules, it should be run at last.
Describe alternatives you've considered
Additional context
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request