Support Top-K query optimization for `ORDER BY <EXPR> [ASC|DESC] LIMIT n`

**Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
As explained by @alamb in the user defined plan:
>  TopK Background:
 A "Top K" node is a common query optimization which is used for
 queries such as "find the top 3 customers by revenue". The
(simplified) SQL for such a query might be:
 ```sql
 CREATE EXTERNAL TABLE sales(customer_id VARCHAR, revenue BIGINT)
  STORED AS CSV location 'tests/customer.csv';

 SELECT customer_id, revenue FROM sales ORDER BY revenue DESC limit 3;
 ```
> And a naive plan would be:

```
> explain SELECT customer_id, revenue FROM sales ORDER BY revenue DESC limit 3;
 +--------------+----------------------------------------+
 | plan_type    | plan                                   |
 +--------------+----------------------------------------+
| logical_plan | Limit: 3                               |
 |              |   Sort: #revenue DESC NULLS FIRST      |
 |              |     Projection: #customer_id, #revenue |
 |              |       TableScan: sales |
 +--------------+----------------------------------------+
 ```

>  While this plan produces the correct answer, the careful reader
 will note it fully sorts the input before discarding everything
 other than the top 3 elements.

In the ClickBench benchmarks (https://github.com/ClickHouse/ClickBench) we can see many queries that might benefit from this optimization, but we can see they are converted to a "naive" plan. See for example query 9:

```
9. EXPLAIN SELECT "RegionID", COUNT(DISTINCT "UserID") AS u FROM hits GROUP BY "RegionID" ORDER BY u DESC LIMIT 10;

-----------------------------------------


DataFusion CLI v12.0.0
0 rows in set. Query took 0.062 seconds.
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                                          |
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Limit: skip=0, fetch=10                                                                                                                                       |
|               |   Sort: #u DESC NULLS FIRST                                                                                                                                   |
|               |     Projection: #hits.RegionID, #COUNT(DISTINCT hits.UserID) AS u                                                                                             |
|               |       Projection: #group_alias_0 AS RegionID, #COUNT(alias1) AS COUNT(DISTINCT hits.UserID)                                                                   |
|               |         Aggregate: groupBy=[[#group_alias_0]], aggr=[[COUNT(#alias1)]]                                                                                        |
|               |           Aggregate: groupBy=[[#hits.RegionID AS group_alias_0, #hits.UserID AS alias1]], aggr=[[]]                                                           |
|               |             TableScan: hits projection=[RegionID, UserID]                                                                                                     |
| physical_plan | GlobalLimitExec: skip=0, fetch=10                                                                                                                             |
|               |   SortExec: [u@1 DESC]                                                                                                                                        |
|               |     CoalescePartitionsExec                                                                                                                                    |
|               |       ProjectionExec: expr=[RegionID@0 as RegionID, COUNT(DISTINCT hits.UserID)@1 as u]                                                                       |
|               |         ProjectionExec: expr=[group_alias_0@0 as RegionID, COUNT(alias1)@1 as COUNT(DISTINCT hits.UserID)]                                                    |
|               |           AggregateExec: mode=FinalPartitioned, gby=[group_alias_0@0 as group_alias_0], aggr=[COUNT(alias1)]                                                  |
|               |             CoalesceBatchesExec: target_batch_size=4096                                                                                                       |
|               |               RepartitionExec: partitioning=Hash([Column { name: "group_alias_0", index: 0 }], 16)                                                            |
|               |                 AggregateExec: mode=Partial, gby=[group_alias_0@0 as group_alias_0], aggr=[COUNT(alias1)]                                                     |
|               |                   AggregateExec: mode=FinalPartitioned, gby=[group_alias_0@0 as group_alias_0, alias1@1 as alias1], aggr=[]                                   |
|               |                     CoalesceBatchesExec: target_batch_size=4096                                                                                               |
|               |                       RepartitionExec: partitioning=Hash([Column { name: "group_alias_0", index: 0 }, Column { name: "alias1", index: 1 }], 16)               |
|               |                         AggregateExec: mode=Partial, gby=[RegionID@0 as group_alias_0, UserID@1 as alias1], aggr=[]                                           |
|               |                           RepartitionExec: partitioning=RoundRobinBatch(16)                                                                                   |
|               |                             ParquetExec: limit=None, partitions=[home/danielheres/Code/gdd/ClickBench/datafusion/hits.parquet], projection=[RegionID, UserID] |
|               |                                                                                                                                                               |
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.015 seconds.

```

**Describe the solution you'd like**
Implement this feature.

It basically means pushing the limit information down to `SortPreservingMergeStream` and keeping only the top N elements in the min_heap datastructure.

**Describe alternatives you've considered**
n/a

**Additional context**
Add any other context or screenshots about the feature request here.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support Top-K query optimization for `ORDER BY <EXPR> [ASC|DESC] LIMIT n` #3516

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support Top-K query optimization for ORDER BY <EXPR> [ASC|DESC] LIMIT n #3516

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Support Top-K query optimization for `ORDER BY <EXPR> [ASC|DESC] LIMIT n` #3516